14wk-62: NLP with Disaster Tweets (Text) / 자료분석(Autogluon)

최규빈
2023-12-01

1. 강의영상

2. Imports

#!pip install autogluon.multimodal

import numpy as np
import pandas as pd
#---#
from autogluon.multimodal import MultiModalPredictor # from autogluon.tabular import TabularPredictor
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

!kaggle competitions download -c nlp-getting-started

Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/coco/.kaggle/kaggle.json'
Downloading nlp-getting-started.zip to /home/coco/Dropbox/Class/STBDA23/posts
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.28MB/s]
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.28MB/s]

!unzip nlp-getting-started.zip -d data

Archive:  nlp-getting-started.zip
  inflating: data/sample_submission.csv  
  inflating: data/test.csv           
  inflating: data/train.csv

df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')

!rm -rf data
!rm nlp-getting-started.zip

4. 분석

df_train.head()

	id	keyword	location	text	target
0	1	NaN	NaN	Our Deeds are the Reason of this #earthquake M...	1
1	4	NaN	NaN	Forest fire near La Ronge Sask. Canada	1
2	5	NaN	NaN	All residents asked to 'shelter in place' are ...	1
3	6	NaN	NaN	13,000 people receive #wildfires evacuation or...	1
4	7	NaN	NaN	Just got sent this photo from Ruby #Alaska as ...	1

df_test.head()

	id	keyword	location	text
0	0	NaN	NaN	Just happened a terrible car crash
1	2	NaN	NaN	Heard about #earthquake is different cities, s...
2	3	NaN	NaN	there is a forest fire at spot pond, geese are...
3	9	NaN	NaN	Apocalypse lighting. #Spokane #wildfires
4	11	NaN	NaN	Typhoon Soudelor kills 28 in China and Taiwan

# step1 -- pass
# step2 
predictr = MultiModalPredictor(label = 'target')
# step3
predictr.fit(df_train)
# step4 
yhat = predictr.predict(df_test)

No path specified. Models will be saved in: "AutogluonModels/ag-20231218_074742/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [1, 0]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.2.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

0 GPUs are detected, and 0 GPUs will be used.

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name              | Type                | Params
----------------------------------------------------------
0 | model             | MultimodalFusionMLP | 109 M 
1 | validation_metric | BinaryAUROC         | 0     
2 | loss_func         | CrossEntropyLoss    | 0     
----------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
439.134   Total estimated model params size (MB)
Epoch 0, global step 26: 'val_roc_auc' reached 0.81472 (best 0.81472), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=0-step=26.ckpt' as top 3
Epoch 0, global step 53: 'val_roc_auc' reached 0.87681 (best 0.87681), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=0-step=53.ckpt' as top 3
Epoch 1, global step 80: 'val_roc_auc' reached 0.87866 (best 0.87866), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=1-step=80.ckpt' as top 3
Epoch 1, global step 107: 'val_roc_auc' reached 0.89115 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=1-step=107.ckpt' as top 3
Epoch 2, global step 134: 'val_roc_auc' reached 0.88618 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=2-step=134.ckpt' as top 3
Epoch 2, global step 161: 'val_roc_auc' reached 0.88654 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=2-step=161.ckpt' as top 3
Epoch 3, global step 188: 'val_roc_auc' reached 0.89034 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=3-step=188.ckpt' as top 3
Epoch 3, global step 215: 'val_roc_auc' was not in top 3
Epoch 4, global step 242: 'val_roc_auc' was not in top 3
Epoch 4, global step 269: 'val_roc_auc' reached 0.89090 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=4-step=269.ckpt' as top 3
Epoch 5, global step 296: 'val_roc_auc' was not in top 3
Epoch 5, global step 323: 'val_roc_auc' was not in top 3
Epoch 6, global step 350: 'val_roc_auc' was not in top 3
Epoch 6, global step 377: 'val_roc_auc' was not in top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon

오래 걸리넹..

5. 제출

sample_submission

	id	target
0	0	0
1	2	0
2	3	0
3	9	0
4	11	0
...	...	...
3258	10861	0
3259	10865	0
3260	10868	0
3261	10874	0
3262	10875	0

3263 rows × 2 columns

sample_submission['target'] = yhat 
sample_submission.to_csv("submission.csv",index=False)

!kaggle competitions submit -c nlp-getting-started -f submission.csv -m "오토글루온, MultiModalPredictor"

Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/coco/.kaggle/kaggle.json'
100%|██████████████████████████████████████| 22.2k/22.2k [00:01<00:00, 12.2kB/s]
Successfully submitted to Natural Language Processing with Disaster Tweets

250/1094

0.22851919561243145

이정도가 합리적임