[STBDA2023] 14wk-62: NLP with Disaster Tweets (Text) / 자료분석(Autogluon)

Author

김보람

Published

December 17, 2023

14wk-62: NLP with Disaster Tweets (Text) / 자료분석(Autogluon)

최규빈
2023-12-01

1. 강의영상

???

2. Imports

#!pip install autogluon.multimodal 
import numpy as np
import pandas as pd
#---#
from autogluon.multimodal import MultiModalPredictor # from autogluon.tabular import TabularPredictor
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

!kaggle competitions download -c nlp-getting-started
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/coco/.kaggle/kaggle.json'
Downloading nlp-getting-started.zip to /home/coco/Dropbox/Class/STBDA23/posts
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.28MB/s]
100%|████████████████████████████████████████| 593k/593k [00:00<00:00, 2.28MB/s]
!unzip nlp-getting-started.zip -d data 
Archive:  nlp-getting-started.zip
  inflating: data/sample_submission.csv  
  inflating: data/test.csv           
  inflating: data/train.csv          
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')
sample_submission = pd.read_csv('data/sample_submission.csv')
!rm -rf data
!rm nlp-getting-started.zip

4. 분석

df_train.head()
id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... 1
1 4 NaN NaN Forest fire near La Ronge Sask. Canada 1
2 5 NaN NaN All residents asked to 'shelter in place' are ... 1
3 6 NaN NaN 13,000 people receive #wildfires evacuation or... 1
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... 1
df_test.head()
id keyword location text
0 0 NaN NaN Just happened a terrible car crash
1 2 NaN NaN Heard about #earthquake is different cities, s...
2 3 NaN NaN there is a forest fire at spot pond, geese are...
3 9 NaN NaN Apocalypse lighting. #Spokane #wildfires
4 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan
# step1 -- pass
# step2 
predictr = MultiModalPredictor(label = 'target')
# step3
predictr.fit(df_train)
# step4 
yhat = predictr.predict(df_test) 
No path specified. Models will be saved in: "AutogluonModels/ag-20231218_074742/"
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [1, 0]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Global seed set to 0
AutoMM starts to create your model. ✨

- AutoGluon version is 0.8.2.

- Pytorch version is 1.13.1+cu117.

- Model will be saved to "/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742".

- Validation metric is "roc_auc".

- To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742
    ```

Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai

0 GPUs are detected, and 0 GPUs will be used.

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs

  | Name              | Type                | Params
----------------------------------------------------------
0 | model             | MultimodalFusionMLP | 109 M 
1 | validation_metric | BinaryAUROC         | 0     
2 | loss_func         | CrossEntropyLoss    | 0     
----------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
439.134   Total estimated model params size (MB)
Epoch 0, global step 26: 'val_roc_auc' reached 0.81472 (best 0.81472), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=0-step=26.ckpt' as top 3
Epoch 0, global step 53: 'val_roc_auc' reached 0.87681 (best 0.87681), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=0-step=53.ckpt' as top 3
Epoch 1, global step 80: 'val_roc_auc' reached 0.87866 (best 0.87866), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=1-step=80.ckpt' as top 3
Epoch 1, global step 107: 'val_roc_auc' reached 0.89115 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=1-step=107.ckpt' as top 3
Epoch 2, global step 134: 'val_roc_auc' reached 0.88618 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=2-step=134.ckpt' as top 3
Epoch 2, global step 161: 'val_roc_auc' reached 0.88654 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=2-step=161.ckpt' as top 3
Epoch 3, global step 188: 'val_roc_auc' reached 0.89034 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=3-step=188.ckpt' as top 3
Epoch 3, global step 215: 'val_roc_auc' was not in top 3
Epoch 4, global step 242: 'val_roc_auc' was not in top 3
Epoch 4, global step 269: 'val_roc_auc' reached 0.89090 (best 0.89115), saving model to '/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742/epoch=4-step=269.ckpt' as top 3
Epoch 5, global step 296: 'val_roc_auc' was not in top 3
Epoch 5, global step 323: 'val_roc_auc' was not in top 3
Epoch 6, global step 350: 'val_roc_auc' was not in top 3
Epoch 6, global step 377: 'val_roc_auc' was not in top 3
Start to fuse 3 checkpoints via the greedy soup algorithm.
AutoMM has created your model 🎉🎉🎉

- To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742")
    ```

- You can open a terminal and launch Tensorboard to visualize the training log:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/coco/Dropbox/Class/STBDA23/posts/AutogluonModels/ag-20231218_074742
    ```

- If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub: https://github.com/autogluon/autogluon

  • 오래 걸리넹..

5. 제출

sample_submission
id target
0 0 0
1 2 0
2 3 0
3 9 0
4 11 0
... ... ...
3258 10861 0
3259 10865 0
3260 10868 0
3261 10874 0
3262 10875 0

3263 rows × 2 columns

sample_submission['target'] = yhat 
sample_submission.to_csv("submission.csv",index=False)
!kaggle competitions submit -c nlp-getting-started -f submission.csv -m "오토글루온, MultiModalPredictor"
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/coco/.kaggle/kaggle.json'
100%|██████████████████████████████████████| 22.2k/22.2k [00:01<00:00, 12.2kB/s]
Successfully submitted to Natural Language Processing with Disaster Tweets

250/1094
0.22851919561243145

이정도가 합리적임