#pip install autogluon해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임
02wk-008: 타이타닉, Autogluon (best_quality)
최규빈
2023-09-12
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-x6USW3HM9Lm-B19o9qrm19&si=EFy8hdlgDJ-LUFHi
2. Import
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv
from autogluon.tabular import TabularDataset, TabularPredictor3. 분석의 절차
A. 데이터
- 비유: 문제를 받아오는 과정으로 비유할 수 있다.
tr = TabularDataset("~/Desktop/titanic/train.csv")
tst = TabularDataset("~/Desktop/titanic/test.csv")Loaded data from: ~/Desktop/titanic/train.csv | Columns = 12 / 12 | Rows = 891 -> 891
Loaded data from: ~/Desktop/titanic/test.csv | Columns = 11 / 11 | Rows = 418 -> 418
tst| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q | 
| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S | 
| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q | 
| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S | 
| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 413 | 1305 | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S | 
| 414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C | 
| 415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S | 
| 416 | 1308 | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S | 
| 417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C | 
418 rows × 11 columns
B. Predictor 생성
- 비유: 문제를 풀 학생을 생성하는 과정으로 비유할 수 있다.
predictr = TabularPredictor("Survived")No path specified. Models will be saved in: "AutogluonModels/ag-20230917_141828/"
C. 적합(fit)
- 비유: 학생이 공부를 하는 과정으로 비유할 수 있다.
- 학습
predictr.fit(tr,presets='best_quality') # 학생(predictr)에게 문제(tr)를 줘서 학습을 시킴(predictr.fit())Presets specified: ['best_quality']
Stack configuration (auto_stack=True): num_stack_levels=0, num_bag_folds=8, num_bag_sets=1
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20230917_141828/"
AutoGluon Version:  0.8.2
Python Version:     3.8.18
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail:   775.46 GB / 982.82 GB (78.9%)
Train Data Rows:    891
Train Data Columns: 11
Label Column: Survived
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
    2 unique label values:  [0, 1]
    If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    37750.16 MB
    Train Data (Original)  Memory Usage: 0.31 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
        Fitting TextSpecialFeatureGenerator...
            Fitting BinnedFeatureGenerator...
            Fitting DropDuplicatesFeatureGenerator...
        Fitting TextNgramFeatureGenerator...
            Fitting CountVectorizer for text features: ['Name']
            CountVectorizer fit with vocabulary size = 8
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])        : 2 | ['Age', 'Fare']
        ('int', [])          : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('object', [])       : 4 | ['Sex', 'Ticket', 'Cabin', 'Embarked']
        ('object', ['text']) : 1 | ['Name']
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])                    : 3 | ['Ticket', 'Cabin', 'Embarked']
        ('float', [])                       : 2 | ['Age', 'Fare']
        ('int', [])                         : 4 | ['PassengerId', 'Pclass', 'SibSp', 'Parch']
        ('int', ['binned', 'text_special']) : 9 | ['Name.char_count', 'Name.word_count', 'Name.capital_ratio', 'Name.lower_ratio', 'Name.special_ratio', ...]
        ('int', ['bool'])                   : 1 | ['Sex']
        ('int', ['text_ngram'])             : 9 | ['__nlp__.henry', '__nlp__.john', '__nlp__.master', '__nlp__.miss', '__nlp__.mr', ...]
    0.1s = Fit runtime
    11 features in original data used to generate 28 features in processed data.
    Train Data (Processed) Memory Usage: 0.07 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.16s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
    To change this, specify the eval_metric parameter of Predictor()
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif_BAG_L1 ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f20705933a0>
Traceback (most recent call last):
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
    0.6308   = Validation score   (accuracy)
    0.0s     = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist_BAG_L1 ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f20705e85e0>
Traceback (most recent call last):
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
    0.6364   = Validation score   (accuracy)
    0.0s     = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
    0.835    = Validation score   (accuracy)
    0.64s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: LightGBM_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
    0.8406   = Validation score   (accuracy)
    0.69s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: RandomForestGini_BAG_L1 ...
    0.8373   = Validation score   (accuracy)
    0.25s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: RandomForestEntr_BAG_L1 ...
    0.8361   = Validation score   (accuracy)
    0.26s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: CatBoost_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
    0.8608   = Validation score   (accuracy)
    1.7s     = Training   runtime
    0.02s    = Validation runtime
Fitting model: ExtraTreesGini_BAG_L1 ...
    0.8294   = Validation score   (accuracy)
    0.26s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: ExtraTreesEntr_BAG_L1 ...
    0.8328   = Validation score   (accuracy)
    0.25s    = Training   runtime
    0.06s    = Validation runtime
Fitting model: NeuralNetFastAI_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
    0.853    = Validation score   (accuracy)
    2.19s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: XGBoost_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
    0.8406   = Validation score   (accuracy)
    0.55s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: NeuralNetTorch_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
    0.8462   = Validation score   (accuracy)
    3.98s    = Training   runtime
    0.09s    = Validation runtime
Fitting model: LightGBMLarge_BAG_L1 ...
    Fitting 8 child models (S1F1 - S1F8) | Fitting with ParallelLocalFoldFittingStrategy
    0.8406   = Validation score   (accuracy)
    1.04s    = Training   runtime
    0.03s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    0.8608   = Validation score   (accuracy)
    0.5s     = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 19.49s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230917_141828/")
<autogluon.tabular.predictor.predictor.TabularPredictor at 0x7f212cb60f40>
- 리더보드확인 (모의고사채점)
predictr.leaderboard()                      model  score_val  pred_time_val  fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0           CatBoost_BAG_L1   0.860831       0.023981  1.700775                0.023981           1.700775            1       True          7
1       WeightedEnsemble_L2   0.860831       0.025379  2.204065                0.001398           0.503290            2       True         14
2    NeuralNetFastAI_BAG_L1   0.852974       0.067810  2.190649                0.067810           2.190649            1       True         10
3     NeuralNetTorch_BAG_L1   0.846240       0.086196  3.984127                0.086196           3.984127            1       True         12
4           LightGBM_BAG_L1   0.840629       0.023935  0.687209                0.023935           0.687209            1       True          4
5      LightGBMLarge_BAG_L1   0.840629       0.025161  1.040782                0.025161           1.040782            1       True         13
6            XGBoost_BAG_L1   0.840629       0.039286  0.545162                0.039286           0.545162            1       True         11
7   RandomForestGini_BAG_L1   0.837262       0.058148  0.252888                0.058148           0.252888            1       True          5
8   RandomForestEntr_BAG_L1   0.836139       0.058385  0.259539                0.058385           0.259539            1       True          6
9         LightGBMXT_BAG_L1   0.835017       0.022836  0.638991                0.022836           0.638991            1       True          3
10    ExtraTreesEntr_BAG_L1   0.832772       0.056459  0.251367                0.056459           0.251367            1       True          9
11    ExtraTreesGini_BAG_L1   0.829405       0.058829  0.257241                0.058829           0.257241            1       True          8
12    KNeighborsDist_BAG_L1   0.636364       0.012997  0.003672                0.012997           0.003672            1       True          2
13    KNeighborsUnif_BAG_L1   0.630752       0.011647  0.003759                0.011647           0.003759            1       True          1
| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | CatBoost_BAG_L1 | 0.860831 | 0.023981 | 1.700775 | 0.023981 | 1.700775 | 1 | True | 7 | 
| 1 | WeightedEnsemble_L2 | 0.860831 | 0.025379 | 2.204065 | 0.001398 | 0.503290 | 2 | True | 14 | 
| 2 | NeuralNetFastAI_BAG_L1 | 0.852974 | 0.067810 | 2.190649 | 0.067810 | 2.190649 | 1 | True | 10 | 
| 3 | NeuralNetTorch_BAG_L1 | 0.846240 | 0.086196 | 3.984127 | 0.086196 | 3.984127 | 1 | True | 12 | 
| 4 | LightGBM_BAG_L1 | 0.840629 | 0.023935 | 0.687209 | 0.023935 | 0.687209 | 1 | True | 4 | 
| 5 | LightGBMLarge_BAG_L1 | 0.840629 | 0.025161 | 1.040782 | 0.025161 | 1.040782 | 1 | True | 13 | 
| 6 | XGBoost_BAG_L1 | 0.840629 | 0.039286 | 0.545162 | 0.039286 | 0.545162 | 1 | True | 11 | 
| 7 | RandomForestGini_BAG_L1 | 0.837262 | 0.058148 | 0.252888 | 0.058148 | 0.252888 | 1 | True | 5 | 
| 8 | RandomForestEntr_BAG_L1 | 0.836139 | 0.058385 | 0.259539 | 0.058385 | 0.259539 | 1 | True | 6 | 
| 9 | LightGBMXT_BAG_L1 | 0.835017 | 0.022836 | 0.638991 | 0.022836 | 0.638991 | 1 | True | 3 | 
| 10 | ExtraTreesEntr_BAG_L1 | 0.832772 | 0.056459 | 0.251367 | 0.056459 | 0.251367 | 1 | True | 9 | 
| 11 | ExtraTreesGini_BAG_L1 | 0.829405 | 0.058829 | 0.257241 | 0.058829 | 0.257241 | 1 | True | 8 | 
| 12 | KNeighborsDist_BAG_L1 | 0.636364 | 0.012997 | 0.003672 | 0.012997 | 0.003672 | 1 | True | 2 | 
| 13 | KNeighborsUnif_BAG_L1 | 0.630752 | 0.011647 | 0.003759 | 0.011647 | 0.003759 | 1 | True | 1 | 
D. 예측 (predict)
- 비유: 학습이후에 문제를 푸는 과정으로 비유할 수 있다.
- training set 을 풀어봄 (predict) \(\to\) 점수 확인
(tr.Survived == predictr.predict(tr)).mean()0.898989898989899
- test set 을 풀어봄 (predict) \(\to\) 점수 확인 하러 캐글에 결과제출
tst.assign(Survived = predictr.predict(tst)).loc[:,['PassengerId','Survived']]\
.to_csv("autogluon(best_quality)_submission.csv",index=False)3. 숙제
- 캐글에 제출한 결과를 캡쳐하여 LMS에 제출