#!pip install autogluon.eda13wk-51: 아이스크림(type무의미) / 자료분석(Autogluon)
최규빈
2023-12-01
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-yqIX6EyIErVtflwk7p8LeM&si=v_1Q2rBbNPCgSnta
2. Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')3. Data
df_train = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/mid/icesales_train.csv')df_train.head()| temp | type | sales | |
|---|---|---|---|
| 0 | 19.4 | choco | 64.807407 |
| 1 | 0.9 | vanilla | 25.656697 |
| 2 | 7.4 | vanilla | 34.756650 |
| 3 | 4.5 | choco | 27.265442 |
| 4 | 21.1 | choco | 70.606946 |
sns.scatterplot(df_train,x='temp',y='sales',hue='type')<AxesSubplot: xlabel='temp', ylabel='sales'>

4. 적합
# step1 -- pass
# step2
predictr = TabularPredictor(label='sales')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)No path specified. Models will be saved in: "AutogluonModels/ag-20231203_074129/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231203_074129/"
AutoGluon Version: 0.8.2
Python Version: 3.8.18
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 673.31 GB / 982.82 GB (68.5%)
Train Data Rows: 280
Train Data Columns: 2
Label Column: sales
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
Label info (max, min, mean, stddev): (88.99437629756306, 10.335207096486446, 51.10189, 21.16757)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 106466.93 MB
Train Data (Original) Memory Usage: 0.02 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 1 | ['temp']
('object', []) : 1 | ['type']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 1 | ['temp']
('int', ['bool']) : 1 | ['type']
0.0s = Fit runtime
2 features in original data used to generate 2 features in processed data.
Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.04s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 224, Val Rows: 56
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
-2.7316 = Validation score (-root_mean_squared_error)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
-3.5558 = Validation score (-root_mean_squared_error)
0.0s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMXT ...
-3.1036 = Validation score (-root_mean_squared_error)
0.17s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
-3.0864 = Validation score (-root_mean_squared_error)
0.12s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestMSE ...
-2.9027 = Validation score (-root_mean_squared_error)
0.19s = Training runtime
0.02s = Validation runtime
Fitting model: CatBoost ...
-2.7878 = Validation score (-root_mean_squared_error)
0.22s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesMSE ...
-2.88 = Validation score (-root_mean_squared_error)
0.19s = Training runtime
0.02s = Validation runtime
Fitting model: NeuralNetFastAI ...
-2.6541 = Validation score (-root_mean_squared_error)
0.89s = Training runtime
0.0s = Validation runtime
Fitting model: XGBoost ...
-3.061 = Validation score (-root_mean_squared_error)
0.07s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
-2.6289 = Validation score (-root_mean_squared_error)
0.6s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMLarge ...
-2.9062 = Validation score (-root_mean_squared_error)
0.18s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
-2.5618 = Validation score (-root_mean_squared_error)
0.15s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 2.97s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231203_074129/")
[1000] valid_set's rmse: 3.11194
sns.scatterplot(df_train,x='temp',y='sales',hue='type',alpha=0.3)
sns.lineplot(df_train,x='temp',y=yhat,hue='type')<AxesSubplot: xlabel='temp', ylabel='sales'>

predictr.leaderboard(silent=True)| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | WeightedEnsemble_L2 | -2.561803 | 0.029295 | 1.844234 | 0.000218 | 0.150833 | 2 | True | 12 |
| 1 | NeuralNetTorch | -2.628865 | 0.002556 | 0.603938 | 0.002556 | 0.603938 | 1 | True | 10 |
| 2 | NeuralNetFastAI | -2.654070 | 0.004787 | 0.889270 | 0.004787 | 0.889270 | 1 | True | 8 |
| 3 | KNeighborsUnif | -2.731556 | 0.006626 | 0.011376 | 0.006626 | 0.011376 | 1 | True | 1 |
| 4 | CatBoost | -2.787790 | 0.000773 | 0.223884 | 0.000773 | 0.223884 | 1 | True | 6 |
| 5 | ExtraTreesMSE | -2.879997 | 0.015717 | 0.192446 | 0.015717 | 0.192446 | 1 | True | 7 |
| 6 | RandomForestMSE | -2.902703 | 0.015109 | 0.188818 | 0.015109 | 0.188818 | 1 | True | 5 |
| 7 | LightGBMLarge | -2.906174 | 0.000660 | 0.182818 | 0.000660 | 0.182818 | 1 | True | 11 |
| 8 | XGBoost | -3.061021 | 0.001390 | 0.066261 | 0.001390 | 0.066261 | 1 | True | 9 |
| 9 | LightGBM | -3.086423 | 0.000620 | 0.123639 | 0.000620 | 0.123639 | 1 | True | 4 |
| 10 | LightGBMXT | -3.103597 | 0.001114 | 0.171221 | 0.001114 | 0.171221 | 1 | True | 3 |
| 11 | KNeighborsDist | -3.555839 | 0.003564 | 0.003017 | 0.003564 | 0.003017 | 1 | True | 2 |
5. 해석 및 시각화
A. y의 분포, (X,y)의 관계 시각화
auto.target_analysis(
train_data=df_train,
label='sales',
fit_distributions=False
)Target variable analysis
| count | mean | std | min | 25% | 50% | 75% | max | dtypes | unique | missing_count | missing_ratio | raw_type | special_types | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sales | 280 | 51.101886 | 21.167573 | 10.335207 | 33.053077 | 47.844021 | 70.451589 | 88.994376 | float64 | 280 | float |

Target variable correlations
train_data - spearman correlation matrix; focus: absolute correlation for sales >= 0.5

Feature interaction between temp/sales in train_data

B. 중요한 설명변수
auto.quick_fit(
train_data=df_train,
label='sales',
show_feature_importance_barplots=True
)No path specified. Models will be saved in: "AutogluonModels/ag-20231203_074133/"
Model Prediction for sales
Using validation data for Test points

Model Leaderboard
| model | score_test | score_val | pred_time_test | pred_time_val | fit_time | pred_time_test_marginal | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LightGBMXT | -3.549018 | -4.206044 | 0.002413 | 0.00081 | 0.149133 | 0.002413 | 0.00081 | 0.149133 | 1 | True | 1 |
Feature Importance for Trained Model
| importance | stddev | p_value | n | p99_high | p99_low | |
|---|---|---|---|---|---|---|
| temp | 25.164809 | 1.617020 | 0.000002 | 5 | 28.494276 | 21.835342 |
| type | -0.048470 | 0.059119 | 0.929654 | 5 | 0.073258 | -0.170197 |

Rows with the highest prediction error
Rows in this category worth inspecting for the causes of the error
| temp | type | sales | sales_pred | error | |
|---|---|---|---|---|---|
| 73 | -3.7 | vanilla | 12.432354 | 24.724379 | 12.292025 |
| 191 | -0.3 | vanilla | 16.436525 | 24.724379 | 8.287854 |
| 218 | 14.7 | choco | 60.178468 | 52.961044 | 7.217424 |
| 166 | 16.1 | choco | 66.821367 | 59.932861 | 6.888506 |
| 5 | 23.2 | vanilla | 75.697957 | 82.155197 | 6.457240 |
| 118 | 8.3 | choco | 45.364110 | 38.923119 | 6.440991 |
| 198 | 4.4 | vanilla | 24.924572 | 31.039103 | 6.114530 |
| 7 | 11.2 | choco | 45.593168 | 51.416027 | 5.822859 |
| 89 | 25.7 | vanilla | 87.788320 | 82.155197 | 5.633123 |
| 109 | 2.0 | vanilla | 19.398204 | 24.724379 | 5.326174 |
C. 관측치별 해석
auto.explain_rows(
train_data=df_train,
model=predictr,
rows=df_train.iloc[[0]],
display_rows=True,
plot='waterfall'
) | temp | type | sales | |
|---|---|---|---|
| 0 | 19.4 | choco | 64.807407 |

auto.explain_rows(
train_data=df_train,
model=predictr,
rows=df_train.iloc[[1]],
display_rows=True,
plot='waterfall'
) | temp | type | sales | |
|---|---|---|---|
| 1 | 0.9 | vanilla | 25.656697 |
