import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.metrics
#---#
import pickle
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')13wk-48: 아이스크림 / 자료분석(Autogluon)
최규빈
2023-12-01
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-xD3gCT0L9vDSnXHHEpxP0E&si=PUo87ILydVK7moIk
2. Imports
3. Data
temp = pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/temp.csv').iloc[:,3].to_numpy()[:100]
temp.sort()
np.random.seed(43052)
eps = np.random.randn(100)*3 # 오차
icecream_sales = 20 + temp * 2.5 + eps
df_train = pd.DataFrame({'temp':temp,'sales':icecream_sales})df_train| temp | sales | |
|---|---|---|
| 0 | -4.1 | 10.900261 |
| 1 | -3.7 | 14.002524 |
| 2 | -3.0 | 15.928335 |
| 3 | -1.3 | 17.673681 |
| 4 | -0.5 | 19.463362 |
| ... | ... | ... |
| 95 | 12.4 | 54.926065 |
| 96 | 13.4 | 54.716129 |
| 97 | 14.7 | 56.194791 |
| 98 | 15.0 | 60.666163 |
| 99 | 15.2 | 61.561043 |
100 rows × 2 columns
sns.scatterplot(df_train, x='temp', y='sales')<AxesSubplot: xlabel='temp', ylabel='sales'>

4. 적합
A. 맨날 쓰는 코드
# step1 -- pass
# step2
predictr = TabularPredictor(label='sales')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)No path specified. Models will be saved in: "AutogluonModels/ag-20231203_071110/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231203_071110/"
AutoGluon Version: 0.8.2
Python Version: 3.8.18
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 673.03 GB / 982.82 GB (68.5%)
Train Data Rows: 100
Train Data Columns: 1
Label Column: sales
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
Label info (max, min, mean, stddev): (61.561043278721556, 10.90026146402572, 33.97342, 10.63375)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 107392.5 MB
Train Data (Original) Memory Usage: 0.0 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 1 | ['temp']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 1 | ['temp']
0.0s = Fit runtime
1 features in original data used to generate 1 features in processed data.
Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 80, Val Rows: 20
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
-4.2111 = Validation score (-root_mean_squared_error)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
-4.6714 = Validation score (-root_mean_squared_error)
0.0s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMXT ...
-35.2477 = Validation score (-root_mean_squared_error)
0.17s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
-5.3708 = Validation score (-root_mean_squared_error)
0.1s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestMSE ...
-4.4041 = Validation score (-root_mean_squared_error)
0.2s = Training runtime
0.01s = Validation runtime
Fitting model: CatBoost ...
-3.8364 = Validation score (-root_mean_squared_error)
0.18s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesMSE ...
-4.2375 = Validation score (-root_mean_squared_error)
0.21s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetFastAI ...
-3.7128 = Validation score (-root_mean_squared_error)
0.4s = Training runtime
0.0s = Validation runtime
Fitting model: XGBoost ...
-4.0555 = Validation score (-root_mean_squared_error)
0.06s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
-3.4399 = Validation score (-root_mean_squared_error)
0.49s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMLarge ...
-3.979 = Validation score (-root_mean_squared_error)
0.12s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
-3.4399 = Validation score (-root_mean_squared_error)
0.15s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 2.22s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231203_071110/")
[1000] valid_set's rmse: 5.45375
[2000] valid_set's rmse: 5.40245
B. 적합결과 시각화
sns.scatterplot(df_train, x='temp',y='sales',label='y')
sns.lineplot(df_train, x='temp',y=yhat,color='C1',linestyle='--',label='yhat')
plt.legend()<matplotlib.legend.Legend at 0x7fb3278fa880>

- 잘 맞춘다?
C. 모형들 확인
predictr.leaderboard(silent=True)| model | score_val | pred_time_val | fit_time | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | NeuralNetTorch | -3.439941 | 0.002146 | 0.485392 | 0.002146 | 0.485392 | 1 | True | 10 |
| 1 | WeightedEnsemble_L2 | -3.439941 | 0.002357 | 0.635780 | 0.000212 | 0.150388 | 2 | True | 12 |
| 2 | NeuralNetFastAI | -3.712791 | 0.003752 | 0.395201 | 0.003752 | 0.395201 | 1 | True | 8 |
| 3 | CatBoost | -3.836449 | 0.000815 | 0.176307 | 0.000815 | 0.176307 | 1 | True | 6 |
| 4 | LightGBMLarge | -3.978956 | 0.000636 | 0.123188 | 0.000636 | 0.123188 | 1 | True | 11 |
| 5 | XGBoost | -4.055491 | 0.001284 | 0.062169 | 0.001284 | 0.062169 | 1 | True | 9 |
| 6 | KNeighborsUnif | -4.211090 | 0.005753 | 0.010933 | 0.005753 | 0.010933 | 1 | True | 1 |
| 7 | ExtraTreesMSE | -4.237516 | 0.014664 | 0.206832 | 0.014664 | 0.206832 | 1 | True | 7 |
| 8 | RandomForestMSE | -4.404096 | 0.013223 | 0.202250 | 0.013223 | 0.202250 | 1 | True | 5 |
| 9 | KNeighborsDist | -4.671405 | 0.002980 | 0.002525 | 0.002980 | 0.002525 | 1 | True | 2 |
| 10 | LightGBM | -5.370826 | 0.000522 | 0.100535 | 0.000522 | 0.100535 | 1 | True | 4 |
| 11 | LightGBMXT | -35.247682 | 0.000819 | 0.170966 | 0.000819 | 0.170966 | 1 | True | 3 |
D. 최강모형의 r2_score 계산
- r2_score 계산 – 방법1
_y = df_train.sales
_yhat = predictr.predict(df_train)
sklearn.metrics.r2_score(_y,_yhat)0.929782932976498
- r2_score 계산 – 방법2
_y = df_train.sales
_yhat = predictr.predict(df_train,model='NeuralNetTorch')
sklearn.metrics.r2_score(_y,_yhat)0.929782932976498
E. 특정모형의 적합값 구경
- XGBoost 궁금해..
_y = df_train.sales
_yhat = predictr.predict(df_train,model='XGBoost')
sklearn.metrics.r2_score(_y,_yhat)0.9516437954914487
sns.scatterplot(df_train, x='temp', y='sales', label='y')
sns.lineplot(df_train, x='temp',y=_yhat,color='C1',linestyle='--',label='yhat')
ax = plt.gca()
ax.set_title("XGBoost")
plt.legend()<matplotlib.legend.Legend at 0x7fb3278f8490>
