#!pip install autogluon.eda
13wk-50: 아이스크림(이상치) / 자료분석(Autogluon)
최규빈
2023-12-01
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-z2-S-MmANJvtOOfd7LMcTF&si=SJ-QPCnXqnz8OwZe
2. Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#---#"
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
'ignore') warnings.filterwarnings(
3. Data
43052)
np.random.seed(= pd.read_csv('https://raw.githubusercontent.com/guebin/DV2022/master/posts/temp.csv').iloc[:100,3].to_numpy()
temp
temp.sort()= 10 + temp * 0.5 + np.random.randn(100)
ice_sales 0] = 50
ice_sales[= pd.DataFrame({'temp':temp,'ice_sales':ice_sales})
df_train df_train
temp | ice_sales | |
---|---|---|
0 | -4.1 | 50.000000 |
1 | -3.7 | 9.234175 |
2 | -3.0 | 9.642778 |
3 | -1.3 | 9.657894 |
4 | -0.5 | 9.987787 |
... | ... | ... |
95 | 12.4 | 17.508688 |
96 | 13.4 | 17.105376 |
97 | 14.7 | 17.164930 |
98 | 15.0 | 18.555388 |
99 | 15.2 | 18.787014 |
100 rows × 2 columns
='temp',y='ice_sales') sns.scatterplot(df_train,x
<AxesSubplot: xlabel='temp', ylabel='ice_sales'>
- 이상치!
4. 적합
# step1 -- pass
# step2
= TabularPredictor(label='ice_sales')
predictr # step3
predictr.fit(df_train)# step4
= predictr.predict(df_train) yhat
No path specified. Models will be saved in: "AutogluonModels/ag-20231203_073224/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231203_073224/"
AutoGluon Version: 0.8.2
Python Version: 3.8.18
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 673.31 GB / 982.82 GB (68.5%)
Train Data Rows: 100
Train Data Columns: 1
Label Column: ice_sales
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
Label info (max, min, mean, stddev): (50.0, 8.273155164108418, 13.17881, 4.33788)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 106861.72 MB
Train Data (Original) Memory Usage: 0.0 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('float', []) : 1 | ['temp']
Types of features in processed data (raw dtype, special dtypes):
('float', []) : 1 | ['temp']
0.0s = Fit runtime
1 features in original data used to generate 1 features in processed data.
Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 80, Val Rows: 20
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
-2.2756 = Validation score (-root_mean_squared_error)
0.01s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
-2.6883 = Validation score (-root_mean_squared_error)
0.0s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMXT ...
-12.8059 = Validation score (-root_mean_squared_error)
0.13s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
-1.6843 = Validation score (-root_mean_squared_error)
0.11s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestMSE ...
-2.562 = Validation score (-root_mean_squared_error)
0.2s = Training runtime
0.01s = Validation runtime
Fitting model: CatBoost ...
-1.2192 = Validation score (-root_mean_squared_error)
0.2s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesMSE ...
-1.8631 = Validation score (-root_mean_squared_error)
0.2s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetFastAI ...
-1.752 = Validation score (-root_mean_squared_error)
0.36s = Training runtime
0.0s = Validation runtime
Fitting model: XGBoost ...
-1.3017 = Validation score (-root_mean_squared_error)
0.06s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
-1.1863 = Validation score (-root_mean_squared_error)
0.29s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMLarge ...
-2.0798 = Validation score (-root_mean_squared_error)
0.12s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
-1.1737 = Validation score (-root_mean_squared_error)
0.16s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 1.98s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231203_073224/")
[1000] valid_set's rmse: 1.69432
='temp',y='ice_sales',alpha=0.2)
sns.scatterplot(df_train,x='temp',y=yhat) sns.lineplot(df_train,x
<AxesSubplot: xlabel='temp', ylabel='ice_sales'>
5. 해석 및 시각화
A. y의 분포, (X,y)의 관계 시각화
auto.target_analysis(=df_train,
train_data='ice_sales',
label=False
fit_distributions )
Target variable analysis
count | mean | std | min | 25% | 50% | 75% | max | dtypes | unique | missing_count | missing_ratio | raw_type | special_types | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ice_sales | 100 | 13.178805 | 4.337878 | 8.273155 | 11.296645 | 12.856589 | 14.294614 | 50.0 | float64 | 100 | float |
Target variable correlations
train_data
- spearman
correlation matrix; focus: absolute correlation for ice_sales
>= 0.5
Feature interaction between temp
/ice_sales
in train_data
B. 중요한 설명변수?
pass # 설명변수가 하나라서..
C. 관측치별 해석
0]] df_train.iloc[[
temp | ice_sales | |
---|---|---|
0 | -4.1 | 50.0 |
0]]) predictr.predict(df_train.iloc[[
0 18.292442
Name: ice_sales, dtype: float32
auto.explain_rows(=df_train,
train_data=predictr,
model=df_train.iloc[[0]],
rows= True,
display_rows='waterfall'
plot
)
temp | ice_sales | |
---|---|---|
0 | -4.1 | 50.0 |