import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')13wk-54: 체중감량(교호작용) / 자료분석(Autogluon)
최규빈
2023-12-01
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-wZhtZDo138uLCeMS-lpdIs&si=eivrWDCk28v-KCXN
2. Imports
3. Data
df_train = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/weightloss.csv')
df_train| Supplement | Exercise | Weight_Loss | |
|---|---|---|---|
| 0 | False | False | -0.877103 |
| 1 | True | False | 1.604542 |
| 2 | True | True | 13.824148 |
| 3 | True | True | 13.004505 |
| 4 | True | True | 13.701128 |
| ... | ... | ... | ... |
| 9995 | True | False | 1.558841 |
| 9996 | False | False | -0.217816 |
| 9997 | False | True | 4.072701 |
| 9998 | True | False | -0.253796 |
| 9999 | False | False | -1.399092 |
10000 rows × 3 columns
4. 적합
# step1 -- pass
# step2
predictr = TabularPredictor(label='Weight_Loss')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)No path specified. Models will be saved in: "AutogluonModels/ag-20231203_082814/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231203_082814/"
AutoGluon Version: 0.8.2
Python Version: 3.8.18
Operating System: Linux
Platform Machine: x86_64
Platform Version: #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2
Disk Space Avail: 673.22 GB / 982.82 GB (68.5%)
Train Data Rows: 10000
Train Data Columns: 2
Label Column: Weight_Loss
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
Label info (max, min, mean, stddev): (18.725299456466026, -3.4848875790233675, 5.11908, 6.09267)
If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 104947.45 MB
Train Data (Original) Memory Usage: 0.02 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 2 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Stage 5 Generators:
Fitting DropDuplicatesFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('bool', []) : 2 | ['Supplement', 'Exercise']
Types of features in processed data (raw dtype, special dtypes):
('int', ['bool']) : 2 | ['Supplement', 'Exercise']
0.0s = Fit runtime
2 features in original data used to generate 2 features in processed data.
Train Data (Processed) Memory Usage: 0.02 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.04s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.1, Train Rows: 9000, Val Rows: 1000
User-specified model hyperparameters to be fit:
{
'NN_TORCH': {},
'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
'CAT': {},
'XGB': {},
'FASTAI': {},
'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
No valid features to train KNeighborsUnif... Skipping this model.
Fitting model: KNeighborsDist ...
No valid features to train KNeighborsDist... Skipping this model.
Fitting model: LightGBMXT ...
-1.0098 = Validation score (-root_mean_squared_error)
0.16s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBM ...
-1.0098 = Validation score (-root_mean_squared_error)
0.14s = Training runtime
0.0s = Validation runtime
Fitting model: RandomForestMSE ...
-1.0098 = Validation score (-root_mean_squared_error)
0.22s = Training runtime
0.02s = Validation runtime
Fitting model: CatBoost ...
-1.0098 = Validation score (-root_mean_squared_error)
0.3s = Training runtime
0.0s = Validation runtime
Fitting model: ExtraTreesMSE ...
-1.0098 = Validation score (-root_mean_squared_error)
0.22s = Training runtime
0.02s = Validation runtime
Fitting model: NeuralNetFastAI ...
-1.0087 = Validation score (-root_mean_squared_error)
4.18s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
-1.0098 = Validation score (-root_mean_squared_error)
0.07s = Training runtime
0.0s = Validation runtime
Fitting model: NeuralNetTorch ...
-1.0088 = Validation score (-root_mean_squared_error)
14.88s = Training runtime
0.0s = Validation runtime
Fitting model: LightGBMLarge ...
-1.0098 = Validation score (-root_mean_squared_error)
0.17s = Training runtime
0.0s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
-1.0084 = Validation score (-root_mean_squared_error)
0.13s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 20.62s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231203_082814/")
5. 해석 및 시각화
A. y의 분포, (X,y)의 관계 시각화
auto.target_analysis(
train_data=df_train,
label='Weight_Loss',
fit_distributions=False
)Target variable analysis
| count | mean | std | min | 25% | 50% | 75% | max | dtypes | unique | missing_count | missing_ratio | raw_type | special_types | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Weight_Loss | 10000 | 5.11908 | 6.092669 | -3.484888 | 0.277797 | 2.683447 | 12.075458 | 18.725299 | float64 | 10000 | float |

Target variable correlations
train_data - spearman correlation matrix; focus: absolute correlation for Weight_Loss >= 0.5

Feature interaction between Exercise/Weight_Loss in train_data

B. 중요한 설명변수
auto.quick_fit(
train_data=df_train,
label='Weight_Loss',
show_feature_importance_barplots=True
)No path specified. Models will be saved in: "AutogluonModels/ag-20231203_082842/"
Model Prediction for Weight_Loss
Using validation data for Test points

Model Leaderboard
| model | score_test | score_val | pred_time_test | pred_time_val | fit_time | pred_time_test_marginal | pred_time_val_marginal | fit_time_marginal | stack_level | can_infer | fit_order | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LightGBMXT | -1.013205 | -0.973093 | 0.005561 | 0.003141 | 0.168503 | 0.005561 | 0.003141 | 0.168503 | 1 | True | 1 |
Feature Importance for Trained Model
| importance | stddev | p_value | n | p99_high | p99_low | |
|---|---|---|---|---|---|---|
| Exercise | 6.735479 | 0.105037 | 7.094759e-09 | 5 | 6.951752 | 6.519206 |
| Supplement | 4.018616 | 0.073537 | 1.344948e-08 | 5 | 4.170030 | 3.867202 |

Rows with the highest prediction error
Rows in this category worth inspecting for the causes of the error
| Supplement | Exercise | Weight_Loss | Weight_Loss_pred | error | |
|---|---|---|---|---|---|
| 4639 | True | False | 4.358143 | 0.484748 | 3.873395 |
| 1683 | True | True | 18.432093 | 14.966088 | 3.466005 |
| 1275 | False | True | 1.693150 | 5.012056 | 3.318906 |
| 2631 | False | True | 8.328789 | 5.012056 | 3.316733 |
| 5334 | True | False | 3.769029 | 0.484748 | 3.284281 |
| 3437 | False | False | -3.205225 | 0.039978 | 3.245204 |
| 2761 | False | True | 1.853100 | 5.012056 | 3.158956 |
| 9675 | True | True | 18.070419 | 14.966088 | 3.104331 |
| 3161 | False | False | -3.020254 | 0.039978 | 3.060232 |
| 4637 | False | True | 2.077349 | 5.012056 | 2.934706 |
C. 관측치별 해석
- 2번관측치
df_train.iloc[[2]]| Supplement | Exercise | Weight_Loss | |
|---|---|---|---|
| 2 | True | True | 13.824148 |
predictr.predict(df_train.iloc[[2]])2 14.999411
Name: Weight_Loss, dtype: float32
auto.explain_rows(
train_data=df_train,
model=predictr,
rows=df_train.iloc[[2]]*1,
display_rows=True,
plot='waterfall'
)| Supplement | Exercise | Weight_Loss | |
|---|---|---|---|
| 2 | 1.0 | 1.0 | 13.824148 |

- 5번관측치
df_train.iloc[[5]]| Supplement | Exercise | Weight_Loss | |
|---|---|---|---|
| 5 | True | False | -0.379401 |
predictr.predict(df_train.iloc[[5]])5 0.584295
Name: Weight_Loss, dtype: float32
auto.explain_rows(
train_data=df_train,
model=predictr,
rows=df_train.iloc[[5]]*1,
display_rows=True,
plot='waterfall'
)| Supplement | Exercise | Weight_Loss | |
|---|---|---|---|
| 5 | 1.0 | 0.0 | -0.379401 |

- 10번관측치
df_train.iloc[[10]]| Supplement | Exercise | Weight_Loss | |
|---|---|---|---|
| 10 | False | True | 4.356397 |
predictr.predict(df_train.iloc[[10]])10 4.990985
Name: Weight_Loss, dtype: float32
auto.explain_rows(
train_data=df_train,
model=predictr,
rows=df_train.iloc[[10]]*1,
display_rows=True,
plot='waterfall'
)| Supplement | Exercise | Weight_Loss | |
|---|---|---|---|
| 10 | 0.0 | 1.0 | 4.356397 |
