13wk-49: 키와 몸무게 (결측치, 성별교호작용) / 자료분석(Autogluon)

최규빈
2023-12-01

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-wlppn6TBGZYzyd9FOp1uB9&si=6MMfePEnviGCFWbE

2. Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

df_train = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/master/posts/mid/height_train.csv')
df_test = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/master/posts/mid/height_test.csv')

df_train.head()

	weight	sex	height
0	71.169041	male	180.906857
1	69.204748	male	178.123281
2	49.037293	female	165.106085
3	74.472874	male	177.467439
4	74.239599	male	177.439925

- 중간고사 문제였죠?

- 자료컨셉

성별간 교호작용 존재
결측치 존재 (성별로 결측치를 처리해야 좋았음)

sns.scatterplot(df_train, x='weight',y='height',hue='sex')

<AxesSubplot: xlabel='weight', ylabel='height'>

4. 적합

# step1 -- pass
# step2 
predictr = TabularPredictor(label='height')
# step3 
predictr.fit(df_train) 
# step4 
yhat = predictr.predict(df_train)

No path specified. Models will be saved in: "AutogluonModels/ag-20231221_040659/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231221_040659/"
AutoGluon Version:  0.8.2
Python Version:     3.8.18
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  2 18:01:13 UTC 2
Disk Space Avail:   650.40 GB / 982.82 GB (66.2%)
Train Data Rows:    280
Train Data Columns: 2
Label Column: height
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
    Label info (max, min, mean, stddev): (195.79716947992372, 148.97529810482766, 174.60543, 9.4301)
    If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    61087.3 MB
    Train Data (Original)  Memory Usage: 0.02 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])  : 1 | ['weight']
        ('object', []) : 1 | ['sex']
    Types of features in processed data (raw dtype, special dtypes):
        ('float', [])     : 1 | ['weight']
        ('int', ['bool']) : 1 | ['sex']
    0.0s = Fit runtime
    2 features in original data used to generate 2 features in processed data.
    Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.04s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 224, Val Rows: 56
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
    -3.463   = Validation score   (-root_mean_squared_error)
    0.21s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist ...
    -3.3929  = Validation score   (-root_mean_squared_error)
    0.02s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMXT ...
    -3.0349  = Validation score   (-root_mean_squared_error)
    0.22s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    -3.1331  = Validation score   (-root_mean_squared_error)
    0.16s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestMSE ...
    -3.0811  = Validation score   (-root_mean_squared_error)
    0.29s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: CatBoost ...
    -2.8341  = Validation score   (-root_mean_squared_error)
    0.32s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesMSE ...
    -3.0481  = Validation score   (-root_mean_squared_error)
    0.24s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    -3.0808  = Validation score   (-root_mean_squared_error)
    0.86s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: XGBoost ...
    -3.0563  = Validation score   (-root_mean_squared_error)
    0.27s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    -2.8592  = Validation score   (-root_mean_squared_error)
    0.74s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMLarge ...
    -3.1759  = Validation score   (-root_mean_squared_error)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    -2.7259  = Validation score   (-root_mean_squared_error)
    0.15s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 3.92s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231221_040659/")

[1000]  valid_set's rmse: 3.05149

sns.scatterplot(df_train, x='weight',y='height',hue='sex',alpha=0.3)
sns.lineplot(df_train, x='weight',y=yhat,hue='sex')

<AxesSubplot: xlabel='weight', ylabel='height'>

predictr.leaderboard(silent=True)

	model	score_val	pred_time_val	fit_time	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	WeightedEnsemble_L2	-2.725899	0.016733	2.555090	0.000230	0.151280	2	True	12
1	CatBoost	-2.834127	0.000724	0.324964	0.000724	0.324964	1	True	6
2	NeuralNetTorch	-2.859244	0.002492	0.736833	0.002492	0.736833	1	True	10
3	LightGBMXT	-3.034869	0.001134	0.220292	0.001134	0.220292	1	True	3
4	ExtraTreesMSE	-3.048093	0.018088	0.243790	0.018088	0.243790	1	True	7
5	XGBoost	-3.056270	0.002459	0.271035	0.002459	0.271035	1	True	9
6	NeuralNetFastAI	-3.080801	0.004651	0.863921	0.004651	0.863921	1	True	8
7	RandomForestMSE	-3.081138	0.018505	0.294994	0.018505	0.294994	1	True	5
8	LightGBM	-3.133082	0.000750	0.155065	0.000750	0.155065	1	True	4
9	LightGBMLarge	-3.175914	0.000838	0.211776	0.000838	0.211776	1	True	11
10	KNeighborsDist	-3.392890	0.004761	0.023989	0.004761	0.023989	1	True	2
11	KNeighborsUnif	-3.463039	0.006178	0.207058	0.006178	0.207058	1	True	1

5. 해석 및 시각화

A. y의 분포, (X,y)의 관계 시각화

auto.target_analysis(
    train_data=df_train,
    label='height',
    fit_distributions=False
)

Target variable analysis

	count	mean	std	min	25%	50%	75%	max	dtypes	unique	missing_count	missing_ratio	raw_type	special_types
height	280	174.605431	9.430102	148.975298	167.572671	175.186487	181.132612	195.797169	float64	280			float

Target variable correlations

train_data - spearman correlation matrix; focus: absolute correlation for height >= 0.5

Feature interaction between weight/height in train_data

Feature interaction between sex/height in train_data

B. 중요한 설명변수?

중요하다고 생각하는 설명변수를 랭킹을 통해 알려줌

auto.quick_fit(
    train_data=df_train,
    label='height',
    show_feature_importance_barplots=True
)

No path specified. Models will be saved in: "AutogluonModels/ag-20231203_071354/"

Model Prediction for height

Using validation data for Test points

Model Leaderboard

	model	score_test	score_val	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	LightGBMXT	-3.441217	-3.881789	0.006712	0.001602	0.306659	0.006712	0.001602	0.306659	1	True	1

Feature Importance for Trained Model

	importance	stddev	p_value	n	p99_high	p99_low
weight	9.433162	0.468715	7.290537e-07	5	10.398253	8.468071
sex	1.710680	0.464422	5.924364e-04	5	2.666932	0.754427

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

	weight	sex	height	height_pred	error
208	NaN	female	159.027430	168.600342	9.572911
263	54.145913	female	165.791300	173.811340	8.020041
146	76.642564	male	175.011295	182.954391	7.943097
228	56.473758	female	165.962051	173.811340	7.849289
92	51.018586	female	160.851952	167.398270	6.546318
198	NaN	male	173.915293	180.355164	6.439870
157	46.214566	female	154.289882	160.576065	6.286183
106	69.667856	male	179.665916	173.611328	6.054588
118	48.711791	female	168.305739	162.763138	5.542602
166	77.068343	male	177.439194	182.954391	5.515197

C. 관측치별 해석

0번 obs

- 0번 observation

df_train.iloc[[0]]

	weight	sex	height
0	71.169041	male	180.906857

predictr.predict(df_train.iloc[[0]])

0    178.642868
Name: height, dtype: float32

왜 178.642868로 예측했을까?

- 해석

auto.explain_rows(
    train_data= df_train,
    model = predictr,
    rows = df_train.iloc[[0]],
    display_rows= True,
    plot='waterfall'
)

	weight	sex	height
0	71.169041	male	180.906857

왜 178.642868로 예측했을까?
일단은 평균값인 173.115에 적합
sex을 고려하여 +2.1
weight을 고려하여 +3.43
최종적으로는 178.643

208번 obs

- 208번 observation

df_train.iloc[[208]]

	weight	sex	height
208	NaN	female	159.02743

predictr.predict(df_train.iloc[[208]])

208    168.788971
Name: height, dtype: float32

왜 168.788971로 예측했을까?

- 해석

auto.explain_rows(
    train_data= df_train,
    model = predictr,
    rows=df_train.iloc[[208]],
    display_rows= True,
    plot='waterfall'
)

	weight	sex	height
208	NaN	female	159.02743

왜 178.642868로 예측했을까?
일단은 평균값인 173.115에 적합
sex을 고려하여 -4.57
weight=nan을 고려하여 +0.25
최종적으로는 168.789

결측치를 그냥 하나의 관측치로 해석함 (nan이라는 값을 가지고 있다고 해석하는 느낌)

이게 왜 가능한가? (이런걸 가능하게 하는 테크닉이 많음, nan을 -9999로 처리하고 트리를 돌린다고 상상)

211번 obs

- 211번 observation

df_train.iloc[[211]]

	weight	sex	height
211	NaN	female	165.076235

predictr.predict(df_train.iloc[[211]])

211    168.788971
Name: height, dtype: float32

- 해석

auto.explain_rows(
    train_data= df_train,
    model = predictr,
    rows=df_train.iloc[[211]],
    display_rows= True,
    plot='waterfall'
)

	weight	sex	height
211	NaN	female	165.076235

- 우리가 생각한 현실적인 적합은 사실 이러함

df_train[df_train.sex =='female'].weight.mean()

49.567060917121516

onerow = df_train.iloc[[211]].copy()
onerow.weight = 49.567060917121516
onerow

	weight	sex	height
211	49.567061	female	165.076235

predictr.predict(onerow)

211    164.488647
Name: height, dtype: float32

auto.explain_rows(
    train_data= df_train,
    model = predictr,
    rows=onerow,
    display_rows= True,
    plot='waterfall'
)

	weight	sex	height
211	49.567061	female	165.076235

198번 obs

- 198번 observation

df_train.iloc[[198]]

	weight	sex	height
198	NaN	male	173.915293

predictr.predict(df_train.iloc[[198]])

198    178.869781
Name: height, dtype: float32

- 해석

auto.explain_rows(
    train_data= df_train,
    model = predictr,
    rows=df_train.iloc[[198]],
    display_rows= True,
    plot='waterfall'
)

	weight	sex	height
198	NaN	male	173.915293