[STBDA2023] 13wk-57: House Prices / 자료분석(Autogluon)

Author

김보람

Published

December 3, 2023

13wk-57: House Prices / 자료분석(Autogluon)

최규빈
2023-12-01

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-x-PYcds3K7ck8ELQyVlVoN&si=ZCdvUB2r4dQ7cnQx

2. Imports

#!pip install autogluon.eda
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
#---#
from autogluon.tabular import TabularPredictor
import autogluon.eda.auto as auto
#---#
import warnings
warnings.filterwarnings('ignore')

3. Data

ref: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

!kaggle competitions download -c house-prices-advanced-regression-techniques
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/coco/.kaggle/kaggle.json'
Downloading house-prices-advanced-regression-techniques.zip to /home/coco/Dropbox/Class/STBDA23/posts
100%|█████████████████████████████████████████| 199k/199k [00:00<00:00, 436kB/s]
100%|█████████████████████████████████████████| 199k/199k [00:00<00:00, 435kB/s]
!unzip house-prices-advanced-regression-techniques.zip
Archive:  house-prices-advanced-regression-techniques.zip
  inflating: data_description.txt    
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               
df_submission = pd.read_csv("sample_submission.csv")
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
!rm sample_submission.csv
!rm train.csv
!rm test.csv
!rm data_description.txt
!rm house-prices-advanced-regression-techniques.zip

4. 적합

set(df_train.columns) - set(df_test.columns)
{'SalePrice'}
# step1 -- pass
# step2
predictr = TabularPredictor(label='SalePrice')
# step3
predictr.fit(df_train)
# step4
yhat = predictr.predict(df_train)
yyhat = predictr.predict(df_test)
No path specified. Models will be saved in: "AutogluonModels/ag-20231210_084023/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231210_084023/"
AutoGluon Version:  0.8.2
Python Version:     3.8.18
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  2 18:01:13 UTC 2
Disk Space Avail:   643.98 GB / 982.82 GB (65.5%)
Train Data Rows:    1460
Train Data Columns: 80
Label Column: SalePrice
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
    Label info (max, min, mean, stddev): (755000, 34900, 180921.19589, 79442.50288)
    If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    128242.01 MB
    Train Data (Original)  Memory Usage: 4.06 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 3 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
        Fitting CategoryFeatureGenerator...
            Fitting CategoryMemoryMinimizeFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('float', [])  :  3 | ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
        ('int', [])    : 34 | ['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', ...]
        ('object', []) : 43 | ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', ...]
    Types of features in processed data (raw dtype, special dtypes):
        ('category', [])  : 40 | ['MSZoning', 'Alley', 'LotShape', 'LandContour', 'LotConfig', ...]
        ('float', [])     :  3 | ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
        ('int', [])       : 34 | ['Id', 'MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', ...]
        ('int', ['bool']) :  3 | ['Street', 'Utilities', 'CentralAir']
    0.2s = Fit runtime
    80 features in original data used to generate 80 features in processed data.
    Train Data (Processed) Memory Usage: 0.52 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.18s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1168, Val Rows: 292
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f2e74356550>
Traceback (most recent call last):
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
    -52278.8213  = Validation score   (-root_mean_squared_error)
    0.2s     = Training   runtime
    0.04s    = Validation runtime
Fitting model: KNeighborsDist ...
Exception ignored on calling ctypes callback function: <function _ThreadpoolInfo._find_modules_with_dl_iterate_phdr.<locals>.match_module_callback at 0x7f2e4f9dd670>
Traceback (most recent call last):
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 400, in match_module_callback
    self._make_module_from_path(filepath)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 515, in _make_module_from_path
    module = module_class(filepath, prefix, user_api, internal_api)
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 606, in __init__
    self.version = self.get_version()
  File "/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/threadpoolctl.py", line 646, in get_version
    config = get_config().split()
AttributeError: 'NoneType' object has no attribute 'split'
    -51314.2734  = Validation score   (-root_mean_squared_error)
    0.03s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBMXT ...
    -27196.7065  = Validation score   (-root_mean_squared_error)
    2.18s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: LightGBM ...
    -28692.2871  = Validation score   (-root_mean_squared_error)
    5.13s    = Training   runtime
    0.05s    = Validation runtime
Fitting model: RandomForestMSE ...
    -32785.3519  = Validation score   (-root_mean_squared_error)
    0.37s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: CatBoost ...
    -28465.6966  = Validation score   (-root_mean_squared_error)
    43.66s   = Training   runtime
    0.01s    = Validation runtime
Fitting model: ExtraTreesMSE ...
    -32045.9062  = Validation score   (-root_mean_squared_error)
    0.3s     = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    -33846.1211  = Validation score   (-root_mean_squared_error)
    1.29s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: XGBoost ...
    -27778.2437  = Validation score   (-root_mean_squared_error)
    0.87s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: NeuralNetTorch ...
    -36076.0341  = Validation score   (-root_mean_squared_error)
    2.24s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: LightGBMLarge ...
    -32084.1712  = Validation score   (-root_mean_squared_error)
    7.97s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    -26322.571   = Validation score   (-root_mean_squared_error)
    0.16s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 65.86s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231210_084023/")
WARNING: Int features without null values at train time contain null values at inference time! Imputing nulls to 0. To avoid this, pass the features as floats during fit!
WARNING: Int features with nulls: ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'GarageCars', 'GarageArea']
[1000]  valid_set's rmse: 27505.1
[2000]  valid_set's rmse: 27240.4
[3000]  valid_set's rmse: 27201.5
[4000]  valid_set's rmse: 27197.3
[5000]  valid_set's rmse: 27197.2
[1000]  valid_set's rmse: 29499.8
[2000]  valid_set's rmse: 28896.4
[3000]  valid_set's rmse: 28752.1
[4000]  valid_set's rmse: 28705.7
[5000]  valid_set's rmse: 28695.2
[6000]  valid_set's rmse: 28693
[7000]  valid_set's rmse: 28692.5
[8000]  valid_set's rmse: 28692.3
[9000]  valid_set's rmse: 28692.3
[10000] valid_set's rmse: 28692.3
[1000]  valid_set's rmse: 32134.9
[2000]  valid_set's rmse: 32087.8
[3000]  valid_set's rmse: 32084.2
[4000]  valid_set's rmse: 32084.2
[5000]  valid_set's rmse: 32084.2

5. 제출

df_submission['SalePrice'] = yyhat 
df_submission.to_csv("submission.csv",index=False)
!kaggle competitions submit -c house-prices-advanced-regression-techniques -f submission.csv -m "오토글루온을 이용하여 첫제출"
!rm submission.csv
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/coco/.kaggle/kaggle.json'
100%|██████████████████████████████████████| 21.1k/21.1k [00:01<00:00, 11.4kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

958/4955
0.19334006054490413

나쁘지 않은 순위..

6. 해석 및 시각화 (HW)

- 변수들중에서 SalePrice를 예측하기에 적절한 변수들을 조사해볼것.

df_train
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 175000
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal 210000
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal 266500
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 142125
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 147500

1460 rows × 81 columns

auto.quick_fit(
    train_data=df_train,
    label='SalePrice',
    show_feature_importance_barplots=True
)
No path specified. Models will be saved in: "AutogluonModels/ag-20231210_084134/"

Model Prediction for SalePrice

Using validation data for Test points

Model Leaderboard

model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBMXT -30529.412291 -32535.182194 0.008621 0.006892 0.269097 0.008621 0.006892 0.269097 1 True 1

Feature Importance for Trained Model

importance stddev p_value n p99_high p99_low
OverallQual 11716.915326 705.955990 1.573763e-06 5 13170.488468 10263.342185
GrLivArea 6071.089919 394.407430 2.125470e-06 5 6883.180269 5258.999569
GarageCars 3209.379857 372.418575 2.137296e-05 5 3976.194851 2442.564863
BsmtFinSF1 2719.389615 125.895615 5.496647e-07 5 2978.610426 2460.168805
TotalBsmtSF 2187.752068 328.342663 5.909712e-05 5 2863.814149 1511.689987
1stFlrSF 1688.228494 248.951746 5.513520e-05 5 2200.823579 1175.633408
Neighborhood 1534.658374 377.814804 4.073284e-04 5 2312.584278 756.732471
HalfBath 881.236830 353.675558 2.542584e-03 5 1609.459692 153.013968
2ndFlrSF 872.631138 123.229525 4.647906e-05 5 1126.362433 618.899844
Fireplaces 868.173697 364.469844 2.989959e-03 5 1618.622144 117.725249
ExterQual 829.910645 401.498301 4.933027e-03 5 1656.601195 3.220095
YearRemodAdd 675.119616 166.871475 4.136303e-04 5 1018.710290 331.528942
LotArea 645.731150 240.401151 1.933927e-03 5 1140.720443 150.741856
FullBath 587.941341 102.476487 1.064026e-04 5 798.941844 376.940838
TotRmsAbvGrd 455.362588 146.798615 1.134317e-03 5 757.622965 153.102211
BsmtExposure 452.574670 100.266016 2.711034e-04 5 659.023783 246.125556
OverallCond 415.878748 101.116481 3.882599e-04 5 624.078980 207.678516
YearBuilt 328.365122 49.414644 5.972756e-05 5 430.110557 226.619687
MasVnrArea 291.987049 102.297731 1.546173e-03 5 502.619491 81.354608
GarageArea 266.571657 194.368541 1.870635e-02 5 666.779168 -133.635855
CentralAir 252.005310 43.263826 1.002687e-04 5 341.086127 162.924494
LotFrontage 234.476025 232.700363 4.367093e-02 5 713.609289 -244.657239
GarageYrBlt 179.569936 41.348994 3.147824e-04 5 264.708087 94.431785
BsmtFullBath 130.488441 27.317412 2.176156e-04 5 186.735370 74.241513
MSSubClass 115.302650 77.955090 1.486407e-02 5 275.813256 -45.207957
BsmtFinType1 89.671801 23.805159 5.438720e-04 5 138.686952 40.656649
KitchenAbvGr 88.796843 13.144665 5.597648e-05 5 115.861890 61.731796
OpenPorchSF 76.599171 32.768044 3.198232e-03 5 144.069027 9.129315
YrSold 55.041099 17.204734 1.010331e-03 5 90.465884 19.616315
MSZoning 52.239352 30.345403 9.156204e-03 5 114.720955 -10.242251
BsmtQual 45.470810 50.360612 5.681646e-02 5 149.164006 -58.222385
PavedDrive 27.952965 11.965434 3.205450e-03 5 52.589960 3.315970
Foundation 15.764567 21.136187 8.534210e-02 5 59.284268 -27.755134
WoodDeckSF 14.490268 49.028633 2.724108e-01 5 115.440900 -86.460364
BsmtUnfSF 14.251367 14.062238 4.304693e-02 5 43.205709 -14.702976
ScreenPorch 10.514393 6.927148 1.371401e-02 5 24.777486 -3.748700
HeatingQC 10.365567 6.499702 1.172951e-02 5 23.748544 -3.017409
Id 9.875743 25.538459 2.179917e-01 5 62.459783 -42.708298
RoofStyle 8.089551 53.663840 3.765020e-01 5 118.584139 -102.405037
LandSlope 6.869875 7.000689 4.662255e-02 5 21.284391 -7.544641
BsmtFinSF2 6.856908 2.939085 3.220858e-03 5 12.908525 0.805292
Exterior1st 4.492434 8.697021 1.561884e-01 5 22.399720 -13.414852
HouseStyle 3.571663 3.437669 4.042508e-02 5 10.649870 -3.506545
BsmtHalfBath 2.760108 1.020890 1.888221e-03 5 4.862135 0.658082
MoSold 1.776995 18.530416 4.203498e-01 5 39.931378 -36.377389
FireplaceQu 0.246895 8.880015 4.767049e-01 5 18.530969 -18.037180
GarageQual 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
GarageCond 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Utilities 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
LandContour 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
3SsnPorch 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
PoolArea 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Alley 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
PoolQC 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Fence 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
MiscFeature 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
MiscVal 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Street 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
SaleType 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
LotConfig 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Condition1 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
SaleCondition 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
MasVnrType 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
GarageType 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Condition2 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Functional 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
BldgType 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
BsmtFinType2 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
BsmtCond 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
KitchenQual 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Heating 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
RoofMatl 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
LowQualFinSF 0.000000 0.000000 5.000000e-01 5 0.000000 0.000000
Electrical -0.249231 0.232446 9.627220e-01 5 0.229380 -0.727841
BedroomAbvGr -0.281759 5.209289 5.452164e-01 5 10.444239 -11.007758
EnclosedPorch -0.726761 1.153234 8.842100e-01 5 1.647763 -3.101286
Exterior2nd -1.300222 2.040234 8.863621e-01 5 2.900648 -5.501092
GarageFinish -4.409206 10.141071 8.070016e-01 5 16.471400 -25.289812
ExterCond -4.973063 0.837647 9.999070e-01 5 -3.248336 -6.697791
LotShape -56.043807 64.453373 9.381189e-01 5 76.666578 -188.754193

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice SalePrice_pred error
1182 1183 60 RL 160.0 15623 Pave NaN IR1 Lvl AllPub ... MnPrv NaN 0 7 2007 WD Abnorml 745000 458492.125000 286507.875000
1298 1299 60 RL 313.0 63887 Pave NaN IR3 Bnk AllPub ... NaN NaN 0 1 2008 New Partial 160000 394141.625000 234141.625000
688 689 20 RL 60.0 8089 Pave NaN Reg HLS AllPub ... NaN NaN 0 10 2007 New Partial 392000 236151.000000 155849.000000
898 899 20 RL 100.0 12919 Pave NaN IR1 Lvl AllPub ... NaN NaN 0 3 2010 New Partial 611657 463175.906250 148481.093750
440 441 20 RL 105.0 15431 Pave NaN Reg Lvl AllPub ... NaN NaN 0 4 2009 WD Normal 555000 419354.593750 135645.406250
581 582 20 RL 98.0 12704 Pave NaN Reg Lvl AllPub ... NaN NaN 0 8 2009 New Partial 253293 382104.968750 128811.968750
769 770 60 RL 47.0 53504 Pave NaN IR2 HLS AllPub ... NaN NaN 0 6 2010 WD Normal 538000 429698.281250 108301.718750
632 633 20 RL 85.0 11900 Pave NaN Reg Lvl AllPub ... NaN NaN 0 4 2009 WD Family 82500 182936.734375 100436.734375
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... NaN NaN 0 12 2008 WD Normal 250000 333203.062500 83203.062500
666 667 60 RL NaN 18450 Pave NaN IR1 Lvl AllPub ... NaN NaN 0 8 2007 WD Abnorml 129000 210298.609375 81298.609375

10 rows × 83 columns

df_train.iloc[[1]]
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500

1 rows × 81 columns

predictr.predict(df_train.iloc[[1]])
1    170013.546875
Name: SalePrice, dtype: float32
# auto.explain_rows(
#     train_data=df_train,
#     model=predictr,
#     rows=df_train.iloc[[1]],
#     display_rows=True,
#     plot='waterfall'
# )
  • 위 코드 돌아가는데 너무 오래걸리네

chatGPT의 변수에 대한 설명

이 데이터는 Kaggle의 “House Prices - Advanced Regression Techniques” 대회에서 사용되는 데이터로, 주택 판매 가격을 예측하는 문제입니다. 아래는 데이터에 있는 변수들에 대한 설명입니다.

  1. SalePrice (판매 가격): 주택의 판매 가격 (예측하려는 대상 변수)

  2. MSSubClass (건물 종류): 건물의 종류를 나타내는 코드

  3. MSZoning (일반 분류): 일반적인 구역 분류

  4. LotFrontage (부동산에 연결된 도로의 선형 길이): 주택과 연결된 도로의 길이

  5. LotArea (부지 면적): 부지의 면적 (제곱 피트)

  6. Street (도로 접근 유형): 도로 접근 유형 (포장 도로 등)

  7. Alley (골목 접근 유형): 골목 접근 유형

  8. LotShape (부지의 일반적인 모양): 부지의 모양

  9. LandContour (부지의 평평함): 부지의 평평함 정도

  10. Utilities (사용 가능한 유틸리티 유형): 사용 가능한 유틸리티의 유형

  11. LotConfig (부지 구성): 부지의 구성

  12. LandSlope (부지 경사): 부지의 경사

  13. Neighborhood (Ames 시 한계 내의 물리적 위치): 물리적 위치

  14. Condition1 (주요 도로 또는 철도와의 근접성): 주요 도로 또는 철도와의 근접성

  15. Condition2 (주요 도로 또는 철도와의 근접성(두 번째)): 두 번째로 주어진 경우의 주요 도로 또는 철도와의 근접성

  16. BldgType (주거 형태): 주거 형태

  17. HouseStyle (주택 스타일): 주택 스타일

  18. OverallQual (전체 자재 및 마감 품질): 전반적인 자재 및 마감 품질

  19. OverallCond (전반적인 상태 등급): 전반적인 상태 등급

  20. YearBuilt (원래 건설 날짜): 원래 건설된 연도

… 그 외에도 다양한 주택 특징 및 설명이 포함되어 있습니다.

이 변수들은 각각 주택에 대한 다양한 특징을 설명하고 있으며, 이러한 특징을 사용하여 각 주택의 판매 가격을 예측하는 것이 목표입니다.

  • 위 데이터에서 SalePrice을 예측하는데 중요한 변수로는 OverallQual이 가장 중요하다.
set(df_train['OverallQual'])
{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

등급이 높을 수록 saleprice가 더 높다.

- 변수 overallqual만 해보자

df_tr_ = df_train[['OverallQual','SalePrice']]
df_ts_ = df_test[['OverallQual']]
df_tr_
OverallQual SalePrice
0 7 208500
1 6 181500
2 7 223500
3 7 140000
4 8 250000
... ... ...
1455 6 175000
1456 6 210000
1457 7 266500
1458 5 142125
1459 5 147500

1460 rows × 2 columns

# step1 -- pass
# step2
predictr = TabularPredictor(label='SalePrice')
# step3
predictr.fit(df_tr_)
# step4
yhat = predictr.predict(df_tr_)
#yyhat = predictr.predict(df_ts_)
No path specified. Models will be saved in: "AutogluonModels/ag-20231210_090017/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20231210_090017/"
AutoGluon Version:  0.8.2
Python Version:     3.8.18
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #38~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov  2 18:01:13 UTC 2
Disk Space Avail:   643.82 GB / 982.82 GB (65.5%)
Train Data Rows:    1460
Train Data Columns: 1
Label Column: SalePrice
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == int and many unique label-values observed).
    Label info (max, min, mean, stddev): (755000, 34900, 180921.19589, 79442.50288)
    If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    128082.22 MB
    Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Stage 5 Generators:
        Fitting DropDuplicatesFeatureGenerator...
    Types of features in original data (raw dtype, special dtypes):
        ('int', []) : 1 | ['OverallQual']
    Types of features in processed data (raw dtype, special dtypes):
        ('int', []) : 1 | ['OverallQual']
    0.0s = Fit runtime
    1 features in original data used to generate 1 features in processed data.
    Train Data (Processed) Memory Usage: 0.01 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.03s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1168, Val Rows: 292
User-specified model hyperparameters to be fit:
{
    'NN_TORCH': {},
    'GBM': [{'extra_trees': True, 'ag_args': {'name_suffix': 'XT'}}, {}, 'GBMLarge'],
    'CAT': {},
    'XGB': {},
    'FASTAI': {},
    'RF': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'XT': [{'criterion': 'gini', 'ag_args': {'name_suffix': 'Gini', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'entropy', 'ag_args': {'name_suffix': 'Entr', 'problem_types': ['binary', 'multiclass']}}, {'criterion': 'squared_error', 'ag_args': {'name_suffix': 'MSE', 'problem_types': ['regression', 'quantile']}}],
    'KNN': [{'weights': 'uniform', 'ag_args': {'name_suffix': 'Unif'}}, {'weights': 'distance', 'ag_args': {'name_suffix': 'Dist'}}],
}
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ...
    -51302.9262  = Validation score   (-root_mean_squared_error)
    0.04s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: KNeighborsDist ...
    -51302.9262  = Validation score   (-root_mean_squared_error)
    0.03s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMXT ...
    -61288.6739  = Validation score   (-root_mean_squared_error)
    0.28s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBM ...
    -48771.8547  = Validation score   (-root_mean_squared_error)
    0.21s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: RandomForestMSE ...
    -47823.2599  = Validation score   (-root_mean_squared_error)
    0.25s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: CatBoost ...
    -47813.9877  = Validation score   (-root_mean_squared_error)
    0.13s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: ExtraTreesMSE ...
    -47823.2599  = Validation score   (-root_mean_squared_error)
    0.24s    = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetFastAI ...
    -47262.4708  = Validation score   (-root_mean_squared_error)
    0.67s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: XGBoost ...
    -47576.4125  = Validation score   (-root_mean_squared_error)
    0.09s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: NeuralNetTorch ...
    -47489.2325  = Validation score   (-root_mean_squared_error)
    1.25s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: LightGBMLarge ...
    -47629.3214  = Validation score   (-root_mean_squared_error)
    0.22s    = Training   runtime
    0.0s     = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
    -47251.0188  = Validation score   (-root_mean_squared_error)
    0.16s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 3.77s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20231210_090017/")
[1000]  valid_set's rmse: 61288.7
[2000]  valid_set's rmse: 61288.7
yyhat = predictr.predict(df_ts_)
auto.quick_fit(
    train_data=df_tr_,
    label='SalePrice',
    show_feature_importance_barplots=True
)
No path specified. Models will be saved in: "AutogluonModels/ag-20231210_090127/"

Model Prediction for SalePrice

Using validation data for Test points

Model Leaderboard

model score_test score_val pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal stack_level can_infer fit_order
0 LightGBMXT -52272.947289 -50360.652261 0.000823 0.000755 0.215841 0.000823 0.000755 0.215841 1 True 1

Feature Importance for Trained Model

importance stddev p_value n p99_high p99_low
OverallQual 52258.930244 2494.060513 6.206537e-07 5 57394.235311 47123.625176

Rows with the highest prediction error

Rows in this category worth inspecting for the causes of the error

OverallQual SalePrice SalePrice_pred error
1182 10 745000 302755.9375 442244.0625
898 9 611657 302755.9375 308901.0625
440 10 555000 302755.9375 252244.0625
769 8 538000 302755.9375 235244.0625
1243 10 465000 302755.9375 162244.0625
1298 10 160000 302755.9375 142755.9375
458 8 161000 302755.9375 141755.9375
1211 8 164000 302755.9375 138755.9375
58 10 438780 302755.9375 136024.0625
991 8 168000 302755.9375 134755.9375
df_submission['SalePrice'] = yyhat 
df_submission.to_csv("submission.csv",index=False)
!kaggle competitions submit -c house-prices-advanced-regression-techniques -f submission.csv -m "overallqual"
!rm submission.csv
Warning: Your Kaggle API key is readable by other users on this system! To fix this, you can run 'chmod 600 /home/coco/.kaggle/kaggle.json'
100%|██████████████████████████████████████| 21.4k/21.4k [00:01<00:00, 11.6kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

image.png
1110/5012
0.2214684756584198

높아졌다잉