import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model
해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임
06wk-024: 취업+각종영어점수, RidgeCV
최규빈
2023-10-05
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-wCfFLHO2uCcH6izfJPTKro&si=YYn_bwPwcuwTk0Ld
2. Imports
3. Data
= pd.read_csv("https://raw.githubusercontent.com/guebin/MP2023/main/posts/employment_multicollinearity.csv")
df 43052)
np.random.seed('employment_score'] = df.gpa * 1.0 + df.toeic* 1/100 + np.random.randn(500) df[
df
employment_score | gpa | toeic | toeic0 | toeic1 | toeic2 | toeic3 | toeic4 | toeic5 | toeic6 | ... | toeic490 | toeic491 | toeic492 | toeic493 | toeic494 | toeic495 | toeic496 | toeic497 | toeic498 | toeic499 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.784955 | 0.051535 | 135 | 129.566309 | 133.078481 | 121.678398 | 113.457366 | 133.564200 | 136.026566 | 141.793547 | ... | 132.014696 | 140.013265 | 135.575816 | 143.863346 | 152.162740 | 132.850033 | 115.956496 | 131.842126 | 125.090801 | 143.568527 |
1 | 10.789671 | 0.355496 | 935 | 940.563187 | 935.723570 | 939.190519 | 938.995672 | 945.376482 | 927.469901 | 952.424087 | ... | 942.251184 | 923.241548 | 939.924802 | 921.912261 | 953.250300 | 931.743615 | 940.205853 | 930.575825 | 941.530348 | 934.221055 |
2 | 8.221213 | 2.228435 | 485 | 493.671390 | 493.909118 | 475.500970 | 480.363752 | 478.868942 | 493.321602 | 490.059102 | ... | 484.438233 | 488.101275 | 485.626742 | 475.330715 | 485.147363 | 468.553780 | 486.870976 | 481.640957 | 499.340808 | 488.197332 |
3 | 2.137594 | 1.179701 | 65 | 62.272565 | 55.957257 | 68.521468 | 76.866765 | 51.436321 | 57.166824 | 67.834920 | ... | 67.653225 | 65.710588 | 64.146780 | 76.662194 | 66.837839 | 82.379018 | 69.174745 | 64.475993 | 52.647087 | 59.493275 |
4 | 8.650144 | 3.962356 | 445 | 449.280637 | 438.895582 | 433.598274 | 444.081141 | 437.005100 | 434.761142 | 443.135269 | ... | 455.940348 | 435.952854 | 441.521145 | 443.038886 | 433.118847 | 466.103355 | 430.056944 | 423.632873 | 446.973484 | 442.793633 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 9.057243 | 4.288465 | 280 | 276.680902 | 274.502675 | 277.868536 | 292.283300 | 277.476630 | 281.671647 | 296.307373 | ... | 269.541846 | 278.220546 | 278.484758 | 284.901284 | 272.451612 | 265.784490 | 275.795948 | 280.465992 | 268.528889 | 283.638470 |
496 | 4.108020 | 2.601212 | 310 | 296.940263 | 301.545000 | 306.725610 | 314.811407 | 311.935810 | 309.695838 | 301.979914 | ... | 304.680578 | 295.476836 | 316.582100 | 319.412132 | 312.984039 | 312.372112 | 312.106944 | 314.101927 | 309.409533 | 297.429968 |
497 | 2.430590 | 0.042323 | 225 | 206.793217 | 228.335345 | 222.115146 | 216.479498 | 227.469560 | 238.710310 | 233.797065 | ... | 233.469238 | 235.160919 | 228.517306 | 228.349646 | 224.153606 | 230.860484 | 218.683195 | 232.949484 | 236.951938 | 227.997629 |
498 | 5.343171 | 1.041416 | 320 | 327.461442 | 323.019899 | 329.589337 | 313.312233 | 315.645050 | 324.448247 | 314.271045 | ... | 326.297700 | 309.893822 | 312.873223 | 322.356584 | 319.332809 | 319.405283 | 324.021917 | 312.363694 | 318.493866 | 310.973930 |
499 | 6.505106 | 3.626883 | 375 | 370.966595 | 364.668477 | 371.853566 | 373.574930 | 376.701708 | 356.905085 | 354.584022 | ... | 382.278782 | 379.460816 | 371.031640 | 370.272639 | 375.618182 | 369.252740 | 376.925543 | 391.863103 | 368.735260 | 368.520844 |
500 rows × 503 columns
4. RidgeCV
-
RidgeCV 클래스에서 모형을 선택해보자.
## step1
= sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
df_train, df_test = df_train.loc[:,'gpa':'toeic499']
X = df_train.loc[:,'employment_score']
y = df_test.loc[:,'gpa':'toeic499']
XX = df_test.loc[:,'employment_score']
yy ## step2
= sklearn.linear_model.RidgeCV()
predictr ## step3
predictr.fit(X,y)## step4 -- pass
RidgeCV()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RidgeCV()
predictr.score(X,y)
0.9999996840224943
predictr.score(XX,yy)
0.11914945948787758
-
alpha들의 후보를 우리가 직접 선정하자.
sklearn.linear_model.RidgeCV?
Init signature: sklearn.linear_model.RidgeCV( alphas=(0.1, 1.0, 10.0), *, fit_intercept=True, scoring=None, cv=None, gcv_mode=None, store_cv_values=False, alpha_per_target=False, ) Docstring: Ridge regression with built-in cross-validation. See glossary entry for :term:`cross-validation estimator`. By default, it performs efficient Leave-One-Out Cross-Validation. Read more in the :ref:`User Guide <ridge_regression>`. Parameters ---------- alphas : array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0) Array of alpha values to try. Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha corresponds to ``1 / (2C)`` in other linear models such as :class:`~sklearn.linear_model.LogisticRegression` or :class:`~sklearn.svm.LinearSVC`. If using Leave-One-Out cross-validation, alphas must be positive. fit_intercept : bool, default=True Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be centered). scoring : str, callable, default=None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``. If None, the negative mean squared error if cv is 'auto' or None (i.e. when using leave-one-out cross-validation), and r2 score otherwise. cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the efficient Leave-One-Out cross-validation - integer, to specify the number of folds. - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if ``y`` is binary or multiclass, :class:`~sklearn.model_selection.StratifiedKFold` is used, else, :class:`~sklearn.model_selection.KFold` is used. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here. gcv_mode : {'auto', 'svd', 'eigen'}, default='auto' Flag indicating which strategy to use when performing Leave-One-Out Cross-Validation. Options are:: 'auto' : use 'svd' if n_samples > n_features, otherwise use 'eigen' 'svd' : force use of singular value decomposition of X when X is dense, eigenvalue decomposition of X^T.X when X is sparse. 'eigen' : force computation via eigendecomposition of X.X^T The 'auto' mode is the default and is intended to pick the cheaper option of the two depending on the shape of the training data. store_cv_values : bool, default=False Flag indicating if the cross-validation values corresponding to each alpha should be stored in the ``cv_values_`` attribute (see below). This flag is only compatible with ``cv=None`` (i.e. using Leave-One-Out Cross-Validation). alpha_per_target : bool, default=False Flag indicating whether to optimize the alpha value (picked from the `alphas` parameter list) for each target separately (for multi-output settings: multiple prediction targets). When set to `True`, after fitting, the `alpha_` attribute will contain a value for each target. When set to `False`, a single alpha is used for all targets. .. versionadded:: 0.24 Attributes ---------- cv_values_ : ndarray of shape (n_samples, n_alphas) or shape (n_samples, n_targets, n_alphas), optional Cross-validation values for each alpha (only available if ``store_cv_values=True`` and ``cv=None``). After ``fit()`` has been called, this attribute will contain the mean squared errors if `scoring is None` otherwise it will contain standardized per point prediction values. coef_ : ndarray of shape (n_features) or (n_targets, n_features) Weight vector(s). intercept_ : float or ndarray of shape (n_targets,) Independent term in decision function. Set to 0.0 if ``fit_intercept = False``. alpha_ : float or ndarray of shape (n_targets,) Estimated regularization parameter, or, if ``alpha_per_target=True``, the estimated regularization parameter for each target. best_score_ : float or ndarray of shape (n_targets,) Score of base estimator with best alpha, or, if ``alpha_per_target=True``, a score for each target. .. versionadded:: 0.23 n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 See Also -------- Ridge : Ridge regression. RidgeClassifier : Classifier based on ridge regression on {-1, 1} labels. RidgeClassifierCV : Ridge classifier with built-in cross validation. Examples -------- >>> from sklearn.datasets import load_diabetes >>> from sklearn.linear_model import RidgeCV >>> X, y = load_diabetes(return_X_y=True) >>> clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1]).fit(X, y) >>> clf.score(X, y) 0.5166... File: ~/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/linear_model/_ridge.py Type: ABCMeta Subclasses:
alphas=(0.1, 1.0, 10.0)
중 값이 가장 좋은 걸 선택해 주는 듯.
## step1
= sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
df_train, df_test = df_train.loc[:,'gpa':'toeic499']
X = df_train.loc[:,'employment_score']
y = df_test.loc[:,'gpa':'toeic499']
XX = df_test.loc[:,'employment_score']
yy ## step2
= sklearn.linear_model.RidgeCV(alphas=[5e2, 5e3, 5e4, 5e5, 5e6, 5e7, 5e8])
predictr ## step3
predictr.fit(X,y)## step4 -- pass
RidgeCV(alphas=[500.0, 5000.0, 50000.0, 500000.0, 5000000.0, 50000000.0, 500000000.0])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RidgeCV(alphas=[500.0, 5000.0, 50000.0, 500000.0, 5000000.0, 50000000.0, 500000000.0])
predictr.score(X,y)
0.752126856015936
predictr.score(XX,yy)
0.7450309251010896
predictr.alpha_
50000000.0
참고로 이 적합결과는 아래의 코드를 실행한것과 같다
## step1
= sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
df_train, df_test = df_train.loc[:,'gpa':'toeic499']
X = df_train.loc[:,'employment_score']
y = df_test.loc[:,'gpa':'toeic499']
XX = df_test.loc[:,'employment_score']
yy ## step2
= sklearn.linear_model.Ridge(alpha=50000000.0)
predictr ## step3
predictr.fit(X,y)## step4 -- pass
Ridge(alpha=50000000.0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=50000000.0)
predictr.score(X,y)
0.752126856015936
predictr.score(XX,yy)
0.7450309251010894