[STBDA2023] 06wk-024: 취업+각종영어점수, RidgeCV

Author

김보람

Published

October 14, 2023

해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임

06wk-024: 취업+각종영어점수, RidgeCV

최규빈
2023-10-05

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-wCfFLHO2uCcH6izfJPTKro&si=YYn_bwPwcuwTk0Ld

2. Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import sklearn.linear_model

3. Data

df = pd.read_csv("https://raw.githubusercontent.com/guebin/MP2023/main/posts/employment_multicollinearity.csv")
np.random.seed(43052)
df['employment_score'] = df.gpa * 1.0 + df.toeic* 1/100 + np.random.randn(500)
df
employment_score gpa toeic toeic0 toeic1 toeic2 toeic3 toeic4 toeic5 toeic6 ... toeic490 toeic491 toeic492 toeic493 toeic494 toeic495 toeic496 toeic497 toeic498 toeic499
0 1.784955 0.051535 135 129.566309 133.078481 121.678398 113.457366 133.564200 136.026566 141.793547 ... 132.014696 140.013265 135.575816 143.863346 152.162740 132.850033 115.956496 131.842126 125.090801 143.568527
1 10.789671 0.355496 935 940.563187 935.723570 939.190519 938.995672 945.376482 927.469901 952.424087 ... 942.251184 923.241548 939.924802 921.912261 953.250300 931.743615 940.205853 930.575825 941.530348 934.221055
2 8.221213 2.228435 485 493.671390 493.909118 475.500970 480.363752 478.868942 493.321602 490.059102 ... 484.438233 488.101275 485.626742 475.330715 485.147363 468.553780 486.870976 481.640957 499.340808 488.197332
3 2.137594 1.179701 65 62.272565 55.957257 68.521468 76.866765 51.436321 57.166824 67.834920 ... 67.653225 65.710588 64.146780 76.662194 66.837839 82.379018 69.174745 64.475993 52.647087 59.493275
4 8.650144 3.962356 445 449.280637 438.895582 433.598274 444.081141 437.005100 434.761142 443.135269 ... 455.940348 435.952854 441.521145 443.038886 433.118847 466.103355 430.056944 423.632873 446.973484 442.793633
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
495 9.057243 4.288465 280 276.680902 274.502675 277.868536 292.283300 277.476630 281.671647 296.307373 ... 269.541846 278.220546 278.484758 284.901284 272.451612 265.784490 275.795948 280.465992 268.528889 283.638470
496 4.108020 2.601212 310 296.940263 301.545000 306.725610 314.811407 311.935810 309.695838 301.979914 ... 304.680578 295.476836 316.582100 319.412132 312.984039 312.372112 312.106944 314.101927 309.409533 297.429968
497 2.430590 0.042323 225 206.793217 228.335345 222.115146 216.479498 227.469560 238.710310 233.797065 ... 233.469238 235.160919 228.517306 228.349646 224.153606 230.860484 218.683195 232.949484 236.951938 227.997629
498 5.343171 1.041416 320 327.461442 323.019899 329.589337 313.312233 315.645050 324.448247 314.271045 ... 326.297700 309.893822 312.873223 322.356584 319.332809 319.405283 324.021917 312.363694 318.493866 310.973930
499 6.505106 3.626883 375 370.966595 364.668477 371.853566 373.574930 376.701708 356.905085 354.584022 ... 382.278782 379.460816 371.031640 370.272639 375.618182 369.252740 376.925543 391.863103 368.735260 368.520844

500 rows × 503 columns

4. RidgeCV

- RidgeCV 클래스에서 모형을 선택해보자.

## step1 
df_train, df_test = sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
X = df_train.loc[:,'gpa':'toeic499']
y = df_train.loc[:,'employment_score']
XX = df_test.loc[:,'gpa':'toeic499']
yy = df_test.loc[:,'employment_score']
## step2 
predictr = sklearn.linear_model.RidgeCV()
## step3
predictr.fit(X,y)
## step4 -- pass 
RidgeCV()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
predictr.score(X,y)
0.9999996840224943
predictr.score(XX,yy)
0.11914945948787758

- alpha들의 후보를 우리가 직접 선정하자.

sklearn.linear_model.RidgeCV?
Init signature:
sklearn.linear_model.RidgeCV(
    alphas=(0.1, 1.0, 10.0),
    *,
    fit_intercept=True,
    scoring=None,
    cv=None,
    gcv_mode=None,
    store_cv_values=False,
    alpha_per_target=False,
)
Docstring:     
Ridge regression with built-in cross-validation.
See glossary entry for :term:`cross-validation estimator`.
By default, it performs efficient Leave-One-Out Cross-Validation.
Read more in the :ref:`User Guide <ridge_regression>`.
Parameters
----------
alphas : array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)
    Array of alpha values to try.
    Regularization strength; must be a positive float. Regularization
    improves the conditioning of the problem and reduces the variance of
    the estimates. Larger values specify stronger regularization.
    Alpha corresponds to ``1 / (2C)`` in other linear models such as
    :class:`~sklearn.linear_model.LogisticRegression` or
    :class:`~sklearn.svm.LinearSVC`.
    If using Leave-One-Out cross-validation, alphas must be positive.
fit_intercept : bool, default=True
    Whether to calculate the intercept for this model. If set
    to false, no intercept will be used in calculations
    (i.e. data is expected to be centered).
scoring : str, callable, default=None
    A string (see model evaluation documentation) or
    a scorer callable object / function with signature
    ``scorer(estimator, X, y)``.
    If None, the negative mean squared error if cv is 'auto' or None
    (i.e. when using leave-one-out cross-validation), and r2 score
    otherwise.
cv : int, cross-validation generator or an iterable, default=None
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:
    - None, to use the efficient Leave-One-Out cross-validation
    - integer, to specify the number of folds.
    - :term:`CV splitter`,
    - An iterable yielding (train, test) splits as arrays of indices.
    For integer/None inputs, if ``y`` is binary or multiclass,
    :class:`~sklearn.model_selection.StratifiedKFold` is used, else,
    :class:`~sklearn.model_selection.KFold` is used.
    Refer :ref:`User Guide <cross_validation>` for the various
    cross-validation strategies that can be used here.
gcv_mode : {'auto', 'svd', 'eigen'}, default='auto'
    Flag indicating which strategy to use when performing
    Leave-One-Out Cross-Validation. Options are::
        'auto' : use 'svd' if n_samples > n_features, otherwise use 'eigen'
        'svd' : force use of singular value decomposition of X when X is
            dense, eigenvalue decomposition of X^T.X when X is sparse.
        'eigen' : force computation via eigendecomposition of X.X^T
    The 'auto' mode is the default and is intended to pick the cheaper
    option of the two depending on the shape of the training data.
store_cv_values : bool, default=False
    Flag indicating if the cross-validation values corresponding to
    each alpha should be stored in the ``cv_values_`` attribute (see
    below). This flag is only compatible with ``cv=None`` (i.e. using
    Leave-One-Out Cross-Validation).
alpha_per_target : bool, default=False
    Flag indicating whether to optimize the alpha value (picked from the
    `alphas` parameter list) for each target separately (for multi-output
    settings: multiple prediction targets). When set to `True`, after
    fitting, the `alpha_` attribute will contain a value for each target.
    When set to `False`, a single alpha is used for all targets.
    .. versionadded:: 0.24
Attributes
----------
cv_values_ : ndarray of shape (n_samples, n_alphas) or             shape (n_samples, n_targets, n_alphas), optional
    Cross-validation values for each alpha (only available if
    ``store_cv_values=True`` and ``cv=None``). After ``fit()`` has been
    called, this attribute will contain the mean squared errors if
    `scoring is None` otherwise it will contain standardized per point
    prediction values.
coef_ : ndarray of shape (n_features) or (n_targets, n_features)
    Weight vector(s).
intercept_ : float or ndarray of shape (n_targets,)
    Independent term in decision function. Set to 0.0 if
    ``fit_intercept = False``.
alpha_ : float or ndarray of shape (n_targets,)
    Estimated regularization parameter, or, if ``alpha_per_target=True``,
    the estimated regularization parameter for each target.
best_score_ : float or ndarray of shape (n_targets,)
    Score of base estimator with best alpha, or, if
    ``alpha_per_target=True``, a score for each target.
    .. versionadded:: 0.23
n_features_in_ : int
    Number of features seen during :term:`fit`.
    .. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.
    .. versionadded:: 1.0
See Also
--------
Ridge : Ridge regression.
RidgeClassifier : Classifier based on ridge regression on {-1, 1} labels.
RidgeClassifierCV : Ridge classifier with built-in cross validation.
Examples
--------
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.linear_model import RidgeCV
>>> X, y = load_diabetes(return_X_y=True)
>>> clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1]).fit(X, y)
>>> clf.score(X, y)
0.5166...
File:           ~/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/linear_model/_ridge.py
Type:           ABCMeta
Subclasses:     

alphas=(0.1, 1.0, 10.0) 중 값이 가장 좋은 걸 선택해 주는 듯.

## step1 
df_train, df_test = sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
X = df_train.loc[:,'gpa':'toeic499']
y = df_train.loc[:,'employment_score']
XX = df_test.loc[:,'gpa':'toeic499']
yy = df_test.loc[:,'employment_score']
## step2 
predictr = sklearn.linear_model.RidgeCV(alphas=[5e2, 5e3, 5e4, 5e5, 5e6, 5e7, 5e8])
## step3
predictr.fit(X,y)
## step4 -- pass 
RidgeCV(alphas=[500.0, 5000.0, 50000.0, 500000.0, 5000000.0, 50000000.0,
                500000000.0])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
predictr.score(X,y)
0.752126856015936
predictr.score(XX,yy)
0.7450309251010896
predictr.alpha_
50000000.0

참고로 이 적합결과는 아래의 코드를 실행한것과 같다

## step1 
df_train, df_test = sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
X = df_train.loc[:,'gpa':'toeic499']
y = df_train.loc[:,'employment_score']
XX = df_test.loc[:,'gpa':'toeic499']
yy = df_test.loc[:,'employment_score']
## step2 
predictr = sklearn.linear_model.Ridge(alpha=50000000.0)
## step3
predictr.fit(X,y)
## step4 -- pass 
Ridge(alpha=50000000.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
predictr.score(X,y)
0.752126856015936
predictr.score(XX,yy)
0.7450309251010894