해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임

06wk-024: 취업+각종영어점수, RidgeCV

최규빈
2023-10-05

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-wCfFLHO2uCcH6izfJPTKro&si=YYn_bwPwcuwTk0Ld

2. Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import sklearn.linear_model

3. Data

df = pd.read_csv("https://raw.githubusercontent.com/guebin/MP2023/main/posts/employment_multicollinearity.csv")
np.random.seed(43052)
df['employment_score'] = df.gpa * 1.0 + df.toeic* 1/100 + np.random.randn(500)

df

	employment_score	gpa	toeic	toeic0	toeic1	toeic2	toeic3	toeic4	toeic5	toeic6	...	toeic490	toeic491	toeic492	toeic493	toeic494	toeic495	toeic496	toeic497	toeic498	toeic499
0	1.784955	0.051535	135	129.566309	133.078481	121.678398	113.457366	133.564200	136.026566	141.793547	...	132.014696	140.013265	135.575816	143.863346	152.162740	132.850033	115.956496	131.842126	125.090801	143.568527
1	10.789671	0.355496	935	940.563187	935.723570	939.190519	938.995672	945.376482	927.469901	952.424087	...	942.251184	923.241548	939.924802	921.912261	953.250300	931.743615	940.205853	930.575825	941.530348	934.221055
2	8.221213	2.228435	485	493.671390	493.909118	475.500970	480.363752	478.868942	493.321602	490.059102	...	484.438233	488.101275	485.626742	475.330715	485.147363	468.553780	486.870976	481.640957	499.340808	488.197332
3	2.137594	1.179701	65	62.272565	55.957257	68.521468	76.866765	51.436321	57.166824	67.834920	...	67.653225	65.710588	64.146780	76.662194	66.837839	82.379018	69.174745	64.475993	52.647087	59.493275
4	8.650144	3.962356	445	449.280637	438.895582	433.598274	444.081141	437.005100	434.761142	443.135269	...	455.940348	435.952854	441.521145	443.038886	433.118847	466.103355	430.056944	423.632873	446.973484	442.793633
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
495	9.057243	4.288465	280	276.680902	274.502675	277.868536	292.283300	277.476630	281.671647	296.307373	...	269.541846	278.220546	278.484758	284.901284	272.451612	265.784490	275.795948	280.465992	268.528889	283.638470
496	4.108020	2.601212	310	296.940263	301.545000	306.725610	314.811407	311.935810	309.695838	301.979914	...	304.680578	295.476836	316.582100	319.412132	312.984039	312.372112	312.106944	314.101927	309.409533	297.429968
497	2.430590	0.042323	225	206.793217	228.335345	222.115146	216.479498	227.469560	238.710310	233.797065	...	233.469238	235.160919	228.517306	228.349646	224.153606	230.860484	218.683195	232.949484	236.951938	227.997629
498	5.343171	1.041416	320	327.461442	323.019899	329.589337	313.312233	315.645050	324.448247	314.271045	...	326.297700	309.893822	312.873223	322.356584	319.332809	319.405283	324.021917	312.363694	318.493866	310.973930
499	6.505106	3.626883	375	370.966595	364.668477	371.853566	373.574930	376.701708	356.905085	354.584022	...	382.278782	379.460816	371.031640	370.272639	375.618182	369.252740	376.925543	391.863103	368.735260	368.520844

500 rows × 503 columns

4. RidgeCV

- RidgeCV 클래스에서 모형을 선택해보자.

## step1 
df_train, df_test = sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
X = df_train.loc[:,'gpa':'toeic499']
y = df_train.loc[:,'employment_score']
XX = df_test.loc[:,'gpa':'toeic499']
yy = df_test.loc[:,'employment_score']
## step2 
predictr = sklearn.linear_model.RidgeCV()
## step3
predictr.fit(X,y)
## step4 -- pass

RidgeCV()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

predictr.score(X,y)

0.9999996840224943

predictr.score(XX,yy)

0.11914945948787758

- alpha들의 후보를 우리가 직접 선정하자.

sklearn.linear_model.RidgeCV?

Init signature:
sklearn.linear_model.RidgeCV(
    alphas=(0.1, 1.0, 10.0),
    *,
    fit_intercept=True,
    scoring=None,
    cv=None,
    gcv_mode=None,
    store_cv_values=False,
    alpha_per_target=False,
)
Docstring:     
Ridge regression with built-in cross-validation.
See glossary entry for :term:`cross-validation estimator`.
By default, it performs efficient Leave-One-Out Cross-Validation.
Read more in the :ref:`User Guide <ridge_regression>`.
Parameters
----------
alphas : array-like of shape (n_alphas,), default=(0.1, 1.0, 10.0)
    Array of alpha values to try.
    Regularization strength; must be a positive float. Regularization
    improves the conditioning of the problem and reduces the variance of
    the estimates. Larger values specify stronger regularization.
    Alpha corresponds to ``1 / (2C)`` in other linear models such as
    :class:`~sklearn.linear_model.LogisticRegression` or
    :class:`~sklearn.svm.LinearSVC`.
    If using Leave-One-Out cross-validation, alphas must be positive.
fit_intercept : bool, default=True
    Whether to calculate the intercept for this model. If set
    to false, no intercept will be used in calculations
    (i.e. data is expected to be centered).
scoring : str, callable, default=None
    A string (see model evaluation documentation) or
    a scorer callable object / function with signature
    ``scorer(estimator, X, y)``.
    If None, the negative mean squared error if cv is 'auto' or None
    (i.e. when using leave-one-out cross-validation), and r2 score
    otherwise.
cv : int, cross-validation generator or an iterable, default=None
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:
    - None, to use the efficient Leave-One-Out cross-validation
    - integer, to specify the number of folds.
    - :term:`CV splitter`,
    - An iterable yielding (train, test) splits as arrays of indices.
    For integer/None inputs, if ``y`` is binary or multiclass,
    :class:`~sklearn.model_selection.StratifiedKFold` is used, else,
    :class:`~sklearn.model_selection.KFold` is used.
    Refer :ref:`User Guide <cross_validation>` for the various
    cross-validation strategies that can be used here.
gcv_mode : {'auto', 'svd', 'eigen'}, default='auto'
    Flag indicating which strategy to use when performing
    Leave-One-Out Cross-Validation. Options are::
        'auto' : use 'svd' if n_samples > n_features, otherwise use 'eigen'
        'svd' : force use of singular value decomposition of X when X is
            dense, eigenvalue decomposition of X^T.X when X is sparse.
        'eigen' : force computation via eigendecomposition of X.X^T
    The 'auto' mode is the default and is intended to pick the cheaper
    option of the two depending on the shape of the training data.
store_cv_values : bool, default=False
    Flag indicating if the cross-validation values corresponding to
    each alpha should be stored in the ``cv_values_`` attribute (see
    below). This flag is only compatible with ``cv=None`` (i.e. using
    Leave-One-Out Cross-Validation).
alpha_per_target : bool, default=False
    Flag indicating whether to optimize the alpha value (picked from the
    `alphas` parameter list) for each target separately (for multi-output
    settings: multiple prediction targets). When set to `True`, after
    fitting, the `alpha_` attribute will contain a value for each target.
    When set to `False`, a single alpha is used for all targets.
    .. versionadded:: 0.24
Attributes
----------
cv_values_ : ndarray of shape (n_samples, n_alphas) or             shape (n_samples, n_targets, n_alphas), optional
    Cross-validation values for each alpha (only available if
    ``store_cv_values=True`` and ``cv=None``). After ``fit()`` has been
    called, this attribute will contain the mean squared errors if
    `scoring is None` otherwise it will contain standardized per point
    prediction values.
coef_ : ndarray of shape (n_features) or (n_targets, n_features)
    Weight vector(s).
intercept_ : float or ndarray of shape (n_targets,)
    Independent term in decision function. Set to 0.0 if
    ``fit_intercept = False``.
alpha_ : float or ndarray of shape (n_targets,)
    Estimated regularization parameter, or, if ``alpha_per_target=True``,
    the estimated regularization parameter for each target.
best_score_ : float or ndarray of shape (n_targets,)
    Score of base estimator with best alpha, or, if
    ``alpha_per_target=True``, a score for each target.
    .. versionadded:: 0.23
n_features_in_ : int
    Number of features seen during :term:`fit`.
    .. versionadded:: 0.24
feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.
    .. versionadded:: 1.0
See Also
--------
Ridge : Ridge regression.
RidgeClassifier : Classifier based on ridge regression on {-1, 1} labels.
RidgeClassifierCV : Ridge classifier with built-in cross validation.
Examples
--------
>>> from sklearn.datasets import load_diabetes
>>> from sklearn.linear_model import RidgeCV
>>> X, y = load_diabetes(return_X_y=True)
>>> clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1]).fit(X, y)
>>> clf.score(X, y)
0.5166...
File:           ~/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/linear_model/_ridge.py
Type:           ABCMeta
Subclasses:

alphas=(0.1, 1.0, 10.0) 중 값이 가장 좋은 걸 선택해 주는 듯.

## step1 
df_train, df_test = sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
X = df_train.loc[:,'gpa':'toeic499']
y = df_train.loc[:,'employment_score']
XX = df_test.loc[:,'gpa':'toeic499']
yy = df_test.loc[:,'employment_score']
## step2 
predictr = sklearn.linear_model.RidgeCV(alphas=[5e2, 5e3, 5e4, 5e5, 5e6, 5e7, 5e8])
## step3
predictr.fit(X,y)
## step4 -- pass

RidgeCV(alphas=[500.0, 5000.0, 50000.0, 500000.0, 5000000.0, 50000000.0,
                500000000.0])

predictr.score(X,y)

0.752126856015936

predictr.score(XX,yy)

0.7450309251010896

predictr.alpha_

50000000.0

참고로 이 적합결과는 아래의 코드를 실행한것과 같다

## step1 
df_train, df_test = sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
X = df_train.loc[:,'gpa':'toeic499']
y = df_train.loc[:,'employment_score']
XX = df_test.loc[:,'gpa':'toeic499']
yy = df_test.loc[:,'employment_score']
## step2 
predictr = sklearn.linear_model.Ridge(alpha=50000000.0)
## step3
predictr.fit(X,y)
## step4 -- pass

Ridge(alpha=50000000.0)

predictr.score(X,y)

0.752126856015936

predictr.score(XX,yy)

0.7450309251010894