import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model
해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임
06wk-024: 취업+각종영어점수, RidgeCV
1. 강의영상
2. Imports
3. Data
= pd.read_csv("")
df 43052)
np.random.seed('employment_score'] = df.gpa * 1.0 + df.toeic* 1/100 + np.random.randn(500) df[
employment_score | gpa | toeic | toeic0 | toeic1 | toeic2 | toeic3 | toeic4 | toeic5 | toeic6 | ... | toeic490 | toeic491 | toeic492 | toeic493 | toeic494 | toeic495 | toeic496 | toeic497 | toeic498 | toeic499 | |
0 | 1.784955 | 0.051535 | 135 | 129.566309 | 133.078481 | 121.678398 | 113.457366 | 133.564200 | 136.026566 | 141.793547 | ... | 132.014696 | 140.013265 | 135.575816 | 143.863346 | 152.162740 | 132.850033 | 115.956496 | 131.842126 | 125.090801 | 143.568527 |
1 | 10.789671 | 0.355496 | 935 | 940.563187 | 935.723570 | 939.190519 | 938.995672 | 945.376482 | 927.469901 | 952.424087 | ... | 942.251184 | 923.241548 | 939.924802 | 921.912261 | 953.250300 | 931.743615 | 940.205853 | 930.575825 | 941.530348 | 934.221055 |
2 | 8.221213 | 2.228435 | 485 | 493.671390 | 493.909118 | 475.500970 | 480.363752 | 478.868942 | 493.321602 | 490.059102 | ... | 484.438233 | 488.101275 | 485.626742 | 475.330715 | 485.147363 | 468.553780 | 486.870976 | 481.640957 | 499.340808 | 488.197332 |
3 | 2.137594 | 1.179701 | 65 | 62.272565 | 55.957257 | 68.521468 | 76.866765 | 51.436321 | 57.166824 | 67.834920 | ... | 67.653225 | 65.710588 | 64.146780 | 76.662194 | 66.837839 | 82.379018 | 69.174745 | 64.475993 | 52.647087 | 59.493275 |
4 | 8.650144 | 3.962356 | 445 | 449.280637 | 438.895582 | 433.598274 | 444.081141 | 437.005100 | 434.761142 | 443.135269 | ... | 455.940348 | 435.952854 | 441.521145 | 443.038886 | 433.118847 | 466.103355 | 430.056944 | 423.632873 | 446.973484 | 442.793633 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
495 | 9.057243 | 4.288465 | 280 | 276.680902 | 274.502675 | 277.868536 | 292.283300 | 277.476630 | 281.671647 | 296.307373 | ... | 269.541846 | 278.220546 | 278.484758 | 284.901284 | 272.451612 | 265.784490 | 275.795948 | 280.465992 | 268.528889 | 283.638470 |
496 | 4.108020 | 2.601212 | 310 | 296.940263 | 301.545000 | 306.725610 | 314.811407 | 311.935810 | 309.695838 | 301.979914 | ... | 304.680578 | 295.476836 | 316.582100 | 319.412132 | 312.984039 | 312.372112 | 312.106944 | 314.101927 | 309.409533 | 297.429968 |
497 | 2.430590 | 0.042323 | 225 | 206.793217 | 228.335345 | 222.115146 | 216.479498 | 227.469560 | 238.710310 | 233.797065 | ... | 233.469238 | 235.160919 | 228.517306 | 228.349646 | 224.153606 | 230.860484 | 218.683195 | 232.949484 | 236.951938 | 227.997629 |
498 | 5.343171 | 1.041416 | 320 | 327.461442 | 323.019899 | 329.589337 | 313.312233 | 315.645050 | 324.448247 | 314.271045 | ... | 326.297700 | 309.893822 | 312.873223 | 322.356584 | 319.332809 | 319.405283 | 324.021917 | 312.363694 | 318.493866 | 310.973930 |
499 | 6.505106 | 3.626883 | 375 | 370.966595 | 364.668477 | 371.853566 | 373.574930 | 376.701708 | 356.905085 | 354.584022 | ... | 382.278782 | 379.460816 | 371.031640 | 370.272639 | 375.618182 | 369.252740 | 376.925543 | 391.863103 | 368.735260 | 368.520844 |
500 rows × 503 columns
4. RidgeCV
RidgeCV 클래스에서 모형을 선택해보자.
## step1
= sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
df_train, df_test = df_train.loc[:,'gpa':'toeic499']
X = df_train.loc[:,'employment_score']
y = df_test.loc[:,'gpa':'toeic499']
XX = df_test.loc[:,'employment_score']
yy ## step2
= sklearn.linear_model.RidgeCV()
predictr ## step3,y)## step4 -- pass
alpha들의 후보를 우리가 직접 선정하자.
Init signature: sklearn.linear_model.RidgeCV( alphas=(0.1, 1.0, 10.0), *, fit_intercept=True, scoring=None, cv=None, gcv_mode=None, store_cv_values=False, alpha_per_target=False, ) If using Leave-One-Out cross-validation, alphas must be positive. fit_intercept : bool, default=True Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (i.e. data is expected to be centered). scoring : str, callable, default=None A string (see model evaluation documentation) or a scorer callable object / function with signature ``scorer(estimator, X, y)``. If None, the negative mean squared error if cv is 'auto' or None (i.e. when using leave-one-out cross-validation), and r2 score otherwise. cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the efficient Leave-One-Out cross-validation - integer, to specify the number of folds. - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if ``y`` is binary or multiclass, :class:`~sklearn.model_selection.StratifiedKFold` is used, else, :class:`~sklearn.model_selection.KFold` is used. Refer :ref:`User Guide <cross_validation>` for the various cross-validation strategies that can be used here. gcv_mode : {'auto', 'svd', 'eigen'}, default='auto' Flag indicating which strategy to use when performing Leave-One-Out Cross-Validation. Options are:: 'auto' : use 'svd' if n_samples > n_features, otherwise use 'eigen' 'svd' : force use of singular value decomposition of X when X is dense, eigenvalue decomposition of X^T.X when X is sparse. 'eigen' : force computation via eigendecomposition of X.X^T The 'auto' mode is the default and is intended to pick the cheaper option of the two depending on the shape of the training data. store_cv_values : bool, default=False Flag indicating if the cross-validation values corresponding to each alpha should be stored in the ``cv_values_`` attribute (see below). This flag is only compatible with ``cv=None`` (i.e. using Leave-One-Out Cross-Validation). alpha_per_target : bool, default=False Flag indicating whether to optimize the alpha value (picked from the `alphas` parameter list) for each target separately (for multi-output settings: multiple prediction targets). When set to `True`, after fitting, the `alpha_` attribute will contain a value for each target. When set to `False`, a single alpha is used for all targets. .. versionadded:: 0.24 Attributes ---------- cv_values_ : ndarray of shape (n_samples, n_alphas) or shape (n_samples, n_targets, n_alphas), optional Cross-validation values for each alpha (only available if ``store_cv_values=True`` and ``cv=None``). After ``fit()`` has been called, this attribute will contain the mean squared errors if `scoring is None` otherwise it will contain standardized per point prediction values. coef_ : ndarray of shape (n_features) or (n_targets, n_features) Weight vector(s). intercept_ : float or ndarray of shape (n_targets,) Independent term in decision function. Set to 0.0 if ``fit_intercept = False``. alpha_ : float or ndarray of shape (n_targets,) Estimated regularization parameter, or, if ``alpha_per_target=True``, the estimated regularization parameter for each target. best_score_ : float or ndarray of shape (n_targets,) Score of base estimator with best alpha, or, if ``alpha_per_target=True``, a score for each target. .. versionadded:: 0.23 n_features_in_ : int Number of features seen during :term:`fit`. .. versionadded:: 0.24 feature_names_in_ : ndarray of shape (`n_features_in_`,) Names of features seen during :term:`fit`. Defined only when `X` has feature names that are all strings. .. versionadded:: 1.0 See Also -------- Ridge : Ridge regression. RidgeClassifier : Classifier based on ridge regression on {-1, 1} labels. RidgeClassifierCV : Ridge classifier with built-in cross validation. Examples -------- >>> from sklearn.datasets import load_diabetes >>> from sklearn.linear_model import RidgeCV >>> X, y = load_diabetes(return_X_y=True) >>> clf = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1]).fit(X, y) >>> clf.score(X, y) 0.5166... File: ~/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/linear_model/ Type: ABCMeta Subclasses:
alphas=(0.1, 1.0, 10.0)
중 값이 가장 좋은 걸 선택해 주는 듯.
## step1
= sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
df_train, df_test = df_train.loc[:,'gpa':'toeic499']
X = df_train.loc[:,'employment_score']
y = df_test.loc[:,'gpa':'toeic499']
XX = df_test.loc[:,'employment_score']
yy ## step2
= sklearn.linear_model.RidgeCV(alphas=[5e2, 5e3, 5e4, 5e5, 5e6, 5e7, 5e8])
predictr ## step3,y)## step4 -- pass
RidgeCV(alphas=[500.0, 5000.0, 50000.0, 500000.0, 5000000.0, 50000000.0, 500000000.0])
참고로 이 적합결과는 아래의 코드를 실행한것과 같다
## step1
= sklearn.model_selection.train_test_split(df,test_size=0.3,random_state=42)
df_train, df_test = df_train.loc[:,'gpa':'toeic499']
X = df_train.loc[:,'employment_score']
y = df_test.loc[:,'gpa':'toeic499']
XX = df_test.loc[:,'employment_score']
yy ## step2
= sklearn.linear_model.Ridge(alpha=50000000.0)
predictr ## step3,y)## step4 -- pass
