해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임

04wk-017: 취업, 로지스틱을 더 깊게

최규빈
2023-09-26

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-yZKLoD4xCQYvimA2q_8lCl&si=vBBY-dA7arD2SCSy

2. Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.linear_model

3. 데이터 불러오기 \(\to\) 학습

df = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/employment.csv')
df

	toeic	gpa	employment
0	135	0.051535	0
1	935	0.355496	0
2	485	2.228435	0
3	65	1.179701	0
4	445	3.962356	1
...	...	...	...
495	280	4.288465	1
496	310	2.601212	1
497	225	0.042323	0
498	320	1.041416	0
499	375	3.626883	1

500 rows × 3 columns

X = df[['toeic','gpa']]
y = df[['employment']]
predictr = sklearn.linear_model.LogisticRegression()
predictr.fit(X,y)

/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

4. yhat이 나오는 방식?

- 확인: 무슨 수식에 의하여 나오긴함

predictr.coef_, predictr.intercept_

(array([[0.00571598, 2.46520018]]), array([-8.45433334]))

u = X.toeic*0.00571598 + X.gpa*2.46520018 -8.45433334
v = 1/(1+np.exp(-u))
v # 확률같은것임

0      0.000523
1      0.096780
2      0.453003
3      0.005627
4      0.979312
         ...   
495    0.976295
496    0.432939
497    0.000855
498    0.016991
499    0.932777
Length: 500, dtype: float64

((v > 0.5) == predictr.predict(X)).mean()

1.0

- 하여튼 아래와 같은 구조임

(구조1)

(구조2) – 단순화

- v 값을 알고 싶다면 어쩌지?

v[:5].round(3)

0    0.001
1    0.097
2    0.453
3    0.006
4    0.979
dtype: float64

predictr.predict_proba(X)[:5].round(3)

array([[0.999, 0.001],
       [0.903, 0.097],
       [0.547, 0.453],
       [0.994, 0.006],
       [0.021, 0.979]])

predictr.predict_proba(X)에서 오른쪽이 v값(취업이 될 확률)

predictr.predict_proba(X)에서 왼쪽 열은 취업이 안 될 확률