해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임

05wk-021: 취업+밸런스게임, 오버피팅

최규빈
2023-10-05

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-zxnl-vcjgZTeJxvNiOT4V0&si=q98aHqMKQyrCr8dW

2. Imports

import numpy as np
import pandas as pd
import sklearn.linear_model

3. 밸런스게임

- ref: https://verovero1.tistory.com/136

- 하고 싶은 말: 내가 원한다면 무수히 많은 데이터를 모을 수 있다고 가정하자.[1]

4. 상황1

- 아래의 자료를 고려하자.

[1] 빅데이터?????

data = {
    'toeic': [640, 705, 930, 755, 410, 655, 400, 915, 970, 895],
    'gpa': [0.712533, 0.639031, 0.148400, 1.230271, 3.279419, 1.255110, 4.157389, 3.714847, 1.584432, 2.170776],
    'employment': [1, 0, 0, 0, 1, 0, 1, 1, 1, 0],
    'balance_game_1': [1, 0, 0, 0, 1, 0, 1, 0, 0, 0]
}

df = pd.DataFrame(data)
df

	toeic	gpa	employment	balance_game_1
0	640	0.712533	1	1
1	705	0.639031	0	0
2	930	0.148400	0	0
3	755	1.230271	0	0
4	410	3.279419	1	1
5	655	1.255110	0	0
6	400	4.157389	1	1
7	915	3.714847	1	0
8	970	1.584432	1	0
9	895	2.170776	0	0

- train/test 를 아래와 같이 구분한다고 하자.

df_train = df[:7]
df_test = df[7:]

df_train

	toeic	gpa	employment	balance_game_1
0	640	0.712533	1	1
1	705	0.639031	0	0
2	930	0.148400	0	0
3	755	1.230271	0	0
4	410	3.279419	1	1
5	655	1.255110	0	0
6	400	4.157389	1	1

df_test

	toeic	gpa	employment
7	915	3.714847	1
8	970	1.584432	1
9	895	2.170776	0

- 직감: 이 자료를 가지고 학습하면 반드시 망한다..

- 결론: 자료가 많다고 무조건 좋은것은 아니다. 쓸모없는 자료는 오히려 학습을 방해할 수 있다. 그래서 train에서는 잘 맞추지만 test에서는 잘 못맞추는 현상이 생길수도 있다. 이러한 현상을 오버피팅이라고 한다.

- 반론: 예제가 너무 억지스러운것 아니야?

5. 상황2

- 다시 아래의 자료를 가정하자.

data = {
    'toeic': [640, 705, 930, 755, 410, 655, 400, 915, 970, 895],
    'gpa': [0.712533, 0.639031, 0.148400, 1.230271, 3.279419, 1.255110, 4.157389, 3.714847, 1.584432, 2.170776],
    'employment': [1, 0, 0, 0, 1, 0, 1, 1, 1, 0]
}
df = pd.DataFrame(data)
df

	toeic	gpa	employment
0	640	0.712533	1
1	705	0.639031	0
2	930	0.148400	0
3	755	1.230271	0
4	410	3.279419	1
5	655	1.255110	0
6	400	4.157389	1
7	915	3.714847	1
8	970	1.584432	1
9	895	2.170776	0

np.random.seed(43052)
arr = (np.random.rand(10*50).reshape(10,50) >0.5)*1.0
df_balance = pd.DataFrame(arr,columns=['X'+str(i) for i in range(50)])
df = pd.concat([df,df_balance],axis=1)

df

	toeic	gpa	employment	X0	X1	X2	X3	X4	X5	X6	...	X40	X41	X42	X43	X44	X45	X46	X47	X48	X49
0	640	0.712533	1	1.0	0.0	1.0	1.0	0.0	0.0	1.0	...	1.0	0.0	1.0	1.0	0.0	1.0	1.0	0.0	0.0	0.0
1	705	0.639031	0	0.0	1.0	1.0	1.0	0.0	0.0	1.0	...	1.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
2	930	0.148400	0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	1.0	0.0	0.0	0.0	1.0	1.0	1.0	0.0	0.0
3	755	1.230271	0	1.0	0.0	0.0	0.0	1.0	1.0	1.0	...	1.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	1.0	1.0
4	410	3.279419	1	1.0	0.0	0.0	0.0	1.0	1.0	1.0	...	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0
5	655	1.255110	0	1.0	0.0	0.0	1.0	0.0	1.0	1.0	...	1.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0	1.0	1.0
6	400	4.157389	1	1.0	0.0	0.0	0.0	0.0	1.0	1.0	...	0.0	1.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
7	915	3.714847	1	1.0	1.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0
8	970	1.584432	1	0.0	0.0	1.0	0.0	0.0	0.0	0.0	...	1.0	1.0	0.0	1.0	0.0	1.0	1.0	1.0	0.0	0.0
9	895	2.170776	0	0.0	0.0	0.0	1.0	0.0	1.0	1.0	...	1.0	1.0	1.0	0.0	1.0	0.0	1.0	0.0	1.0	1.0

10 rows × 53 columns

(df.X12 + df.X43 + df.X8 - df.X29)>0

0     True
1    False
2    False
3    False
4     True
5    False
6     True
7     True
8     True
9    False
dtype: bool

df.employment

0    1
1    0
2    0
3    0
4    1
5    0
6    1
7    1
8    1
9    0
Name: employment, dtype: int64

- 수틀리면 50억번 밸런스게임을 진행할 수도 있어..

6. 이상한 자료로 분석

- 취업자료를 변경

df = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/employment.csv')
df

	toeic	gpa	employment
0	135	0.051535	0
1	935	0.355496	0
2	485	2.228435	0
3	65	1.179701	0
4	445	3.962356	1
...	...	...	...
495	280	4.288465	1
496	310	2.601212	1
497	225	0.042323	0
498	320	1.041416	0
499	375	3.626883	1

500 rows × 3 columns

df_balance = pd.DataFrame((np.random.randn(500,5000)>0.5).reshape(500,5000)*1,columns = ['X'+str(i) for i in range(5000)])
df_merged = pd.concat([df,df_balance],axis=1)
df_merged

	toeic	gpa	employment	X0	X1	X2	X3	X4	X5	X6	...	X4990	X4991	X4992	X4993	X4994	X4995	X4996	X4997	X4998	X4999
0	135	0.051535	0	0	0	0	0	1	0	0	...	1	0	0	0	0	1	0	0	1	0
1	935	0.355496	0	0	0	1	0	0	0	1	...	0	0	0	0	1	0	0	1	1	0
2	485	2.228435	0	1	0	0	1	0	0	0	...	0	1	0	1	0	0	0	0	1	0
3	65	1.179701	0	0	0	0	1	0	0	0	...	1	0	0	1	1	0	0	0	0	0
4	445	3.962356	1	0	1	0	1	0	1	0	...	1	0	0	0	0	1	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
495	280	4.288465	1	0	0	1	0	0	0	1	...	0	0	0	0	0	1	1	0	1	0
496	310	2.601212	1	0	1	0	0	0	0	0	...	0	1	1	0	0	0	1	0	0	0
497	225	0.042323	0	0	0	0	0	0	0	1	...	1	0	0	0	0	0	0	0	1	1
498	320	1.041416	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
499	375	3.626883	1	0	0	1	0	0	0	0	...	1	0	0	0	0	0	0	0	1	1

500 rows × 5003 columns

- (X,y), (XX,yy)

X = df_merged.drop(['employment'],axis=1)[:400]
XX = df_merged.drop(['employment'],axis=1)[400:]
y = df_merged[['employment']][:400]
yy = df_merged[['employment']][400:]

- predictor 생성, 학습, 평가

prdtr = sklearn.linear_model.LogisticRegression()
prdtr.fit(X,y)

/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

prdtr.score(X,y)

1.0

prdtr.score(XX,yy)

0.71

7. 올바른 자료로 분석

- 올바른 취업자료

df = pd.read_csv('https://raw.githubusercontent.com/guebin/MP2023/main/posts/employment.csv')
df

	toeic	gpa	employment
0	135	0.051535	0
1	935	0.355496	0
2	485	2.228435	0
3	65	1.179701	0
4	445	3.962356	1
...	...	...	...
495	280	4.288465	1
496	310	2.601212	1
497	225	0.042323	0
498	320	1.041416	0
499	375	3.626883	1

500 rows × 3 columns

- (X,y), (XX,yy)

X = df.drop(['employment'],axis=1)[:400]
XX = df.drop(['employment'],axis=1)[400:]
y = df[['employment']][:400]
yy = df[['employment']][400:]

- predictor 생성, 학습, 평가

prdtr = sklearn.linear_model.LogisticRegression()
prdtr.fit(X,y)

/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)

LogisticRegression()

prdtr.score(X,y)

0.8925

prdtr.score(XX,yy)

0.83

8. 오버피팅

A. 학부1~2학년 수준의 설명

- 과적합(Overfitting): 머신러닝과 통계에서 자주 나타나는 문제로, 모델이 학습데이터에 과도하게 최적화가 되어서 실제로 새로운 데이터나 테스트데이터에서 성능이 저하되는 현상을 말함.

- 오버피팅의 원인:

불필요한 특징: 불필요한 특징이 데이터에 포함되어 있다면 오버피팅이 발생할 수 있음.

B. 일반인 수준의 설명

- 시험 공부(1): 공부를 하랬더니 외우고 있음..

- 시험 공부(2): (시험 하루 전날에) 공부 그만하고 술이나 먹으러 가자.. 더 공부하면 train error만 줄일 뿐이야..

- 운전: 특정도로에서만 운전연습을 했음. 그래서 그 도로의 구멍, 곡률, 신호등의 위치까지 완벽하게 숙지하였음. 그 결과 그 도로에서는 잘하게 되었지만, 그 도로 이외의 다른도로에서 운전을 한다면 문제가 발생함.

- 언어: 특정 주제나 특정 상황에 대한 대화만을 반복적으로 연습하여, 그 상황에서는 완벽한 대화가 가능하지만 그 외의 상황에서는 대화를 제대로 이어나갈 수 없음.

9. 숙제

- 오버피팅을 쉽게 설명할 수 있는 예시를 고민해보고 그 예시를 제출할것.

회사 업무를 위한 외국어(일본어) 공부를 하기 위해서 일본 애니메이션을 통해 공부했다면, 실제 회사 업무에서는 못함 ㅎㅎ 간바레 이런것만 하겠다 ㅋㅋㅋㅋㅋㅋㅋㅋㅋ 헷

음식 대회에 나가기 위해서 한식, 일식 요리에 대해 연구했는데 양식 대회에 나가게 되면 제대로 된 요리를 할 수 없음