import numpy as np
import pandas as pd
import sklearn.linear_model
해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임
03wk-013: 타이타닉, 로지스틱
최규빈
2023-09-21
1. 강의영상
https://youtu.be/playlist?list=PLQqh36zP38-wi9Mkfc849jTCMENvydI8B&si=hotGT-ErLB8dukhs
2. Import
3. Data 불러오기
= pd.read_csv('~/Desktop/titanic/train.csv')
df_train = pd.read_csv('~/Desktop/titanic/test.csv') df_test
df_train
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
set(df_train) - set(df_test)
{'Survived'}
train에는 있지만 test에는 없는 것 Survived -> y로 놓자
4. 분석 – 실패
A. 데이터 정리
= pd.get_dummies(df_train.drop(['PassengerId','Survived'],axis=1))
X = df_train[['Survived']] y
B. Predictor 생성
= sklearn.linear_model.LogisticRegression()
predictr predictr
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
C. 학습 (fit, learn)
predictr.fit(X,y)
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
- X가 NaN값이 있어서 오류남
5. 원인분석
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
-
문제1: Cabin
열은 너무 많은 결측치를 가지고 있다. (추후에 빼자.)
-
문제2: Name
혹은 Ticket
과 같은 변수는 one-hot 인코딩 하기 어색하다. (object 변수)
len(set(df_train['Name']))
891
len(set(df_train['Ticket']))
681
-
문제3: df_train의 Age
와 Embarked
에 약간 포함된 결측치가 마음에 걸린다.. \(\to\) 빼자!
-
문제4: df_test의 Fare
에 포함된 결측값도 걸린다. \(\to\) 빼자!
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
- Fare는 417인데 애매하구먼? autogluon에서는 어떻게 했었나 한번 살펴보자
6. 분석 – 성공
A. 데이터정리
= pd.get_dummies(df_train[["Pclass", "Sex", "SibSp", "Parch"]])
X = df_train[["Survived"]] y
X
Pclass | SibSp | Parch | Sex_female | Sex_male | |
---|---|---|---|---|---|
0 | 3 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 0 | 1 | 0 |
2 | 3 | 0 | 0 | 1 | 0 |
3 | 1 | 1 | 0 | 1 | 0 |
4 | 3 | 0 | 0 | 0 | 1 |
... | ... | ... | ... | ... | ... |
886 | 2 | 0 | 0 | 0 | 1 |
887 | 1 | 0 | 0 | 1 | 0 |
888 | 3 | 1 | 2 | 1 | 0 |
889 | 1 | 0 | 0 | 0 | 1 |
890 | 3 | 0 | 0 | 0 | 1 |
891 rows × 5 columns
B. Predictor 생성
= sklearn.linear_model.LogisticRegression() predictr
C. 학습
predictr.fit(X, y)
/home/coco/anaconda3/envs/py38/lib/python3.8/site-packages/sklearn/utils/validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
D. 예측
#predictr.predict(X)
=predictr.predict(X)).loc[:,['Survived','Survived_hat']] df_train.assign(Survived_hat
Survived | Survived_hat | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 1 |
2 | 1 | 1 |
3 | 1 | 1 |
4 | 0 | 0 |
... | ... | ... |
886 | 0 | 0 |
887 | 1 | 1 |
888 | 0 | 1 |
889 | 1 | 0 |
890 | 0 | 0 |
891 rows × 2 columns
E. 평가
predictr.score(X,y)
0.8002244668911336
7. 제출 (HW)
df_test
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S |
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S |
416 | 1308 | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S |
417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C |
418 rows × 11 columns
= pd.get_dummies(df_test[["Pclass", "Sex", "SibSp", "Parch"]]) X
X
Pclass | SibSp | Parch | Sex_female | Sex_male | |
---|---|---|---|---|---|
0 | 3 | 0 | 0 | 0 | 1 |
1 | 3 | 1 | 0 | 1 | 0 |
2 | 2 | 0 | 0 | 0 | 1 |
3 | 3 | 0 | 0 | 0 | 1 |
4 | 3 | 1 | 1 | 1 | 0 |
... | ... | ... | ... | ... | ... |
413 | 3 | 0 | 0 | 0 | 1 |
414 | 1 | 0 | 0 | 1 | 0 |
415 | 3 | 0 | 0 | 0 | 1 |
416 | 3 | 0 | 0 | 0 | 1 |
417 | 3 | 1 | 1 | 0 | 1 |
418 rows × 5 columns
=predictr.predict(X)).loc[:,['PassengerId','Survived']]\
df_test.assign(Survived"03wk(get_dummies).csv",index=False) .to_csv(
-
kaggle 제출 -> 0.77511