해당 자료는 전북대학교 최규빈 교수님 2023학년도 2학기 빅데이터분석특강 자료임

1. 강의영상

https://youtu.be/playlist?list=PLQqh36zP38-yR3MqhN9-OgAtewojoYoKD&si=U1GTMdGiWFvlppYH

2. Alexis Cook의 분석

Logging into Kaggle for the first time can be daunting. Our competitions often have large cash prizes, public leaderboards, and involve complex data. Nevertheless, we really think all data scientists can rapidly learn from machine learning competitions and meaningfully contribute to our community. To give you a clear understanding of how our platform works and a mental model of the type of learning you could do on Kaggle, we’ve created a Getting Started tutorial for the Titanic competition. It walks you through the initial steps required to get your first decent submission on the leaderboard. By the end of the tutorial, you’ll also have a solid understanding of how to use Kaggle’s online coding environment, where you’ll have trained your own machine learning model.

So if this is your first time entering a Kaggle competition, regardless of whether you: - have experience with handling large datasets, - haven’t done much coding, - are newer to data science, or - are relatively experienced (but are just unfamiliar with Kaggle’s platform),

you’re in the right place!

Part 1: Get started

In this section, you’ll learn more about the competition and make your first submission.

Join the competition!

The first thing to do is to join the competition! Open a new window with the competition page, and click on the “Join Competition” button, if you haven’t already. (If you see a “Submit Predictions” button instead of a “Join Competition” button, you have already joined the competition, and don’t need to do so again.)

This takes you to the rules acceptance page. You must accept the competition rules in order to participate. These rules govern how many submissions you can make per day, the maximum team size, and other competition-specific details. Then, click on “I Understand and Accept” to indicate that you will abide by the competition rules.

The challenge

The competition is simple: we want you to use the Titanic passenger data (name, age, price of ticket, etc) to try to predict who will survive and who will die.

The data

To take a look at the competition data, click on the Data tab at the top of the competition page. Then, scroll down to find the list of files.
There are three files in the data: (1) train.csv, (2) test.csv, and (3) gender_submission.csv.

(1) train.csv

train.csv contains the details of a subset of the passengers on board (891 passengers, to be exact – where each passenger gets a different row in the table). To investigate this data, click on the name of the file on the left of the screen. Once you’ve done this, you can view all of the data in the window.

The values in the second column (“Survived”) can be used to determine whether each passenger survived or not: - if it’s a “1”, the passenger survived. - if it’s a “0”, the passenger died.

For instance, the first passenger listed in train.csv is Mr. Owen Harris Braund. He was 22 years old when he died on the Titanic.

(2) test.csv

Using the patterns you find in train.csv, you have to predict whether the other 418 passengers on board (in test.csv) survived.

Click on test.csv (on the left of the screen) to examine its contents. Note that test.csv does not have a “Survived” column - this information is hidden from you, and how well you do at predicting these hidden values will determine how highly you score in the competition!

(3) gender_submission.csv

The gender_submission.csv file is provided as an example that shows how you should structure your predictions. It predicts that all female passengers survived, and all male passengers died. Your hypotheses regarding survival will probably be different, which will lead to a different submission file. But, just like this file, your submission should have: - a “PassengerId” column containing the IDs of each passenger from test.csv. - a “Survived” column (that you will create!) with a “1” for the rows where you think the passenger survived, and a “0” where you predict that the passenger died.

Part 2: Your coding environment

In this section, you’ll train your own machine learning model to improve your predictions. If you’ve never written code before or don’t have any experience with machine learning, don’t worry! We don’t assume any prior experience in this tutorial.

The Notebook

The first thing to do is to create a Kaggle Notebook where you’ll store all of your code. You can use Kaggle Notebooks to getting up and running with writing code quickly, and without having to install anything on your computer. (If you are interested in deep learning, we also offer free GPU access!)

Begin by clicking on the Code tab on the competition page. Then, click on “New Notebook”.

Your notebook will take a few seconds to load. In the top left corner, you can see the name of your notebook – something like “kernel2daed3cd79”.

You can edit the name by clicking on it. Change it to something more descriptive, like “Getting Started with Titanic”.

Your first lines of code

When you start a new notebook, it has two gray boxes for storing code. We refer to these gray boxes as “code cells”.

The first code cell already has some code in it. To run this code, put your cursor in the code cell. (If your cursor is in the right place, you’ll notice a blue vertical line to the left of the gray box.) Then, either hit the play button (which appears to the left of the blue line), or hit [Shift] + [Enter] on your keyboard.

If the code runs successfully, three lines of output are returned. Below, you can see the same code that you just ran, along with the output that you should see in your notebook.

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

This shows us where the competition data is stored, so that we can load the files into the notebook. We’ll do that next.

Load the data

The second code cell in your notebook now appears below the three lines of output with the file locations.

Type the two lines of code below into your second code cell. Then, once you’re done, either click on the blue play button, or hit [Shift] + [Enter].

train_data = pd.read_csv('~/Desktop/titanic/train.csv')
train_data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Your code should return the output above, which corresponds to the first five rows of the table in train.csv. It’s very important that you see this output in your notebook before proceeding with the tutorial! > If your code does not produce this output, double-check that your code is identical to the two lines above. And, make sure your cursor is in the code cell before hitting [Shift] + [Enter].

The code that you’ve just written is in the Python programming language. It uses a Python “module” called pandas (abbreviated as pd) to load the table from the train.csv file into the notebook. To do this, we needed to plug in the location of the file (which we saw was /kaggle/input/titanic/train.csv).
> If you’re not already familiar with Python (and pandas), the code shouldn’t make sense to you – but don’t worry! The point of this tutorial is to (quickly!) make your first submission to the competition. At the end of the tutorial, we suggest resources to continue your learning.

At this point, you should have at least three code cells in your notebook.

Copy the code below into the third code cell of your notebook to load the contents of the test.csv file. Don’t forget to click on the play button (or hit [Shift] + [Enter])!


test_data = pd.read_csv("~/Desktop/titanic/test.csv")
test_data.head()

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

As before, make sure that you see the output above in your notebook before continuing.

Once all of the code runs successfully, all of the data (in train.csv and test.csv) is loaded in the notebook. (The code above shows only the first 5 rows of each table, but all of the data is there – all 891 rows of train.csv and all 418 rows of test.csv!)

Part 3: Your first submission

Remember our goal: we want to find patterns in train.csv that help us predict whether the passengers in test.csv survived.

It might initially feel overwhelming to look for patterns, when there’s so much data to sort through. So, we’ll start simple.

Explore a pattern

Remember that the sample submission file in gender_submission.csv assumes that all female passengers survived (and all male passengers died).

Is this a reasonable first guess? We’ll check if this pattern holds true in the data (in train.csv).

Copy the code below into a new code cell. Then, run the cell.

women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095

여성의 생존률을 구하는 코드입니다, 이전에 accuracy를 구하던 테크닉을 활용하면 아래의 코드도 가능합니다

train_data[train_data.Sex == 'female'].Survived.mean()

0.7420382165605095

Before moving on, make sure that your code returns the output above. The code above calculates the percentage of female passengers (in train.csv) who survived.

Then, run the code below in another code cell:

men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924

The code above calculates the percentage of male passengers (in train.csv) who survived.

From this you can see that almost 75% of the women on board survived, whereas only 19% of the men lived to tell about it. Since gender seems to be such a strong indicator of survival, the submission file in gender_submission.csv is not a bad first guess!

But at the end of the day, this gender-based submission bases its predictions on only a single column. As you can imagine, by considering multiple columns, we can discover more complex patterns that can potentially yield better-informed predictions. Since it is quite difficult to consider several columns at once (or, it would take a long time to consider all possible patterns in many different columns simultaneously), we’ll use machine learning to automate this for us.

Your first machine learning model

We’ll build what’s known as a random forest model. This model is constructed of several “trees” (there are three trees in the picture below, but we’ll construct 100!) that will individually consider each passenger’s data and vote on whether the individual survived. Then, the random forest model makes a democratic decision: the outcome with the most votes wins!

The code cell below looks for patterns in four different columns (“Pclass”, “Sex”, “SibSp”, and “Parch”) of the data. It constructs the trees in the random forest model based on patterns in the train.csv file, before generating predictions for the passengers in test.csv. The code also saves these new predictions in a CSV file submission.csv.

Copy this code into your notebook, and run it in a new code cell.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission_AlexisCook.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!

Make sure that your notebook outputs the same message above (Your submission was successfully saved!) before moving on. > Again, don’t worry if this code doesn’t make sense to you! For now, we’ll focus on how to generate and submit predictions.

Once you’re ready, click on the “Save Version” button in the top right corner of your notebook. This will generate a pop-up window.
- Ensure that the “Save and Run All” option is selected, and then click on the “Save” button. - This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the “Save Version” button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (…) to the right of the most recent version, and select Open in Viewer.
- Click on the Data tab on the top of the screen. Then, click on the “Submit” button to submit your results.

Congratulations for making your first submission to a Kaggle competition! Within ten minutes, you should receive a message providing your spot on the leaderboard. Great work!

Part 4: Learn more!

If you’re interested in learning more, we strongly suggest our (3-hour) Intro to Machine Learning course, which will help you fully understand all of the code that we’ve presented here. You’ll also know enough to generate even better predictions!

3. Alexis Cook의 분석을 이어받아 약간변형

A. Alexis Cook의 분석은 train에서 얼마나 잘 맞출까?

- 원래코드

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission_AlexisCook.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!

len(predictions), len(X_test)

(418, 418)

- 이렇게 수정하면 될 듯

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)

####

predictions = model.predict(X)

len(predictions),len(y)

(891, 891)

(predictions == y).mean()

0.8159371492704826

B. Alexis Cook의 코드를 수정해보자!

- 코드를 수정해보자.

- 모형에서 하이퍼파라메터 조정을 해주면 성능이 좋아진다.

from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=5000, max_depth=1000, random_state=1)
model.fit(X, y)

####

predictions = model.predict(X)

(predictions == y).mean()

0.8170594837261503

내가 만든게 더 좋은데??

- 이것도 제출결과로 만들어보자.

predictions = model.predict(X_test)

- 아래와 같이 제출하면 에러가 발생

pd.read_csv("~/Desktop/titanic/gender_submission.csv")\
.assign(Survived=predictions)\
.to_csv("AlexisCook수정_submission.csv")

- 아래와 같이 제출파일을 저장해야 한다.

pd.read_csv("~/Desktop/titanic/gender_submission.csv")\
.assign(Survived=predictions)\
.to_csv("AlexisCook수정2_submission.csv",index=False)

4. 제출결과의 비교

- 교훈: 모의고사(train)를 잘 푼다고 꼭 수능(test)을 잘보는 건 아니다.

1 Alexis Cook의 코드

train/정답률 81.593
test/submit/정답률 77.511

2 위 코드 수정

train/정답률 81.70594
test/submit/정답률 76.555