本文共 32965 字,大约阅读时间需要 109 分钟。
kaggle泰坦尼克号
is a site where people create algorithms and compete against machine learning practitioners around the world. Your algorithm wins the competition if it’s the most accurate on a particular data set. Kaggle is a fun way to practice your machine learning skills.
是一个人们在其中创建算法并与全球机器学习从业人员竞争的网站。 如果您的算法在特定数据集上最准确,那么它将赢得竞争。 Kaggle是练习机器学习技能的有趣方式。
This tutorial is based on part of our free, four-part course: . This interactive course is the most comprehensive introduction to Kaggle’s Titanic competition ever made. The course includes a certificate on completion. Use the button below to start the course:
本教程基于我们的免费课程(分为四部分): 。 该互动式课程是对Kaggle泰坦尼克号比赛有史以来最全面的介绍。 该课程包括结业证书。 使用下面的按钮开始课程:
In this tutorial we’ll learn learn how to:
在本教程中,我们将学习学习如何:
This tutorial presumes you have an understanding of Python and the pandas library. If you need to learn about these, we recommend our blog post.
本教程假定您已了解Python和pandas库。 如果您需要了解这些内容,我们建议您阅读博客文章。
Kaggle has created a number of competitions designed for beginners. The most popular of these competitions, and the one we’ll be looking at, is about predicting which passengers survived the .
Kaggle创建了许多针对初学者的比赛。 在这些竞赛中,我们将要探讨的最流行的竞赛是关于预测哪些乘客幸存下来。
In this competition, we have a data set of different information about passengers onboard the Titanic, and we see if we can use that information to predict whether those people survived or not. Before we start looking at this specific competition, let’s take a moment to understand how Kaggle competitions work.
在这场比赛中,我们拥有一系列有关泰坦尼克号上乘客的不同信息的数据,我们将看看是否可以使用该信息来预测这些人是否幸存。 在开始研究此特定比赛之前,让我们花点时间了解Kaggle比赛的工作方式。
Each Kaggle competition has two key data files that you will work with – a training set and a testing set.
每个Kaggle竞赛都有两个关键数据文件可供您使用- 训练集和测试集。
The training set contains data we can use to train our model. It has a number of feature columns which contain various descriptive data, as well as a column of the target values we are trying to predict: in this case, Survival
.
训练集包含可用于训练模型的数据。 它具有许多要素列,其中包含各种描述性数据,以及我们尝试预测的目标值的列:在这种情况下为Survival
。
The testing set contains all of the same feature columns, but is missing the target value column. Additionally, the testing set usually has fewer observations (rows) than the training set.
测试集包含所有相同的特征列,但缺少目标值列。 另外,测试集通常比训练集具有更少的观察(行)。
This is useful because we want as much data as we can to train our model on. Once we have trained our model on the training set, we will use that model to make predictions on the data from the testing set, and submit those predictions to Kaggle.
这很有用,因为我们需要尽可能多的数据来训练模型。 在训练集中训练完模型后,我们将使用该模型对测试集中的数据进行预测,然后将这些预测提交给Kaggle。
In this competition, the two files are named test.csv
and train.csv
. We’ll start by using library to read both files and then inspect their size.
在本次比赛中,这两个文件名为test.csv
和train.csv
。 我们将从使用库开始读取两个文件,然后检查它们的大小。
import import pandas pandas as as pdpdtest test = = pdpd .. read_csvread_csv (( "test.csv""test.csv" ))train train = = pdpd .. read_csvread_csv (( "train.csv""train.csv" ))printprint (( "Dimensions of train: "Dimensions of train: {}{} "" .. formatformat (( traintrain .. shapeshape ))))printprint (( "Dimensions of test: "Dimensions of test: {}{} "" .. formatformat (( testtest .. shapeshape ))))
Dimensions of train: (891, 12)Dimensions of test: (418, 11)
The files we just opened are available on . That page also has a data dictionary, which explains the various columns that make up the data set. Below are the descriptions contained in that data dictionary:
我们刚刚打开的文件可 。 该页面还具有一个数据字典 ,该字典解释了构成数据集的各个列。 以下是该数据字典中包含的描述:
PassengerID
— A column added by Kaggle to identify each row and make submissions easierSurvived
— Whether the passenger survived or not and the value we are predicting (0=No, 1=Yes)Pclass
— The class of the ticket the passenger purchased (1=1st, 2=2nd, 3=3rd)Sex
— The passenger’s sexAge
— The passenger’s age in yearsSibSp
— The number of siblings or spouses the passenger had aboard the TitanicParch
— The number of parents or children the passenger had aboard the TitanicTicket
— The passenger’s ticket numberFare
— The fare the passenger paidCabin
— The passenger’s cabin numberEmbarked
— The port where the passenger embarked (C=Cherbourg, Q=Queenstown, S=Southampton)PassengerID
由Kaggle添加的列,用于标识每一行并简化提交过程 Survived
—乘客是否幸存,以及我们所预测的值(0 =否,1 =是) Pclass
乘客购买的机票等级(1 = 1st,2 = 2nd,3 = 3rd) Sex
-乘客的性别 Age
-乘客的年龄(以年为单位) SibSp
乘客在泰坦尼克号上拥有的兄弟姐妹或配偶的数量 Parch
-泰坦尼克号上乘客的父母或子女人数 Ticket
—乘客的机票号 Fare
—乘客支付的票价 Cabin
-旅客的客舱编号 Embarked
—旅客Embarked
的港口(C =瑟堡,Q =皇后镇,S =南安普敦) The data page on Kaggle has some additional notes about some of the columns. It’s always worth exploring this in detail to get a full understanding of the data.
Kaggle上的数据页面还有一些有关某些列的附加说明。 为了全面了解数据,总是值得进行详细研究。
Let’s take a look at the first few rows of the train
dataframe.
让我们看一下train
数据帧的前几行。
PassengerId | 旅客编号 | Survived | 幸存下来 | Pclass | P类 | Name | 名称 | Sex | 性别 | Age | 年龄 | SibSp | 锡卜 | Parch | 胹 | Ticket | 票 | Fare | 票价 | Cabin | 舱 | Embarked | 出发 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 1个 | 0 | 0 | 3 | 3 | Braund, Mr. Owen Harris | 布朗德,欧文·哈里斯先生 | male | 男 | 22.0 | 22.0 | 1 | 1个 | 0 | 0 | A/5 21171 | A / 5 21171 | 7.2500 | 7.2500 | NaN | N | S | 小号 |
1 | 1个 | 2 | 2 | 1 | 1个 | 1 | 1个 | Cumings, Mrs. John Bradley (Florence Briggs Th… | 卡明斯,约翰·布拉德利夫人(弗洛伦斯·布里格斯 | female | 女 | 38.0 | 38.0 | 1 | 1个 | 0 | 0 | PC 17599 | 电脑17599 | 71.2833 | 71.2833 | C85 | C85 | C | C |
2 | 2 | 3 | 3 | 1 | 1个 | 3 | 3 | Heikkinen, Miss. Laina | 海基宁·莱娜小姐 | female | 女 | 26.0 | 26.0 | 0 | 0 | 0 | 0 | STON/O2. 3101282 | STON / O2。 3101282 | 7.9250 | 7.9250 | NaN | N | S | 小号 |
3 | 3 | 4 | 4 | 1 | 1个 | 1 | 1个 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | Futrelle,Jacques Heath夫人(莉莉·梅·皮尔) | female | 女 | 35.0 | 35.0 | 1 | 1个 | 0 | 0 | 113803 | 113803 | 53.1000 | 53.1000 | C123 | C123 | S | 小号 |
4 | 4 | 5 | 5 | 0 | 0 | 3 | 3 | Allen, Mr. William Henry | 艾伦·威廉·亨利先生 | male | 男 | 35.0 | 35.0 | 0 | 0 | 0 | 0 | 373450 | 373450 | 8.0500 | 8.0500 | NaN | N | S | 小号 |
The type of machine learning we will be doing is called classification, because when we make predictions we are classifying each passenger as ‘survived’ or not. More specifically, we are performing binary classification, which means that there are only two different states we are classifying.
我们将要进行的机器学习类型称为分类 ,因为当我们进行预测时,我们会将每位乘客分类为“幸存”。 更具体地说,我们正在执行二进制分类 ,这意味着我们仅对两个不同的状态进行分类。
In any machine learning exercise, thinking about the topic you are predicting is very important. We call this step acquiring domain knowledge, and it’s one of the most important determinants for success in machine learning.
在任何机器学习练习中,思考您要预测的主题都是非常重要的。 我们将此步骤称为获取领域知识,这是机器学习成功的最重要决定因素之一。
In this case, understanding the Titanic disaster and specifically what variables might affect the outcome of survival is important. Anyone who has watched the movie would remember that women and children were given preference to lifeboats (as they were in real life). You would also remember the vast class disparity of the passengers.
在这种情况下,了解泰坦尼克号灾难尤其是哪些变量可能会影响生存结果非常重要。 任何看过电影《 都会记得,妇女和儿童被优先考虑使用救生艇(就像他们在现实生活中一样)。 您还会记得乘客之间的巨大差距。
This indicates that Age
, Sex
, and PClass
may be good predictors of survival. We’ll start by exploring Sex
and Pclass
by visualizing the data.
这表明Age
, Sex
和PClass
可能是生存的良好预测指标。 我们将通过可视化数据来探索Sex
和Pclass
。
Because the Survived
column contains 0
if the passenger did not survive and 1
if they did, we can segment our data by sex and calculate the mean of this column. We can use DataFrame.pivot_table()
to easily do this:
因为如果乘客没有幸存,则“ Survived
列包含0
如果乘客没有幸存,则包含1
,因此我们可以按性别对数据进行细分,然后计算此列的平均值。 我们可以使用DataFrame.pivot_table()
轻松地做到这一点:
import import matplotlib.pyplot matplotlib.pyplot as as pltplt%% matplotlib inlinematplotlib inlinesex_pivot sex_pivot = = traintrain .. pivot_tablepivot_table (( indexindex == "Sex""Sex" ,, valuesvalues == "Survived""Survived" ))sex_pivotsex_pivot .. plotplot .. barbar ()()pltplt .. showshow ()()
We can immediately see that females survived in much higher proportions than males did. Let’s do the same with the Pclass
column.
我们可以立即看到,女性的存活率比男性高得多。 让我们对Pclass
列进行相同的Pclass
。
The Sex
and PClass
columns are what we call categorical features. That means that the values represented a few separate options (for instance, whether the passenger was male or female).
Sex
和PClass
列是我们所谓的分类特征。 这意味着这些值代表几个单独的选项(例如,乘客是男性还是女性)。
Let’s take a look at the Age
column using .
让我们使用来查看Age
列。
traintrain [[ "Age""Age" ]] .. describedescribe ()()
count 714.000000mean 29.699118std 14.526497min 0.42000025% 20.12500050% 28.00000075% 38.000000max 80.000000Name: Age, dtype: float64
The Age
column contains numbers ranging from 0.42
to 80.0
(If you look at Kaggle’s data page, it informs us that Age
is fractional if the passenger is less than one). The other thing to note here is that there are 714 values in this column, fewer than the 814 rows we discovered that the train
data set had earlier in this mission which indicates we have some missing values.
“ Age
列包含的数字范围为0.42
到80.0
(如果您查看Kaggle的数据页,它会告诉我们,如果乘客少于一, Age
就是小数)。 这里要注意的另一件事是,此列中有714个值,少于我们发现train
数据集在此任务中较早的814行,这表明我们有一些缺失的值。
All of this means that the Age
column needs to be treated slightly differently, as this is a continuous numerical column. One way to look at distribution of values in a continuous numerical set is to use histograms. We can create two histograms to compare visually the those that survived vs those who died across different age ranges:
所有这些都意味着“ Age
列的处理需要稍有不同,因为这是一个连续的数字列。 查看连续数值集中的值分布的一种方法是使用直方图。 我们可以创建两个直方图,以直观比较幸存者和不同年龄段死亡者:
The relationship here is not simple, but we can see that in some age ranges more passengers survived – where the red bars are higher than the blue bars.
这里的关系并不简单,但是我们可以看到,在某些年龄范围内,有更多的乘客幸存下来-红色条高于蓝色条。
In order for this to be useful to our machine learning model, we can separate this continuous feature into a categorical feature by dividing it into ranges. We can use the to help us out.
为了使该功能对我们的机器学习模型有用,我们可以通过将连续特征划分为多个范围,从而将该连续特征分离为分类特征。 我们可以使用来帮助我们。
The pandas.cut()
function has two required parameters – the column we wish to cut, and a list of numbers which define the boundaries of our cuts. We are also going to use the optional parameter labels
, which takes a list of labels for the resultant bins. This will make it easier for us to understand our results.
pandas.cut()
函数具有两个必需的参数-我们希望剪切的列,以及定义剪切边界的数字列表。 我们还将使用可选参数labels
,该参数为生成的垃圾箱获取标签列表。 这将使我们更容易理解我们的结果。
Before we modify this column, we have to be aware of two things. Firstly, any change we make to the train
data, we also need to make to the test
data, otherwise we will be unable to use our model to make predictions for our submissions. Secondly, we need to remember to handle the missing values we observed above.
在修改此列之前,我们必须了解两件事。 首先,我们对train
数据所做的任何更改,我们还需要对test
数据进行更改,否则我们将无法使用我们的模型对提交的内容进行预测。 其次,我们需要记住处理上面观察到的缺失值。
We’ll create a function that:
我们将创建一个函数:
-0.5
Age
column into six segments: Missing
, from -1
to 0
Infant
, from 0
to 5
Child
, from 5
to 12
Teenager
, from 12
to 18
Young Adult
, from 18
to 35
Adult
, from 35
to 60
Senior
, from 60
to 100
-0.5
填充所有缺少的值 Age
列分为六个部分: Missing
,从-1
到0
Infant
,从0
到5
Child
( 5
至12
Teenager
( 12
至18
18
至35
Young Adult
Adult
,从35
至60
Senior
,从60
到100
We’ll then use that function on both the train
and test
dataframes.
然后,我们将在train
和test
数据帧上使用该功能。
The diagram below shows how the function converts the data:
下图显示了该函数如何转换数据:
Note that the cut_points
list has one more element than the label_names
list, since it needs to define the upper boundary for the last segment.
请注意, cut_points
列表比label_names
列表多一个元素,因为它需要定义最后一段的上限。
def def process_ageprocess_age (( dfdf ,, cut_pointscut_points ,, label_nameslabel_names ): ): dfdf [[ "Age""Age" ] ] = = dfdf [[ "Age""Age" ]] .. fillnafillna (( -- 0.50.5 ) ) dfdf [[ "Age_categories""Age_categories" ] ] = = pdpd .. cutcut (( dfdf [[ "Age""Age" ],], cut_pointscut_points ,, labelslabels == label_nameslabel_names ) ) return return dfdfcut_points cut_points = = [[ -- 11 ,, 00 ,, 55 ,, 1212 ,, 1818 ,, 3535 ,, 6060 ,, 100100 ]]label_names label_names = = [[ "Missing""Missing" ,, "Infant""Infant" ,, "Child""Child" ,, "Teenager""Teenager" ,, "Young Adult""Young Adult" ,, "Adult""Adult" ,, "Senior""Senior" ]]train train = = process_ageprocess_age (( traintrain ,, cut_pointscut_points ,, label_nameslabel_names ))test test = = process_ageprocess_age (( testtest ,, cut_pointscut_points ,, label_nameslabel_names ))pivot pivot = = traintrain .. pivot_tablepivot_table (( indexindex == "Age_categories""Age_categories" ,, valuesvalues == 'Survived''Survived' ))pivotpivot .. plotplot .. barbar ()()pltplt .. showshow ()()
So far we have identified three columns that may be useful for predicting survival:
到目前为止,我们已经确定了三列可能对预测生存率有用的列:
Sex
Pclass
Age
, or more specifically our newly created Age_categories
Sex
Pclass
Age
,或更确切地说是我们新创建的Age_categories
Before we build our model, we need to prepare these columns for machine learning. Most machine learning algorithms can’t understand text labels, so we have to convert our values into numbers.
在建立模型之前,我们需要准备这些列以进行机器学习。 大多数机器学习算法无法理解文本标签,因此我们必须将值转换为数字。
Additionally, we need to be careful that we don’t imply any numeric relationship where there isn’t one. The data dictionary tells us that the values in the Pclass
columnare 1
, 2
, and 3
. We can confirm this with pandas:
此外,我们需要注意不要在没有数字关系的地方隐含任何数字关系。 数据字典告诉我们的是,在的值Pclass
columnare 1
, 2
,和3
。 我们可以用熊猫来确认:
3 4911 2162 184Name: Pclass, dtype: int64
While the class of each passenger certainly has some sort of ordered relationship, the relationship between each class is not the same as the relationship between the numbers 1
, 2
, and 3
. For instance, class 2
isn’t “worth” double what class 1
is, and class 3
isn’t “worth” triple what class 1
is.
虽然类每位乘客的肯定有某种有序的关系,每个类之间的关系是不一样的数字之间的关系1
, 2
,和3
。 例如,第2
类的“价值”不是第1
类的两倍,而第3
类的“价值”不是第1
类的两倍。
In order to remove this relationship, we can create dummy columns for each unique value in Pclass
:
为了消除这种关系,我们可以为Pclass
每个唯一值创建虚拟列:
Rather than doing this manually, we can use the which will generate columns shown in the diagram above.
手动执行此操作,我们可以使用 ,该将生成上图中所示的列。
We’ll create a function to create the dummy columns for the Pclass
column and add it back to the original dataframe. We’ll then apply that function on the train
and test
dataframes for each of the Pclass
, Sex
, and Age_categories
columns.
我们将创建一个函数来为Pclass
列创建虚拟列,并将其添加回原始数据帧。 然后,我们将该功能应用于Pclass
, Sex
和Age_categories
列的train
和test
数据帧。
def def create_dummiescreate_dummies (( dfdf ,, column_namecolumn_name ): ): dummies dummies = = pdpd .. get_dummiesget_dummies (( dfdf [[ column_namecolumn_name ],], prefixprefix == column_namecolumn_name ) ) df df = = pdpd .. concatconcat ([([ dfdf ,, dummiesdummies ],], axisaxis == 11 ) ) return return dfdffor for column column in in [[ "Pclass""Pclass" ,, "Sex""Sex" ,, "Age_categories""Age_categories" ]: ]: train train = = create_dummiescreate_dummies (( traintrain ,, columncolumn ) ) test test = = create_dummiescreate_dummies (( testtest ,, columncolumn ))
Now that our data has been prepared, we are ready to train our first model. The first model we will use is called Logistic Regression, which is often the first model you will train when performing classification.
现在我们的数据已经准备好了,我们准备训练我们的第一个模型。 我们将使用的第一个模型称为Logistic回归 ,它通常是执行分类时将训练的第一个模型。
We will be using the library as it has many tools that make performing machine learning easier. The scikit-learn workflow consists of four main steps:
我们将使用库,因为它具有许多使执行机器学习更加容易的工具。 scikit-learn工作流程包括四个主要步骤:
Each model in scikit-learn is implemented as a separate class and the first step is to identify the class we want to create an instance of. In our case, we want to use the .
scikit-learn中的每个模型都是作为一个单独的类实现的,第一步是识别我们要为其创建实例的类。 在我们的例子中,我们想使用 。
We’ll start by looking at the first two steps. First, we need to import the class:
我们将从头两个步骤开始。 首先,我们需要导入该类:
Next, we create a LogisticRegression
object:
接下来,我们创建一个LogisticRegression
对象:
lr lr = = LogisticRegressionLogisticRegression ()()
Lastly, we use the to train our model. The .fit()
method accepts two arguments: X
and y
. X
must be a two dimensional array (like a dataframe) of the features that we wish to train our model on, and y
must be a one-dimensional array (like a series) of our target, or the column we wish to predict.
最后,我们使用来训练我们的模型。 .fit()
方法接受两个参数: X
和y
。 X
必须是我们希望在其上训练模型的特征的二维数组(例如数据框), y
必须是目标或我们希望预测的列的一维数组(例如一系列)。
The code above fits (or trains) our LogisticRegression
model using three columns: Pclass_2
, Pclass_3
, and Sex_male
.
上面的代码使用三列适合(或训练)我们的LogisticRegression
模型: Pclass_2
, Pclass_3
和Sex_male
。
Let’s train our model using all of the columns we created with our create_dummies()
function.
让我们使用通过create_dummies()
函数创建的所有列来训练模型。
from from sklearn.linear_model sklearn.linear_model import import LogisticRegressionLogisticRegressioncolumns columns = = [[ 'Pclass_1''Pclass_1' , , 'Pclass_2''Pclass_2' , , 'Pclass_3''Pclass_3' , , 'Sex_female''Sex_female' , , 'Sex_male''Sex_male' , , 'Age_categories_Missing''Age_categories_Missing' ,, 'Age_categories_Infant''Age_categories_Infant' , , 'Age_categories_Child''Age_categories_Child' , , 'Age_categories_Teenager''Age_categories_Teenager' , , 'Age_categories_Young Adult''Age_categories_Young Adult' , , 'Age_categories_Adult''Age_categories_Adult' , , 'Age_categories_Senior''Age_categories_Senior' ]]lr lr = = LogisticRegressionLogisticRegression ()()lrlr .. fitfit (( traintrain [[ columnscolumns ], ], traintrain [[ "Survived""Survived" ])])
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
Congratulations, you’ve trained your first machine learning model! Our next step is to find out how accurate our model is, and to do that, we’ll have to make some predictions.
恭喜,您已经训练了您的第一个机器学习模型! 我们的下一步是找出模型的准确性,然后,我们必须做出一些预测。
If you recall from earlier, we do have a test
dataframe that we could use to make predictions. We could make predictions on that data set, but because it doesn’t have the Survived
column we would have to submit it to Kaggle to find out our accuracy. This would quickly become a pain if we had to submit to find out the accuracy every time we optimized our model.
如果您回想起以前,我们确实有一个可用于进行预测的test
数据框。 我们可以对该数据集进行预测,但是由于它没有Survived
列,因此必须将其提交给Kaggle才能确定我们的准确性。 如果每次我们优化模型都必须提交以找出准确性时,这将很快成为一种痛苦。
We could also fit and predict on our train
dataframe, however if we do this there is a high likelihood that our model will overfit, which means it will perform well because we’re testing on the same data we’ve trained on, but then perform much worse on new, unseen data.
我们也可以在train
数据帧上进行拟合和预测,但是如果这样做,我们的模型很可能会过拟合 ,这意味着它会运行良好,因为我们正在测试的数据与我们训练的相同,但是在看不见的新数据上的表现要差得多。
Instead we can split our train
dataframe into two:
相反,我们可以将train
数据帧分为两个部分:
The convention in machine learning is to call these two parts train
and test
. This can become confusing, since we already have our test
dataframe that we will eventually use to make predictions to submit to Kaggle. To avoid confusion, from here on, we’re going to call this Kaggle ‘test’ data holdout data, which is the technical name given to this type of data used for final predictions.
机器学习的惯例是将这两个部分称为train
和test
。 这可能会造成混淆,因为我们已经有了test
数据框,最终将用于进行预测以提交给Kaggle。 为了避免混淆,从这里开始,我们将调用这个Kaggle“测试”数据维持数据,这是考虑到这种类型的用于最终预测数据的技术名称。
The scikit-learn library has a handy that we can use to split our data. train_test_split()
accepts two parameters, X
and y
, which contain all the data we want to train and test on, and returns four objects: train_X
, train_y
, test_X
, test_y
:
scikit-learn库具有一个方便的 ,可用于拆分数据。 train_test_split()
接受两个参数X
和y
,它们包含我们要对其进行训练和测试的所有数据,并返回四个对象: train_X
, train_y
, test_X
, test_y
:
You’ll notice that we use some extra parameters: test_size
, which lets us control what proportions our data are split into, and random_state
. The train_test_split()
function randomizes observations before dividing them, and setting a means that our results will be reproducible, so you can follow along and get the same result as we did.
你会发现,我们使用一些额外的参数: test_size
,这让我们对照一下我们将按比例数据被分成和random_state
。 train_test_split()
函数会在划分观察值之前对观察值进行随机化,并且设置意味着我们的结果将具有可重复性,因此您可以继续进行并获得与我们相同的结果。
Now that we have our data split into train and test sets, we can fit our model again on our training set, and then use that model to make predictions on our test set.
现在我们将数据分为训练集和测试集,我们可以再次将模型拟合到训练集上,然后使用该模型对测试集进行预测。
Once we have fit our model, we can use the to make predictions.
拟合模型后,可以使用进行预测。
The predict()
method takes a single parameter X
, a two dimensional array of features for the observations we wish to predict. X
must have the exact same features as the array we used to fit our model. The method returns single dimensional array of predictions.
predict()
方法采用单个参数X
,这是我们希望预测的观测值的二维特征数组。 X
必须具有与用于拟合模型的数组完全相同的特征。 该方法返回一维预测数组。
lr lr = = LogisticRegressionLogisticRegression ()()lrlr .. fitfit (( train_Xtrain_X , , train_ytrain_y ))predictions predictions = = lrlr .. predictpredict (( test_Xtest_X ))
There are a number of ways to measure the accuracy of machine learning models, but when competing in Kaggle competitions you want to make sure you use the same method that Kaggle uses to calculate accuracy for that specific competition.
有多种方法可以衡量机器学习模型的准确性,但是在参加Kaggle比赛时,您需要确保使用与Kaggle用于计算特定比赛的准确性相同的方法。
In this case, tells us that our score calculated as “the percentage of passengers correctly predicted”. This is by far the most common form of accuracy for binary classification.
在这种情况下, 告诉我们,我们的得分计算为“正确预测的乘客百分比”。 到目前为止,这是二进制分类最常用的精度形式。
As an example, imagine we were predicting a small data set of five observations.
例如,假设我们正在预测一个包含五个观测值的小型数据集。
Our model’s prediction | 我们模型的预测 | The actual value | 实际值 | Correct | 正确 |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | Yes | 是 |
1 | 1个 | 0 | 0 | No | 没有 |
0 | 0 | 1 | 1个 | No | 没有 |
1 | 1个 | 1 | 1个 | Yes | 是 |
1 | 1个 | 1 | 1个 | Yes | 是 |
In this case, our model correctly predicted three out of five values, so the accuracy based on this prediction set would be 60%.
在这种情况下,我们的模型可以正确预测五个值中的三个,因此基于此预测集的准确性为60%。
Again, scikit-learn has a handy function we can use to calculate accuracy: . The function accepts two parameters, y_true
and y_pred
, which are the actual values and our predicted values respectively, and returns our accuracy score.
再次,scikit-learn有一个方便的函数可用来计算准确性: 。 该函数接受两个参数y_true
和y_pred
,分别是实际值和我们的预测值,并返回我们的准确性得分。
Let’s put all of these steps together, and get our first accuracy score.
让我们将所有这些步骤放在一起,以获得我们的第一个准确性分数。
from from sklearn.metrics sklearn.metrics import import accuracy_scoreaccuracy_scorelr lr = = LogisticRegressionLogisticRegression ()()lrlr .. fitfit (( train_Xtrain_X , , train_ytrain_y ))predictions predictions = = lrlr .. predictpredict (( test_Xtest_X ))accuracy accuracy = = accuracy_scoreaccuracy_score (( test_ytest_y , , predictionspredictions ))printprint (( accuracyaccuracy ))
0.810055865922
Our model has an accuracy score of 81.0% when tested against our 20% test set. Given that this data set is quite small, there is a good chance that our model is overfitting, and will not perform as well on totally unseen data.
与我们的20%测试集相比,我们的模型的准确性得分为81.0%。 鉴于此数据集非常小,因此我们的模型很有可能过度拟合,并且在完全看不见的数据上表现不佳。
To give us a better understanding of the real performance of our model, we can use a technique called cross validation to train and test our model on different splits of our data, and then average the accuracy scores.
为了使我们对模型的实际性能有更好的了解,我们可以使用一种称为交叉验证的技术来对数据的不同分割进行训练和测试,然后对准确性得分取平均。
The most common form of cross validation, and the one we will be using, is called k-fold cross validation. ‘Fold’ refers to each different iteration that we train our model on, and ‘k’ just refers to the number of folds. In the diagram above, we have illustrated k-fold validation where k is 5.
我们将使用的最常见的交叉验证形式称为k折交叉验证。 “折叠”是指我们训练模型的每个不同迭代,而“ k”仅是指折叠数。 在上图中,我们说明了k倍验证,其中k为5。
We will use scikit-learn’s to automate the process. The basic syntax for cross_val_score()
is:
我们将使用scikit-learn的来自动执行该过程。 cross_val_score()
的基本语法为:
estimator
is a scikit-learn estimator object, like the LogisticRegression()
objects we have been creating.X
is all features from our data set.y
is the target variables.cv
specifies the number of folds.estimator
是一个scikit-learn估计器对象,就像我们一直在创建的LogisticRegression()
对象一样。 X
是我们数据集中的所有特征。 y
是目标变量。 cv
指定折数。 The function returns a numpy ndarray of the accuracy scores of each fold. It’s worth noting, the cross_val_score()
function can use a variety of cross validation techniques and scoring types, but it defaults to k-fold validation and accuracy scores for our input types.
该函数返回每个折叠的准确性得分的numpy ndarray。 值得注意的是, cross_val_score()
函数可以使用多种交叉验证技术和评分类型,但是对于我们的输入类型,它默认为k倍验证和准确性得分。
We’ll use model_selection.cross_val_score()
to perform cross-validation on our data, before calculating the mean of the scores produced:
在计算产生的分数的平均值之前,我们将使用model_selection.cross_val_score()
对我们的数据进行交叉验证:
from from sklearn.model_selection sklearn.model_selection import import cross_val_scorecross_val_scorelr lr = = LogisticRegressionLogisticRegression ()()scores scores = = cross_val_scorecross_val_score (( lrlr , , all_Xall_X , , all_yall_y , , cvcv == 1010 ))scoresscores .. sortsort ()()accuracy accuracy = = scoresscores .. meanmean ()()printprint (( scoresscores ))printprint (( accuracyaccuracy ))
[ 0.76404494 0.76404494 0.7752809 0.78651685 0.8 0.80681818 0.80898876 0.81111111 0.83146067 0.87640449]0.802467086596
From the results of our k-fold validation, you can see that the accuracy number varies with each fold – ranging between 76.4% and 87.6%. This demonstrates why cross validation is important.
从我们的k折验证结果中,您可以看到准确度数字随每折而变化-介于76.4%和87.6%之间。 这说明了为什么交叉验证很重要。
As it happens, our average accuracy score was 80.2%, which is not far from the 81.0% we got from our simple train/test split, however this will not always be the case, and you should always use cross-validation to make sure the error metrics you are getting from your model are accurate.
碰巧的是,我们的平均准确性得分是80.2%,与我们从简单的训练/测试划分中获得的81.0%相差不远,但是情况并非总是如此,您应该始终使用交叉验证来确保您从模型中获得的错误指标是准确的。
We are now ready to use the model we have built to train our final model and then make predictions on our unseen holdout data, or what Kaggle calls the ‘test’ data set.
现在,我们可以使用我们构建的模型来训练最终模型,然后对看不见的保持数据或Kaggle所谓的“测试”数据集进行预测。
The last thing we need to do is create a submission file. Each Kaggle competition can have slightly different requirements for the submission file. Here’s what is specified on the :
我们需要做的最后一件事是创建一个提交文件。 每次Kaggle竞赛对提交文件的要求可能都略有不同。 这是《 上指定的内容:
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.
您应该提交包含418个条目以及标题行的csv文件。 如果您有多余的列(PassengerId和Survived以外)或行,则提交的内容将显示错误。
The file should have exactly 2 columns:
该文件应恰好有2列:
- PassengerId (sorted in any order)
- Survived (contains your binary predictions: 1 for survived, 0 for deceased)
- PassengerId(按任何顺序排序)
- 尚存(包含您的二进制预测:1表示尚存,0表示已故)
The table below shows this in a slightly easier to understand format, so we can visualize what we are aiming for.
下表以稍微容易理解的格式显示了此内容,因此我们可以直观地看到目标。
PassengerId | 旅客编号 | Survived | 幸存下来 |
---|---|---|---|
892 | 892 | 0 | 0 |
893 | 893 | 1 | 1个 |
894 | 894 | 0 | 0 |
We will need to create a new dataframe that contains the holdout_predictions
we created in the previous screen and the PassengerId
column from the holdout
dataframe. We don’t need to worry about matching the data up, as both of these remain in their original order.
我们需要创建一个包含了新的数据帧holdout_predictions
我们在前面的屏幕创建和PassengerId
从列holdout
数据帧。 我们不必担心数据匹配,因为它们都保持原始顺序。
To do this, we can pass a dictionary to the :
为此,我们可以将字典传递给 :
holdout_ids holdout_ids = = holdoutholdout [[ "PassengerId""PassengerId" ]]submission_df submission_df = = { { "PassengerId""PassengerId" : : holdout_idsholdout_ids , , "Survived""Survived" : : holdout_predictionsholdout_predictions }}submission submission = = pdpd .. DataFrameDataFrame (( submission_dfsubmission_df ))
Finally, we’ll use the to save the dataframe to a CSV file. We need to make sure the index
parameter is set to False
, otherwise we will add an extra column to our CSV.
最后,我们将使用将数据保存到CSV文件。 我们需要确保index
参数设置为False
,否则我们将在CSV中添加额外的列。
You can download the submission file created above from within our . When working on your own computer, it will be in the same directory as your notebook.
您可以从我们的下载上面创建的提交文件。 在您自己的计算机上工作时,它将与笔记本计算机位于同一目录中。
Now that we have our submission file, we can start our submission to Kaggle by clicking the blue ‘Submit Predictions’ button on the .
现在我们有了提交文件,我们可以单击上的蓝色“提交预测”按钮开始向Kaggle提交文件。
You will then be prompted to upload your CSV file, and add a brief description of your submission. When you make your submission, Kaggle will process your predictions and give you your accuracy for the holdout data and your ranking. When it is finished processing you will see our first submission gets an accuracy score of 0.75598, or 75.6%.
然后,系统将提示您上传CSV文件,并添加对提交内容的简短描述。 提交时,Kaggle将处理您的预测,并为您提供有关保留数据和排名的准确性。 处理完成后,您将看到我们的第一份报告得到的准确性得分为0.75598,或75.6%。
The fact that our accuracy on the holdout data is 75.6% compared with the 80.2% accuracy we got with cross-validation indicates that our model is overfitting slightly to our training data.
我们在保留数据上的准确性为75.6%,而交叉验证的准确性为80.2%,这表明我们的模型与训练数据有些过拟合。
At the time of writing, accuracy of 75.6% gives a rank of 6,663 out of 7,954. It’s easy to look at Kaggle leaderboards after your first submission and get discouraged, but keep in mind that this is just a starting point.
在撰写本文时,75.6%的准确度使7,954中的6,663排名。 初次提交后很容易看到Kaggle排行榜,但灰心丧气,但是请记住,这只是一个起点。
It’s also very common to see a small number of scores of 100% at the top of the Titanic leaderboard and think that you have a long way to go. In reality, anyone scoring about 90% on this competition is likely cheating (it’s easy to look up the names of the passengers in the holdout set online and see if they survived).
在Titanic排行榜的顶部看到少量100%的分数并认为您还有很长的路要走,这也是很常见的。 实际上,任何在这项比赛中得分达到90%的人都可能会作弊(很容易在网上设置的候补名单中查找乘客的姓名,看看他们是否还活着)。
There is a great analysis on Kaggle, , which uses a few different strategies and suggests a minimum score for this competition is 62.7% (achieved by presuming that every passenger died) and a maximum of around 82%. We are a little over halfway between the minimum and maximum, which is a great starting point.
在Kaggle上有一个很好的分析, 该 ,它使用了几种不同的策略,因此建议该比赛的最低分数为62.7%(假设每个乘客都死亡),最高分数约为82%。 我们在最小值和最大值之间略微超过一半,这是一个很好的起点。
There are many things we can do to improve the accuracy of our model. Here are some of the things you’ll learn in the rest of our :
我们可以做很多事情来提高模型的准确性。 以下是我们在其余部分中将学到的一些 :
翻译自:
kaggle泰坦尼克号
转载地址:http://fiqwd.baihongyu.com/