りんだろぐ rindalog: タイタニック号データ分析：Feature Engineering

これは、Kaggle の記事 Exploring the Titanic Dataset から。次は、本文 "1. Introduction" からの引用。

I’ll then use randomForest to create a model predicting survival on the Titanic.

randomForest を使って、Titanic 号の乗組員の生存を予測するモデルを構築。

本来は、Kaggle という企業や、そのサイトで提供していることを紹介すべきだろうが、ざっくりと無視。私の紹介よりも、サイトを眺めてもらった方が良いから。

それよりも、この「Tianic 号の予測」はすぐに惹かれた。この手のデータ分析に関する情報は、日本の書籍よりも、海外のネット情報の方が有益。高い質もそうだが、圧倒的な量の多さも魅力だ。

ちなみに、この記事の投稿者の Megan Risdal さんは「機械学習初心者」とのこと。

I am new to machine learning and hoping to learn a lot, so feedback is very welcome!

記事も読まずに「ダメじゃん、それじゃ」と思ったあなたは、考え方を変えた方が良いと思うぞ。

この投稿は、どこぞの権威ある人の本より、全くもって実用的でリアルだ。「本が売れない」と嘆くよりも「良い本を出して下さい」と、日本の出版業界に対しては切に願う。「人口知能は人間を超えるか」とかいうタイトルの本を書く暇があったらさ... 。

今回は、主に "2 Feature Engineering" を取り上げる。

備考：「Feature Engineering とは」を最初に投稿しようと思った。feature selection との違いも調べたかった。しかし、本投稿を読みながら、「そんなことは後回し」となった次第。ここでの具体例でイメージできれば「定義は後回し」という感じ。

1.1 Load and check data

まずは、最後の予測モデル構築まで必要なパッケージのロード。細かいことだが、こんなコメントは嬉しい。

# Load packages
library('ggplot2') # visualization
library('ggthemes') # visualization
library('scales') # visualization
library('dplyr') # data manipulation
library('mice') # imputation
library('randomForest') # classification algorithm

訓練データ train とテストデータ test をロード、それを結合したのが full 。CSV ファイルはサイトからダウンロードしたもの。

> train <- read.csv("Downloads/kaggle/titanic/train.csv",
stringsAsFactors = F)
> test <- read.csv("Downloads/kaggle/titanic/test.csv",
stringsAsFactors = F)
> full <- bind_rows(train,test)

訓練とテストデータはあわせて 1309 件、変数が 16 個。

> str(full)
'data.frame': 1309 obs. of 16 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
$ Sex : chr "male" "female" "female" "female" ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : chr "" "C85" "" "C123" ...
$ Embarked : chr "S" "C" "S" "S" ...
$ Title : chr "Mr" "Mrs" "Miss" "Mrs" ...
$ Surname : chr "Braund" "Cumings" "Heikkinen" "Futrelle" ...
$ Fsize : num 2 2 1 2 1 1 1 5 3 2 ...
$ FsizeD : chr "small" "small" "singleton" "small" ...

以下に、記事にある変数の説明、16 変数すべてではなく、分析で扱うものだけ。

Survived 変数が「生き残った or 死んだ」で、1, 0 という表記が生々しい...、その中間はないのだ...。

2 Feature Engineering

次は、”2.1 What’s in a name?" の解説。

The first variable which catches my attention is passenger name because we can break it down into additional meaningful variables which can feed predictions or be used in the creation of additional new variables. For instance, passenger title is contained within the passenger name variable and we can use surname to represent families. Let’s do some feature engineering!

最初に気になったのは、乗組員の氏名。他の意味のある変数に分割できるから。例えば、乗組員の肩書き（title）は、氏名変数に含まれて、surname を家族の代表するものに使える。

これだけでも、何となく feature engineering が分かってくる。

# 以下のような名前から title を取り出す
> head(full$Name, 3)
[1] "Braund, Mr. Owen Harris"
[2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"
[3] "Heikkinen, Miss. Laina"

> full$Title <- gsub('(.*,)|(\\..*)','',full$Name)
# " Mr" のように冒頭の空白文字を削除、この処理は記事にない（バグか？）
> full$Title <- gsub(' ','',full$Title)
> table(full$Sex, full$Title)

Capt Col Don Dona Dr Jonkheer Lady ...
female 0 0 0 1 1 0 1 ...
male 1 4 1 0 7 1 0 ...

# 次の変数は、"rare" としてまとめる数が少ない title 。
# 空白文字を削除したので 'the Countess' から'theCountess' に変更。
> rare_title <- c('Dona', 'Lady', 'theCountess','Capt',
'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer')

# Mlle, MS, Mme を置換
> full$Title[full$Title == 'Mlle'] <- 'Miss'
> full$Title[full$Title == 'Ms'] <- 'Miss'
> full$Title[full$Title == 'Mme'] <- 'Mrs'
# 'Rare Title' に置換
> full$Title[full$Title %in% rare_title] <- 'Rare Title'
> table(full$Sex, full$Title)

Master Miss Mr Mrs Rare Title

female 0 264 0 198 4

male 61 0 757 0 25

# 名前から surname を取り出す
> full$Surname <- sapply(full$Name,
function(x) strsplit(x,split='[,.]')[[1]][1])
# こんな感じ...
> head(full$Surname)
[1] "Braund" "Cumings" "Heikkinen" "Futrelle" "Allen" "Moran"
# 以上の結果を、HTML文法で出力
> cat(paste('We have <b>', nlevels(factor(full$Surname)), '</b> unique surnames. I would be interested to infer ethnicity based on surname --- another time.'))

We have <b> 875 </b> unique surnames. I would be interested to infer ethnicity based on surname --- another time.

これで、「surname を元に民族性での推定」が可能になった。

家族は一緒に沈んだか、それとも泳いだか？

これも生々しいタイトルだが、原文は "2.2 Do families sink or swim together?" から。要するに、「家族構成と生死」の関係を見ようというもの。

family size 変数を作る。

# 乗組員も含めて +1 している
> full$Fsize <- full$SibSp + full$Parch + 1

# 今回、用途は不明だが、Family 変数を作った。
> full$Family <- paste(full$Surname, full$Fsize, sep='_')
> head(full$Family)
[1] "Braund_2" "Cumings_2" "Heikkinen_1" "Futrelle_2" "Allen_1"
[6] "Moran_1"

次は、家族数と生存の関係を、訓練データ（891 件目まで）でプロットしたもの。

> ggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) +

geom_bar(stat='count', position='dodge') +
scale_x_continuous(breaks=c(1:11)) +
labs(x = 'Family Size') +
theme_few()

Ah hah. We can see that there’s a survival penalty to singletons and those with family sizes above 4. We can collapse this variable into three levels which will be helpful since there are comparatively fewer large families. Let’s create a discretized family size variable.

一人、そして 4 人以上の家族が生存に不利、そして少ない家族数の 3 レベルに分けられそう。

変数 FsizeD にこの 3 レベルを設定

# Discretize family size
> full$FsizeD[full$Fsize == 1] <- 'singleton'
> full$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- 'small'
> full$FsizeD[full$Fsize > 4] <- 'large'

# Show family size by survival using a mosaic plot
> mosaicplot(table(full$FsizeD, full$Survived),
main='Family Size by Survival', shade=TRUE)

次はモザイクプロット。Standardized Residua 値の読み方は、本投稿の末尾を参考。

The mosaic plot shows that we preserve our rule that there’s a survival penalty among singletons and large families, but a benefit for passengers in small families. I want to do something further with our age variable, but 263 rows have missing age values, so we will have to wait until after we address missingness.

モザイクプロットが示すのは、singletons と large families には生き残る上での短所があり、small families には長所があったということ。さらに年齢で調べたいところだが、263 件で年齢情報が欠損している。

もう少し詳しく解説。

small の場合に survival = 0（死亡）が極端に少なく（-4 以下）、survival = 1（生存）が極端に多い（4 以上）。一方 singleton, large の場合には small の逆で、survival = 0（死亡）が多く（2 から 4）、survival = 1（生存）が少ない（-4 から -2）。

年齢の欠損は次の通りで、次回対応する。

> summary(full$Age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.17 21.00 28.00 29.88 39.00 80.00 263

他に使えそうな情報として、客席 Cabin がある。

> full$Cabin[1:28]
[1] "" "C85" "" "C123" ""
[6] "" "E46" "" "" ""
[11] "G6" "C103" "" "" ""
[16] "" "" "" "" ""
[21] "" "D56" "" "A6" ""
[26] "" "" "C23 C25 C27"

これも欠損が多い。

> length(full$Cabin[full$Cabin == ""])
[1] 1014

それでも、ここでは Deck 変数に、最初のアルファベットを切り出している。

> full$Deck <- factor(sapply(full$Cabin,
function(x) strsplit(x, NULL)[[1]][1]))
> summary(full$Deck)
A B C D E F G T NA's
22 65 94 46 41 21 5 1 1014

There’s more that likely could be done here including looking into cabins with multiple rooms listed (e.g., row 28: “C23 C25 C27”), but given the sparseness of the column we’ll stop here.

訳さないが、「もっとやった方がいい編集はあるが、この辺で止める」という感じ。

この後の分析は、Missing value imputation, そして Prediction! と続く。

「Missingness」に続く。

Standardized Residuals

解説というよりも、以下のサイトからの引用。standardized residuals の理屈ではなく、解釈部分のみを引用。

Rule of Thumb for Interpreting Standardized Residuals

A general rule of thumb for figuring out what the standardized residual means, is:

If the residual is less than -2, the cell’s observed frequency is less than the expected frequency.
Greater than 2 and the observed frequency is greater than the expected frequency.

If your residuals are +/-3, then it means that something extremely unusual is happening. If you get +/-4, it’s something from the Twilight Zone! This makes sense if you think about the 68 95 99.7 rule: if your data is normally distributed, 95% of your data should be within 2 standard deviations from the mean. If you have something greater than that, then you’re looking at an outlier.

りんだろぐ rindalog

2016年10月15日土曜日

タイタニック号データ分析：Feature Engineering

0 件のコメント:

コメントを投稿