りんだろぐ rindalog: データ操作：NA の取扱い方

本投稿からしばらく、Nina Zumel, John Mount 著 "Practical Data Science with R" の第 4 章 "Managing Data" をもとにする。

posted with amazlet at 18.03.28

Nina Zumel John Mount
Manning Pubns Co
売り上げランキング: 56,969

Amazon.co.jpで詳細を見る

ここでは、前回の視覚化の段階でも発見されるであろう、「問題のあるデータの取扱い方」という感じ。

今回は、"4.1.1 Treating missing values (NAs)" より「NA の扱い方」。

捨てる

NA とは "Not Available" で「使えません」という感じの missing data 。

まずは「そんなレコードは捨てる」という方法の検討。

> summary(custdata$housing.type)
Homeowner free and clear ... NA's
157 56
> summary(custdata$recent.move)
Mode FALSE TRUE NA's
logical 820 124 56
> summary(custdata$num.vehicles)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000 1.000 2.000 1.916 2.000 6.000 56
> summary(custdata$is.employed)
Mode FALSE TRUE NA's
logical 73 599 328

> 56/nrow(custdata)
[1] 0.056

housing.type, recent.move, num.vehicles の NA 数は全体の約 6 % 。

また以下のようして、この 56 件が同じ顧客であることを確認できるので、dorp する（「捨てる」）ことを決定できる。

# Restrict to the rows where housing.type is NA.
> summary(custdata[is.na(custdata$housing.type),
# Look only at the columns recent.move and num.vehicles.
c("recent.move","num.vehicles")])
recent.move num.vehicles
Mode:logical Min. : NA
NA's:56 1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :56

問題は is.employed の 33% の、高すぎる NA 率。

> 328/nrow(custdata)
[1] 0.328

最も安直な方法は "missing" カテゴリーの追加。そして、True/False 値を "employed", "not employed" へ変更。

> custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),

　　"missing", ifelse(custdata$is.employed==T,"employed","not employed"))

> summary(as.factor(custdata$is.employed.fix))

employed missing not employed

599 328 73

とはいえ、真っ先に考察すべきことは

これほど多くの NA がある理由

例えば、主婦、学生、定年退職者、などは考えられないか？ということで、ここでは "not in active workforce" と明示。

> custdata$is.employed.fix <- ifelse(is.na(custdata$is.employed),

"not in active workforce",

ifelse(custdata$is.employed==T,"employed","not employed"))

> summary(as.factor(custdata$is.employed.fix))

employed not employed not in active workforce

599 73 328

重要なこと

R の処理において、何の指定もなければ NA レコードを無視する。これは便利な判明、情報の消失の可能性を高める。is.employed のような大量の NA が場合、今回のように "missing" へ変更するなどの対処が必要。

更に、今回の例では、学生や主婦の区別が必要性があれば、データの最収集が必要になる。つまり、常に「このデータは分析目的を満たすか？」の姿勢は、グラフ化の段階から継続している。

ちなみに、今回のように新たな変数 is.employed.fix を作った。この方が is.employed を変更するよりも、「オリジナルデータを残す」観点から良いと思う。

数値データの場合

次は、「収入」などの数値データが NA の場合の取扱い方法。

ここでは、健康保険の加入予測では「収入」変数が欠かせないもの（an important predictor）とする。よって、NA に対して何らかの操作をして、可能な限り「適切な標本データ」にすることが目的となる。

幾つかの方法がある。

NA はランダム

NA が含まれるのは fault sensor が原因とする考え。つまり「ランダムに発生」という考えで、NA を適切な値、例えば平均値で置換える。

> summary(custdata$Income)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0 25000 45000 66200 82000 615000 328

# Don't forget the argument "na.rm=T"! Otherwise, the mean() function will include the NAs by default, and meanIncome will be NA.

> meanIncome <- mean(custdata$Income, na.rm=T)

> Income.fix <- ifelse(is.na(custdata$Income), meanIncome, custdata$Income)

> summary(Income.fix)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0 35000 66200 66200 66200 615000

NA 置換で重要なこと。

It’s important to remember that replacing missing values by the mean, as well as many more sophisticated methods for imputing missing values, assumes that the customers with missing income are in some sense random (the “faulty sensor” situation).

不足データの平均値などでの置換は、不足データを持つ顧客にはある程度のラインダム性があることを仮定している。

ところが、置換では不適切な場合がある。

The customers with missing income data are systematically different from the others.

不足データには、他のデータと体系的な違いがある。

例えば、収入変数に値がない顧客の収入が「現実にもない」場合、置換えは誤った方法になる。

二つの方法がある「カテゴリー変換」と「ゼロ置換」。

カテゴリー置換

数値を一定区間毎に区切って、カテゴリーに型変換する。今回の例では、NA を "no income" のカテゴリー変数とする。

# Select some income ranges of interest. To use the cut() function, the upper and lower bounds should encompass the full income range of the data.

> breaks <- c(0,10^4,5*10^4,10*10^4,25*10^4,100*10^4)

# Cut the data into income ranges. The include.lowest=T argument makes sure that zero income data is included in the lowest income range category. By default it would be excluded.

> Income.groups <- cut(custdata$Income,breaks=breaks,include.lowest=T)

# The cut() function produces factor variables. Note the NAs are preserved.

> summary(Income.groups)

[0,1e+04] (1e+04,5e+04] (5e+04,1e+05] (1e+05,2.5e+05] (2.5e+05,1e+06]

63 312 178 98 21

NA's

328

# To preserve the category names before adding a new category, convert the variables to strings.

> Income.groups <- as.character(Income.groups)

# Add the "no income" category to replace the NAs.

> Income.groups <- ifelse(is.na(Income.groups),"no income", Income.groups)

> summary(as.factor(Income.groups))

(1e+04,5e+04] (1e+05,2.5e+05] (2.5e+05,1e+06] (5e+04,1e+05] [0,1e+04]

312 98 21 178 63

no income

328

ゼロ置換

第二の方法は単純。今回の場合、本当に「収入がない」ものととして、ゼロに置換する。

# The missing variable lets you differentiate the two kinds of zeros in the data: the ones that you are about to add, and the ones that were already there.

# 元々のゼロと、ゼロに置換するものを区別するため。

> missingIncome <- is.na(custdata$Income)

# Replace the NAs with zeros.

> Income.fix <- ifelse(is.na(custdata$Income),0,custdata$Income)

> summary(Income.fix)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0 0 25250 44490 60000 615000

当然のこととして、平均値と置換した場合の Mean 66200 より低くなる。

まとめ

NA データの扱い方について。

In summary, to properly handle missing data you need to know why the data is missing in the first place. If you don’t know whether the missing values are random or systematic, we recommend assuming that the difference is systematic, rather than trying to impute values to the variables based on the faulty sensor assumption.

まずは、不足データがある理由を知る。不足データが「ランダム」か「体系的」なのか不明な場合、「体系的」をお勧めする。

平均値などの適当な値で置換するのは、お勧めしないとのこと。

これは感覚的にも納得する。例えば、不足データの平均が、標本平均に近似するとは考え難い。不足データ数が多いほど近似するだろうが、不足データがそんなにも多いことが、そもそもの問題。分析の対象にしないか、データの再収集の方が優先されると思う。

「Data Transformations（データ変換）」に続く。

りんだろぐ rindalog

2016年9月25日日曜日

データ操作：NA の取扱い方

0 件のコメント:

コメントを投稿