りんだろぐ rindalog: Histogram と Density Plot：対数スケール

「安易さと誤用」からの続き。

前回ロードしたデータから、今回は income「収入」を density plot（図 3.4）。

> c1 <- "Most of the distribution is concentrated at\nthe low end: less than $100,000 a year.\nIt's hard to get good resolution here."

> c2 <- "Subpopulation\nof wealthy\ncustomers in\nthe $400,000\nrange."

> c3 <- "Wide data range:\nseveral orders of\nmagnitude."

> ggplot(custdata) + geom_density(aes(x=income)) +
scale_x_continuous(labels=scales::dollar) + geom_text(data=data.frame(),aes(170000,1.2e-05,label=c1,colour=1),size=4) +
geom_text(data=data.frame(),aes(400000,2.5e-06,label=c2,colour=1),size=4)+geom_text(data=data.frame(),aes(550000,1.5e-06,label=c3,colour=1),size=4) +
theme(legend.position="none")

R 備考：x 軸の「$ 表示」方法は scale_x_continuous(labels=scales::dollar)

分布の特徴を示した、プロット中の英文コメントを要約

分布は年間 10 万ドル以下の低収入側に集中。しかし、この分布からは良い情報は得られない。
40 万ドルの富裕層の subpopulation「サブ集団」。
幾つかの階級があり、範囲が広い。

つまり、この表示では分析はやりにくい。

When the data range is very wide and the mass of the distribution is heavily concentrated to one side, like the distribution in figure 3.4, it’s difficult to see the details of its shape. For instance, it’s hard to tell the exact value where the income distribution has its peak.

この図のように、データの範囲が非常に広く、ある側にひどく偏っている場合、分布の形を捉えにくい。例えば、ピーク値の収入額が分かりにくい。

summary からも、散らばりの大きさがわかる。

> summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8700 14600 35000 53500 67000 615000

蛇足ながら、ここでの histogram の使用はもってのほか。

対数スケール

logarithmic scale を「対数スケール」と訳した。同じ income データを対数スケールでプロットしたのが次の図 3.5 。

> c1 <- "Very-low-income outliers"

> c2 <- "More customers have income in the\n$10,000 range than you would expect."

> c3 <- "Most customers have income in the\n$20,000-$100,000 range."

> c4 <- "Peak of income\ndistribution at $40,000"

> ggplot(custdata) + geom_density(aes(x=income)) +
scale_x_log10(breaks=c(100,1000,10000,100000),labels=scales::dollar) +
annotation_logticks(sides="bt") +
geom_text(data=data.frame(),aes(100,0.05,label=c1,colour=1),size=3.3) +
geom_text(data=data.frame(),aes(2000,0.45,label=c2,colour=1),size=3.3) +
geom_text(data=data.frame(),aes(5000,0.70,label=c3,colour=1),size=3.3) +
geom_text(data=data.frame(),aes(10000,0.90,label=c4,colour=1),size=3.3)+
theme(legend.position="none")

R 備考：scale_x_log10(breaks=c(100,...),labels=scales::dollar) で、x 軸を log10 スケールで、tick ポイントとラベルを指定。annotation_logticks(sides="bt") は log スケールの tick マークを図の上下に付ける。

分布の特徴を示した、プロット中の英文コメントを要約

超低所得の外れ値
1 万ドルの範囲に、予想よりも多くいる。これは、リニアスケール（等間隔スケール）では捉えられない。
最も多い範囲は 2 万から 10 万ドル。
ピーク値は 4 万ドル。

また、ここでは作図の関係から省略したコメント

Customers with income over $200,000 are rare, but they no longer look like"outliers" in log space.

20 万ドルを超える所得の顧客は希少、しかし log スケールでは、もはや「外れ値」には見えない。

つまり、図 3.4 のリニアスケールで 20 万ドル以上は、「ロングテール」域で「外れ値」に見える。ところが図 3.5 では、逆に低所得層が「外れ値」に見える。

In log space, income is distributed as something that looks like a “normalish” distribution.

対数スケールでは、収入の分布が "normalish" な分布のようになる。

normalish については、本書の “Appendix B. Important statistical concepts” をもとにした投稿を予定。まだ読んでいないが、「分析しやすい分布」という感じだと予想。

ちなみに、今回のプロット時に、以下の警告がでる

　警告メッセージ:
　1: self$trans$transform(x) で: 計算結果が NaN になりました
　2: Removed 79 rows containing non-finite values (stat_density).

これは、次のようにゼロ対数は「マイナスの無限」

> log10(0)
[1] -Inf

なので、データにゼロ値かマイナス値のレコードが 79 件あって、無視したという警告。

最後に When should you use a logarithmic scale?「対数スケールを使うべき時」の第一段落だけ引用して要約、第二段落からは例だけを引用。

You should use a logarithmic scale when percent change, or change in orders of magnitude, is more important than changes in absolute units. You should also use a log scale to better visualize data that is heavily skewed.

対数スケールを使うべきなのは、階級での割合の変化が、絶対値の変化よりも重要な場合。また、ひどく偏ったデータを可視化するのにも良い。

例えば、5,000 ドルの違いは、数十万ドルや数百万ドルの集団と比べて、数万ドルの集団では大きな違いを意味する。

ところで、「対数スケールの histogram」もあるそうだが、今のところ使える場面が想像できないので無視する。

「Bar Chart は離散型データの Histogram」に続く。

りんだろぐ rindalog

2016年9月20日火曜日

Histogram と Density Plot：対数スケール

0 件のコメント:

コメントを投稿