りんだろぐ rindalog: 二変数の視覚化：Scatter Plot に曲線フィット

「Line より Scatter」からの続き。

先の「年齢と年収」の scatter plot に、直線をフィットさせたのが次の図 3.11 。

> ggplot(custdata2, aes(x=age,y=income)) + geom_point() +

stat_smooth(method="lm") + ylim(0,200000)

ggplot2 の stat layers の指定が stat_smooth(method="lm") 、"lm" は Linear Model「線形モデル」。

しかし、明らかに直線では、年齢と年収の関係を表せていない。そこで直線でなく、smoothing curve でフィットさせたのが次の図 3.12 。

> c1 <- "The smoothing curve makes it easier to see that\nincome increases up to about age 40, then tends to\ndecrease after about age 55 or 60."

> c2 <- "The ribbon shadows the standard\nerror aound the smoothed estimate.\nIt tends to be wider where data is\nsparse. narrower where data is dense."

> myarrow <- arrow(type = "closed", length = unit(3,"mm"))

> ggplot(custdata2, aes(x=age,y=income)) + geom_point() +

stat_smooth() + ylim(0,200000) +

geom_text(data=data.frame(),aes(x=40,y=190000),label=c1,color="#56B4E9",fontface="bold",size=4) +

geom_text(data=data.frame(),aes(x=77,y=130000),label=c2,color="#56B4E9",fontface="bold",size=4) +

geom_segment(aes(x=80,y=100000,xend=80,yend=40000),size=0.5,color="#56B4E9",arrow=myarrow)

プロット上のコメントは以下の通り。

曲線は、約 40 歳までの収入の増加、55 か 60 歳あたりでの収入の減少を明示。
見積もった曲線に沿った「影」は標準エラー。影が広いほどデータは疎らで、狭いほど密集。

「年齢と収入」の関係は「概ね、こんな曲線で表される」ていう感じ。直線ではあり得ない。

Boolean 値と数値

「年齢と収入」はどちらも数値だが、次に health.ins「健康保険加入の有無」と「年齢」との関係を視覚化する。次のように health.ins 項目は "logical" すなわち、True / False の二値の変数。

> summary(custdata2$health.ins)
Mode FALSE TRUE NA's
logical 119 791 0

「年齢」との関係で、健康保険加入の可能性を示したのが次の図 3.13 。

> c1 <- "Here, the y-variable is Boolean(0/1);\nwe've jittered it for legibility."

> c2 <- "The smoothing curve shows the fraction\nof customers with health insurance, as a\nfunction of age."

# The Boolean variable health.ins must be converted to a 0/1 variable using as.numeric.

> ggplot(custdata2,aes(x=age,y=as.numeric(health.ins))) +

# Since y values can only be 0 or 1, add a small jitter to get a sense of data density.

geom_point(position=position_jitter(w=0.05,h=0.05)) +

# Add smoothing curve.

geom_smooth() +

geom_text(data=data.frame(),aes(x=60,y=0.6),label=c1,color="#56B4E9",fontface="bold",size=4) +

geom_text(data=data.frame(),aes(x=60,y=0.4),label=c2,color="#56B4E9",fontface="bold",size=4)

曲線により、年齢の増加と健康保険の加入率の具合がわかる。

Hexbin Plots

ところで、データ数が増えるに従って、scatter plot は読みにくいものになる。つまり、データの増加につれて「塗る潰された状態」になっていく。

それを解決するのが hexbin plot で、two-dimensional histogram「二次元ヒストグラム」という感じ。データ数を「色の濃さ」で表すもの。先の「年齢と収入」を hexbin plot したのが次の図 3.14 。

> library(hexbin)

> c <- "The hexbin plot gives a\nsense of the shape of a\ndense data cloud.\nThe lighter the\nbin, the more\ncustomers in\nthat bin."

> ggplot(custdata2,aes(x=age,y=as.numeric(income))) +

geom_hex(binwidth=c(5,10000)) +

geom_smooth(color="white", se=F) + ylim(0,200000) +

geom_text(data=data.frame(),aes(x=85,y=170000),label=c,color="#56B4E9",fontface="bold",size=4)

この表示では「明るいほど度数が高い」。感覚的には「暗いほど...」と思うけど...。

これまで視覚化してきた関係（年齢と収入、年齢と保険加入有無）では、少なくとも片方は数値変数であった。次回は、両方ともカテゴリーデータの視覚化を行う。

「カテゴリー同士」に続く。

りんだろぐ rindalog

2016年9月22日木曜日

二変数の視覚化：Scatter Plot に曲線フィット

0 件のコメント:

コメントを投稿