りんだろぐ rindalog: 三つの分布：Normal Distribution（正規分布）

本投稿からしばらく、Nina Zumel, John Mount 著 "Practical Data Science with R" の Appendix B の "B.1. Distributions" をもとにする。

Nina Zumel John Mount

今更ながらの「分布」なのだが、R を使った本書の解説が非常に明快だ。下手な文章による解説より、このような説明が断然良い。細かい理屈の前に、こうやって確率などを算出したり、グラフ化することで概要は掴みやすい。私は、理屈から入ろうとして、長いこと理解に苦しんだ...。

For some reason, R refers to the cumulative distribution function (or CDF) as the probability distribution function (hence the convention pDIST). This drives us crazy, because most people use the term probability distribution function (or PDF) to refer to what R calls dDIST. Be aware of this.

とある理由により R では、cumulative distribution function (or CDF) を probability distribution function (pDIST) と呼ぶが、これは混乱を招く。というのも多くの人が使う用語 probability distribution function (or PDF) に相当するのが R では dDist と呼ぶ。

当初は混乱しながらも、自分なりに理解していたつもりだったが、こうやって書かれてるのを読んで、かなりスッキリした。用語定義を曖昧にすると、後の混乱のもと。私が日本語表記を使いたがらない理由もそこにある。

{d,p,r,q}DIST

二年以上前にも R のヘルプを頼りに同様なことを書いたが、今回の方が断然分かりやすい。月日の経過も関係してるか？

R の確率分布関数を一般化した {d,p,r,q}DIST の定義、distribution, probability, random, quantile の頭文字と、分布の種類を表す DIST 。

dDIST(x, ...) is the distribution function (PDF) that returns the probability of observing the value x.

値 x を観測する確率を返す確率分布 (distribution function or PDF) 。

pDIST(x, ...) is the cumulative distribution function (CDF) that returns the probability of observing a value less than x. The flag lower.tail=F will cause pDIST(x, ...) to return the probability of observing a value greater than x (the area under the right tail, rather than the left).

値 x 以下を観測する確率を返す cumulative distribution function (CDF) 。x 以上を観測する確率の場合は lower.tail=F 。

rDIST(n, ...) is the random number generation function that returns n values drawn from the distribution DIST.

n 個の乱数を確率分布 DIST から生成。

qDIST(p, ...) is the quantile function that returns the x corresponding to the pth percentile of DIST. The flag lower.tail=F will cause qDIST(p, ...) to return the x that corresponds to the 1 - pth percentile of DIST.

確率分布 DIST の p 番目の分位に対応する x を返す。lower.tail=F は (1 - p) 番目の分位の x を返す。

平均 0 , 標準偏差 1 の正規分布において

> pnorm(-2, lower.tail = T)
[1] 0.02275013
> pnorm(-2, lower.tail = F)
[1] 0.9772499
> pnorm(-2, lower.tail = T) + pnorm(-2, lower.tail = F)
[1] 1

pnorm() と qnorm() は逆関数の関係

> pnorm(0)
[1] 0.5
> qnorm(0.5)
[1] 0

　注意：離散型分布では pDIST と qDIST は逆関数の関係ではない。

75% of the Area Under the Curve

標準正規分布の「下」75% を塗りつぶしたもの（図 B.3）。

以前やった塗りつぶし方法よりは、geom_area の方が断然簡単。

# create a graph of the normal distribution with mean 0, sd 1
> x <- seq(from=-5,to=5,length.out=100)
> f <- dnorm(x)
> nframe <- data.frame(x=x,y=f)

# calculate the 75th percentile
> line <- qnorm(0.75)
> xstr <- sprintf("qnorm(0.75) = %1.3f",line)

# the part of the normal distribution to the left of the 75th percentil
> nframe75 <- subset(nframe, nframe$x < line)

# Plot it. The shaded area is 75% of the area under the normal curve.
> ggplot(nframe, aes(x=x,y=y)) + geom_line() +
geom_area(data=nframe75,
aes(x=x,y=y),fill="gray") +
geom_vline(aes(xintercept=line),linetype=2) +
geom_text(data=data.frame(),
aes(line,0,label=xstr,vjust=1))

「Lognormal Distribution（対数正規分布）」に続く。

りんだろぐ rindalog

2016年9月24日土曜日

三つの分布：Normal Distribution（正規分布）

0 件のコメント:

コメントを投稿