2016年9月24日土曜日

三つの分布:Binomial Distribution(二項分布)

Lognormal Distribution(対数正規分布)」からの続き。

「コインの裏表」など、binary classification problems(「二値分割問題」、正式な訳は不明)のモデル化の代表格なのが、今回の binomial distribution「二項分布」。

次の図 B.6 は、表が出る確率が異なるコインを、それぞれ 50 回投げた場合の「表の回数」の確率分布。
> numflips <- 50
> x <- 0:numflips

# probalility of heads for several different coins
> p <- c(0.05, 0.15, 0.5, 0.75)
> plabels <- paste("p =",p)

# calculate the probability of seeing x heads in numflips flips for all the coins. This probably isn't the most elegant way to do this, but at least it's easy to read
> flips <- NULL
> for (i in 1:length(p)) {
 coin <- p[i]
 label <- plabels[i]
 tmp <- data.frame(number.of.heads=x, probability=dbinom(x,numflips,coin),coin.type=label)
 flips <- rbind(flips, tmp)
}
> ggplot(flips, aes(x=number.of.heads, y=probability)) +
 geom_point(aes(color=coin.type,shape=coin.type)) +
 geom_line(aes(color=coin.type))


次の 図 B.7 は、次の問題の解答。
問題:クラスの生徒の男女比が 50 : 50 の母集団におて、1 クラス 20 人クラスの標本データを 100 個集めた。1 クラス当たりの女性の人数を分布を予想しなさい。
本描画では、stat_summary, scale_*_continuous については末尾の「R コード」を参照。

> p <- 0.5          # the percentage of females in this student population
> class.size <- 20  # size of a classroom
> numclasses <- 100 # how many classrooms we observe

# what might a typecal outcome look like?
# Because we didn't call set.seed, we expect different results each time we run this line.
> numFemales <- rbinom(numclasses, class.size, p)

# the theoretical counts (not necessarily integral)
> probs <- dbinom(0:class.size, class.size, p)
> tcount <- numclasses*probs

# the obvious way to plot this is with histogram or geom_bar but this might just look better
> zero <- function(x) {0} # a dummy function that returns only 0
> ggplot(data.frame(number.of.girls=numFemales, dummy=1),
   aes(x=number.of.girls, y=dummy)) +
 # count the number of times you see x heads
 stat_summary(fun.y="sum", geom="point", size=2) +
 stat_summary(fun.ymax="sum", fun.ymin="zero", geom="linerange") + 
 # superimpose the theoretical number of times you see x heads.
 geom_line(data=data.frame(x=0:class.size, y=probs),
   aes(x=x, y=tcount), linetype=2) +
 scale_x_continuous(breaks=0:class.size, labels=0:class.size) +
 scale_y_continuous("number of classrooms")


{p,q}binom

pbinom(k, nflips, p)
The cumulative distribution function (CDF) that returns the probability of observing k heads or fewer from nflips of a coin with heads probability p. 
確率 pnflips 回のコイン投げで、表が k 回以下の確率
> pbinom(5,10,0.5)
[1] 0.6230469
> pbinom(5,10,0.5, lower.tail=F)
[1] 0.3769531
> pbinom(50,100,0.5)
[1] 0.5397946
> pbinom(50,100,0.5, lower.tail=F)
[1] 0.4602054

qbinom(q, nflips, p)
The quantile function that returns the number of heads k that corresponds to the qth percentile of the binomial distribution corresponding to nflips of a coin with heads probability p.
次の条件のもとでの表の回数。表の確率 p のコインを nflips 回投げた時の二項分布の q 番目の分位
> qbinom(0.5,10,0.5)
[1] 5
> qbinom(0.5,100,0.5)
[1] 50

では、{p,q}binom を使って解く解く問題、先ずは qbinom の例。

> nflips <- 100
> nheads <- c(25,45,50,60) # number of heads
# What are the probabilities of observing at most that number of heads on a fair coin?
# コインに偏りがない確率が最も高い結果はどれか?
> left.tail <- pbinom(nheads, nflips, 0.5)
> sprintf("%2.2f", left.tail)
[1] "0.00" "0.18" "0.54" "0.98"

# The probabilities of observing more than that number of heads on a fair coin?
# 偏りのないコインで、この結果以上の数の表が出る確率は?
> right.tail <- pbinom(nheads, nflips, 0.5, lower.tail=F)
> sprintf("%2.2f", right.tail)
[1] "1.00" "0.82" "0.46" "0.02"
> left.tail + right.tail
[1] 1 1 1 1

So if you flip a fair coin 100 times, you are guaranteed to see more than 10  25 heads, almost guaranteed to see fewer than 60, and probably more than 45. 
つまり、偏りのないコインを 100 回投げると、25 回以上の表は保証されて、45 回以上 60 回未満は概ね保証される。

次は qbinom の例。

What's the 95% "central" interval of heads that you would expect to observe on 100 flips of a fair coin? 
偏りのないコインの 100 回の試行で、95%の中央区間の表の数は?

> left.edge <- qbinom(0.025, nflips, 0.5)
> right.edge <- qbinom(0.025, nflips, 0.5, lower.tail=F)
> c(left.edge, right.edge)
[1] 40 60

So with 95% probability you should see between 40 and 60 heads. 
よって、表の回数は 40 から 60 回の間。

pbinom() と qbinom() の関係の注意点。

One thing to keep in mind is that because the binomial distribution is discrete, pbinom() and qbinom() won’t be perfect inverses of each other, as is the case with continuous distributions like the normal. 
二項分布は離散型のため、正規分布などの連続型と違い、pbinom() と qbinom() 完全な逆関数の関係ではない。

例えば、次では一致する

> pbinom(45, nflips, 0.5)
[1] 0.1841008
> qbinom(0.1841008, nflips, 0.5)
[1] 45

が、次の例では一致しない。

> qbinom(0.75, nflips, 0.5)
[1] 53
> pbinom(53, nflips, 0.5)
[1] 0.7579408


R コード

ここでは解説はしないが、以下の 4 つのプロットとスクリプトから、stat_summary, scale_*_continuous の役割はわかるでしょう。
> p1 <- ggplot(data.frame(number.of.girls=numFemales, dummy=1),
    aes(x=number.of.girls, y=dummy)) +
  stat_summary(fun.y="sum", geom="point", size=2)
> p2 <- p1 + stat_summary(fun.ymax="sum", fun.ymin="zero", geom="linerange")
> p3 <- p2 + geom_line(data=data.frame(x=0:class.size, y=probs),
   aes(x=x, y=tcount), linetype=2)
> p4 <- p3 + scale_x_continuous(breaks=0:class.size, labels=0:class.size) +  scale_y_continuous("number of classrooms")
> library(grid)
> nplot(list(p1,p2,p3,p4))
> grid.newpage()
> pushViewport(viewport(layout=grid.layout(2,2)))
> print(p1,vp=viewport(layout.pos.row=1,layout.pos.col=1))
> print(p2,vp=viewport(layout.pos.row=1,layout.pos.col=2))
> print(p3,vp=viewport(layout.pos.row=2,layout.pos.col=1))
> print(p4,vp=viewport(layout.pos.row=2,layout.pos.col=2))

0 件のコメント:

コメントを投稿