りんだろぐ rindalog: 野球統計学：ebbr パッケージ

Bayesian 階層モデル からの続き。

今回は Introducing the ebbr package for empirical Bayes estimation (using baseball statistics) を元にした。

注意：元記事の連載順では、今回は Understanding mixture models and expectation-maximization (using baseball statistics) なのだが、記事と同じ結果を得られない「不具合？」のため、その次の投稿を取り上げた。その影響は、本文中にもあるので注意。

Introducing the ebbr package for empirical Bayes estimation (using baseball statistics)

We’ve introduced a number of statistical techniques in this series: estimating a beta prior, beta-binomial regression, hypothesis testing, mixture models, and many other components of the empirical Bayes approach. These approaches are useful whenever you have many observations of success/total data.

この連載で統計学的技法をいくつか紹介した、ベータ分布の事前確率の推定、beta-bainomial 回帰、仮説検定、mixture モデル、empirical Bayes アプローチのその他の要素。成功数／全数のデータが大量にあれば、これらのアプローチは有益。

Since I’ve provided the R code within each post, you can certainly apply these methods to your own data (as some already have!). However, you’d probably find yourself copying and pasting a rather large amount of code, which can take you out of the flow of your own data analysis and introduces opportunities for mistakes.

各投稿にRコードを付けてきたので、自分のデータに適用できます。ただ、ちょっと大きなコードをコピー＆貼り付けしただけじゃ、間違いの元になるかもしれません。

Here, I’ll introduce the new ebbr package for performing empirical Bayes on binomial data. The package offers convenient tools for performing almost all the analyses we’ve done during this series, along with documentation and examples. This post also serves as review of our entire empirical Bayes series so far: we’ll touch on each post briefly and recreate some of the key results.

今回は binomial データを扱う empirical Bayes 用パッケージ ebbr_package の紹介。このパッケージで、本連載中のほぼ全ての分析を行える。本投稿はこれまで紹介した empirical Bayes の復習で、各投稿の概要に触れ、重要な結果をもう一度生成する。

ebbr: Empirical Bayes on the Binomial in R の解説にあるように、事前に次を実行すること。

devtools::install_github("dgrtwo/ebbr")

Setup

We start, as always, by assembling some per-player batting data (note that all the code in this post can be found here). It’s worth remembering that while we’re using batting averages as an example, this package can be applied to many types of data.

library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE)
library(scales)
library(ggplot2)
theme_set(theme_bw())
library(Lahman)
library(dplyr)
library(tidyr)

# Grab career batting average of non-pitchers
# (allow players that have pitched <= 3 games, like Ty Cobb)
pitchers <- Pitching %>%
group_by(playerID) %>%
summarize(gamesPitched = sum(G)) %>%
filter(gamesPitched > 3)

# Add player names
player_names <- Master %>%
tbl_df() %>%
dplyr::select(playerID, nameFirst, nameLast, bats) %>%
unite(name, nameFirst, nameLast, sep = " ")

# include the "bats" (handedness) and "year" column for later
career_full <- Batting %>%
filter(AB > 0) %>%
anti_join(pitchers, by = "playerID") %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB), year = mean(yearID)) %>%
inner_join(player_names, by = "playerID") %>%
filter(!is.na(bats))

# We don't need all this data for every step
career <- career_full %>%
select(-bats, -year)

Empirical Bayes estimation

Empirical Bayes estimation Empirical Bayes 推定（shrinkage とは）

Credible intervals credible と confidence 区間

In one of the first posts in the series, we noticed that the distribution of player batting averages looked roughly like a beta distribution:

本連載の最初の投稿で、打率が概ねベータ分布であることに気づいた：

career %>%

filter(AB >= 100) %>%

ggplot(aes(H / AB)) +

geom_histogram()

We thus wanted to estimate the beta prior for the overall dataset, which is the first step of empirical Bayes analysis. The ebb_fit_prior function encapsulates this, taking the data along with the success/total columns and fitting the beta through maximum likelihood.

全データから事前ベータ分布を推定するのが、empirical Bayes 分析の第一段階。ebb_fit_prior 関数に、データと成功数／全数の列を指定して、最尤値によりベータ分布へフィットさせる。

library(ebbr)

prior <- career %>%

filter(AB >= 500) %>%

ebb_fit_prior(H, AB)

prior

## Empirical Bayes binomial fit with method mle

## Parameters:

## # A tibble: 1 × 2

## alpha beta

## <dbl> <dbl>

## 1 96.8 275

prior is an ebb_prior object, which is a statistical model object containing the details of the beta distribution. Here we can see the overall alpha and beta parameters. (We’ll see later that it can store more complicated models).

prior は ebb_prior オブジェクトで、ベータ分布の詳細を含む統計学モデル。

The second step of empirical Bayes analysis is updating each observation based on the overall statistical model. Based on the philosophy of the broom package, this is achieved with the augment function:

empirical Bayes 分析の第二段階はデータによる更新。broom パッケージの augment 関数で行う。

> augment(prior, data = career)

# A tibble: 9,653 × 10

playerID H AB name .alpha1 .beta1 .fitted .raw .low .high

1 aaronha01 3771 12364 Hank Aaron 3867.8 8868 0.304 0.3050 0.296 0.312

2 aaronto01 216 944 Tommie Aaron 312.8 1003 0.238 0.2288 0.215 0.261

3 abadan01 2 21 Andy Abad 98.8 294 0.252 0.0952 0.210 0.296

4 abadijo01 11 49 John Abadie 107.8 313 0.257 0.2245 0.216 0.299

5 abbated01 772 3044 Ed Abbaticchio 868.8 2547 0.254 0.2536 0.240 0.269

6 abbeych01 492 1751 Charlie Abbey 588.8 1534 0.277 0.2810 0.259 0.297

7 abbotda01 1 7 Dan Abbott 97.8 281 0.259 0.1429 0.216 0.304

8 abbotfr01 107 513 Fred Abbott 203.8 681 0.230 0.2086 0.203 0.259

9 abbotje01 157 596 Jeff Abbott 253.8 714 0.262 0.2634 0.235 0.291

10 abbotku01 523 2044 Kurt Abbott 619.8 1796 0.257 0.2559 0.239 0.274

# ... with 9,643 more rows

Notice we’ve now added several columns to the original data, each beginning with . (which is a convention of the augment verb to avoid rewriting e). We have the .alpha1 and .beta1 columns as the parameters for each player’s posterior distribution, as well as .fitted representing the new posterior mean (the “shrunken average”).

列名が「.（ドット）」で始まる列が追加される。各選手のパラメータ .alpha1,.beta1、.fitted は事後確率の平均（"shrinken"平均）。

We often want to run these two steps in sequence: estimating a model, then using it as a prior for each observation. The ebbr package provides a shortcut, combining them into one step with add_ebb_estimate:

ここまでの二つの段階を一度に処理する。モデルを推定後にその事前確率を使ってデータを適用する関数 add_ebb_estimate ：

eb_career <- career %>%

add_ebb_estimate(H, AB, prior_subset = AB >= 500)

eb_career

# A tibble: 9,653 × 10

playerID H AB name .alpha1 .beta1 .fitted .raw .low .high

1 aaronha01 3771 12364 Hank Aaron 3867.8 8868 0.304 0.3050 0.296 0.312

2 aaronto01 216 944 Tommie Aaron 312.8 1003 0.238 0.2288 0.215 0.261

3 abadan01 2 21 Andy Abad 98.8 294 0.252 0.0952 0.210 0.296

4 abadijo01 11 49 John Abadie 107.8 313 0.257 0.2245 0.216 0.299

5 abbated01 772 3044 Ed Abbaticchio 868.8 2547 0.254 0.2536 0.240 0.269

6 abbeych01 492 1751 Charlie Abbey 588.8 1534 0.277 0.2810 0.259 0.297

7 abbotda01 1 7 Dan Abbott 97.8 281 0.259 0.1429 0.216 0.304

8 abbotfr01 107 513 Fred Abbott 203.8 681 0.230 0.2086 0.203 0.259

9 abbotje01 157 596 Jeff Abbott 253.8 714 0.262 0.2634 0.235 0.291

10 abbotku01 523 2044 Kurt Abbott 619.8 1796 0.257 0.2559 0.239 0.274

# ... with 9,643 more rows

(The add_ prefix is inspired by modelr’s add_residuals and add_predictions, which also fit a statistical model then append columns). Note the prior_subset argument, which noted that while we wanted to keep all the shrunken values in our output, we wanted to fit the prior only on individuals with at least 500 at-bats.

引数 prior_subset の指定により、事前確率の推定では、少なくとも 500 打席の選手が対象となる。　

Estimates and credible intervals

Having the posterior estimates for each player lets us explore the model results using our normal tidy tools like dplyr and ggplot2. For example, we could visualize how batting averages were shrunken towards the mean of the prior:

各選手の事後確率の推定値から、いつもの dplyr,ggplot2 などのツールを使って、モデルの結果を見る。次は、打率が事前確率の平均への shrink を可視化：

eb_career %>%
ggplot(aes(.raw, .fitted, color = AB)) +
geom_point() +
geom_abline(color = "red") +
scale_color_continuous(trans = "log",
　　　　　　　　breaks = c(1, 10, 100, 1000)) +
geom_hline(yintercept = tidy(prior)$mean, color = "red", lty = 2) +
labs(x = "Raw batting average",
y = "Shrunken batting average")

This was one of our first visualizations in the empirical Bayes estimation post. I like how it captures what empirical Bayes estimation is doing: moving all batting averages towards the prior mean (the dashed red line), but moving them less if there is a lot of information about that player (high AB).

これは empirical Bayes 推定の投稿で取り上げたプロットの一つ。打率は（赤い点線）事前確率の平均に沿って動くが、情報が多い（= 打席数が多い）と動きは少ない。

In the following post, we used credible intervals, to visualize our uncertainty about each player’s true batting average. The output of add_ebb_estimate comes with those credible intervals in the form of .low and .high. This makes it easy to visualize these intervals for particular players, such as the 1998 Yankees:

credible 区間を用いて、各選手の打率の「確からしさ」を可視化する。add_ebb_estimate の出力には credible 区間の .low, .high が含まれる。これらを使えば容易に特定選手の credible 区間を描画できる。

yankee_1998 <- c("brosisc01", "jeterde01", "knoblch01",
"martiti02", "posadjo01", "strawda01", "willibe02")
eb_career %>%
filter(playerID %in% yankee_1998) %>%
mutate(name = reorder(name, .fitted)) %>%
ggplot(aes(.fitted, name)) +
geom_point() +
geom_errorbarh(aes(xmin = .low, xmax = .high)) +
labs(x = "Estimated batting average (w/ 95% confidence interval)",
y = "Player")

Notice that once we have the output from add_ebb_estimate, we’re no longer relying on the ebbr package, only on dplyr and ggplot2.

add_ebb_estimate の出力後は、ebbr パッケージは用いない、使用するのは dplyr, ggplot2 のみ。

Hierarchical modeling

Beta-binomial regression beta-binomial 回帰分析

Understanding empirical Bayesian hierarchical modeling Bayesian 階層モデル

In two subsequent posts, we examined how this beta-binomial model may not be appropriate, because of the relationship between a player’s at-bats and their batting average. Good batters tend to have long careers, while poor batters may retire quickly.

この連続した二つの投稿で、打席数と打率の関係から、beta-binomial モデルの不適切さを検証した。良い打者はキャリアに長い傾向があり、ダメな打者は早々に引退する。

career %>%
filter(AB >= 10) %>%
ggplot(aes(AB, H / AB)) +
geom_point() +
geom_smooth(method = "lm") +
scale_x_log10()

We solved this by fitting a prior that depended on AB, through beta-binomial regression. The add_ebb_estimate function offers this option, by setting method = "gamlss" and providing a formula to mu_predictors.

この傾向に対応するため、beta-binomial 回帰で事前確率を打席数に依存させた。add_ebb_estimate 関数にオプション method="gamlss" と mu_predictors をつける。

eb_career_ab <- career %>%
add_ebb_estimate(H, AB, method = "gamlss",
mu_predictors = ~ log10(AB))
eb_career_ab
## # A tibble: 9,653 × 14
## playerID H AB name .mu .sigma .alpha0 .beta0 .alpha1 .beta1 .fitted .raw
## <chr> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 aaronha01 3771 12364 Hank Aaron 0.289 0.00181 159.4 393 3930 8986 0.304 0.3050
## 2 aaronto01 216 944 Tommie Aaron 0.247 0.00181 136.4 416 352 1144 0.235 0.2288
## 3 abadan01 2 21 Andy Abad 0.193 0.00181 106.7 446 109 465 0.190 0.0952
## 4 abadijo01 11 49 John Abadie 0.204 0.00181 112.9 440 124 478 0.206 0.2245
## 5 abbated01 772 3044 Ed Abbaticchio 0.265 0.00181 146.6 406 919 2678 0.255 0.2536
## 6 abbeych01 492 1751 Charlie Abbey 0.257 0.00181 141.7 411 634 1670 0.275 0.2810
## 7 abbotda01 1 7 Dan Abbott 0.179 0.00181 99.1 453 100 459 0.179 0.1429
## 8 abbotfr01 107 513 Fred Abbott 0.238 0.00181 131.3 421 238 827 0.224 0.2086
## 9 abbotje01 157 596 Jeff Abbott 0.240 0.00181 132.5 420 290 859 0.252 0.2634
## 10 abbotku01 523 2044 Kurt Abbott 0.259 0.00181 143.1 409 666 1930 0.257 0.2559
## # ... with 9,643 more rows, and 2 more variables: .low <dbl>, .high <dbl>

(You can also provide sigma_predictors to have the variance of the beta depend on regressors, though it’s less common that you’d want to do so).

The augmented output is now a bit more complicated: besides the posterior parameters such as .alpha1, .beta1, and .fitted, each observation also has its own prior parameters .alpha0 and .beta0. These are predicted based on the regression on AB.

事後確率パラメータ .alpha1, .beta1, .fitted に、打席数 ABに回帰して推定した各選手の事前確率パラメータalpha0 と.beta0 が加わった、

The other parameters, such as .fitted and the credible interval, are now shrinking towards a trend rather than towards a constant. We can see this by plotting AB against the original and the shrunken estimates:

eb_career_ab %>%
filter(AB > 10) %>%
rename(Raw = .raw, Shrunken = .fitted) %>%
gather(type, estimate, Raw, Shrunken) %>%
ggplot(aes(AB, estimate)) +
geom_point() +
facet_wrap(~ type) +
scale_x_log10()

As we saw in the post on hierarchical modeling, the model can incorporate still more useful information. For example, it could include what year they play in, and whether they are left- or right- handed, both of which tend to affect batting average.

Bayesian 階層モデルの投稿で見たように、まだ多くの情報をモデルに組み込める。例えば、プレーした年代、右打者か左打者か。この両方とも打率への影響が見られる。

library(splines)
eb_career_prior <- career_full %>%
ebb_fit_prior(H, AB, method = "gamlss",
u_predictors = ~ 0 + ns(year, df = 5) * bats + log(AB))

In this case I’m fitting the prior with ebb_fit_prior rather than adding the estimates with add_ebb_estimate. This lets us feed it new data that we generate ourselves, and examine how the posterior distribution would change. This takes a bit more work, but lets us re-generate one of the more interesting plots from that post about how time and handedness relate:

ここでは、推定の二段階目を加える add_ebb_estimate 関数ではなく、第一段階の事前確率を求める ebb_fit_prior を使用。このため、独自作成したデータを投入でき、事後確率の分布の変化を検証できる。年代と利き手の影響について、興味深いプロットを示す：

# fake data ranging from 1885 to 2013
fake_data <- crossing(H = 300,
AB = 1000,
year = seq(1885, 2013),
bats = c("L", "R"))
# find the mean of the prior, as well as the 95% quantiles,
# for each of these combinations. This does require a bit of
# manual manipulation of alpha0 and beta0:
augment(eb_career_prior, newdata = fake_data) %>%
mutate(prior = .alpha0 / (.alpha0 + .beta0),
prior.low = qbeta(.025, .alpha0, .beta0),
prior.high = qbeta(.975, .alpha0, .beta0)) %>%
ggplot(aes(year, prior, color = bats)) +
geom_line() +
geom_ribbon(aes(ymin = prior.low,
ymax = prior.high), alpha = .1, lty = 2) +
ylab("Prior distribution (mean + 95% quantiles)")

Since the ebbr package makes these models convenient to fit, you can try a few variations on the model and compare them.

Hypothesis testing

Bayesian A/B testing Bayesian A/B Test

Beta-binomial regression beta-binomial 回帰分析

Another pair of posts examined the problem of hypothesis testing. For example, we wanted to get a posterior probability for the statement “this player’s true batting average is greater than .300”, so that we could construct a “Hall of Fame” of such players.

この二つの投稿では仮説検定を扱った。例えば、声明「この選手の真の打率は 3 割以上」に向けて事後確率を求めて、野球殿堂入りする選手を選ぶ。

This method is implemented in the add_ebb_prop_test (notice that like add_ebb_estimate, it adds columns to existing data). add_ebb_prop_test takes the output of an earlier add_ebb_estimate operation, which contains posterior parameters for each observation, and appends columns to it:

add_ebb_estimate の出力を受ける add_ebb_prop_test で検定。

test_300 <- career %>%
add_ebb_estimate(H, AB, method = "gamlss",
mu_predictors = ~ log10(AB)) %>%
add_ebb_prop_test(.300, sort = TRUE)
test_300
## # A tibble: 9,653 × 16
## playerID H AB name .mu .sigma .alpha0 .beta0 .alpha1 .beta1 .fitted
## <chr> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 cobbty01 4189 11434 Ty Cobb 0.287 0.00181 159 394 4348 7639 0.363
## 2 hornsro01 2930 8173 Rogers Hornsby 0.282 0.00181 156 397 3086 5640 0.354
## 3 speaktr01 3514 10195 Tris Speaker 0.285 0.00181 158 395 3672 7076 0.342
## 4 delahed01 2596 7505 Ed Delahanty 0.280 0.00181 155 398 2751 5307 0.341
## 5 willite01 2654 7706 Ted Williams 0.281 0.00181 155 397 2809 5449 0.340
## 6 keelewi01 2932 8591 Willie Keeler 0.282 0.00181 156 396 3088 6055 0.338
## 7 lajoina01 3242 9589 Nap Lajoie 0.284 0.00181 157 395 3399 6742 0.335
## 8 jacksjo01 1772 4981 Shoeless Joe Jackson 0.273 0.00181 151 401 1923 3610 0.348
## 9 gwynnto01 3141 9288 Tony Gwynn 0.284 0.00181 157 396 3298 6543 0.335
## 10 heilmha01 2660 7787 Harry Heilmann 0.281 0.00181 155 397 2815 5524 0.338
## # ... with 9,643 more rows, and 5 more variables: .raw <dbl>, .low <dbl>, .high <dbl>, .pep <dbl>,
## # .qvalue <dbl>

(Note the sort = TRUE argument, which sorts in order of our confidence in each player). There are now too many columns to read easily, so we’ll select only a few of the most interesting ones:

（sort = TRUE で confidence 順に選手を並べる）列数が多過ぎるので、対象の列に絞る：

test_300 %>%
select(name, H, AB, .fitted, .low, .high, .pep, .qvalue)
## # A tibble: 9,653 × 8
## name H AB .fitted .low .high .pep .qvalue
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ty Cobb 4189 11434 0.363 0.354 0.371 2.48e-49 2.48e-49
## 2 Rogers Hornsby 2930 8173 0.354 0.344 0.364 2.62e-27 1.31e-27
## 3 Tris Speaker 3514 10195 0.342 0.333 0.351 7.03e-21 2.34e-21
## 4 Ed Delahanty 2596 7505 0.341 0.331 0.352 5.86e-16 1.46e-16
## 5 Ted Williams 2654 7706 0.340 0.330 0.350 1.87e-15 4.91e-16
## 6 Willie Keeler 2932 8591 0.338 0.328 0.347 3.55e-15 1.00e-15
## 7 Nap Lajoie 3242 9589 0.335 0.326 0.344 1.05e-14 2.36e-15
## 8 Shoeless Joe Jackson 1772 4981 0.348 0.335 0.360 1.42e-14 3.84e-15
## 9 Tony Gwynn 3141 9288 0.335 0.326 0.344 2.72e-14 6.43e-15
## 10 Harry Heilmann 2660 7787 0.338 0.327 0.348 6.79e-14 1.26e-14
## # ... with 9,643 more rows

Notice that two columns have been added, with per-player metrics described in this post.

Bayesian false discovery rate (FDR) に記した、次の二つの列が追加。

.pep: the posterior error probability- the probability that this player’s true batting average is less than .3. 「事後確率誤差率」は、この選手の真の打率が 3 割より低い確率（要するに、推定した事後確率が誤りである確率）。

.qvalue: the q-value, which corrects for multiple testing by controlling for false discovery rate (FDR). Allowing players with a q-value below .05 would mean only 5% of the ones included would be false discoveries. q-value を .05 以下とは、その分析結果のうち高々 5% が誤りを意味する。

For example, we could find how many players would be added to our “Hall of Fame” with an FDR of 5%, or 1%:

FDR 5% と 1% の場合で、野球殿堂入りする選手の数：

sum(test_300$.qvalue < .05)
## [1] 114
sum(test_300$.qvalue < .01)
## [1] 78

Player-player A/B test

Another post discussed the case where instead of comparing each observation to a single threshold (like .300) we want to compare to another player’s posterior distribution. I noted that this is similar to the problem of “A/B testing”, where we might be comparing two clickthrough rates, each represented by successes / total.

Bayesian A/B testing Bayesian A/B Test で取り上げたのは、各観測値を一つの閾値（例えば 3 割打者）との比較ではなく、選手同士で事後確率分布を比較。これはA/Bテストにおける問題と同じで、例えば二つのクリック率の比較は、お互いの打率の比較に相当する。

The post compared each player in history to Mike Piazza, and found players we were confident were better batters. We’d first find Piazza’s posterior parameters:

投稿では歴史上の選手を Mike Piazza と比較して優れた選手を探した。Piazza の事後確率分布パラメータ：

piazza <- eb_career_ab %>%
filter(name == "Mike Piazza")
piazza_params <- c(piazza$.alpha1, piazza$.beta1)
piazza_params
## [1] 2281 5182

This vector of two parameters, an alpha and a beta, can be passed into add_ebb_prop_test just like we passed in a threshold:^*1

compare_piazza <- eb_career_ab %>%
add_ebb_prop_test(piazza_params, approx = TRUE, sort = TRUE)

^*1: This is rather similar to how the built-in prop.test function handles the difference between one-sample and two-sample tests. I’ve often found this kind of behavior annoying (determining what test to use based on the length of the input), but I must admit it is convenient.

これは prop.test 関数に多少似ている。

Again we select only a few interesting columns:

compare_piazza %>%
select(name, H, AB, .fitted, .low, .high, .pep, .qvalue)
## # A tibble: 9,653 × 8
## name H AB .fitted .low .high .pep .qvalue
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Ty Cobb 4189 11434 0.363 0.354 0.371 6.97e-17 6.97e-17
## 2 Rogers Hornsby 2930 8173 0.354 0.344 0.364 4.15e-11 2.08e-11
## 3 Tris Speaker 3514 10195 0.342 0.333 0.351 1.49e-07 4.97e-08
## 4 Shoeless Joe Jackson 1772 4981 0.348 0.335 0.360 2.45e-07 9.86e-08
## 5 Ed Delahanty 2596 7505 0.341 0.331 0.352 9.40e-07 2.67e-07
## 6 Ted Williams 2654 7706 0.340 0.330 0.350 1.84e-06 5.29e-07
## 7 Willie Keeler 2932 8591 0.338 0.328 0.347 5.06e-06 1.18e-06
## 8 Harry Heilmann 2660 7787 0.338 0.327 0.348 8.64e-06 2.11e-06
## 9 Billy Hamilton 2158 6268 0.339 0.328 0.350 1.09e-05 3.09e-06
## 10 Nap Lajoie 3242 9589 0.335 0.326 0.344 1.59e-05 4.37e-06
## # ... with 9,643 more rows

Just like the one-sample test, we’ve added .pep and .qvalue columns. From this we can see a few players who we’re extremely confident are better than Piazza.

列.pep , .qvalue が追加。

Mixture models

Mixture models and expectation-maximization

This brings us to my recent post on mixture models, where we noticed that when pitchers are included, the data looks a lot less like a beta distribution and more like a combination of two betas.

投手の打率を加えた mixture モデル。データ全体はベータ分布よりは、二つのベータ分布を合わせたように見える。

career_w_pitchers <- Batting %>%
filter(AB >= 25, lgID == "NL", yearID >= 1980) %>%
group_by(playerID) %>%
summarize(H = sum(H), AB = sum(AB), year = mean(yearID)) %>%
mutate(isPitcher = playerID %in% pitchers$playerID) %>%
inner_join(player_names, by = "playerID")
ggplot(career_w_pitchers, aes(H / AB)) +
geom_histogram()

Fitting a mixture model, to separate out the two beta distributions so they could be shrunken separately, took a solid amount of code.

mixture モデルのフィットで、二つのベータ分布に分けると、それぞれで shrink すると考えられる。

The ebbr package thus provides tools for fitting a mixture model using an iterative expectation-maximization algorithm, with the ebb_fit_mixture function. Like the other estimation functions, it takes a table as the first argument, followed by two arguments for the “successes” column and the “total” column:

set.seed(2017)
mm <- ebb_fit_mixture(career_w_pitchers, H, AB, clusters = 2)

It returns the parameters of two (or more) beta distributions:

二つのベータ分布のパラメータを返す：

tidy(mm)
## # A tibble: 2 × 6
## cluster alpha beta mean number probability
## <chr> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 154.6 443 0.259 2071 0.5
## 2 2 27.5 165 0.143 912 0.5

It also assigns each observation to the most likely cluster. Here, we can see that cluster 1 is made up of pitchers, while cluster 2 is the non-pitchers:

クラスター１が投手、クラスター２が投手以外の打率。

ggplot(mm$assignments, aes(H / AB, fill = .cluster)) +
geom_histogram(alpha = 0.8, position = "identity")

2017年2月6日現在、記事プロットは左のとおり。私がプロットしたのと形は概ね同じだが、クラスターが逆になっている。これでは投手の打率が良いという推定なので、明らかに誤り。

mixture モデルの元記事でも同様の現象や、全てBに分類される現象が起こっている。

分析のポイントは非常に興味深いので、元記事の修正を待つことにする。

Mixture model across iterations

注意：本節は、前節の「不具合？」を受けて記事と同じ結果を得られていないため、記事の訳はしていない。

You may be interested in how the mixture model converged to its parameters. The iterations component of the ebb_mixture object contains details on the iterations, which can be visualized.

fits <- mm$iterations$fits
fits
## # A tibble: 10 × 7
## iteration cluster alpha beta mean number probability
## <int> <chr> <dbl> <dbl> <dbl> <int> <dbl>
## 1 1 2 14.0 48.6 0.224 1511 0.5
## 2 1 1 18.5 62.0 0.230 1472 0.5
## 3 2 1 147.9 426.7 0.257 2137 0.5
## 4 2 2 18.3 113.4 0.139 846 0.5
## 5 3 1 153.6 440.0 0.259 2092 0.5
## 6 3 2 27.4 165.6 0.142 891 0.5
## 7 4 1 153.9 440.6 0.259 2075 0.5
## 8 4 2 27.8 167.2 0.142 908 0.5
## 9 5 1 154.6 442.5 0.259 2071 0.5
## 10 5 2 27.5 165.1 0.143 912 0.5

fits %>%

gather(parameter, value, alpha, beta, mean) %>%
ggplot(aes(iteration, value, color = parameter, lty = cluster)) +
geom_line() +
facet_wrap(~ parameter, scales = "free_y")

ここでも左の記事のプロットと異なる結果。クラスター 1, 2 を逆にすれば概ね同じに見える。

Note that it took only about one full iteration for the parameters to get pretty close to their eventual values. We can also examine the change in cluster assignments for each observation:

assignments <- mm$iterations$assignments
assignments
## # A tibble: 17,898 × 10
## iteration playerID H AB year isPitcher name bats .cluster .likelihood
## <int> <chr> <int> <int> <dbl> <lgl> <chr> <fctr> <chr> <dbl>
## 1 1 abbotje01 11 42 2001 FALSE Jeff Abbott R 2 NA
## 2 1 abbotku01 473 1851 1997 FALSE Kurt Abbott R 2 NA
## 3 1 abbotky01 2 29 1992 TRUE Kyle Abbott L 1 NA
## 4 1 abercre01 86 386 2007 FALSE Reggie Abercrombie R 1 NA
## 5 1 abnersh01 110 531 1989 FALSE Shawn Abner R 2 NA
## 6 1 abreubo01 1602 5373 2003 FALSE Bobby Abreu L 2 NA
## 7 1 abreuto01 127 497 2010 FALSE Tony Abreu B 1 NA
## 8 1 acevejo01 6 77 2002 TRUE Jose Acevedo R 1 NA
## 9 1 ackerji01 3 28 1986 TRUE Jim Acker R 1 NA
## 10 1 adamecr01 13 53 2015 FALSE Cristhian Adames B 1 NA
## # ... with 17,888 more rows

assignments %>%
ggplot(aes(H / AB, fill = .cluster)) +
geom_histogram(alpha = 0.8, position = "identity") +
facet_wrap(~ iteration)

左が記事のプロット。

The package’s functions for mixture modeling are a bit more primitive than the rest: I’m still working out the right output format, and some of the details are likely to change (this is one of the reasons I haven’t yet submitted ebbr to CRAN). Still, the package makes it easier to experiment with expectation-maximization algorithms.

What’s Next: Simulation

We’re approaching the end of this series on empirical Bayesian methods, and have touched on many statistical approaches for analyzing binomial data, all with the goal of estimating the “true” batting average of each player. There’s one question we haven’t answered, though: do these methods actually work?

empirical Bayesian アプローチの本連載は終わりに近づいている。binomial データ分析で、多くの統計学的アプローチに触れてきたが、目的は「真」の打率を推定すること。ここまで答えていない質問がある、これらの手法は現実に機能するのか？

Even if we assume each player has a “true” batting average as our model suggests, we don’t know it, so we can’t see if our methods estimated it accurately. We believe that empirical Bayes shrinkage gets closer to the true probabilities than raw batting averages do, but we can’t measure its bias or mean-squared error. We can’t see whether credible intervals actually contain the true parameters, or whether our FDR control was successful in finding a set of players above a cutoff. This means we can’t test our methods, or examine when they work well and when they don’t.

モデルが示すように、各選手に「真」の打率があると仮定しても、（「真」の打率は）知る由もない。なので、正確に推定したかどうかの判断はできない。empirical Bayes の shrinkage が、「生」の打率よりも、真の確率に近いとは信じている。しかし、偏りや、mean-squared error (MSE) を測ることはできない。credible 区間が真のパラメータを含んでいる判断できないし、FDR によってある閾値を超える選手を見つけるのが本当に機能しているのか分からない。つまり、我々の手法を検証できないし、いつ上手く機能して、いつ上手く機能しないのかを検証できない。

In the next post, we’ll simulate some fake batting average data, which will let us know the true probabilities for each player, then examine how close our statistical methods got. Simulation is a universally useful way to test a statistical method, to build intuition about its mathematical ~~properies~~ properties, and to gain confidence that we can trust its results. This will be especially easy now that we have the ebbr package to encapsulate the methods themselves, and will be a good chance to demonstrate the tidyverse approach to simulation.

次回は、偽の打率データをシミュレートして、各選手の真の打率を探し、我々の手法がどれほど近づくかを見る。統計学的手法のテスト、数学的要素について、我々の結果への確証を得る、このようにシミュレーションはいかなる時でも有効な方法。

Simulation（empirical Bayesian 手法）に続く。

りんだろぐ rindalog

2017年2月6日月曜日

野球統計学：ebbr パッケージ

0 件のコメント:

コメントを投稿