りんだろぐ rindalog: Theoretical よりも Empirical（私の助けになるもの）

野球統計学として投稿してきたが、その元記事の連載も一旦は Simulation of empirical Bayesian methods (using baseball statistics) が最後。その記事はかなり長いが、連載で紹介された手法の復習の面もあり非常に良い。一応最後まで読んで、要約の日本語訳した。ただ、少し調べたいことがあって公開は後日にすることにした。

ここでは、その公開時の内容とダブってしまうが、まとめの節「Conclusion: “I have only proved it correct, not simulated it”」が非常に同感したので、ここで取り上げる。

先に引用文を記すほうが流れ的に正しいのだが、日本語訳を付けたので引用文は短くはないため、「頭でっかち」な本文になるので、私の所感を先に記す。

論文の Theories よりも Empirical

私がここで強調したいのが、empirical Bayesian を現実の野球データに適用した著者の「基本姿勢」。

Oxford Learner's Dictionaries から

　empirical adjective
　(formal) based on experiments or experience rather than ideas or theories

empirical evidence/knowledge/research
an empirical study

　opposite theoretical
　See related entries: Experiments and research

rather than ideas or theories がポイントだと思う、「机上の空論」の真逆。

次は、著者の記事を私が意訳したもの。

この手の論文はその分野では重要なのだが、私にはほとんど無意味。私は数式をいじくるのは得意ではないし、大学院を出てからは錆びれる一方だ。ある統計学的手法をあるデータセットに適用するさい、先の論文のようなものは私の助けにはならない。

今のように英文で統計学やデータ分析を学ぶ前は、無知もあって、日本の文献を漁った。海外の良書の翻訳本はほとんどなく、目に付いたのは「統計学の基本」「誰でもできるデータ分析」「Excelで簡単データ分析」「何とかは最強の学問」などなどの「薄っぺらい」もので、実践的と言えたものではなかった。その対極にあったのが「日本語より数式の量が多いのじゃないか」と思える、どっかの先生が書いた本。

先の引用中の「この手の論文」とは、この「どっかの先生が書いた本」に相当する。私も同感で、そんな本は「私の助けにはならない」。そんな数式での証明は「その分野では重要」なのは理解しているが、私には直接的に関係はないのだ。

そんな「どっかの先生」が、果たして empirical なのかどうか疑問だ。empirical であれば、そんな本や論文があるべきだろう。とはいえ、私は「empirical でなければダメ」とは言っていない。「empirical な技術情報だけが私の助けになる」と言っているだけ。

元記事の著者の本 Introduction to Empirical Bayes: Examples from Baseball Statistics が出ています。これからも有用な記事や本を期待できる人だと思う。

Conclusion: “I have only proved it correct, not simulated it”

注意：内容的に私には理解しきれていない点がある、ロジスティック回帰分析の例えの段の訳、もしくは本文も、読み飛ばした方が良いと思う。

Computer scientist Donald Knuth has a famous quote: “Beware of bugs in the above code; I have only proved it correct, not tried it.” I feel the same way about statistical methods.

“Beware of bugs in the above code; I have only proved it correct, not tried it.” 統計学手法に関しても同じように感じる。

When I look at mathematical papers about statistical methods, the text tends to look something like:

統計学手法についての数学的に記された論文を見ると、だいたい内容はこんな感じ：

Smith et al (2008) proved several asymptotic properties of empirical Bayes estimators of the exponential family under regularity assumptions i, ii, and iii. We extend this to prove the estimator of Jones et al (2001) is inadmissable, in the case that θ̂ (x) is an unbiased estimator and g(y) is convex…
This kind of paper is an important part of the field, but it does almost nothing for me. I’m not particularly good at manipulating equations, and I get rustier every year out of grad school. If I’m considering applying a statistical method to a dataset, papers and proofs like this won’t help me judge whether it will work. (“Oh- I should have known that my data didn’t follow regularity assumption ii!”) What does help me is the approach we’ve used here, where I can see for myself just how accurate the method tends to be.

この手の論文はその分野では重要なのだが、私にはほとんど無意味。私は数式をいじくるのは得意ではないし、大学院を出てからは錆びれる一方だ。ある統計学的手法をあるデータセットに適用するさい、先の論文のようなものは私の助けにはならない（「あぁ、このデータは（見なした）規則性 ii に従っていないと気づくべきだった！」）。何が助けになるかといえば、それはここで用いたようなアプローチで、自分自身でその手法の正確性を見ること。

For example, last month I was working on a problem of logistic regression that I suspected had mislabeled outcomes (some zeroes turned to ones, and vice versa), and read up on some robust logistic regression methods, implemented in the robust package. But I wasn’t sure they would be effective on my data, so I did some random simulation of mislabeled outcomes and applied the method. The method didn’t work as well as I needed it to, which saved me from applying it to my data and thinking I’d solved the problem.

先月、「不正な結果」（ゼロがイチになったり、その逆になったり）と疑っているロジスティック回帰分析の問題に取り組んでいた。（robust パッケージに実装されている）robust ロジスティック回帰分析の手法をいくつか読み込んだ。自分のデータに影響があるとは思えなかったので、「不正な結果」のランダムシミュレーションをやって分析した。私が求めるようには、その手法は上手くいかなかった。つまり、その手法に私のデータを適用しなくてすんだわけで、これで問題を解決できたと思った。

For this reason, no matter how much math and proofs there are that show a method is , I really only feel comfortable with a method once I’ve worked with it simulated data. It’s also a great way to teach myself about the statistical method.

ということで、どんなに数学や証明が手法を示したとしても、シミュレーションデータを分析してからじゃないと、私はその手法を信頼することはない。これは、私の統計学的手法の学び方の一つで、とっても良い。

りんだろぐ rindalog

2017年2月10日金曜日

Theoretical よりも Empirical（私の助けになるもの）

0 件のコメント:

コメントを投稿