りんだろぐ rindalog: モデル構築、評価 & 判定：Probability 評価

「Scoring 評価」からの続き。

本書 "5.2.3 Evaluating probability models" をもとにした。

Probability models are useful for both classification and scoring tasks. Probability models are models that both decide if an item is in a given class and return an estimated probability (or confidence) of the item being in the class.

probability models「確率モデル」は、分類と採点の両方においても有効なモデル。「確率モデル」は、項目があるクラスに分類されるか否かと、そのクラスに分類される確率（もしくは confidence）を返す。

このモデルは、ロジスティック回帰分析や decision trees（決定木）が有名。

次は図 5.8 Distribution of score broken up by known classes「既知のクラスに分割された採点の分布」である、double density plot 。

> c1 <- "Destribution of score for items\nknown to not be spam"
> c2 <- "Destribution of score for items\nknown to be spam"
> myarrow <- arrow(length = unit(2,"mm"))
> ggplot(data=spamTest) +
geom_density(aes(x=pred,color=spam,linetype=spam))+
# Comments
geom_segment(aes(x=0.2,y=7,xend=0.05,yend=7),size=0.5,color="#56B4E9",
arrow=myarrow) +
geom_text(data=data.frame(),aes(x=0.45,y=7),label=c1,
color="#56B4E9",fontface="bold",size=4) +
geom_segment(aes(x=0.77,y=3.5,xend=0.9,yend=3.5),size=0.5,color="#56B4E9",
arrow=myarrow) +
geom_text(data=data.frame(),aes(x=0.52,y=3.5),label=c2,
color="#56B4E9",fontface="bold",size=4)

この分布の解説は以下のとおり。

Figure 5.8 is particularly useful at picking and explaining classifier thresholds. It also illustrates what we’re going to try to check when evaluating estimated probability models: examples in the class should mostly have high scores and examples not in the class should mostly have low scores.

あえて訳さないが、以下の主張を裏付けるように、一般的に明確な分布とは言えないだろう。

In our opinion, most of the measures for probability models are very technical and very good at comparing the qualities of different models on the same dataset. But these criteria aren’t easy to precisely translate into businesses needs. So we recommend tracking them, but not using them with your project sponsor or client.

ビジネス上の要望に明確に落とし込むことは簡単ではない。プロジェクトのスポンサーや顧客とのやり取りで使用することは勧めない。

以降、ROC 曲線、log likelihood, deviance を取り上げる。

ROC 曲線

　参照：ROC Curve（ROC 曲線）

receiver operating characteristic curve が「ROC 曲線」、先の double density plot の代替として好まれるプロット。

> library(ROCR)

> eval <- prediction(spamTest$pred, spamTest$spam)

> plot(performance(eval, "tpr", "fpr"))

> attributes(performance(eval,'auc'))$y.values[[1]]

[1] 0.9660072

In the last line, we compute the AUC or area under the curve, which is 1.0 for perfect classifiers and 0.5 for classifiers that do no better than random guesses.

AUC が 1.0 は「完璧な分類」で、0.5 は「無作為な推定より悪い分類」。

We’re not big fans of the AUC; many of its claimed interpretations are either incorrect or not relevant to actual business questions. But working around the ROC curve with your project client is a good way to explore possible project goal trade-offs.

AUC は良くはないが、ROC 曲線は顧客と一緒にプロジェクト上のトレードオフを見るには便利。

Log Likelihood

例えば、スパムである確率が 0.9 の log likelihood は log(0.9) で、同時にスパムでない確率の log likelihood は log(1 - 0.9) 。

An important evaluation of an estimated probability is the log likelihood. The log likelihood is the logarithm of the product of the probability the model assigned to each example.

log likelihood は確率推定値の重要な評価。

The principle is this: if the model is a good explanation, then the data should look likely (not implausible) under the model.

原則として、そのモデルでうまく説明できているなら、データはモデルに納まってる（「あり得なくはない」データ）。

次のように、対数は 1.0 から減少するごとに大きなマイナス値となる。

> log(seq(1.0,0.1,by=-0.1))
[1] 0.0000000 -0.1053605 -0.2231436 -0.3566749 -0.5108256 -0.6931472 -0.9162907
[8] -1.2039728 -1.6094379 -2.3025851

よって、log likelihood は常にマイナス値で、0 に近いほど高い評価。

The traditional way of calculating the log likelihood is to compute the sum of the logarithms of the probabilities the model assigns to each example.

log likelihood の算出方法は、モデルが予測した確率の対数の合計。

つまり、0.5 より 0.9 のように確率が高いほど、log likehood による評価点は高くなる。

スパム判定の log likelihood を算出する。

# data 内容 spam：スパム判定, pred：予測
> head(spamTest$spam)
[1] spam spam spam spam spam spam
Levels: non-spam spam
> head(spamTest$pred)
[1] 0.7719024 0.9998770 0.8198918 0.9106905 0.9790891 0.8916659
> sum(ifelse(spamTest$spam=='spam',log(spamTest$pred),log(1-spamTest$pred)))
[1] -134.9478
# The number of row.

> dim(spamTest)[[1]]
[1] 458
> nrow(spamTest)
[1] 458
# Rescaled by the number of data points.

> sum(ifelse(spamTest$spam=='spam',log(spamTest$pred),log(1-spamTest$pred)))
/ dim(spamTest)[[1]]
[1] -0.2946458

null model は 180 / 458 、以下のようにこれは単なる「平均スパム率」。

> summary(spamTest$spam)

non-spam spam
278 180

この確率は the best single number estimate of the chance of spam で、要するに

何もモデル化していない分析、当てずっぽうの分析モデル

そんな null model と比較して、スパム検知モデルは優れているのか log likelihood 値で比較。

> pNull <- sum(ifelse(spamTest$spam=='spam',1,0))/dim(spamTest)[[1]]
# 要するに SPAM 率
> pNull
[1] 0.3930131
> 180/458
[1] 0.3930131
> sum(ifelse(spamTest$spam=='spam',1,0))*log(pNull) +
sum(ifelse(spamTest$spam=='spam',0,1))*log(1-pNull)
[1] -306.8952
# 要するに {スパムの数}*log(pNull) + {スパムじゃない数}*log(1 - pNull)
> sum(ifelse(spamTest$spam=='spam',1,0)); sum(ifelse(spamTest$spam=='spam',0,1))
[1] 180
[1] 278

スパム検知モデルの log likelihood の -134.9478 の方が、null model の -302.8952 より優っている。

Deviance

別のモノサシが deviance で、次がその定義

　　-2 * (loglikelihood - S)

where S is a technical constant called “the log likelihood of the saturated model."

the saturated model とは、クラスに分類される確率が 1 を返す perfect model 。逆に S = 0 はそのクラスに分類される確率が 0 。

とはいえ重要なのは、null deviance と評価するモデルの deviance との差で

　-2*(-306.8952 - S) - (-2*(-134.9478 - S))

このように S は打ち消される。

> -2*(-306.8952) - (-2*(-134.9478))

[1] 343.8948

S = 0 の deviance で pseudo R-squared を算出できる（「モデル構築、評価 & 判定：Scoring 評価」の R-squared 参照）。次が、今回の pseudo R-squared で「モデルが説明している割合」を示す。

> S= 0; 1 - (-2*(-134.9478 - S)) / (-2*(-306.8952 - S))

[1] 0.5602805

50% を若干上回る値は、悪くはないけど、良いとは言えない。

この後、本書では AIC, Entropy と続くが割愛（参照：「GLMの基礎：AIC モデル選択」「Entropy and Information Gain」）

「Ranking 評価」に続く。

りんだろぐ rindalog

2016年10月10日月曜日

モデル構築、評価 & 判定：Probability 評価

0 件のコメント:

コメントを投稿