りんだろぐ rindalog: 再びの Confusion Matrix

bad loan applications「焦げ付きそうな債務者」を探すモデル分析。そんな分析のゴールの例として

Identify 90% of accounts that will go into default at least two months before the first missed payment with a false positive rate of no more than 25%.

最初の返済滞納から二ヶ月以内に焦げ付く債務の 90% を特定する、偽陽性率は 25% 未満」。

これは、Nina Zumel, John Mount 著 "Practical Data Science with R" から。本書は第 8 章から読み始めた。「再度、データ分析の流れを整理しよう」と思い最初から読み始めたが、なかなか面白い。

Practical Data Science With R

posted with amazlet at 17.02.10

Nina Zumel John Mount
Manning Pubns Co
売り上げランキング: 107,052

Amazon.co.jpで詳細を見る

ここで使われるのが「Confusion Matrix 混合行列」。ここでは、本書 "Chapter 1. The data science process" から confusion matrix の算出例を取り上げる。なお、分析データについては末尾の「分析データ」参照。

> resultframe <- data.frame(Good.Loan=creditdata$Good.Loan,
　　　　　　　　　　　　　　　　 pred=predict(model, type="class"))
> rtab <- table(resultframe)
> rtab
pred
Good.Loan BadLoan GoodLoan
BadLoan 41 259
GoodLoan 13 687
> diag(rtab)
BadLoan GoodLoan
41 687
> sum(diag(rtab))/sum(rtab)

[1] 0.728

Overall model accuracy: 73% of the predictions were correct.

正解率：予測の 73% が正解。

> rtab[,1]
BadLoan GoodLoan
41 13
> sum(rtab[1,1])/sum(rtab[,1])
[1] 0.7592593

Model precision: 76% of the applications predicted as bad really did default.

モデル精度：焦げ付くと予測した債務の 76% が焦げ付いた。

Bad Loan を探すのが目的なので、ここでは「bad loan を positive」になる。

参考：データ分析は予測モデルの構築

> rtab[1,]
BadLoan GoodLoan
41 259
> sum(rtab[1,1])/sum(rtab[1,])

[1] 0.1366667

Model recall: the model found 14% of the defaulting loans.

モデルリコール：焦げ付いた債務の 14% をモデルが発見。

> rtab[2,1];rtab[2,]
[1] 13
BadLoan GoodLoan
13 687
> sum(rtab[2,1])/sum(rtab[2,])
[1] 0.01857143

False positive rate: 2% of the good applications were mistakenly identified as bad.

偽陽性率：2 % を誤って焦げ付くと予想した。

「偽陽性率」は、もちろん低いのが好ましい。この場合だと、誤った 2 % に対して焦げ付かないのにも関わらず、余計な対策を施してしまうことになる。

分析データ

分析データは、過去の債務者の返済状況、焦げ付き、等。

> load("/*/zmPDSwR-master/Statlog/GCDData.RData")
> library(rpart)
> model <- rpart(Good.Loan ~ Duration.in.month +
+ Installment.rate.in.percentage.of.disposable.income +
+ Credit.amount +
+ Other.installment.plans,
+ data=d, control=rpart.control(maxdepth=4),method="class")

この model の次の decision tree については、今後取り上げる予定。

りんだろぐ rindalog

2016年9月10日土曜日

再びの Confusion Matrix

0 件のコメント:

コメントを投稿