りんだろぐ rindalog: モデル構築、評価 & 判定：モデル構築

「全体像」からの続き。

本書の "5.1 Mapping problems to machine learning tasks" をもとにした。

Your task is to map a business problem to a good machine learning method.

やるべきことは、ビジネス上の問題を適切な機械学習手法にあてはめること。

強調した machine learning tasks や machine learning method から、AI をイメージしてしまうが、AI と machine learning は密接に関係しているから仕方がない。こうやって予測モデルを構築して運用することを称して

machine learning task / method

とここでは呼んでいる。

そのモデルが解決する問題を、オンライン小売店の例で列挙すると

過去の購買履歴にもとづく購入予測
不正購買の特定
個々の商品や商品クラスの「価格弾力性」の決定
検索結果の商品リストの最適提供方法の決定
購買行動による顧客の分類
特定の AdWords に使うべき金額
実施したマーケティングの評価
新製品の分類管理

これらの多くは、オンラインに限らず、小売店全般が対処すべき問題。

以下、これらの問題解決に向けた手法を紹介。ここでは概要のみで、詳細は後の章で取り上げる。末尾には、表 5.2 "Problem-to-method Mapping"「問題とモデルの対応表」を載せた。

分類法：Classification Problems

列挙の最後「新製品の分類管理」は classification problems の例。要するに、ある商品を「家電」「電話」「コンピュータ」などに分類すること。何万、何十万点の商品を、「人力」で分類することが「愚行」なのは明らか。数百規模だったとしても「愚行」かもしれない。

Classification itself is an example of what is called supervised learning.

分類は「教師あり学習」の例。

つまり分類には、training set と呼ばれる「分類済みデータ集合」が必要。

主な分類の手法：Naive Bayes, Decision trees, Logistic regression, Support vector machines

採点法：Scoring Problems

採点法も「教師あり学習」の例。「実施したマーケティングの効果」を、Web サイトのトラッフィックの変化だけじゃなく、売上の変化等で測る。

次のように、見るべき要素は多岐に渡る。

You’re looking at a number of different factors: the communication channel (ads on websites, YouTube videos, print media, email, and so on); the traffic source (Facebook, Google, radio stations, and so on); the demographic targeted; the time of year; and so on.

コミュニケーションチャネル、トラッフィック元、ターゲット購買層、時期、などなど。

次の図 5.3 は「購入が不正」である確率の判定例。

過去の「不正購入 training set データ」で学習したモデルにより判定。

本書で扱う一般的な採点法は次の二つ。

Linear regression

This can be a very effective approximation, even when the underlying situation is in fact nonlinear. Linear regression is often a good first model to try when trying to predict a numeric value.

実際は線形でない場合でも、とても効果的な概算方法となり得る。数値予測において、最初に試すには良いモデル化の方法。

Logistic regression

先の「不正購入判定」のように「0 か 1 か」の判定を確率で行うのに適切。

「教師なし学習」：Without Known Target

ここまでの方法では、「データと結果」(a training dataset of situations with known outcomes) があった。「予測したい特定の結果」がない場合、つまり

You may be looking for patterns and relationships in the data that will help you understand your customers or your business better.

顧客やそのビジネスのより良い理解に繋がる、データ中にあるパターンや関係性を探す。

These situations correspond to a class of approaches called unsupervised learning: rather than predicting outputs based on inputs, the objective of unsupervised learning is to discover similarities and relationships in the data.

「教師なし学習」(unsupervised learning) と呼ばれる。その目的は、入力から出力を予測することではなく、データにある類似性や関係性を発見すること。

次は、本書で扱う一般的なクラスタリング手法

K-means clustering
Apriori algorithm for finding association rules
Nearest neighbor

K-means clustering
同じような購買パターンごとに顧客を分類。

Apriori
「同時に買われている商品を決定」、これは association rules あるいは recommendation systems の応用例。次の図 5.5 は購買パターンの探索例。

Nearest Neighbor
次の図 5.6 では、顧客の JaneB へ勧める映画の探索。

手順としては、JaneB と似た映画レンタル履歴を持つ 3 人を探して、彼らのレンタルした映画の内で JaneB がレンタルしていない作品をお勧め候補とする。この例では、nearest neighbor (or k-nearest neighbor methods, with K = 3) 。

次の表 5.2 は Problem-to-method Mapping 。常に杓子定規に従うべきものではないが、分析の取り掛かりとしては十分に機能する。

Example tasks	Machine learning terminology	Typical algorithms
Identifying spam email. Sorting products in a product catalog. Identifying loans that are about to default. Assigning customers to customer clusters.	Classification: assigning known labels to objects	Decision trees Naive Bayes Logistic regression (with a threshold) Support vector machines
Predicting the value of AdWords. Estimating the probability that a loan will default. Predicting how much a marketing campaign will increase traffic or sales.	Regression: predicting or forecasting numerical values	Linear regression Logistic regression
Finding products that are purchased together. Identifying web pages that are often visited in the same session. Identifying successful (much-clicked) combinations of web pages and AdWords.	Association rules: finding objects that tend to appear in the data together	Apriori
Identifying groups of customers with the same buying patterns. Identifying groups of products that are popular in the same regions or with the same customer clusters. Identifying news items that are all discussing similar events.	Clustering: finding groups of objects that are more similar to each other than to objects in other groups	K-means
Making product recommendations for a customer based on the purchases of other similar customers. Predicting the final price of an auction item based on the final prices of similar products that have been auctioned in the past.	Nearest neighbor: predicting a property of a datum based on the datum or data that are most similar to it	Nearest neighbor

「Model Validation」に続く。

りんだろぐ rindalog

2016年10月1日土曜日

モデル構築、評価 & 判定：モデル構築

0 件のコメント:

コメントを投稿