りんだろぐ rindalog: Practical Data Science with R：目次

Nina Zumel, John Mount 著 Practical Data Science with R は「一字一句読んだ」と言っても過言ではない。もちろん理論やRコードによる実装はすべて理解して読み進んだ。本書にある、どんな情報でも見逃したくなかったからだ。

本書のタイトルの practical に偽りはない。先ほど Theoretical よりも Empirical（私の助けになるもの） を投稿したが、本書はまさに「私の助けになるもの」である。「この本ぐらいは読んどけっ！」と誰にも言うつもりはないが、この意見に同意してくれる人は少なくはないだろう。

ここでは本書をもとにした投稿を目次形式でまとめた。取り上げなかった章も含めて、各章の冒頭にあるポイントを記した。

Practical Data Science With R

posted with amazlet at 17.01.17

Nina Zumel John Mount
Manning Pubns Co
売り上げランキング: 80,221

Amazon.co.jpで詳細を見る

Part 1. Introduction to data science

Chapter 1. The data science process

Defining data science project roles
Understanding the stages of a data science project
Setting expectations for a new data science project

　　再びの Confusion Matrix

Chapter 2. Loading data into R

Understanding R’s data frame structure
Loading data into R from files and from relational databases
Transforming data for analysis

Chapter 3. Exploring data

Using summary statistics to explore data
Exploring data using visualization
Finding problems and issues during data exploration

　Histogram と Density Plot

　　　安易さと誤用

　　　対数スケール

　　　Bar Chart は離散型データの Histogram

　二変数の視覚化
　　　Line より Scatter
　　　Scatter Plot に曲線フィット
　　　カテゴリー同士
　　　グラフ化の目的

Chapter 4. Managing data

Fixing data quality problems
Organizing your data for the modeling process

Part 2. Modeling methods

Chapter 5. Choosing and evaluating models

Mapping business problems to machine learning tasks
Evaluating model quality
Validating model soundness

　モデル構築、評価 & 判定
　　　全体像
　　　モデル構築
　　　Model Validation
　　　評価のモノサシ
　　　Classification 評価
　　　Scoring 評価
　　　Probability 評価
　　　Ranking 評価

Chapter 6. Memorization methods

Building single-variable models
Cross-validated variable selection
Building basic multivariable models
Starting with decision trees, nearest neighbor, and naive Bayes models

　Simple Methods

　　　データセット（Training, Calibration, Test）

　　　Single Model（カテゴリ変数）

　　　Single Model（数値変数）

　　　Overfitting を見積る Cross-Validation

　　　変数選択　　　

　　　Decision Trees
　　　Nearest Neighbor Methods
　　　Naive Bayes と Smoothing（明日太陽が昇る確率）

Chapter 7. Linear and logistic regression

Using linear regression to predict quantities
Using logistic regression to predict probabilities or categories
Extracting relations and advice from functional models
Interpreting the diagnostics from R’s lm() call
Interpreting the diagnostics from R’s glm() call

　線形とロジスティック回帰
　　　直線的でないのに線形回帰
　　　予測値と実際値のプロット
　　　R-squared
　　　係数の解釈（学士号の価値）
　　　Summary #1（直線回帰では residuals が全て）
　　　Summary #2（collinearity の兆候）
　　　Summary #3（overfitting の確認）
　　　線形回帰のまとめ
　　　Logistic Regression
　　　GLM 構築

Chapter 8. Unsupervised methods

Using R’s clustering functions to explore data and look for similarities
Choosing the right number of clusters
Evaluating a clustering
Using R’s association rules functions to find patterns of co-occurrence in data
Evaluating a set of association rules

　クラスター分析

　　　Unsupervised Methods

　　　Hierarchical Clustering「階層的クラスタリング」

　　　主成分分析で可視化

　　　Cluster Stabilities（クラスターの安定度）

　　　クラスター数の決定法
　　　k-means 法
　　　分析結果で新データ判別
　　　クラスタリングのまとめ
　　　Association Rule Mining（関連性探索）
　　　トランザクションデータ
　　　Rule 探し開始
　　　マイニングは Interactive

Chapter 9. Exploring advanced methods

Reducing training variance with bagging and random forests
Learning non-monotone relationships with generalized additive models
Increasing data separation with kernel methods
Modeling complex decision boundaries with support vector machines

　高度化分析手法
　　　Bagging でモデル改善
　　　Random Forests でモデル改善
　　　変数削減
　　　直線回帰と「monotone な関係」
　　　GAM を試す
　　　overfit リスク、s カーブ
　　　新生児の体重予測
　　　GAM ロジスティック回帰、GAM まとめ
　　　Kernel Methods, SVM

Part 3. Delivering Results

Chapter 10. Documentation and deployment

Producing effective milestone documentation
Managing project history using source control
Deploying results and making demonstrations

Chapter 11. Producing effective presentations

Presenting your results to project sponsors
Communicating with your model’s end users
Presenting your results to fellow data scientists

　効果的プレゼン
　　　対スポンサー #1
　　　対スポンサー #2
　　　対ユーザ
　　　対ピア & まとめ

Appendix A. Working with R and other tools
　　　どっちで選択、[[]] か [] ？

Appendix B. Important statistical concepts: B.1. Distributions

　三つの分布

　　　Normal Distribution（正規分布）
　　　Lognormal Distribution（対数正規分布）
　　　Binomial Distribution（二項分布）

Appendix B. Important statistical concepts: B.2. Statistical theory

　統計理論

　　　Exchangeability

　　　Concept drift, Goodhart's law
　　　Bias variance decomposition
　　　Statistical efficiency（駆け込み「Bayes 寺」）
　　　A/Bテスト

　　　statistical test power
　　　omitted variable bias
　　　offset は silver bullet にあらず
　　　More Tools and Ideas #1

　　　More Tools and Ideas #2

りんだろぐ rindalog

2017年2月10日金曜日

Practical Data Science with R：目次

0 件のコメント:

コメントを投稿