2017年2月11日土曜日

歌われる州

先日の投稿の元記事  Understanding mixture models and expectation-maximization (using baseball statistics) で紹介されているのが Song Lyrics Across the United States 。このタイトル通り、「歌詞」についての分析で

全米の州の名称のうち、歌詞に最も登場する州は?

このテーマ自体だけじゃなく、結果の地図へのマッピングも良い。その後 empirical Bayes に適用する。


Song Lyrics Across the United States

各州の人口データは、acs パッケージを使って American Community Survey から取得。ここでパッケージを利用するための key を要求して、次のように key をインストールする。

# do this once, you never need to do it again
api.key.install(key = "the key you've got")

library(acs)
library(dplyr)
library(reshape2)
stategeo <- geo.make(state = "*") # Create a new geo.set object for use with the acs package
popfetch <- acs.fetch(geography = stategeo
                       endyear = 2014,
                       span = 5, 
                       table.number = "B01003",
                       col.names = "pretty")
head(popfetch,3)
## ACS DATA: 
##  2010 -- 2014 ;
##   Estimates w/90% confidence intervals;
##   for different intervals, see confint()
##         Total Population: Total
## Alabama 4817678 +/- 0          
## Alaska  728300 +/- 0           
## Arizona 6561516 +/- 0          

####
# You might need to add "acs::" as I was.
####
pop_df <- tbl_df(melt(acs::estimate(popfetch))) %>%
head(acs::estimate(popfetch),3)
##         Total Population: Total
## Alabama                 4817678
## Alaska                   728300
## Arizona                 6561516
melt(head(acs::estimate(popfetch),3))
##      Var1                    Var2   value
## 1 Alabama Total Population: Total 4817678
## 2  Alaska Total Population: Total  728300
## 3 Arizona Total Population: Total 6561516
tbl_df(melt(head(acs::estimate(popfetch),3)))
## # A tibble: 3 × 3
##      Var1                    Var2   value
##    <fctr>                  <fctr>   <dbl>
## 1 Alabama Total Population: Total 4817678
## 2  Alaska Total Population: Total  728300
## 3 Arizona Total Population: Total 6561516
    mutate(state_name = tolower(Var1),
           pop2014 = value) %>%
    select(state_name, pop2014) %>%
    filter(state_name != "puerto rico")

歌詞のデータは 1958 年から現在までの 50 Years of Pop Music分析例やデータは GitHub 50 Years of Pop Music Lyrics 。ダウンロードした csv ファイルから歌詞データを読み込む。

library(readr)
song_lyrics <- read_csv(
  "~/Downloads/musiclyrics-master/billboard_lyrics_1964-2015.csv")
names(song_lyrics)
## [1] "Rank"   "Song"   "Artist" "Year"   "Lyrics" "Source"

例えば、Otis Redding の (Sittin' On) The Dock of the Bay の歌詞

(song_lyrics %>%
  filter(Song == "sittin on the dock of the bay") %>%
  select(Lyrics))[[1]]
## [1] "sittin in the mornin sun ill be sittin when the evenin comes watchin the ships roll in then i watch em roll away again yeah im sittin on the dock of the bay watchin the tide roll away ooo im just sittin on the dock of the bay wastin timei left my home in georgia headed for the frisco bay cause i had nothin to live for it look like nothins gonna come my way so im just goin sittin on the dock of the bay watchin the tide roll away ooo im sittin on the dock of the bay wastin timelook like nothins gonna change everything still remains the same i cant do what ten people tell me to do so i guess ill remain the same yessittin here restin my bones and this loneliness wont leave me alone yes two thousand miles i roam just to make this dock my home now im just gonna sit at the dock of the bay watchin the tide roll away ooo yea sittin on the dock of the bay wastin time"

このデータと実際の歌詞と比べる

Sittin' in the mornin' sun
I'll be sittin' when the evenin' come
Watching the ships roll in
And then I watch 'em roll away again, yeah

データでは全て小文字で、カンマ、ピリオド、アポストロフィなど、記号は除かれている。


n-gram

任意の文字列の連続は n-gram で、ここでは歌詞にある単語二つを組み合わせた bigram を作る。

library(tidytext)
tidy_lyrics <- bind_rows(song_lyrics %>% 
                             unnest_tokens(state_name, Lyrics),
                         song_lyrics %>% 
                             unnest_tokens(state_name, Lyrics, 
                                           token = "ngrams", n = 2))

これは、unset_tokens を二度実行した結果を結合したもの。一度目は、単語にバラしただけ。先ほどの The Dog of the Baye の歌詞は、次のようにバラされて

head(song_lyrics %>%
  unnest_tokens(state_name, Lyrics) %>%
  filter(Song == "sittin on the dock of the bay"),3)
## # A tibble: 3 × 6
##    Rank                          Song       Artist  Year Source state_name
##   <int>                         <chr>        <chr> <int>  <int>      <chr>
## 1     4 sittin on the dock of the bay otis redding  1968      1     sittin
## 2     4 sittin on the dock of the bay otis redding  1968      1         in
## 3     4 sittin on the dock of the bay otis redding  1968      1        the

次に二単語の組み合わせ。

tail(tidy_lyrics %>%
  filter(Song == "sittin on the dock of the bay"),3)
## # A tibble: 3 × 6
##    Rank                          Song       Artist  Year Source  state_name
##   <int>                         <chr>        <chr> <int>  <int>       <chr>
## 1     4 sittin on the dock of the bay otis redding  1968      1     the bay
## 2     4 sittin on the dock of the bay otis redding  1968      1  bay wastin
## 3     4 sittin on the dock of the bay otis redding  1968      1 wastin time

歌詞データと人口データを結合。

> inner_join(tidy_lyrics,pop_df)
Joining, by = "state_name"
# A tibble: 526 × 7
    Rank               Song         Artist  Year Source state_name  pop2014
   <int>              <chr>          <chr> <int>  <int>      <chr>    <dbl>
1     12   king of the road   roger miller  1965      1      maine  1328535
2     29 eve of destruction  barry mcguire  1965      1    alabama  4817678
3     49   california girls the beach boys  1965      3 california 38066920
4     49   california girls the beach boys  1965      3 california 38066920
5     49   california girls the beach boys  1965      3 california 38066920
6     49   california girls the beach boys  1965      3 california 38066920
7     49   california girls the beach boys  1965      3 california 38066920
8     49   california girls the beach boys  1965      3 california 38066920
9     49   california girls the beach boys  1965      3 california 38066920
10    49   california girls the beach boys  1965      3 california 38066920
# ... with 516 more rows

一つの歌で同じ州名が複数回登場する場合でも一回とカウントする。

tidy_lyrics <- inner_join(tidy_lyrics, pop_df) %>%
    distinct(Rank, Song, Artist, Year, state_name, .keep_all = TRUE)
> tidy_lyrics
# A tibble: 253 × 7
    Rank                          Song                         Artist  Year Source state_name
   <int>                         <chr>                          <chr> <int>  <int>      <chr>
1     12              king of the road                   roger miller  1965      1      maine
2     29            eve of destruction                  barry mcguire  1965      1    alabama
3     49              california girls                 the beach boys  1965      3 california
4     10            california dreamin           the mamas  the papas  1966      3 california
5     77            message to michael                 dionne warwick  1966      1   kentucky
6     61             california nights                    lesley gore  1967      1 california
7      4 sittin on the dock of the bay                   otis redding  1968      1    georgia
8     10                    tighten up        archie bell  the drells  1968      3      texas
9     25                      get back the beatles with billy preston  1969      3    arizona
10    25                      get back the beatles with billy preston  1969      3 california
# ... with 243 more rows, and 1 more variables: pop2014 <dbl>


それぞれの州名の頻出回数。

state_counts <- tidy_lyrics %>% 
    group_by(state_name) %>% 
    summarise(n = n()) %>% 
    arrange(desc(n))
# summarize() 以降は tally() で代用可能
# tidy_lyrics %>% group_by(state_name) %>% tally(sort=T)
state_counts

## # A tibble: 33 × 2

##     state_name     n

##          <chr> <int>
## 1     new york    64
## 2   california    34
## 3      georgia    22
## 4    tennessee    14
## 5        texas    14
## 6      alabama    12
## 7  mississippi    10
## 8     kentucky     7
## 9       hawaii     6
## 10    illinois     6
## # ... with 23 more rows

「ニューヨーク州ではなくてニューヨーク市のことが歌われている」との疑いは強いが、ここでは無視する、とのこと。
Now, I am going to use my vast knowledge of pop culture here and suggest that these mentions of New York are referencing New York City, not the state of New York, as lovely as it may be. I’ll keep them in for now but we should be aware of that. Also, I am a bit surprised the numbers are this low overall; this makes me long for BIGGER DATA.

100 万人当たりの回数

pop_df <- pop_df %>% 
    left_join(state_counts) %>% 
    mutate(rate = n/pop2014*1e6)

pop_df %>%
    arrange(desc(rate)) %>%
    top_n(10)

## # A tibble: 10 × 4

##     state_name  pop2014     n     rate

##          <chr>    <dbl> <int>    <dbl>
## 1       hawaii  1392704     6 4.308166
## 2  mississippi  2984345    10 3.350819
## 3     new york 19594330    64 3.266251
## 4      alabama  4817678    12 2.490826
## 5        maine  1328535     3 2.258126
## 6      georgia  9907756    22 2.220483
## 7    tennessee  6451365    14 2.170083
## 8      montana  1006370     2 1.987341
## 9     nebraska  1855617     3 1.616713
## 10    kentucky  4383272     7 1.596981

main 州が高すぎるので、理由を探る。

tidy_lyrics %>%
     filter(state_name == "maine") %>%
     select(Song, Artist, Year)
## # A tibble: 3 × 3
##               Song                      Artist  Year
##              <chr>                       <chr> <int>
## 1 king of the road                roger miller  1965
## 2       every girl                 young money  2009
## 3          bedrock young money featuring lloyd  2010

# “King of the Road”, OK, sure, but it turns out that Mack Maine is a rap artist who is the president of a label named Young Money. It is possible there are other examples of this kind of confusion in this analysis, but I checked most of the other states and did not find anyway. The other state names seen here seem less likely to fall into such a mistake anyway. Let’s drop Maine’s number down to 1 and recalculate the rate.
pop_df$n[pop_df$state_name == "maine"] <- 1
pop_df <- pop_df %>% 
    mutate(rate = n/pop2014*1e6)


地図作成

minimap package を使用。

## abbreviations for states
> state.abb
##  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY"
## [18] "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND"
## [35] "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"
state_abb <- c(state.abb[1:7], "DC", state.abb[8:50])
pop_df <- pop_df %>% 
    bind_cols(data_frame(abb = state_abb))

## colors for each state
library(RColorBrewer)
pop_df <- pop_df %>% 
    bind_cols(data_frame(n_ntile = ntile(desc(pop_df$n), 11), 
                     rate_ntile = ntile(desc(pop_df$rate), 11))) %>%
    mutate(color_n = if_else(is.na(n_ntile),
                           "gray95",
                           brewer.pal(9,"BuPu")[9-n_ntile]),
           color_rate = if_else(is.na(rate_ntile),
                                "gray90",
                                brewer.pal(9,"BuPu")[9-rate_ntile]))

以下の地図の作成で、コード中の強調した箇所は、元記事から変更したもの。フォント関連エラーのため。

library(minimap)
quartzFonts(sans = quartzFont(rep("AppleGothic", 4)))
par(family = 'sans')
miniusa(pop_df$abb, pop_df$color_n, 
        state_name_cex = 1.2, font = "sans")
title(main = "What States Are Mentioned in Song Lyrics?\nNumber of Mentions",
      line = -1)
100 万人当たりの回数

miniusa(pop_df$abb, pop_df$color_rate,
        state_name_cex = 1.2, font = "sans")
title(main = "What States Are Mentioned in Song Lyrics?\nNumber of Mentions per Million Population", 
      line = -1)
この二枚から画像ファイルを作成。

library(magick)
map1 <- image_read("~/Desktop/minimap-1.png")
map2 <- image_read("~/Desktop/minimap-2.png")
(animate_map <- image_animate(c(map1, map2), fps = 0.8))
##   format width height colorspace filesize
## 1    gif   579    481       sRGB        0
## 2    gif   579    481       sRGB        0
image_write(animate_map,"~/Desktop/animate_map.gif")

0 件のコメント:

コメントを投稿