りんだろぐ rindalog: 歌われる州

先日の投稿の元記事 Understanding mixture models and expectation-maximization (using baseball statistics) で紹介されているのが Song Lyrics Across the United States 。このタイトル通り、「歌詞」についての分析で

全米の州の名称のうち、歌詞に最も登場する州は？

このテーマ自体だけじゃなく、結果の地図へのマッピングも良い。その後 empirical Bayes に適用する。

Song Lyrics Across the United States

各州の人口データは、acs パッケージを使って American Community Survey から取得。ここでパッケージを利用するための key を要求して、次のように key をインストールする。

# do this once, you never need to do it again
api.key.install(key = "the key you've got")

library(acs)
library(dplyr)
library(reshape2)
stategeo <- geo.make(state = "*") # Create a new geo.set object for use with the acs package
popfetch <- acs.fetch(geography = stategeo,
endyear = 2014,
span = 5,
table.number = "B01003",
col.names = "pretty")
head(popfetch,3)
## ACS DATA:
##  2010 -- 2014 ;
##   Estimates w/90% confidence intervals;
##   for different intervals, see confint()
##   Total Population: Total
## Alabama 4817678 +/- 0
## Alaska 728300 +/- 0
## Arizona 6561516 +/- 0

####
# You might need to add "acs::" as I was.
####
pop_df <- tbl_df(melt(acs::estimate(popfetch))) %>%
head(acs::estimate(popfetch),3)
## Total Population: Total
## Alabama 4817678
## Alaska 728300
## Arizona 6561516
melt(head(acs::estimate(popfetch),3))
## Var1 Var2 value
## 1 Alabama Total Population: Total 4817678
## 2 Alaska Total Population: Total 728300
## 3 Arizona Total Population: Total 6561516
tbl_df(melt(head(acs::estimate(popfetch),3)))
## # A tibble: 3 × 3
##   Var1 Var2 value
##   <fctr> <fctr> <dbl>
## 1 Alabama Total Population: Total 4817678
## 2 Alaska Total Population: Total 728300
## 3 Arizona Total Population: Total 6561516

mutate(state_name = tolower(Var1),

pop2014 = value) %>%

select(state_name, pop2014) %>%

filter(state_name != "puerto rico")

歌詞のデータは 1958 年から現在までの 50 Years of Pop Music、分析例やデータは GitHub 50 Years of Pop Music Lyrics 。ダウンロードした csv ファイルから歌詞データを読み込む。

library(readr)

song_lyrics <- read_csv(

"~/Downloads/musiclyrics-master/billboard_lyrics_1964-2015.csv")

names(song_lyrics)

## [1] "Rank" "Song" "Artist" "Year" "Lyrics" "Source"

例えば、Otis Redding の (Sittin' On) The Dock of the Bay の歌詞

(song_lyrics %>%

filter(Song == "sittin on the dock of the bay") %>%

select(Lyrics))[[1]]

## [1] "sittin in the mornin sun ill be sittin when the evenin comes watchin the ships roll in then i watch em roll away again yeah im sittin on the dock of the bay watchin the tide roll away ooo im just sittin on the dock of the bay wastin timei left my home in georgia headed for the frisco bay cause i had nothin to live for it look like nothins gonna come my way so im just goin sittin on the dock of the bay watchin the tide roll away ooo im sittin on the dock of the bay wastin timelook like nothins gonna change everything still remains the same i cant do what ten people tell me to do so i guess ill remain the same yessittin here restin my bones and this loneliness wont leave me alone yes two thousand miles i roam just to make this dock my home now im just gonna sit at the dock of the bay watchin the tide roll away ooo yea sittin on the dock of the bay wastin time"

このデータと実際の歌詞と比べる

Sittin' in the mornin' sun

I'll be sittin' when the evenin' come

Watching the ships roll in

And then I watch 'em roll away again, yeah

データでは全て小文字で、カンマ、ピリオド、アポストロフィなど、記号は除かれている。

n-gram

任意の文字列の連続は n-gram で、ここでは歌詞にある単語二つを組み合わせた bigram を作る。

library(tidytext)

tidy_lyrics <- bind_rows(song_lyrics %>%

unnest_tokens(state_name, Lyrics),

song_lyrics %>%

unnest_tokens(state_name, Lyrics,

token = "ngrams", n = 2))

これは、unset_tokens を二度実行した結果を結合したもの。一度目は、単語にバラしただけ。先ほどの The Dog of the Baye の歌詞は、次のようにバラされて

head(song_lyrics %>%

　　unnest_tokens(state_name, Lyrics) %>%

　　filter(Song == "sittin on the dock of the bay"),3)

## # A tibble: 3 × 6

## Rank Song Artist Year Source state_name

## <int> <chr> <chr> <int> <int> <chr>

## 1 4 sittin on the dock of the bay otis redding 1968 1 sittin

## 2 4 sittin on the dock of the bay otis redding 1968 1 in

## 3 4 sittin on the dock of the bay otis redding 1968 1 the

次に二単語の組み合わせ。

tail(tidy_lyrics %>%

　　filter(Song == "sittin on the dock of the bay"),3)

## # A tibble: 3 × 6

## Rank Song Artist Year Source state_name

## <int> <chr> <chr> <int> <int> <chr>

## 1 4 sittin on the dock of the bay otis redding 1968 1 the bay

## 2 4 sittin on the dock of the bay otis redding 1968 1 bay wastin

## 3 4 sittin on the dock of the bay otis redding 1968 1 wastin time

歌詞データと人口データを結合。

> inner_join(tidy_lyrics,pop_df)

Joining, by = "state_name"

# A tibble: 526 × 7

Rank Song Artist Year Source state_name pop2014

1 12 king of the road roger miller 1965 1 maine 1328535

2 29 eve of destruction barry mcguire 1965 1 alabama 4817678

3 49 california girls the beach boys 1965 3 california 38066920

4 49 california girls the beach boys 1965 3 california 38066920

5 49 california girls the beach boys 1965 3 california 38066920

6 49 california girls the beach boys 1965 3 california 38066920

7 49 california girls the beach boys 1965 3 california 38066920

8 49 california girls the beach boys 1965 3 california 38066920

9 49 california girls the beach boys 1965 3 california 38066920

10 49 california girls the beach boys 1965 3 california 38066920

# ... with 516 more rows

一つの歌で同じ州名が複数回登場する場合でも一回とカウントする。

tidy_lyrics <- inner_join(tidy_lyrics, pop_df) %>%

distinct(Rank, Song, Artist, Year, state_name, .keep_all = TRUE)

> tidy_lyrics

# A tibble: 253 × 7

Rank Song Artist Year Source state_name

1 12 king of the road roger miller 1965 1 maine

2 29 eve of destruction barry mcguire 1965 1 alabama

3 49 california girls the beach boys 1965 3 california

4 10 california dreamin the mamas the papas 1966 3 california

5 77 message to michael dionne warwick 1966 1 kentucky

6 61 california nights lesley gore 1967 1 california

7 4 sittin on the dock of the bay otis redding 1968 1 georgia

8 10 tighten up archie bell the drells 1968 3 texas

9 25 get back the beatles with billy preston 1969 3 arizona

10 25 get back the beatles with billy preston 1969 3 california

# ... with 243 more rows, and 1 more variables: pop2014 <dbl>

それぞれの州名の頻出回数。

state_counts <- tidy_lyrics %>%

group_by(state_name) %>%

summarise(n = n()) %>%

arrange(desc(n))

# summarize() 以降は tally() で代用可能

# tidy_lyrics %>% group_by(state_name) %>% tally(sort=T)

state_counts

## # A tibble: 33 × 2

## state_name n

## <chr> <int>

## 1 new york 64

## 2 california 34

## 3 georgia 22

## 4 tennessee 14

## 5 texas 14

## 6 alabama 12

## 7 mississippi 10

## 8 kentucky 7

## 9 hawaii 6

## 10 illinois 6

## # ... with 23 more rows

「ニューヨーク州ではなくてニューヨーク市のことが歌われている」との疑いは強いが、ここでは無視する、とのこと。

Now, I am going to use my vast knowledge of pop culture here and suggest that these mentions of New York are referencing New York City, not the state of New York, as lovely as it may be. I’ll keep them in for now but we should be aware of that. Also, I am a bit surprised the numbers are this low overall; this makes me long for BIGGER DATA.

100 万人当たりの回数

pop_df <- pop_df %>%

left_join(state_counts) %>%

mutate(rate = n/pop2014*1e6)

pop_df %>%

arrange(desc(rate)) %>%

top_n(10)

## # A tibble: 10 × 4

## state_name pop2014 n rate

## <chr> <dbl> <int> <dbl>

## 1 hawaii 1392704 6 4.308166

## 2 mississippi 2984345 10 3.350819

## 3 new york 19594330 64 3.266251

## 4 alabama 4817678 12 2.490826

## 5 maine 1328535 3 2.258126

## 6 georgia 9907756 22 2.220483

## 7 tennessee 6451365 14 2.170083

## 8 montana 1006370 2 1.987341

## 9 nebraska 1855617 3 1.616713

## 10 kentucky 4383272 7 1.596981

main 州が高すぎるので、理由を探る。

tidy_lyrics %>%

filter(state_name == "maine") %>%

select(Song, Artist, Year)

## # A tibble: 3 × 3

## Song Artist Year

## <chr> <chr> <int>

## 1 king of the road roger miller 1965

## 2 every girl young money 2009

## 3 bedrock young money featuring lloyd 2010

# “King of the Road”, OK, sure, but it turns out that Mack Maine is a rap artist who is the president of a label named Young Money. It is possible there are other examples of this kind of confusion in this analysis, but I checked most of the other states and did not find anyway. The other state names seen here seem less likely to fall into such a mistake anyway. Let’s drop Maine’s number down to 1 and recalculate the rate.

pop_df$n[pop_df$state_name == "maine"] <- 1

pop_df <- pop_df %>%

mutate(rate = n/pop2014*1e6)

地図作成

minimap package を使用。

## abbreviations for states
> state.abb
## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY"
## [18] "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND"
## [35] "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"

state_abb <- c(state.abb[1:7], "DC", state.abb[8:50])

pop_df <- pop_df %>%

bind_cols(data_frame(abb = state_abb))

## colors for each state

library(RColorBrewer)

pop_df <- pop_df %>%

bind_cols(data_frame(n_ntile = ntile(desc(pop_df$n), 11),

rate_ntile = ntile(desc(pop_df$rate), 11))) %>%

mutate(color_n = if_else(is.na(n_ntile),

"gray95",

brewer.pal(9,"BuPu")[9-n_ntile]),

color_rate = if_else(is.na(rate_ntile),

"gray90",

brewer.pal(9,"BuPu")[9-rate_ntile]))

以下の地図の作成で、コード中の強調した箇所は、元記事から変更したもの。フォント関連エラーのため。

library(minimap)

quartzFonts(sans = quartzFont(rep("AppleGothic", 4)))

par(family = 'sans')

miniusa(pop_df$abb, pop_df$color_n,

state_name_cex = 1.2, font = "sans")

title(main = "What States Are Mentioned in Song Lyrics?\nNumber of Mentions",

line = -1)

100 万人当たりの回数

miniusa(pop_df$abb, pop_df$color_rate,

state_name_cex = 1.2, font = "sans")

title(main = "What States Are Mentioned in Song Lyrics?\nNumber of Mentions per Million Population",

line = -1)

この二枚から画像ファイルを作成。

library(magick)

map1 <- image_read("~/Desktop/minimap-1.png")

map2 <- image_read("~/Desktop/minimap-2.png")

(animate_map <- image_animate(c(map1, map2), fps = 0.8))

## format width height colorspace filesize

## 1 gif 579 481 sRGB 0

## 2 gif 579 481 sRGB 0

image_write(animate_map,"~/Desktop/animate_map.gif")

Cartogram Map（統計地図）と Tile Grid Map に続く。

りんだろぐ rindalog

2017年2月11日土曜日

歌われる州

0 件のコメント:

コメントを投稿