Rap Lyrics Text Mining

A Kaggle Version

A website Version

As a rap listener and a data analyst, I’m willing to combine my hobbies. Until I found suitable data, I could achieve it.

I found a github repo contains famous rappers’ lyrics, shout out to the author. I know and listened 90% of them. They are all widely accepted greatest rappers. Notice that those lyrics mostly are relatively old rappers, and you could go to figure 5 to check full rappers list.

I use R because it could easily do nice data visualization. However, Python has better features for Machine Learning. Since our dataset has no label for Classification. This dataset will be only used to do some basic analysis and text metrics.

`%>%` <- magrittr::`%>%`
library(readr)
library(ggplot2)
library(dplyr)
library(purrr)
library(stringr)
library(tidytext)
library(dplyr)
#../input/rap-lyrics-from-36-rappers/rap_all.csv
rap <- read_csv("E:/RapLyrics/lyrics_US/rap_all.csv")
# remove non-english
rap$text <- lapply(rap$text, iconv, "UTF-8", "ASCII", sub="")

rap1 <- rap %>%
  unnest_tokens(word,text) %>% 
  anti_join(stop_words) %>% 
  mutate(word = SnowballC::wordStem(word)) %>% 
  filter(!str_detect(word, "^[0-9]*$")) %>%
  filter(str_detect(word, "^[a-z]+$")) # match only english alphabet

wdf <- rap1 %>%
  group_by(rapper) %>% 
  count(word, sort = T) %>% 
  mutate(word = reorder(word, n),
         word=as.character(word)) %>% 
  filter(word!='na') 
wdf
# A tibble: 108,957 x 3
# Groups:   rapper [36]
   rapper            word      n
   <chr>             <chr> <int>
 1 Tyler The Creator fuck    726
 2 Montana of 300    nigga   629
 3 Montana of 300    bitch   482
 4 A$AP Rocky        nigga   461
 5 the_notorious_big nigga   440
 6 Deniro Farrar     nigga   413
 7 Tyler The Creator nigga   387
 8 J Cole            nigga   366
 9 Royce Da 59       nigga   355
10 Big L             nigga   326
# ... with 108,947 more rows
library(wordcloud)
rap1 %>%
  count(word) %>%
  filter(word!='na',word!='00a0') %>% 
  with(wordcloud(word, n, max.words = 100))
Word Cloud

Figure 1: Word Cloud

Wow! Rappers really love N-words and F-words. Don’t feel shocked, If you go deeper, this is very common in rap songs.

Furthermore, We could analysis sentiment of those words. I plot them into different colors. Although these lyrics contians a lot of negative words, rappers still want to show love & peace to the world.

library(reshape2)

rap1 %>%
  inner_join(get_sentiments("bing")) %>% # use bing sentiment lexicon
  count(word, sentiment, sort = T) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("blue", "red"),max.words = 100)
Word Cloud by sentiment

Figure 2: Word Cloud by sentiment

Now we test Zipf’s Law, which indicates that a terms’ frequencies are inversely proportional to their ranks. E.g., If rappers love to use Nxgga mostly, the term frequency of secondly used word Fxxk will be half of Nxgga.

Now we plot it and use eye-ball test. Every color means a rapper’s words of lyrics. Figure 3 shows a log-log scale of the two variables, while Figure 4 shows origin relation between TF and rank.

total_words <- wdf %>% 
  group_by(rapper) %>% 
  summarize(total = sum(n))

new_words <- left_join(wdf, total_words)

song_words <- new_words %>% 
  group_by(rapper) %>% 
  mutate(rank = row_number(), 
         `term frequency` = n/total)

song_words %>% 
  ggplot(aes(log10(rank), log10(`term frequency`),
             col = rapper)) + 
  geom_line(size = 1.1, alpha = 0.8,
            show.legend = F) 
Test Zipf's Law: log-log scale.

Figure 3: Test Zipf’s Law: log-log scale.

song_words %>% 
  ggplot(aes(rank, `term frequency`,
             col = rapper)) + 
  geom_line(size = 1.1, alpha = 0.8, 
            show.legend = F) +
  xlim(0, 500) +
  ylim(0, 0.01)
Test Zipf's Law: original scale.

Figure 4: Test Zipf’s Law: original scale.

In order to discern every rapper’s difference, I will calculate every word’s tf-idf value, which represents the importance of specific word for a rapper. E.g., Yeezy for Kanye West; Slim Shady for Eminem. Attention that featuring songs (Working with other rappers) might bring some noise to this dataset. For instance, Runaway is a Kanye West’s song features Push-T, who contributes a verse to the song. Nevertheless, this song belongs to Kanye completely in the dataset.

Now see the result:

totalword <- wdf %>% 
  group_by(rapper) %>% 
  summarize(total = sum(n))

book_words <- left_join(wdf, totalword)
totalword
# A tibble: 36 x 2
   rapper              total
   <chr>               <int>
 1 A$AP Ant             3747
 2 A$AP Rocky          12098
 3 Action Bronson       7496
 4 "Andr\xa8\xa6 3000"  1142
 5 Bas                  7001
 6 Big L               15752
 7 Chance The Rapper   10844
 8 Childish Gambino    13645
 9 Common              11975
10 CunninLynguists      9618
# ... with 26 more rows
book_words <- book_words %>%
  bind_tf_idf(word,rapper, n)

book_words
# A tibble: 108,957 x 7
# Groups:   rapper [36]
   rapper            word      n total     tf    idf   tf_idf
   <chr>             <chr> <int> <int>  <dbl>  <dbl>    <dbl>
 1 Tyler The Creator fuck    726 14908 0.0487 0.0572 0.00278 
 2 Montana of 300    nigga   629 18510 0.0340 0.0282 0.000957
 3 Montana of 300    bitch   482 18510 0.0260 0.0572 0.00149 
 4 A$AP Rocky        nigga   461 12098 0.0381 0.0282 0.00107 
 5 the_notorious_big nigga   440 15618 0.0282 0.0282 0.000794
 6 Deniro Farrar     nigga   413  7089 0.0583 0.0282 0.00164 
 7 Tyler The Creator nigga   387 14908 0.0260 0.0282 0.000731
 8 J Cole            nigga   366  9440 0.0388 0.0282 0.00109 
 9 Royce Da 59       nigga   355 17105 0.0208 0.0282 0.000585
10 Big L             nigga   326 15752 0.0207 0.0282 0.000583
# ... with 108,947 more rows
book_words %>%
  mutate(word = factor(word,
                       levels = rev(unique(word)))) %>% 
  group_by(rapper) %>% 
  top_n(15) %>% 
  ungroup() %>%
  ggplot(aes(reorder(word,tf_idf), 
             tf_idf, fill = rapper)) +
  geom_col(show.legend = F) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~rapper, ncol = 3, scales = "free") +
  coord_flip()
Tf-Idf value and Top words for every rapper

Figure 5: Tf-Idf value and Top words for every rapper

Famous rappers always have their own slang, like . Like Lil Wayne also known as Weezy or Carter. Similarly, The Notorious B.I.G call himself Biggie Smalls, or Big Poppa. It could be seen from figure 5

I love it when you call me big poppa

R.I.P Biggie!

We could see that some vocabularies are ended by i, for example, reply will be converted to repli by SnowballC::wordStem.

SnowballC::wordStem('reply',language = "eng")
[1] "repli"

I don’t know why it appears that, could someone explain it?