As a rap listener and a data analyst, I’m willing to combine my hobbies. Until I found suitable data, I could achieve it.
I found a github repo contains famous rappers’ lyrics, shout out to the author. I know and listened 90% of them. They are all widely accepted greatest rappers. Notice that those lyrics mostly are relatively old rappers, and you could go to figure 5 to check full rappers list.
I use R because it could easily do nice data visualization. However, Python has better features for Machine Learning. Since our dataset has no label for Classification. This dataset will be only used to do some basic analysis and text metrics.
`%>%` <- magrittr::`%>%` library(readr) library(ggplot2) library(dplyr) library(purrr) library(stringr) library(tidytext) library(dplyr)
#../input/rap-lyrics-from-36-rappers/rap_all.csv rap <- read_csv("E:/RapLyrics/lyrics_US/rap_all.csv")
# remove non-english rap$text <- lapply(rap$text, iconv, "UTF-8", "ASCII", sub="") rap1 <- rap %>% unnest_tokens(word,text) %>% anti_join(stop_words) %>% mutate(word = SnowballC::wordStem(word)) %>% filter(!str_detect(word, "^[0-9]*$")) %>% filter(str_detect(word, "^[a-z]+$")) # match only english alphabet wdf <- rap1 %>% group_by(rapper) %>% count(word, sort = T) %>% mutate(word = reorder(word, n), word=as.character(word)) %>% filter(word!='na') wdf
# A tibble: 108,957 x 3 # Groups: rapper  rapper word n <chr> <chr> <int> 1 Tyler The Creator fuck 726 2 Montana of 300 nigga 629 3 Montana of 300 bitch 482 4 A$AP Rocky nigga 461 5 the_notorious_big nigga 440 6 Deniro Farrar nigga 413 7 Tyler The Creator nigga 387 8 J Cole nigga 366 9 Royce Da 59 nigga 355 10 Big L nigga 326 # ... with 108,947 more rows
library(wordcloud) rap1 %>% count(word) %>% filter(word!='na',word!='00a0') %>% with(wordcloud(word, n, max.words = 100))
Wow! Rappers really love N-words and F-words. Don’t feel shocked, If you go deeper, this is very common in rap songs.
Furthermore, We could analysis sentiment of those words. I plot them into different colors. Although these lyrics contians a lot of negative words, rappers still want to show love & peace to the world.
library(reshape2) rap1 %>% inner_join(get_sentiments("bing")) %>% # use bing sentiment lexicon count(word, sentiment, sort = T) %>% acast(word ~ sentiment, value.var = "n", fill = 0) %>% comparison.cloud(colors = c("blue", "red"),max.words = 100)
Now we test Zipf’s Law, which indicates that a terms’ frequencies are inversely proportional to their ranks. E.g., If rappers love to use Nxgga mostly, the term frequency of secondly used word Fxxk will be half of Nxgga.
total_words <- wdf %>% group_by(rapper) %>% summarize(total = sum(n)) new_words <- left_join(wdf, total_words) song_words <- new_words %>% group_by(rapper) %>% mutate(rank = row_number(), `term frequency` = n/total) song_words %>% ggplot(aes(log10(rank), log10(`term frequency`), col = rapper)) + geom_line(size = 1.1, alpha = 0.8, show.legend = F)
song_words %>% ggplot(aes(rank, `term frequency`, col = rapper)) + geom_line(size = 1.1, alpha = 0.8, show.legend = F) + xlim(0, 500) + ylim(0, 0.01)
In order to discern every rapper’s difference, I will calculate every word’s tf-idf value, which represents the importance of specific word for a rapper. E.g., Yeezy for Kanye West; Slim Shady for Eminem. Attention that featuring songs (Working with other rappers) might bring some noise to this dataset. For instance, Runaway is a Kanye West’s song features Push-T, who contributes a verse to the song. Nevertheless, this song belongs to Kanye completely in the dataset.
Now see the result:
totalword <- wdf %>% group_by(rapper) %>% summarize(total = sum(n)) book_words <- left_join(wdf, totalword) totalword
# A tibble: 36 x 2 rapper total <chr> <int> 1 A$AP Ant 3747 2 A$AP Rocky 12098 3 Action Bronson 7496 4 "Andr\xa8\xa6 3000" 1142 5 Bas 7001 6 Big L 15752 7 Chance The Rapper 10844 8 Childish Gambino 13645 9 Common 11975 10 CunninLynguists 9618 # ... with 26 more rows
book_words <- book_words %>% bind_tf_idf(word,rapper, n) book_words
# A tibble: 108,957 x 7 # Groups: rapper  rapper word n total tf idf tf_idf <chr> <chr> <int> <int> <dbl> <dbl> <dbl> 1 Tyler The Creator fuck 726 14908 0.0487 0.0572 0.00278 2 Montana of 300 nigga 629 18510 0.0340 0.0282 0.000957 3 Montana of 300 bitch 482 18510 0.0260 0.0572 0.00149 4 A$AP Rocky nigga 461 12098 0.0381 0.0282 0.00107 5 the_notorious_big nigga 440 15618 0.0282 0.0282 0.000794 6 Deniro Farrar nigga 413 7089 0.0583 0.0282 0.00164 7 Tyler The Creator nigga 387 14908 0.0260 0.0282 0.000731 8 J Cole nigga 366 9440 0.0388 0.0282 0.00109 9 Royce Da 59 nigga 355 17105 0.0208 0.0282 0.000585 10 Big L nigga 326 15752 0.0207 0.0282 0.000583 # ... with 108,947 more rows
book_words %>% mutate(word = factor(word, levels = rev(unique(word)))) %>% group_by(rapper) %>% top_n(15) %>% ungroup() %>% ggplot(aes(reorder(word,tf_idf), tf_idf, fill = rapper)) + geom_col(show.legend = F) + labs(x = NULL, y = "tf-idf") + facet_wrap(~rapper, ncol = 3, scales = "free") + coord_flip()
Famous rappers always have their own slang, like . Like Lil Wayne also known as Weezy or Carter. Similarly, The Notorious B.I.G call himself Biggie Smalls, or Big Poppa. It could be seen from figure 5
I love it when you call me big poppa
We could see that some vocabularies are ended by i, for example, reply will be converted to repli by
SnowballC::wordStem('reply',language = "eng")
I don’t know why it appears that, could someone explain it?