What started off as a project to learn more about language modling turned into a trove of really interesting data. I scraped together a dataset of over hundreds of thousands of rap verses and created a rap language model. The details aren't important, but the result is that each word is represented as a collection of numbers which can be used to determine relationships between words, and in this case, the artists using them. I'm going to try to update this website frequently with findings from the data, but I've put together a summary here to get started.
Check out the two sections below to explore the lyrical similarity between artists, and the syntactical similarity between words. If you want to dig in more, check out the "Explore Words" and "Explore Artists" sections. For the aspiring rappers among you, try your hand at "Spit a Verse" to see what rappers are most similar to your rap.
Every rapper likes to brag about their lyricism being leagues above their competition. While we can't currently measure cleverness, it turns out that unless you're literally speaking a different language, you probably have a closely related lyrical peer. The only exception to this are E-40 and Mac-Dre, who are in their own little bubble (middle left).
You can hover over the points on the graph below to see how rappers tend to cluster lyrically.
An interesting part of this graph is the the difference by time. 2010s rappers in purple are almost completely distinct from rappers in the 1980s and 1990s in pink and green, respectively.
Here's the same graph as above, but instead showing the similarity between the 1000 most common words. I've identified a handful of rap common themes, which are colored accordingly, and you can see some pretty strong clustering here. In the top left corner you'll find that Drugs, Money, and Partying all ended up clustering pretty close to one another.