Analysing The Top 100 Weekly Charts From 1958 To 2019 Using AI
I’m a big fan of Music & Data Science and wanted to do a project combining the two. I originally set out to find some data containing the charts with the songs, artists, lyrics and genre but I couldn’t find anything available online. Although the data exists, it is disjointed; the charts, lyrics and genres are all found on totally different websites. Therefore, I decided to gather the data myself and do some analysis. I wanted to understand how different genres have become more and less popular over the years. I also wanted to visualise the lyrics between genres and build an AI to try and predict the genre using only the lyrics.
For the purpose of this project I will be specifically looking at the top 100 charts on a weekly basis from 1958 to 2019 (60+ years), I will look to pull in the weekly rank, the song name, artist name, genre & lyrics. The genre of a song can be heavily debated and as a result I will be considering multiple sources for the genre to try and negate this.
I will then do some analysis to:
- Trend the proportion of the weekly charts held by each genre
- Analyse which artists have performed the best
- Analyse which songs have performed the best
- Create a custom wordcloud for each genre using the lyrics
- Develop a LSTM neural network to predict the genre using the lyrics
The first step was to gather the data. In the interest of everyone’s time, I won’t delve too deeply in to the approach. All of the scripts used are hosted on GitHub and a link will be provided at the bottom if you would like to explore the code yourself.
I used billboard.com to gather data on the top 100 songs for the last 60 years. Since the data is on a weekly basis, I had to web scrape data from over 3k pages and pull in data for over 300k songs.
The next stage was to pull in the genre of the song. I did this using the artist name as opposed to the song name and considered multiple sources: Wikipedia.com & AllMusic.com. The issue with the genre pulled from AllMusic.com is that Pop & Rock are combined where as the match rate on Wikipedia was significantly less. Due to the large number of unique artists I had to web scrape data from over 25k pages at this point.
Finally, I pulled in the lyrics from chartslyrics.com. There was a very large number of unique songs and hence this step required over 50k web scrapes. The end result had a fairly high match rate.
At this point, I had finally obtained & cleaned the data I needed and could begin the analysis. I first looked at visualising how different genres have gained and lost popularity over the last 60 years. The results can be seen below.
Interestingly, although “Pop” is often considered to be the most “Mainstream” genre, it has spent a very small amount of time in 1st place over the last 60 years. What’s also obvious is the clear decline in “Rocks” popularity since it’s peak in the early 80s. At approximately the same time we begin to see the rise of “Hip Hop” moving all the way to the present time where hip hop dominates the weekly top 100 charts. “R&B” had it’s peak popularity in the 90s and 00s and has performed fairly consistently apart from this.
For those that find the above visualisation slightly hard to interpret, below is a more traditional (albeit slightly busy) line graph showing the % of the top 100 charts that each genre occupied.
Top Performing Artists
Given that the data that I’ve gathered also contained the rank of each song on a weekly basis, I thought it would be interesting to see which artists were consistently in the charts.
For each artist, I calculated how many weeks they had a song in the number one spot in the charts. The below table shows the top artists. Mariah Carey has spent the most time at the top of the charts with 76 weeks across the 60 years.
I then calculated how many weeks each artist had songs in the top 10 spots in the charts. The below table shows the top artists. Mariah Carey has spent the most time with songs in the top 10 of the charts with 247 weeks across the 60 years.
I then calculated how many weeks each artist had songs in the top 100 spots in the charts. The below table shows the top artists. Drake has spent the most time with songs in the top 100 of the charts with 920 weeks across the 60 years.
Drake, Madonna & Mariah Carey all did well across the three above leadership boards. Given that Drake is still performing it’s likely that he will overtake the others in coming years. Rihanna is also consistently doing quite well and also should move up the leaderboards over the next few years.
Top Performing Songs
I replicated the above analysis but looked exclusively at individual songs as opposed to artists.
I calculated how many weeks each song had the number one spot in the charts. The below table shows the top songs. “One Sweet Day By Mariah Carey” & “Despacito By Luis Fonsi” are tied first, spending the most time at the top of the charts with 16 weeks across the 60 years.
I then calculated how many weeks each song was in the top 10 spots in the charts. The below table shows the top songs. “Shape Of You By Ed Sheeran”, “Girls Like You By Maroon 5” & “Sicko Mode By Travis Scott” are all tied first with 32 weeks in the top 10 charts.
I then calculated how many weeks each song was in the top 100 spots in the charts. The below table shows the top songs. “Radioactive By Imagine Dragons” is first with 86 weeks in the top 100 charts.
The top performing songs change quite dramatically if you look at weeks spent in the top 1/10/100 spots when compared to artists where there is more consistency.
Genre Lyrics Wordclouds
I produced a collection of wordclouds for each of the genres, looking at the unique songs that appeared in the top 100 charts at any point over the last 60 years. I used custom shapes for the wordclouds to try and represent the genre. Please ignore any offensive language, the lyrics are taken as is with no words removed apart from stop words (e.g. the, to, then etc.)
The words used in the lyrics for “Country” songs are very similar to most of the other genres. There is a few distinct words appearing more often in “Country” than other genres such as: man, old, little & time
The words used in the lyrics for “Pop” songs are very similar to most of the other genres. There is a few distinct words appearing more often in “Pop” than other genres such as: now, heart, dance & night.
The words used in “Hip Hop” songs is fairly diverse with a lot of unique words appearing, especially offensive & aggressive terms.
The words used in “Rock” songs is quite close to “Pop” & “Country”. This arguably could be due to the ambiguity when defining the genres or could be as a result of the genres being fairly similar.
“Techno” has a fair few words that appear often that are different to the other genres. What’s also interesting is that the terms that appear often are much larger than the others, implying that the diversity of words used in “Techno” is fairly limited. This aligns with “Techno” generally being played in clubs with repetitive lyrics.
“Jazz” lyrics are fairly similar to “R&B” although there is certainly some distinct words such as: sweet, world & ooh.
“Reggae” is by far the most different when compared to the other genres, this makes sense as the majority of reggae songs use a form of “broken English”. Words such as: love & baby appear often nevertheless.
“R&B” seems to have quite a diverse set of words. There is a lot of similarity with “Pop” & “Blues”. The relationship with “Pop” makes sense given the evolution of “R&B”.
I hope the above illustrations helped to highlight the differences between the terms used in the genres. It’s also quite clear that: love, baby & like are by far the most popular words used in songs (at least those that appear in the charts at some point). The differences in lyrics between the genres should hopefully be evident in the next stage of the project where the neural network tries to predict the genres using the lyrics.
Using AI To Predict Genre Using Lyrics
As mentioned, I decided to build a neural network to see how viable it is to determine a songs genre using only the lyrics. Recent breakthroughs in neural networks and natural language processing have massively improved the viability of using computers to automatically classify text. This approach is often used in areas such as sentiment analysis or classifying reviews.
For the purpose of this article I won’t spend too much time discussing the technicalities of the approach. Effectively the approach is to first calculate TFIDF which removes words that commonly appear across genres and leaves words that are relatively unique (and hopefully useful), These words are then used in a “bag of words” model and fed in to a LSTM neural network. LSTM neural networks look to replicate the way the human brain works by having both a short term memory capability and a long term memory capability.
The AI was built to try and classify songs in to 10 genres. As a baseline, choosing genres randomly would give everything an accuracy of 10% and categorising all songs as “Pop” would give an accuracy of 30%. In the end the AI was able to correctly classify over 80% of the songs it was given.
With further development and testing, I’m sure this figure can be increased significantly. It’s also worth mentioning that the AI is only looking at songs in the top 100 weekly charts for the last 60 years. More data could easily be gathered by pulling in songs that didn’t make it to the charts. Neural networks generally perform much better with more data so this would be an easy way to potentially make the AI more accurate.
That’s the end of the analysis for now, I hope it was insightful and easy to understand. For a first version, the AI looks promising. As a proof of concept it has demonstrated the capability of using neural networks to classify unstructured texts such as song lyrics.
Eventually I would like to develop a generative adversarial network where the AI can generate it’s own sound frequencies to create music but that’s a project for further down the line.
I deliberately kept this article fairly nontechnical. If you would like to explore the code yourself and run through this analysis on your own, all of the scripts are hosted on the GitHub link below.
The data engineering part of the process was primarily done using Python, BeautifulSoup, Requests & Pandas. Unfortunately the size of the data is too large to host on GitHub but you can run the web scraping scripts yourself and this will give you the data. (warning: it takes about 7 hours on my server)
The data cleaning took quite a long time as the web scrape didn’t return perfect results and needed a lot of RegEx & NLP to wrangle it in to a usable format. This is primarily due to the fact that all of the columns come from different datasets and needed to be fuzzy matched together. Surprisingly the match rate was quite good for the lyrics and the genres once the data was clean.
The data visualisation was primarily done in R, this is due to R having the wonderful Wordcloud2 package (warning: the package currently has some bugs and might not work properly without some fixes.) I also used ggplot2 for most of the graphs as the aesthetics are generally cleaner and more customisable than Pythons Matplotlib.
The neural network was done in Python using Keras with a TensorFlow backend and trained on the GPU. The architecture and build of the neural network is also included on GitHub. The architecture was relatively simple, it contained a word embedding layer, a LSTM and some dense layers with dropouts. This was done to reduce over fitting and also due to the fact that LSTM networks have recently been doing very well with textual data.
A natural next step would be to use one of the pretrained NLP algorithms for the word embedding such as word2vec or GloVe, I may do this at some point.
All scripts can be found at – https://github.com/DataInspector/MusicOverTime