In my final week of bootcamp, we were tasked with presenting our capstones. Our capstones were to be the culmination of everything we have learned in class. From week one’s python challenges all the way to learning about object-oriented programming (OOP), we were finally down to our last project. Our capstones could be anything we wanted to make it (in terms of data science). I chose to center my project around Natural Language Processing (NLP). The main topic of my capstone was music, more specifically, songs and their genres, my goal was to model two classifiers that used audio attributes and lyrics to predict a song genre. This was indeed a much more time-consuming project than I initially thought.
Before starting work on my capstone, I spent the first few weeks brainstorming on what I wanted to do. Since I’ve always loved music and I spend my nights on the weekend DJ-ing, I figured I’d focus on songs. Once I decided on building a genre classifier, I started my data gathering. I got two of my main datasets from Kaggle. These consisted of the audio attributes and the artist names as well as their genres. I cleaned the audio attributes dataset first. I wanted to see how these attributes reflected the type of music the artist made, so I indexed the songs by artist name and looked at the average values of their music. I found that different levels were consistent with different genre classifications. Hip hop artists for example, had high levels of speechiness and popularity as well as explicitness. Other genres, like pop, had lower levels of speechiness, high popularity, low explicitness, and high liveness. This helped me get an idea of how these attributes affect the classification of a genre. The problem with the dataset, however, was that each artist had a list of multi-genre classifications. I would have to narrow down these genres to one overarching genre, which would fall under the umbrella of ‘pop’, ‘classical’, ‘metal’, latin, etc. I ended up creating a function that counted the number of times it saw the overarching genre in the list, and the genre with the highest count became the classification for that song. This process took me a while to debug, and looking back, I realized it probably wasn’t the best way to classify my songs, but I will go more in depth about my conclusions at the end.
Next was web scraping and this process was rather daunting. I’d say this process was actually the most challenging part for me. After messing around with a finicky web scraper, I found the lyrics genius python package. This scraper is pretty much already set up for you to easily grab lyrics off genius.com, but you do have to take the necessary steps to set up your access to the genius API. Once I had installed the package and got familiar with the code, I made some changes that would allow me to iterate through a list of artist names and append the song lyrics to a dictionary of these artists. It seemed like an easy task, but I ran into many challenges such as the lyrics not appending to the dictionary correctly, the web scraper running for days at a time, not putting check points down to make sure the data is collecting properly. This took about a week in itself, but I eventually got the data that I needed.
To clarify, for every artist, I appended the most popular song from genius to a dictionary, converted that dictionary into a dataframes, and then I merged the dictionary to the dataset I got from Kaggle.
I continued with cleaning my data by removing any NaN values and duplicates. I dropped unnecessary columns and then ran my lyrics column through a function I made that removed stop words and punctuation and lowercased and lemmatized the words. I tokenized the column later on as well. The tokenizing allowed me to split all the words into their own “token” Once I was done with this, I was finally able to concatenate this dataset with the newly crated genres column. I also added an additional lyrics dataset from Kaggle that had more accurate song lyrics for each song listed by an artist in my dataset. This gave me a better array of words and how they’re used differently among each genre.
This portion of my capstone really pushed what I knew as my definition EDA in regard to NLP. I initially started out with a simple word count of the top words in each genre, but this was not nearly enough for me to really get an understanding of what I was dealing with in terms of understanding these words. After discussing with my instructors and coming to the realization that I would need to do way more word exploration, I went back to the drawing boarding and implemented Word2Vec. Word2Vec takes every word, puts it in a corpus and assigns a vector to it. This was you can see how a word interacts with another in space. I was able to type in words like ‘love’ and ‘baby’ and find the top 5 similar words next to it.
I also used Doc2Vec, to create vectors of variable length pieces of text. This means text that is similar in context will be closer together in space. I was able to understand how similar text was to other parts of text in other tagged documents. Lastly, I performed some sentiment analysis on each genre, and I found that there was a high neutral outlier at zero for everyone. Removing this outlier left a fairly normal distribution for all except classical music.
I did finish out my NLP analysis with some visualizations using ploty and t-SNE, which are both great at visualizing things in higher dimensional spaces.
When it came down to modeling, I had created two different models, one that used audio attributes to classify genre and the other used the lyrics that I had just performed my NLP on. I went with decision trees and bagging to model my baseline and I also utilized cross validation and random forest grid search to optimize my scores. For both of my models, My decision tree scores were very low. My model’s kept scoring at around 50% which is basically like flipping a coin for a result. For my audio attributes, I fine tuned my model by grouping the genre classifications with the lowest score together to boost their signal.
I also removed features that had very weak correlations and I reran my model using the random forest grid search. This helped in boosting my training score up to 70%, however, my testing score was still at a mere 57%. My lyrics model scored much better when using a random forest grid search. I only added one attribute into the dataset, so the model was training solely on the lyrics and other features such as popularity and sentiment. I ended up with a training score of 86% and a testing score of 72%. I did receive much higher scores with the random forest models, however, my models were severely overfit. After some reflection and running back through my project, I was able to conclude a multitude of things.
Well, what are my next steps to improve my model? I found that my features caused much of my issues when it came to my model performance. The method of genre classification that I used was not the best (as shown by my training results), and though I explored my audio attributes, I did not utilize them to my full advantage. I would need to revisit my features and genre classes. The classification system can be improved by exploring the audio quality levels themselves and classifying the songs based on the average set of levels among each artist. Largely, what I would need to improve is how I explored my features and the methods used to feed into the model. My NLP analysis pointed out a few key things. The first is that there was an error in not omitting Spanish or Latin stop words. These showed up in our plots and they dominated my word count for the Latin genre. Since I stripped my lyrics classifier of all audio attributes, I did not get a chance to run it in tandem with my audio features model. It would be beneficial to combine both datasets and run another model that allowed for the training on a master dataset. Ultimately, my lyrics classifier and our audio classifier are promising, but feature selection and exploration are necessary if I want this to become a reliable way to predict genres of songs.
What’s Next After the Bootcamp Life?
Now that I’ve completed General Assembly’s Data Science bootcamp, the first thing I’m going to do is take a brain break. No seriously, my brain has been in overdrive and we learned a lot of stuff in a short time. I am now on the cusp of the job search. I realize that I’ve learned a lot, and I should reflect on what I’ve done and where I started from. 12 weeks ago, I really knew nothing. Aside from my own efforts of coding and taking self-paced courses, I was no where near as knowledgeable as I am now. I think the best part about this experience is knowing that I will only get better from here. Now, I’ve got the ball rolling on my path of data science and I am excited to see what the future holds. I want to give a special shout out to my instructors BingYune and James, for all of the continued support throughout the program. I couldn’t have asked for better people to teach me the foundations of data science!
If anyone is interested in taking a bootcamp and really investing in your future, I highly recommend the General Assembly Data Science Program. It was very challenging, but it pushed me to be better and I was given tools that I’ll use for the rest of my life.