Subreddit Identification Using NLP Analysis

5 min readApr 5, 2021

For this project, the goal was to collect data from two different subreddit forums (both with certain similarities) and identify whether or not a post came from that particular forum. To do this, I used Natural Language Processing (NLP) techniques as well as the NLTK toolkit. I will go into detail about he project and it’s challenges within this post.

To start, I chose 2 subreddit forums — rollerblading and roller-skating — and I used these as inputs to my web scraping function. I pulled the created_utc, which is a timestamp of the latest post at that time, and I used it to pull each post earlier than that point. This function then took in a subreddit forum and utilized a for loop to pull 100 posts at a time. The pulled posts were then transformed into a dataframe and stored in a list.

These steps were also completed for the rollerskating subreddit as well.

Once both datasets were gathered, I saved them as a csv, for easy loading. I converted the datasets to dataframes and I dropped every column except for “selftext”, “title” and “subreddit”. I cleaned the dataframes and replaced the nan values with spaces. I then created a function to clean the text and title columns of the dataframes. I wanted to remove stop words, removes special characters, lowercases all words, and lemmatizes the words before rejoining them.

I wanted to view the top words from each dataframe, so I explored the frequency and relationships among the words using tokenization. During my exploration, I dropped all nan values and created smaller dataframes from my data so that I could analyze the word count. The top word for both of my subreddits were “skate”, while “wheel” came in second for rollerblading and “skating” came in second for rollerskating.

I also used Word2Vec to view the similarities among the top words, I found that though “skate” was a top word for both dataframes, the words most similar to each differed immensely. This was based on the nature of the conversations taking place in each subreddit. Though rollerblading and rollerskating are very similar activites, the discussions have different content.

Modeling the Data

I concatenated both dataframes and I mapped zeros and ones to the subreddit column, 1 if x is “rollerblading” and 0 if it’s “rollerskating”. I then merged the “title” column with the cleaned text column to add more words to our dataset.

To model our data, I utilized a train test split and started with a baseline Multinomial Naïve Bayes model. This baseline model will help us compare our scores across all models that we try. I fit the X_train and X_test with a count vectorizer and I transformed the clean_text column.

I then fit naïve Bayes to my X_train and y_train, and scored my model’s performance using metrics.

The model accuracy came out the be 89%, but our null accuracy was very low. Our Baseline training score was very high, and was 10% over the testing score, so we are able to see that the model is overfit.

Baseline Train score 0.9905437352245863
Baseline Test score 0.8929133858267716

For my second model, I implemented a pipeline using the Random Forest Classifier as my method of choice. I also used grid search for both of my pipelines. After instantiating my model, I set my params and utilized grid search.

I then scored my random forest model.
Random Forest Train score 0.9822925798086055
Random Forest Test score 0.9658593424265889

There is evidence of overfitting, but we can see our training score is much closer to our testing score.

For my last model, I instantiated a logistic regression method and put it through a pipeline.

Our logistic regression training score was at 99%, however, the model was overfit because there was a wider gap between the training and testing results.

Logistic Regression Train score 0.9999936080064284
Logistic Regression Test score 0.9569410526733774

Conclusions

In conclusion, in relation to our problem statement, we can predict- fairly accurately- which submission a subreddit came from. If we want the best results, we should use our random forest classifier model. This model has the smallest margin between the training and testing scores, meaning that though the model is slightly overfit, it is still good at predicting our y_predict values.

Though logistic regression had the highest training score, there was a larger margin of overfitting between the train and test scores, and the roc_auc score was the lowest of all models at 88%.

The model with the highest test score was our Count Vectorized Logistic Regression model.
The model with the lowest margin of overfitting was our count vectorized, random foreset classifier.
Our random forest classifier also had the lowest number of false positives at 37, however our false negatives were the highest at 97.
Our best parameters for random forest were:

CountVectorizer(ngram_range=(1, 2))

RandomForestClassifier(class_weight={0: 1, 1: 1}

max_depth=30

min_samples_leaf=3

min_samples_split=3

n_estimators=120

Our next steps would be to change and optimize the random forest parameters in order to get a higher training score with a lower margin of overfitting (example- TFIDF vectorizer). Inversely, we can also explore the parameters for logistic regression to close the gap between the training and testing scores.

Subreddit Identification Using NLP Analysis

Written by Yetti Obasade