Spark

Sentiment Analysis with Self-Built Naive Bayes Classifier

Classification lies at the heart of both human and machine intelligence.

Introduction

I decided to initiate a sentiment analysis project to enhance my understanding of the probabilistic machine learning classifier. In this binary classification project, I attempted to build a Naive Bayes Classifier from scratch. I utilized the self-built classifier to conduct the sentiment classification or say, the extraction of sentiment (the positive or negative orientation that a writer expresses toward some object). In this project, I used the NLTK movie review corpus. Reviews are separated into a training set (80% of the data) and a development set (10% of the data). Within each set, reviews are sorted by sentiment (positive/negative). The files are already tokenized. Each review is in its own file. (You can find the complete code on my GitHub)

Two assumptions of Naive Bayes model

Naive Bayes is a generative model that makes

  1. the bag of words assumption (position doesn’t matter) and
  2. the conditional independence assumption (words are conditionally independent of each other given the class).

Main Function Instruction‍

Training function

This function serves the purpose of training the Naive Bayes Classifier. It takes two inputs: training data and the other is the select rate, which is the hyperparameter that users can define themselves. The training function computes the maximum likelihood estimate. I completed this task by using frequencies in the data. For example, I derive the maximum likelihood estimate of P(wi|c) by assuming a feature is just the existence of a word in the document's bag of words, and computing P(wi|c) as the fraction of times the word wi appears among all words in all documents of class c. Note that I use add-1 smoothing, in this case, to avoid getting 0 in probability when encountering unseen words.‍

Test function

This function serves to apply our classifier to the testing data and take the `argmax` to find the most probable class. Note that this function would generate a dictionary 'results' such that:

results[filename][‘correct’] = correct class

results[filename][‘predicted’] = predicted class

Figure 1. The Naive Bayes algorithm, using add-1 smoothing (From Jurafsky and Martin’s Speech and Language Processing (Chapter 4)

Evaluate function

‍This function would, given the results of the test set, computes precision, recall, and F1 score for each class, and the overall accuracy, and prints out the classifier's performance.

Select feature function

The select_features function takes two inputs as well, train_set and select_rate. The function selects features by computing mutual information for each word, a value that falls between 0 and 1, and selects the top n features with the highest information gain. (n is decided based on the selec_rate that user-defined in the first place)

Figure 2. Formula for mutual information calculation

For example, the following list contains those words with top 60 highest information gain.

Figure 3. Word list of high information gain

Grid search function

The grid_search take training set, validating set and a list of select rates as input and generates a dataframe containing the performance of the classifier on the basis of different possible combination of features. In this case, I use F1 score and accuracy as my main metrics to rank the classifier.

Figure 4. Partial result of classifier's performance

Figure 5. Performance Plot

Improvement of predictive power

At first, I tried to use the top 100 features with the highest information gain; however, the result turned out to be quite disappointing. The F1 score and accuracy were quite low, and the precision wasn't ideal either (as shown in the following graph).

Figure 6. The initial output without optimization


To optimize the classifier, I implemented a self-defined grid search function, which serves the purpose of looking for the best set of features and training the model based on the selected features. Luckily, this time we derived a more pleasant result.

Figure 7. The final result after grid search