NLP: A Complete Sentiment Classification on Amazon Reviews

What is Sentiment Analysis?

In this article, we aim to perform a sentiment analysis of product reviews written by online users from Amazon. The textual review data comes with numerical rating data, ranging from 1 to 5 (1: negative, 5: positive). This numerical indicator will be used as labels that represent the sentiment of the review text. Thus, this problem will be viewed as a multi-classification process and we seek to predict the sentiment scale of the user reviews based on machine learning classifiers and deep learning algorithms.


  • Text Preprocessing & Lemmatization
  • Word Vectorization (BOW vs. TF-IDF)
  • Implementing ML Classifiers & Evaluation Metrics
  • Results
  • Hyperparameter Tuning
  • Discussion
  • Any Future Work?

Required Python Libraries


The raw data is compressed in gzip format, hence extra codes are needed to store the data into pandas DataFrame format. Following the code:

We now have the review data in DataFrame format as shown below. Let’s don't think too much about other features. For this post, we will be only focusing on ‘reviewText’ and ‘overall’ columns, so let’s only keep these columns. ‘reviewText’ contains the raw text of the reviews of a product and the ‘overall’ column is in numerical values ranging from 1 to 5. Here, these values can be seen as sentiment labels: value 1 for ‘Very Negative’, 2 for ‘Somewhat Negative’, 3 for ‘Neutral’, 4 for ‘Somewhat Negative’, and 5 for ‘Very Positive’.

Let’s briefly look at the distribution of the ‘overall’ ratings in our dataset. And as shown, we are observing imbalanced data! This could be harmful to our machine learning classifiers as these algorithms expect an equal number of examples per class to perform properly.

Thus, we apply a simple code that will sort out an equal number of observations for each class.

Now, we have a total of 250,000 rows which contains equal counts for each class. Time to move on to the next step.

What’s Next? Text preprocessing!

Step 1: Fundamental preprocessing tasks

Step 2: Lemmatization

Don’t be anxious about the cell running time. As mentioned earlier, lemmatization requires some computation time. Applying step 1 and 2’s functions, now our data looks like this.

Word Vectorization (Embedding)

Bag of Words and TF-IDF?

The Bag of Words (BoW) model is the simplest form of representing texts into numerical values. It simply creates a vector space with the count of each word within a sentence or document.

For instance, if there are two reviews saying,
1. Review 1: ‘This TV is nice looking but very heavy’
2. Review 2: ‘This TV is bad looking and heavy’

Hence, the BOW approach would vectorize Review 1 as [1, 1, 1, 1, 0, 1, 1, 0, 1, 1]. The downside of the BOW model is that it may create sparse vectors with numerous 0 values and ultimately expand the dimensionality of the dataset. Also, this occurrence counting approach neglects the fact that longer sentences will tend to have higher counts of certain words. This is when the concept of TF-IDF is introduced.

TF-IDF (Term Frequency-Inverse Document Frequency) is a scoring measure generally used in the field of information retrieval that helps understand how important or relevant a term is within the sentence and a given collection of sentences (= documents).

The equation of TF-IDF is shown below.

While TF measures how frequent a term t shows, IDF measures the importance of term t, and the product of these two terms generates the composite weight of each term. In other words, each word in the preprocessed reviews will be assigned a score/weight that represents its importance among the review corpora. A high value of the TF-IDF score of a word indicates that the certain word is being informative for differentiating the reviews.

Implementing TF-IDF Weighting in Python

Here, we set the range of n-grams to consider both unigrams (=single word) and bigrams (=combination of two words). Afterward, the TfidfTransformer function is implemented to convert the count matrix into normalized TF-IDF representation. The example code for running the multinomial Naive Bayes Classifier is shown below.

ML Classifiers for Multi-Classification

1. Multinomial Naive Bayes Classifier
is one of the simple but powerful algorithms used for classification problems. It is based on the Bayes Theorem with an assumption that all observed features are independent of each other. In our project’s case, it would be assuming that every word in a review is independent of the other ones.

2. Linear Support Vector Machine
is a linear classifier which is optimized by Stochastic Gradient Descent. SGD algorithm computes the minimum of the cost function through iteratively updating the gradient with a randomly selected single sample.

3. Logistic Regression Classifier
is a common supervised machine learning algorithm for categorizing data into binary or multiple groups with discrete prediction values. It is a regression model based on the sigmoid function for computing the probability of each data points belonging to the categories. It follows the assumption that there exists a linear relationship between the logit of independent variables and the response.

Before throwing in our review data to the selected algorithms, let’s first split our dataset into train & test sets using the ‘train_test_split’ function from scikit-learn library. Here, it is important to use the ‘stratify’ parameter, which makes a split with equal proportions of each class (=sentiment labels).

Evaluation Metrics

Initial ML Classifier’s Results

Hyperparameter Tuning

> C (Inverse Regularization Parameter):
The inverse regularization parameter C in logistic regression is a control variable that is inversely positioned to the lambda regulator. It can be regarded as a method of controlling the overfitting of the regression model. Hence, a higher value of C indicates a stronger impact on the regularization strength. The grid of this parameter was [0.01, 0.1, 1].

> Optimization Algorithms
– ‘lbfgs’: Algorithm that approximates the estimated Hessian matrix by updating with approximate gradient evaluations.
– ‘sag’: ‘sag’ algorithm is generally faster than other algorithms due to its benefit of using a random sample of previous gradient values for obtaining a faster convergence rate. Thus, it is generally preferred for large datasets.
– ‘saga’: ‘saga’ algorithm is a variant from ‘sag’, as it allows to handle non-smooth L1 penalty regularization. Similar to ‘saga’, it is suitable for large datasets.

Let’s utilize the GridSearchCV function from scikit-learn library as below. The code first outputs the best estimator that showed the highest prediction accuracy on k-fold stratified cross-validation. Then, these estimators are used to get the accuracy score on the test set.

The result of the gridsearchCV showed C=1, and solver=’saga’ to be the best estimator that led to achieving 0.5310 accuracy for predicting the review’s sentiment on the test set.


The confusion matrix is a summary of the prediction results of classification, which can also identify what type of error our model is making. In contrast to our unsatisfactory accuracy result, we can observe how well our model has classified the extreme reviews (=’Very Negative’, ’Very Positive’) into its sentiment class. On the other hand, our model is shown to be performing poorly in distinguishing between ’Somewhat Negative’, ’Neutral’, and ’Somewhat Positive’. Thus, if possible for future work, we conclude to focus on maximizing the prediction accuracy of the vulnerable sentiment classes to enhance the overall accuracy of our selected model.

Any Future Work?

  • Also, one can try having a deeper understanding of the review’s context with higher-order n-gram methods (such as trigram). Our post only considered unigram and bigram, to reduce the size of the vocabulary dictionary, but higher-order n-grams can become feasible to implement if we reduce the size of the dimensionality through withdrawing terms that have low-frequency counts.

Current Masters student of the Data Science program at the Univ. of Michigan. I strive for bringing societal impacts by leveraging tools of Data Science.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store