NLP: A Complete Sentiment Classification on Amazon Reviews

What is Sentiment Analysis?

Sentiment Analysis, also known as Opinion Mining, is one of the common research areas which performs Natural Language Processing (NLP) tasks for the purpose to extract subjective information by analyzing text data written by users. In the case of sentiment analysis of review data, the main goal is to identify the user’s subjectivity and classify the statements into different groups of sentiments.

In this article, we aim to perform a sentiment analysis of product reviews written by online users from Amazon. The textual review data comes with numerical rating data, ranging from 1 to 5 (1: negative, 5: positive). This numerical indicator will be used as labels that represent the sentiment of the review text. Thus, this problem will be viewed as a multi-classification process and we seek to predict the sentiment scale of the user reviews based on machine learning classifiers and deep learning algorithms.


  • About the Dataset
  • Text Preprocessing & Lemmatization
  • Word Vectorization (BOW vs. TF-IDF)
  • Implementing ML Classifiers & Evaluation Metrics
  • Results
  • Hyperparameter Tuning
  • Discussion
  • Any Future Work?

Required Python Libraries


Let’s try downloading the Amazon review dataset by using the link below. OR to directly download from notebooks,

The raw data is compressed in gzip format, hence extra codes are needed to store the data into pandas DataFrame format. Following the code:

We now have the review data in DataFrame format as shown below. Let’s don't think too much about other features. For this post, we will be only focusing on ‘reviewText’ and ‘overall’ columns, so let’s only keep these columns. ‘reviewText’ contains the raw text of the reviews of a product and the ‘overall’ column is in numerical values ranging from 1 to 5. Here, these values can be seen as sentiment labels: value 1 for ‘Very Negative’, 2 for ‘Somewhat Negative’, 3 for ‘Neutral’, 4 for ‘Somewhat Negative’, and 5 for ‘Very Positive’.

Let’s briefly look at the distribution of the ‘overall’ ratings in our dataset. And as shown, we are observing imbalanced data! This could be harmful to our machine learning classifiers as these algorithms expect an equal number of examples per class to perform properly.

Thus, we apply a simple code that will sort out an equal number of observations for each class.

Now, we have a total of 250,000 rows which contains equal counts for each class. Time to move on to the next step.

What’s Next? Text preprocessing!

For NLP projects, text preprocessing is traditionally a vital step as it has a potential impact on the final performance of our classifier model. Such a process not only reduces the dimensionality of the input data but also converts the input text into a form that is more meaningful or predictable. We’ll go through each step.

Step 1: Fundamental preprocessing tasks

Some fundamental text preprocessing steps include lowercasing, punctuations removal, and removal of stopwords. Stopwords are referred to a set of commonly used words such as ‘to’, ‘from’, and ‘or’. In the NLP approach, these stopwords are perceived as insignificant information, and further, removing stopwords can be beneficial for downsizing the input tokens. A simple function is created to handle this process at once.

Step 2: Lemmatization

Lemmatization is the process of grouping inflected words into a root word. For instance, the word ‘ran’ is lemmatized into ‘run’, ‘crying’ into ‘cry’, and ‘better’ into ‘good’. Although processing time could take a while, lemmatizing is critical for reducing the number of unique words and also, reduce any noise (=unwanted words). Additional function (morphological analysis) is added on top of the lemmatizing function, to first identify and cut down the inflectional forms into a common base word.

Don’t be anxious about the cell running time. As mentioned earlier, lemmatization requires some computation time. Applying step 1 and 2’s functions, now our data looks like this.

Word Vectorization (Embedding)

Word Embeddings is an NLP technique where words or phrases are represented in numerical values in pre-defined vector space. This process is inevitable as in the perspective of machine comprehension, raw strings or text formats cannot be used as input data for machine learning algorithms and deep learning architectures. While there exist numerous approaches such as Bag of Words, TF-IDF, word2vec, and GloVe, we will focus on TF-IDF for this post.

Bag of Words and TF-IDF?

The Bag of Words (BoW) model is the simplest form of representing texts into numerical values. It simply creates a vector space with the count of each word within a sentence or document.

For instance, if there are two reviews saying,
1. Review 1: ‘This TV is nice looking but very heavy’
2. Review 2: ‘This TV is bad looking and heavy’

Hence, the BOW approach would vectorize Review 1 as [1, 1, 1, 1, 0, 1, 1, 0, 1, 1]. The downside of the BOW model is that it may create sparse vectors with numerous 0 values and ultimately expand the dimensionality of the dataset. Also, this occurrence counting approach neglects the fact that longer sentences will tend to have higher counts of certain words. This is when the concept of TF-IDF is introduced.

TF-IDF (Term Frequency-Inverse Document Frequency) is a scoring measure generally used in the field of information retrieval that helps understand how important or relevant a term is within the sentence and a given collection of sentences (= documents).

The equation of TF-IDF is shown below.

While TF measures how frequent a term t shows, IDF measures the importance of term t, and the product of these two terms generates the composite weight of each term. In other words, each word in the preprocessed reviews will be assigned a score/weight that represents its importance among the review corpora. A high value of the TF-IDF score of a word indicates that the certain word is being informative for differentiating the reviews.

Implementing TF-IDF Weighting in Python

For this section, we will be utilizing the Pipeline class from the scikit-learn library which first uses the CountVectorizer to transform the review texts into a matrix of token counts (for TF).

Here, we set the range of n-grams to consider both unigrams (=single word) and bigrams (=combination of two words). Afterward, the TfidfTransformer function is implemented to convert the count matrix into normalized TF-IDF representation. The example code for running the multinomial Naive Bayes Classifier is shown below.

ML Classifiers for Multi-Classification

As we now have the word embeddings of the review data obtained from TF-IDF weighting, let’s try performing some machine learning classifiers for predicting the sentiments of each review. We will be testing these classifiers:

1. Multinomial Naive Bayes Classifier
is one of the simple but powerful algorithms used for classification problems. It is based on the Bayes Theorem with an assumption that all observed features are independent of each other. In our project’s case, it would be assuming that every word in a review is independent of the other ones.

2. Linear Support Vector Machine
is a linear classifier which is optimized by Stochastic Gradient Descent. SGD algorithm computes the minimum of the cost function through iteratively updating the gradient with a randomly selected single sample.

3. Logistic Regression Classifier
is a common supervised machine learning algorithm for categorizing data into binary or multiple groups with discrete prediction values. It is a regression model based on the sigmoid function for computing the probability of each data points belonging to the categories. It follows the assumption that there exists a linear relationship between the logit of independent variables and the response.

Before throwing in our review data to the selected algorithms, let’s first split our dataset into train & test sets using the ‘train_test_split’ function from scikit-learn library. Here, it is important to use the ‘stratify’ parameter, which makes a split with equal proportions of each class (=sentiment labels).

Evaluation Metrics

For evaluating our candidate model for the prediction of sentiment multi-classification, we will be using accuracy on the test data set. Accuracy is a metric simply dividing # of correct predictions by the total # of predictions. We concluded that accuracy would be the most appropriate and concise measure as we previously re-sampled our initial data to become well balanced among all 5 sentiment classes. Just a quick note, in case of dealing with imbalanced data, the f1-score metric (a mixture of recall & precision) is recommended.

Initial ML Classifier’s Results

The following code returns the accuracy score, confusion matrix, and classification report of each method. In the case of classification accuracy on the test set, Naive Bayes showed 0.4582, SGD with 0.4729, and Logistic Regression with 0.5080.

Hyperparameter Tuning

Overall, our initial approach for ML models concluded that the logistic regression classifier shows the best sentiment classification performance with 0.5080 accuracies. Hence, we seek to tune the hyperparameters of this model with the gridsearchCV library. Below are the descriptions of the hyperparameters selected for this process.

> C (Inverse Regularization Parameter):
The inverse regularization parameter C in logistic regression is a control variable that is inversely positioned to the lambda regulator. It can be regarded as a method of controlling the overfitting of the regression model. Hence, a higher value of C indicates a stronger impact on the regularization strength. The grid of this parameter was [0.01, 0.1, 1].

> Optimization Algorithms
– ‘lbfgs’: Algorithm that approximates the estimated Hessian matrix by updating with approximate gradient evaluations.
– ‘sag’: ‘sag’ algorithm is generally faster than other algorithms due to its benefit of using a random sample of previous gradient values for obtaining a faster convergence rate. Thus, it is generally preferred for large datasets.
– ‘saga’: ‘saga’ algorithm is a variant from ‘sag’, as it allows to handle non-smooth L1 penalty regularization. Similar to ‘saga’, it is suitable for large datasets.

Let’s utilize the GridSearchCV function from scikit-learn library as below. The code first outputs the best estimator that showed the highest prediction accuracy on k-fold stratified cross-validation. Then, these estimators are used to get the accuracy score on the test set.

The result of the gridsearchCV showed C=1, and solver=’saga’ to be the best estimator that led to achieving 0.5310 accuracy for predicting the review’s sentiment on the test set.


The sentiment classification accuracy achieved throughout this post is not sufficient to be used for an end-user. Therefore, let’s try looking into the confusion matrix for further observations. We utilized the function from the below link, which is built to concisely visualize the confusion matrix.

The confusion matrix is a summary of the prediction results of classification, which can also identify what type of error our model is making. In contrast to our unsatisfactory accuracy result, we can observe how well our model has classified the extreme reviews (=’Very Negative’, ’Very Positive’) into its sentiment class. On the other hand, our model is shown to be performing poorly in distinguishing between ’Somewhat Negative’, ’Neutral’, and ’Somewhat Positive’. Thus, if possible for future work, we conclude to focus on maximizing the prediction accuracy of the vulnerable sentiment classes to enhance the overall accuracy of our selected model.

Any Future Work?

  • As discussed earlier, there exist other options such as word2vec, GloVe embedding for vectorizing the words from our review dataset. It may worth trying these methods and compare the performance against the TF-IDF weighting approach. Typically, pre-trained embedding is regarded to boost NLP performance as they capture semantic/syntactic meaning from large datasets (Wikipedia, Twitter, etc).
  • Also, one can try having a deeper understanding of the review’s context with higher-order n-gram methods (such as trigram). Our post only considered unigram and bigram, to reduce the size of the vocabulary dictionary, but higher-order n-grams can become feasible to implement if we reduce the size of the dimensionality through withdrawing terms that have low-frequency counts.



