Paper Discussion: Gamon et al. (2005)

Pulse: Mining Customer Opinions from Free Text

- Michael Gamon, Anthony Aue, Simon Corston-Oliver, and Eric Ringger

Characteristics Summary
Domain Car reviews
Sentiment Classes Positive / Negative / Other
Aspect Detection Method Keyword-based soft-clustering
Sentiment Analysis Method Naïve Bayes classifier using Expectation Maximization bootstrapping
Performance Summary
Not really specified as evaluation scores are not properly presented.

This research aims at performing sentiment analysis on a set of car reviews. It is mainly focused at the sentence level. This is slightly different compared to what we've seen before, where the focus is on finding aspects and from there, the corresponding sentiment. Here, sentences are first soft-clustered be assigning them to one or more aspect categories in an automatically created taxonomy. Then the sentence is classified for sentiment, which is then attributed to all the aspects (the categories from the taxonomy) which this sentence was clustered into.


The data set consists of a sample from a car review database, containing 406,818 reviews and almost 900,000 sentences. A subset of 3,000 sentences has been annotated manually for sentiment, with 2,500 being used as a training set, and 500 as a test set. The sentiment categories that are used are positive, negative, and other, with the latter consisting of all sentence that have mixed sentiment, no sentiment at all, or sentiment that can only be detected using external world-knowledge.

System Design

The first step is the creation of a taxonomy of major categories (the makes) and minor categories (the models) of cars by just querying the car reviews database. The sentences are then retrieved for each make and model to be processed. So beforehand, the entity is known (the make and model of the car), but the aspects and sentiment are not. This is quite often the case in related research.

Then the set of sentences is processed by a clustering algorithm. As existing clustering algorithms are either designed to work with larger pieces of text (i.e. documents) or smaller ones (i.e. on the word level), the authors designed their own clustering algorithm. Besides the set of review sentences, the input also entails a list of stop-words and a list of, what they call, go-words. Stop-words are words that definitely should be in a cluster and a stop0-list is mandatory input. Go-words on the other hand are the opposite in that beforehand it is known that these words do belong in a cluster. It is an optional piece of input, which allows additional external knowledge to be fed into the system. The algorithm consists of the following steps:

  1. Sentences are stemmed using the Porter stemmer.
  2. Frequencies F are recorded for each stem not appearing in the stop-list.
  3. The frequencies for stems in the go-list are multiplied by a parameter λ1.
  4. The frequencies for stems with a high tf-idf value, calculated over all the data, are multiplied by a parameter λ2.
  5. The frequencies for stems with a high tf-idf value, calculated over all sentences within the same leaf node of the taxonomy (i.e., over all reviews for the same car model), are multiplied by a parameter λ3.
  6. The list of stems are sorted by descending adjusted frequency.
  7. One cluster is created for each of the most frequent N stems, with all sentences that contain that stem being classified in that cluster. Optionally, one could specify a minimum number of sentences for each cluster.
  8. Two clusters are merged if the overlap of sentences that are in both clusters is more than 50% of the number of sentence in either one of the clusters. A merged cluster is labeled with both words, unless one cluster is labeled with a stem that is part of the label of the other cluster, in which case the new label will be the larger phrase.

The stop-list is very important here as it not only contains a manually created list of function words and high-frequency semantically empty words (like 'put'), but also the set of top N features from the sentiment classifier. The latter is computed using the log likelihood ratio of the sentiment word with the target feature. By excluding words that are highly correlated with positive or negative sentiment, the clusters are bound to be orthogonal to the sentiment of the feedback.

As mentioned earlier, the sentiment classifier is trained to make a three-way distinction between the sentiment classes. The authors argue that a Naïve Bayes classifier is a good choice, as it can be bootstrapped (using the Expectation Maximization algorithm) with the loads of unlabeled data they have at their disposal. The steps are as follows:

  1. An initial Naïve Bayes classifier is trained with parameters θ on the set of labeled documents.
  2. This initial classifier is used to estimate a probability distribution over all three classes for each document in the set of unlabeled documents. (This is the E-step in Expectation Maximization)
  3. Both the labeled and unlabeled data are then used to estimate the parameters for a new classifier. (This is the M-step in Expectation Maximization)
  4. Step 2 and 3 are repeated until the joint probability of the data and the parameters are within some specified threshold.

Additionally, a free parameter δ is used to vary the weight given to the unlabeled documents, and mixtures(?) are used to model each class.


While the intuition behind this model seems fair, there is no proper evaluation present in this paper. The clusters are not evaluated at all, and even though the sentiment classifier is, no proper results (as in tables) are presented. Only some random numbers and a graph is given to show that it works. For the positive class, a recall between 0.95 and 0.97 is given, which is extremely high (one wonders what the precision will be for that class...). For the negative and other class, a precision-recall graph is given, but it looks rather weird. I'll include it for you, so you can see for yourself.


It is a shame that the evaluation is so poor, because the ideas are really interesting, especially given the fact that this algorithm requires only a limited amount of labeled data to work. I hope to encounter similar research in the future, which does provide answers to whether this is actually a good approach.


Leave a Reply

Your email address will not be published. Required fields are marked *