Paper Discussion: Hu & Liu (2004) 5

Mining Opinion Features in Customer Reviews and Mining and Summarizing Customer Reviews

- Minqing Hu and Bing Liu

Domain Product reviews (find data set here)
Sentiment Classes Positive / Negative
Aspect Detection Method Frequent Nouns
Sentiment Analysis Method Rule-based
Sentiment Lexicon Adjectives only; custom WordNet propagation algorithm
Aspect detection Precision: 72% Recall: 80%
Opinion sentence extraction Precision: 64.2% Recall: 69.3%
Opinion sentence orientation detection Accuracy: 84.2%

This research is one of the earliest solutions for aspect-level sentiment analysis. It is well-known and much-cited in subsequent publications. Since this research is split into two publications with significant overlap, I will discuss the content from both papers. I will try to be detailed enough for you to implement it yourself, but also succinct enough to be a reasonable read.

The goal of the proposed method is to generate an opinion summary for some product, given a set of reviews of that product. Instead of just stating the number of positive and negative reviews for that product, the opinion summary is aspect-based, meaning that positive and negative reviews are mentioned for each detected aspect of the product. Examples of aspects of a mobile phone are the camera, screen, weight, price, etc. Thus, one of the most important contributions of this research is a method to find these aspects in plain text (you can find this most extensively in the first of the two papers).

The general intuition is as follows: while people have a diverse vocabulary and will use many different words, there is only a limited set of words that describes the aspects of the product. Therefore, the vocabulary will converge on this words, or: they will be used more often than normal words. These aspects everyone talks about are called the frequent features in this research (however, to avoid confusion with the features from machine learning, I will try to consistently use 'aspect' instead of 'feature'). An architectural overview of the method can be found here, which is basically just a sequence of steps:

Feature-based opinion summarization framework


The first step is to crawl reviews, which is not discussed in the paper as such. You can find the set of reviews that the authors used to test their implementation here, on the homepage of Bing Liu.

Frequent Aspect Detection

The first operation here is Part-of-Speech tagging and phrase chunking. The authors use NLProcessor for this task, but any other library that performs the same operation will probably suffice as well. A good alternative is the Apache OpenNLP project, which has components for both of these tasks as well.

To find the frequent aspects, a transaction file is created with each line in this file being the set of nouns and/or noun phrases from a sentence. Note that some pre-processing is done here as well: stopwords are removed, words are stemmed and a fuzzy matching technique (derived from this paper) is used to deal with alternative spellings or typos. So, this transaction file should contain the stemmed, spelling-corrected nouns/noun phrases from the reviews and no stopwords.

To find the most frequent occurring nouns/noun phrases, the authors use the Apriori algorithm on the transaction file. This allows them to specify a minimum support, which is minimum percentage of sentences in which a certain noun or noun phrase should occur. While the Apriori algorithm uses the frequent itemsets as an intermediate step to generate association rules, the latter part of the algorithm is not need in this context. The minimum support used by the authors is 1%. The maximum number of words in a frequent itemset is 3 for this research, as aspects with more than three words are extremely rare. But even so, this limit is easily adjusted.

Now, we have a rather large number of frequent itemsets, containing both genuine aspects and non-aspects. Two pruning steps are performed to shave away as many of the non-aspects as possible.

We start with what is called compactness pruning: as the Apriori algorithm does not consider the position of a word in the sentence, there can be frequent itemsets that consist of words that are not even close together. Such itemsets are unlikely to be genuine features. To test for compactness, the word distance between any two adjacent words in a sentence containing the words of this frequent itemset should not be greater than 3. When there are at least 2 sentences for which this is the case, the frequent itemset is compact.

The next pruning step is called redundancy pruning, which is used to remove single word aspects that are not aspectson their own but are just part of a multi-word aspect. To measure this, the p-support is computed as the number of sentences that contain just the single-word aspect (so no larger aspect that includes the single word). When the p-support is lower than 3, the aspect is pruned. An example from the paper is life, which is not interesting on its own as an aspect, but only occurs as part of battery life. On the other hand, manual is interesting as an aspect as it occurs often as a single word aspect, even though there are larger aspects that include the word manual, such as manual mode, manual setting, etc.

Opinion Words

Now, for each sentence that contains a frequent aspect, all adjectives are extracted as opinion words. For each aspect in the sentence (yes, there can be more than one!), the adjective that is nearby is recorded as its effective opinion.

To find the sentiment orientation of these adjectives, a WordNet propagation technique is used. Using a small set of known positive and negative words, their sentiment is propagated through the WordNet graph to determine the orientation for all (or at least a large number) of the adjectives in WordNet. This method, while smart in 2004, is obsolete with the advent of SentiWordNet, a version of WordNet already tagged with polarity orientation scores.

Infrequent Aspects

While the frequent aspects are the most important, there are also some aspects that only a few people talk about. It would however be a shame to miss these, just because they are not mentioned that often. To find them the authors devised the following algorithm:

For each sentence in the review database that contains one or more opinions words but does not contain any frequent aspect, the nearest noun or noun phrase around the opinion word is stored as an infrequent aspect.

Note that this will result in quite a lot of non-aspects as well: precision will decrease in favor of a larger recall.

Orientation of Opinion Sentences

Using all the extracted information from the previous steps, we can now predict the polarity of each opinion sentence. The best way to explain this is to show you the original pseudocode of the authors:


Essentially, there are three scenario's in this pseudocode:

  1. There is a majority in positive/negative words. Sentence orientation will be the same as the majority.
  2. There are as many positive as negative words. Now only effective opinion words are counted  (yes, the same opinion word can be effective for multiple aspects and will be counted for each aspect). A simple form of negation handling is used here (in the wordOrientation method, the term appears closely is defined as being within a word distance of 5). Again, the majority determines the orientation of the sentence.
  3. There is still no majority: it is impossible to determine polarity from the words in this sentence. The sentence is assigned the polarity of the previous sentence.

A notable exception are sentences with a clause that starts with a word like but, however, etc. If no opinion appears in the clause, the opposite polarity of the sentence will be used for the aspects in the clause.

Summary Generation

The last step is the generation of an opinion summary which is done as follows: for each aspect, all sentences that mention this aspect are grouped into positive and negative sentences. Note that the sentence polarity is used, regardless of the polarity of the actual aspect within the sentence.  The aspects are then ranked according their frequency in the reviews. Longer aspects are shown before single-word aspects.


The method is evaluated with respect to the following three tasks:

  • Aspect extraction
  • Opinion sentence extraction
  • Orientation prediction of opinion sentences

The following table has the precision and recall scores for all the steps in the aspect extraction phase. This can be very useful when you want to implement this method, as you can test intermediate steps instead of only the final numbers.

HuLiu2004-Results-AspectsThe results for the other two tasks are summarized in this table:



This research has set the stage for many researchers to develop methods of their own for aspect-level sentiment analysis. As such, it has been very inspiring, which is definitely a good point.

As it is one of the first solutions for this problem, there are, of course, some shortcomings in this research. Many of these have been addressed in later papers, but I think it makes sense to mention them here anyways. Note that what I call shortcomings are often just design decisions, where simplicity (and thus speed) is favored over precision. This is not wrong per se, as a method that gives you a perfect result... in 100 years, is not so useful. I'll just point out these design decisions and their possible ramifications.

  • On multiple occasions, word distance is used as a proxy for syntactic relatedness in a sentence. While in a lot of cases this might work, it is essentially incorrect. For true syntactic relatedness you would want a full syntactic parser, which will give you all the grammatical relations in a sentence. This is however a costly (read: slow) operation.
  • By computing the orientation on a sentence level, the individual orientation of aspects is lost. I don't really understand why the authors have chosen this over just using the effective opinions of the aspects when aggregating the information into the opinion summary.
  • Only adjectives are used as opinion-bearing words. This has indeed been the standard for some time, but now we know that also other words, for example nouns, can bear sentiment. You can see that in SentiWordNet where a lot of words have a polarity score, not only adjectives.
  • The frequent itemset method from the Apriori algorithm assumes a uniform distribution as a baseline. However, words are far from uniformly distributed. In fact, they have something close to a Zipfian distribution. Some words, and hence, certain word combinations, occur much more frequently than others. The 1% minimum support rule used in this research does not take this into account.

A big advantage of a method like this, is that it does not need any training: it will work right out of the box with decent performance. In summary, this is an excellent base paper in the field of aspect-level sentiment analysis. Obviously, as it is one of the first, there are many improvements that can be made.

Some students I know have tried to implement this method, but somehow were not able to reproduce the same performance for aspect extraction. While sentiment determination, using SentiWordNet, is just as good, if not better, aspect extraction performance was absurdly bad. If you also implemented this and encountered this problem  as well, please let me know! If you did not encounter this, then definitely let me know :), there might a bug somewhere after all...

5 thoughts on “Paper Discussion: Hu & Liu (2004)

Leave a Reply

Your email address will not be published. Required fields are marked *