# Paper Discussion: Du & Tan (2009)

#### An Iterative Reinforcement Approach for Fine-Grained Opinion Mining

- Weifu Du & Songbo Tan

##### Characteristics Summary
 Domain Chinese Hotel reviews from www.ctrip.com Sentiment Classes unknown Method Improved Information Bottleneck Algorithm for Semantic Information
##### Performance Summary
 Accuracy of review aspect category construction: Rand-index: 0.71 Sentiment association: Precision: 78.90%
##### Introduction

Approaching the problem of aspect-level sentiment analysis from a completely different perspective makes this paper an interesting, albeit difficult, read. I hope I will be able to correctly summarize and explain what's going on in this research. Feel free to help me out in the comments if I made some mistake.

When defining the process of aspect-level sentiment analysis as the following three sub-tasks:

1. get opinion word set O (with polarity labels);
2. get product feature set F;
3. identify relationships between F and O.

Then this research focuses on steps 2 and 3, assuming that step 1 has already been done. It especially tries to find not only explicit features, but implicit ones as well. The idea is that while these implicit features are not mentioned as such in the text, they can be deduced by the opinion words that are used to describe them. Instead of using adjacency to find the relationships between $F$ and $O$, this approach detects these relations based on review feature categories and opinion word groups from the corpus. Indeed, this sounds a bit vague. Let's try to explain it some more.

##### Clustering based on the Information Bottleneck algorithm

Let's define the set of product feature words $F = \{f_1, f_2, ..., f_m\}$, and the set of opinion words $O = \{o_1, o_2, ..., o_n\}$. Then a weighted bipartite graph $G = \{F, O, R\}$ can be built, where $R = [r_{ij}]$, which is the $m*n$ link weight matrix containing all the pair-wise weights between $F$ and $O$. While multiple weighting schemes exists, the one chosen here is to use the co-appearence frequency of $f_i$ and $o_j$ on the clause level as weights.

Now comes the hard part: both $F$ and $O$ are taken as random variables and the problem of constructing or clustering the object groups is now defined as finding a compressed representation of each variable while preserving the information about another variable as much as possible. The iterative reinforcement framework that is proposed in this paper is based on the information bottleneck algorithm and the above definitions.

First the mutual information between the two random variables can be defined as the symmetric functional

which measures the relative entropy between their joint distribution $p(f, o)$ and the product of the respective marginal distributions $p(f)p(o)$. This is the only consistent statistical measure that variable $F$ contains about variable $O$ (and the other way around). The intuition is that when you compress $F$ into $C$, some of that mutual information is lost. Or in other words: $I(C, O) \leq I(F, O)$.

So $C$ is the compressed representation of $F$, which is basically a (possibly stochastic) mapping from each value $f \in F$ to $c \in C$. Formally, this kind of mapping can be characterized by the conditional probability $p(c|f)$, or: a soft partitioning of the $F$ values, as each value of $F$ is associated to some extent (with a normalized probability) with each value of $C$. Given the empirical joint distribution of two variables, one variable is compressed so that the mutual information  about the other variable is preserved as much as possible. Or in other words: the Information Bottleneck method can be seen as finding a minimal sufficient partition of one variable with respect to the other one. With the help of a Lagrange multiplier, the problem can be solved by minimizing $L[p(c | f)] = I(C, F) - \beta I(C, O)$. Working this out gives a solution that is defined in terms of $p(c)$, $p(c | f)$, and $p(o | c)$. Then the quality of the solution can be computed using the Kullback-Leibler divergence between $p(o | f)$ and $p(o | c)$.

For the math enthusiast, the KL divergence, or the distortion between points $f$ and $c$ is

The clustering algorithm starts by trivially assigning all values of $X$ into their own singleton cluster. Then clusters are merged in a way that locally minimizes the loss of mutual information.

##### An improved Information Bottleneck algorithm

In the above algorithm, the distance between two objects is measured as the difference in information values. In this case, that is based on the co-occurrence values of $F$ and $O$. However, such an approach completely ignores any semantic relatedness that may or may not be present between these two variables. Therefore, the distance between two objects is augmented to be a linear combination of the information value distance and the semantic relatedness distance. A semantic lexicon like WordNet (or HowNet in this case for Chinese) is used to determine the semantic relatedness. A parameter $\alpha$ is used to determine the influence of the semantic relatedness distance, and $1 - \alpha$ then obviously is the influence of the traditional information value distance.

##### Evaluation

The system is evaluated on a set of Chinese hotel reviews, extracted from www.ctrip.com, and a system of one to four stars is used to overall rate the hotels. The set of opinion words is determined to consist of all adjectives in the text, and nouns and noun phrases together form the set of feature words. The latter is however pruned by removing all named entities (as found by a Named Entity Recognizer).

Two tasks are evaluated: the effectiveness of product feature category construction, and the precision of the sentiment association between product feature categories and opinion word groups.

For the first task, the Rand-index is used, which is a measure of agreement between two partitions of the same data. So, given some partition of a data set, how similar is some other partition to the first one. It can be computed as $Rand(P_1, P_2) = \frac{a + b}{n(n - 1) / 2}$, where each partition is viewed as set of $n(n-1)/2$ pairwise decisions with $n$ being the size of the data set. For each pair of points, the algorithm either assigns them to the same cluster or to different clusters. Then $a$ is the number of decisions where both points are in the same cluster in both partitions, and $b$ is the number of decisions where both points are in different clusters in both partitions.

To get the optimal value for parameter $\alpha$, it is increased by 0.2 each time, going from 0 to 1, with the results being recorded in the below graph:

As expected, the semantic relatedness distance is more important than traditional information value, although the latter is not completely unimportant.

For the second task, the precision of the association between feature categories and opinion word groups is computed (no recall unfortunately!). It is computed as $\frac{\textit{number of correctly associated pairs}}{\textit{number of detected pairs}}$. A simple template based system is cooked up to compare the results against. By means of regular expressions, the nouns (phrase) and gerund (phrase) are extracted as the review features, and the nearest adjective is used as the relevant opinion word. This is of course reminiscent of Hu & Liu (2004), although this is much simpler. The results are shown in the table below:

 Approach Pairs Precision Template extraction 27,683 65.89% Proposed approach 141,826 78.90%
##### Discussion

While the obtained precision seems good, it is hard to really value this research. The comparative measure is not good enough to properly serve as a benchmark, as it is a simpler version of a method that is already a couple of years old when this paper was published. Furthermore, the recall is missing, and, as we've seen before, there are multiple papers that boast a very high precision but at the cost of recall. As there is always a trade-off between these two, only reporting precision is not that valuable. Furthermore, I would personally have liked to see some examples of the clusters that are being generated by this method. As it is now, it remains a bit vague as to what the actual output is. Grouping product aspect words into categories is something I can imagine, but how about the opinion words? Are they clustered into groups  as well? And if so, what kind of groups: just positive/negative or something more sophisticated?

A final remark is that choosing only adjectives for opinion words and noun/noun phrases for aspects is a bit of a safe bet. Indeed, these word categories correspond quite closely to aspects and opinion words respectively, but other type of words might be an aspect or opinion word as well. As the authors explain, they didn't want to include others as it would include also a lot of noise. Decisions such as these will indeed improve precision, but at the cost of recall (which is not reported).

The idea of using this Information Bottleneck algorithm is however interesting, and I am curious to see if this can be found in other papers as well.

P.S. This is my first post that uses LaTeX formulas and I must say: it looks very nice! I use the LaTeX for WordPress plugin for this. What I especially like is the fact that is uses the MathJax library to parse the LaTeX code into actual characters instead of using a picture (although I think a picture is used as a fall-back when the JavaScript should fail)