« 2007's Worst | Main | Summizing the Golden Globes »

January 02, 2008

Stars and Bars

We are often asked why we use a histogram visualization we refer to as a "snip" for ratings rather than the traditional stars or numeric scores used on most other sites. The answer is a bit more complex than "it looks cool" and thus seemed like a great topic for a post.  Star and numeric single value summaries for a product often hide valuable information. This hurts users because controversial books are often the great books, or the negative review is often the most insightful.  So today I am going to give a quick review of the history of attitude encoding, examine the issues associated with central tendency summaries (single numeric or star value) and demonstrate how a histogram allows users to better understand the review-o-sphere.

Reviews, surveys and questionaries allow people to express their attitudes and opinions on various topics.  While verbal or textual free-form methods of capturing sentiment allow people the greatest freedom in terms of what they can talk about, they are not well suited for comparison or summary statistics. How do you compare the text of two movie reviews if you only have text?  In 1932 Rensis Likert was working on creating methods for encoding attitudes on various topics and came up with encoding  agreement with a question or topic by using a numbered scale.  His system is now called the Likert scale.

(Likert, Rensis (1932), "A Technique for the Measurement of Attitudes", Archives of Psychology 140: pp. 1-55).

While many are not familiar with his name, we have all encountered this technique in questionnaires, reviews or ratings when we are asked to rate something on a scale of 1 to X or if we are asked how much we agree or disagree with a statement. Because attitudes are encoded into a number, there is no question how to compare text, just use the numeric value (stars) as the value and then summaries of that data can easily be calculated.

The idea of encoding a rating into a 1 to X point scale has been used for years for ratings.  Online review companies have used this idea also; examples are IMDb, Epinions.com, Yahoo! Movies, Amazon.com, and many more. By having users express their ratings numerically, the natural reaction is to use a central tendency statistic to quickly understand the data set. Here is where the common use of the mean (average) as the statistic becomes a problem.   

A central tendency statistic is a single value that is quoted as representative of an entire data set. Most commonly we see average or mean as the statistic used to represent a set of reviews.  The problem with using average for reviews is simple -- people often disagree!  In other words, reviews often follow a bi-modal distribution, not a normal or Laplace-Gaussian distribution.

Take the following example bi-modal distribution.  If we were to take the average from this data, it would give us a value between the two humps. That value would not show us that there are two camps of views on a topic. Other central tendency statistics like mode or median have the same problem. 

Bimodal_2

Examining our data we don't see ratings following a normal distribution.  As in the examples below, reviews often have several opinion groups being represented in the data, thus a single value can never tell the full story. 

Reviewtypes

Only looking at the average number of stars also does a better job of hiding another inherent bias in reviews:  Not only do people tend to write only polarizing ones, but they tend to write positive ones.  If you look at our home page we provide a visualization for all the reviews we have indexed. Looking at it you can see that green (positive or 5 star) reviews far out-number the other classes.  You might expect that there are about the same number of one-star books, movies, gadgets as there are five-star, but in fact the majority have a high average rating.  That just makes it harder to make a decision!

Overallopinion

We were not the first to see this pattern and conclude that the mean (average) is problematic with online reviews. Hu, Nan, et.al have a great paper examining the review distributions.

Hu, Nan, Paul A. Pavlou, and Jennifer Zhang, "Can Online Word-of-Mouth Communication Reveal True Product Quality? Experimental Insights, Econometric Results, and Analytical Modeling," (April 2006).

Also, the NY Times did a piece a few years ago looking at the Amazon ratings and found that some of the best sellers out there have a bi-modal distribution (what they called "horseshoe-shaped") of ratings. The reason is simple: some of the best books are the most controversial. There is a little feature on Summize that helps you find those gems, it is called "disagree on". Check it out for the query "non-fiction books" click on "disagree on" and find books like:

all are controversial books of some kind and best sellers.

Disagreeon
When Greg Pass (one of the co-founders of Summize) was looking at the best method to visualize our sentiment classification of reviews he wanted to express more information in the same amount of space as current star representations. He has a post coming up with his story on that visualization, but I will give a brief spoiler:  His visualization or histogram shows the full data set, this visualization removes all the issues with central tendency statistics like mean, median, mode or fancier encodings. I think his visualization does that quite effectively, but of course I am a bit biased on that point. The visualization is sometimes called a stacked bar histogram or spino-gram.

Summize is not the only company trying to help people fully understand reviews.  A few months after Summize released its visualization Amazon came out with one of their own. Below are two visualizations of the review distribution data for "The Fair Tax.." book, Amazon's vs Summize. Both offer interesting views into the data, I will let you pick which one you like better.

Amazondistirs

Summizedistirs

One good theory for why ratings do not follow a normal distribution is called "sampling bias". People tend not to write reviews unless they have a strong opinion on a topic.  Hu, Nan, et.al. looked at this sampling bias in reviews and proposed that if all users were forced to review a product, a normal distribution would occur for most products. The reality is most people do not write reviews on the products they consume.  The people writing reviews tend to have polar opinions on a topic, thus this sampling bias occurs. 

Our goal in summarizing reviews is to expose the many facets they cover about a product.  A single value hides all that great insight people have.  While I am all for simplification, I think that showing the full set of data in a quick and easy visualization makes for a more informative, more interesting experience. 

Abdur
abdur@summize.com

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/t/trackback/2467032/24635476

Listed below are links to weblogs that reference Stars and Bars:

Comments

Post a comment

If you have a TypeKey or TypePad account, please Sign In