<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/stylesheets/rss.css" type="text/css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Tom Moertel's Weblog: Tag imdb</title>
    <link>http://blog.moertel.com/articles/tag/imdb?tag=imdb</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Quality rants on programming theory and stuff geeks like</description>
    <item>
      <title>Greasmonkey script annotates IMDb movies with their decoder-ring percentile ranks</title>
      <description>&lt;p&gt;Sam at &lt;a href="http://rephrase.net/"&gt;rephase.net&lt;/a&gt; has harnessed the earth-shattering power of the &lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;IMDb movie-rating decoder ring&lt;/a&gt; to create a &lt;a href="http://rephrase.net/days/07/06/imdb-decoder"&gt;Greasmonkey script that annotates IMDb-listed movies with their percentile ranks&lt;/a&gt;. Now you don&amp;#8217;t need to look up a movie&amp;#8217;s &amp;#8220;star rating&amp;#8221; in the decoder ring to see where the movie ranks; the ranking appears right on the movie&amp;#8217;s IMDb page.&lt;/p&gt;


	&lt;p&gt;Do check out the &lt;a href="http://rephrase.net/box/user-js/scripts/imdb-percentile-ratings.user.js"&gt;script itself&lt;/a&gt;  to see how Sam cleverly embeds a copy of the decoder ring and plucks scores from it as needed.&lt;/p&gt;


	&lt;p&gt;For more on the IMDb movie-rating decoder ring, see:&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;&lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;the decoder ring itself&lt;/a&gt;&lt;/li&gt;
		&lt;li&gt;&lt;a href="http://blog.moertel.com/articles/2007/06/21/talk-fun-with-numbers-r-and-perl-and-imdb-data"&gt;my talk &lt;em&gt;Fun with Numbers: R and Perl and &lt;span class="caps"&gt;IMDB&lt;/span&gt; data&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
		&lt;li&gt;&lt;a href="http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1"&gt;Mining gold from the IMDb&lt;/a&gt;&lt;/li&gt;
	&lt;/ul&gt;</description>
      <pubDate>Wed, 11 Jul 2007 13:49:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:ccf23640-312c-49f8-9e89-7bae08d56c4f</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/07/11/greasmonkey-script-for-imdb-decoder-ring</link>
      <category>hacks</category>
      <category>imdb</category>
      <category>statistics</category>
      <category>greasmonkey</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/513</trackback:ping>
    </item>
    <item>
      <title>Talk: Fun with Numbers: R and Perl (and IMDB data)</title>
      <description>&lt;p&gt;Last week I gave a talk on the &lt;a href="http://www.r-project.org/"&gt;R statistics
system&lt;/a&gt; and Perl for the &lt;a href="http://pgh.pm.org/"&gt;Pittsburgh Perl
Mongers&lt;/a&gt;.  The example that threaded through the
talk was something I have written about here before, &lt;a href="http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1"&gt;extracting
useful information from the Internet Movie
Database&lt;/a&gt;.
If you&amp;#8217;ve read my earlier &lt;a href="http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1"&gt;blog
post&lt;/a&gt;
or have used the &lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;Grand Unified &lt;span class="caps"&gt;IMDB&lt;/span&gt; Movie Rating Decoder
Ring&lt;/a&gt;,
you might find the slides from the talk interesting.  They provide
some more details about the R and Perl code used to analyze the &lt;span class="caps"&gt;IMDB&lt;/span&gt; data
and create the decoder ring.&lt;/p&gt;


	&lt;p&gt;You can get the slides here:&lt;/p&gt;


&lt;div class="slide"&gt;
&lt;a href="http://community.moertel.com/~thor/talks/pgh-pm-perl-and-r.pdf"&gt;&lt;img src="http://community.moertel.com/~thor/talks/pgh-pm-perl-and-r.png" title="Title slide from my talk on R and Perl" alt="Title slide from my talk on R and Perl" /&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Thu, 21 Jun 2007 14:38:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:790fc9ef-72d5-43fc-b140-f0aaeccad6ee</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/06/21/talk-fun-with-numbers-r-and-perl-and-imdb-data</link>
      <category>talks</category>
      <category>perl</category>
      <category>talks</category>
      <category>R</category>
      <category>imdb</category>
      <category>statistics</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/481</trackback:ping>
    </item>
    <item>
      <title>The IMDB Movie Rating Decoder Ring: updated w/ 2 March 2007 data</title>
      <description>&lt;p&gt;If you want to get more out of &lt;a href="http://imdb.com/"&gt;&lt;span class="caps"&gt;IMDB&lt;/span&gt;&lt;/a&gt; movie ratings, check out my
&lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;&lt;span class="caps"&gt;IMDB&lt;/span&gt; Movie Rating Decoder Ring&lt;/a&gt;, now updated with fresher data (as of 2 March 2007).&lt;/p&gt;</description>
      <pubDate>Fri, 09 Mar 2007 17:40:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:f75cbc12-2c78-4a30-9863-968dc535d1a3</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/03/09/the-imdb-movie-rating-decoder-ring-updated-w-2-march-2007-data</link>
      <category>statistics</category>
      <category>imdb</category>
      <category>statistics</category>
      <category>movies</category>
      <category>decoder_rinng</category>
      <category>ratings</category>
      <category>stars</category>
      <category>data</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/409</trackback:ping>
    </item>
    <item>
      <title>Mining gold from the Internet Movie Database, part 1: decoding user ratings</title>
      <description>&lt;p&gt;&lt;a href="http://imdb.com/"&gt;The Internet Movie Database&lt;/a&gt; (IMDb) is a rich source
of online movie information.  The problem is, the true gold is buried
deep beneath the site&amp;#8217;s user-friendly exterior and hidden within the
database itself.  With a little digging, however, we can extract the
gold, nugget by nugget, and learn about fun statistical tools for data
analysis.&lt;/p&gt;


	&lt;p&gt;Today, in the first part of our analysis, we will put our intuition
about rating systems to the test.  We will decode IMDb &amp;#8220;user ratings,&amp;#8221; 
those numbers such as 6.1 and 7.8 that summarize how the registered
users of the IMDb rated movies on a scale from 1 to 10, typically
depicted as a series of stars on the screen:&lt;/p&gt;


&lt;div style="text-align: center; margin: 1.5ex; "&gt;
&lt;img src="http://community.moertel.com/~thor/pix/20060114/sample-user-rating.png" title="sample user rating" alt="sample user rating" /&gt;
&lt;/div&gt;

	&lt;p&gt;We will extract the collective wisdom of registered IMDb users in
order to convert a movie&amp;#8217;s user rating into the movie&amp;#8217;s standing
within the database.  This gives us a good indicator of how the movie
stacks up against other movies in general, and that&amp;#8217;s good information
to have when deciding which movies to see in the theater or add to
your Netflix list.&lt;/p&gt;


	&lt;p&gt;Ready to start digging?  Let&amp;#8217;s go!&lt;/p&gt;&lt;h3&gt;Getting to know user ratings: fundamental descriptive statistics&lt;/h3&gt;


	&lt;p&gt;Like most online movie databases, the IMDb encourages its users to
rate movies on a numerical scale, in this case from 1 to 10.  The IMDb
software averages these ratings into a composite &amp;#8220;user rating&amp;#8221; for
each movie.  &lt;a href="http://imdb.com/title/tt0360717/"&gt;King Kong&lt;/a&gt;, for
example, currently has a user rating of 7.8.  &lt;a href="http://imdb.com/title/tt0388482/"&gt;Transporter
2&lt;/a&gt;, on the other hand, has a user
rating of 6.1.&lt;/p&gt;


	&lt;p&gt;Certainly, we have some sense of what these ratings mean.  &lt;span class="caps"&gt;A 6&lt;/span&gt;.1, for
example, is somewhat higher than the midpoint of the 1-to-10 scale.
Thus we might expect a 6.1-rated movie to be somewhat better than the
typical movie.  But is this expectation justified?  Also, we know a
7.8 is better than a 6.1.  But how much better?  Is it 1.7 stars
better?  And, if so, what does that mean?&lt;/p&gt;


	&lt;p&gt;To understand what user ratings mean, we must put them into context.
Let&amp;#8217;s assume that buried within the IMDb is some kind of useful
information that reflects the collective wisdom of the site&amp;#8217;s users.
When a movie is rated 7.8, we will assume that the rating means the
movie is &amp;#8220;better&amp;#8221; than lower-rated movies and &amp;#8220;worse&amp;#8221; than
higher-rated movies.  To what degree, we don&amp;#8217;t know for sure, but
that is what we are about to find out.&lt;/p&gt;


	&lt;p&gt;While we might not know what it means for a movie to be a &amp;#8220;7.8,&amp;#8221; we
probably do have a genuine sense for what it means for a movie to be
among the best of movies, or among the worst, or among the middle of
the pack.  We have developed this sense by experience, by watching
movies over our lifetimes.  What we need is some way of converting the
number 7.8 into something that registers with this
hard-earned experience.&lt;/p&gt;


	&lt;p&gt;As a starting point, let&amp;#8217;s examine the most fundamental descriptive
statistics of the IMDb&amp;#8217;s user ratings:&lt;/p&gt;


	&lt;table&gt;
		&lt;tr&gt;
			&lt;th style="text-align:right;"&gt;Count &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;&amp;nbsp;Mean &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;&amp;nbsp;Median &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;&amp;nbsp;St.Dev. &lt;/th&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;23,396 &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;        6.2 &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;          6.4 &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;           1.4 &lt;/td&gt;
		&lt;/tr&gt;
	&lt;/table&gt;




	&lt;p&gt;Breaking them down:&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;&lt;em&gt;count&lt;/em&gt; &amp;#8211; There are 23,396 user ratings in the database.  (There are
  actually more, but to eliminate fringe movies I am considering only
  those movies that have been rated by more than 100 users.)&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;mean&lt;/em&gt; &amp;#8211; The average user rating is 6.2.  While some ratings are
  lower and others higher, if you were to put all of the ratings in
  a blender and purée them into a homogeneous soup, the soup&amp;#8217;s
  overall rating would balance out to 6.2.&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;median&lt;/em&gt; &amp;#8211; The rating that divides the database in half.  Ratings
  higher than 6.4 fall into the better half; ratings lower than 6.4, the
  worse half.&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;standard deviation&lt;/em&gt; &amp;#8211; This is a measure of how spread out the
  ratings are.  Assuming the distribution of the ratings has a
  bell-curve shape, which we will investigate in a moment, about 85
  percent of the ratings will fall within one standard deviation of
  the mean, i.e., in the range 6.2 +/- 1.4 = 4.8 to 7.6.&lt;/li&gt;
	&lt;/ul&gt;


	&lt;p&gt;Another way to examine the ratings is graphically.  The following
chart, called a &lt;em&gt;histogram&lt;/em&gt;, shows how many movies had each possible
user rating:&lt;/p&gt;


	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/pix/20060114/hist-all.png" title="Histogram of IMDb movie ratings" alt="Histogram of IMDb movie ratings" /&gt;&lt;/p&gt;


	&lt;p&gt;The ratings form a pointy bell curve.  It&amp;#8217;s easy to see that few
movies have ratings lower than 4 or higher than 8; most movies fall in
between.  The movies are most densely packed in the range that is a bit
higher than 6 and a bit lower than 8.  I have plotted the mean
(the triangle) and median (the &amp;#8220;X&amp;#8221;) along the bottom of the chart to put
them into perspective.&lt;/p&gt;


	&lt;h3&gt;Exploring the extremes&lt;/h3&gt;


	&lt;p&gt;With this information, we can begin to make crude interpretations of
user ratings.  Say we hear that
&lt;a href="http://imdb.com/title/tt0327554/"&gt;Catwoman&lt;/a&gt; has a user rating of 3.4.
Before we looked at the histogram, we probably could have guessed that
the movie was not good.  (We may even have heard as much from friends.)
But now that we have seen the histogram, we know that very few movies
had a rating lower than 4, let alone 3.4, and so we know the movie is
among the worst ever released.  It is, no pun intended, an outright
dog.&lt;/p&gt;


	&lt;p&gt;On the other side of the spectrum, 
&lt;a href="http://imdb.com/title/tt0372784/"&gt;Batman Begins&lt;/a&gt; has a user
rating of 8.3.  Since we know that few movies rate better than 8,
we know that this movie is probably among the very best.&lt;/p&gt;


	&lt;p&gt;The following histogram shows where both movies stand:&lt;/p&gt;


	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/pix/20060114/hist-all-catwoman.png" title="Histogram of IMDb movie ratings, augmented" alt="Histogram of IMDb movie ratings, augmented" /&gt;&lt;/p&gt;


	&lt;p&gt;So far, we understand the extremes of the rating system.  Movies lower
than 4 are probably terrible, and movies higher than 8 are probably
great.  No doubt, that is useful information.  But, what about that big
lump in the middle which represents the bulk of movies?  That is where
there real gold is hidden.  To get it, we must dig deeper.&lt;/p&gt;


	&lt;h3&gt;Charting the inner masses&lt;/h3&gt;


	&lt;p&gt;We already know Catwoman is bad, but how bad is it?  One way to
quantify its badness is to count how many movies in the database are
equally bad or worse, and compare that count to the size of the entire
database.  In the database, there are 1,060 movies with Catwoman&amp;#8217;s 3.4
user rating or lower.  The size of the entire database is 23,396
movies.  Dividing the first number by the second, we find that
Catwoman is among the worst 5 percent of movies the database.
It is in the &lt;em&gt;5th percentile.&lt;/em&gt;&lt;/p&gt;


	&lt;p&gt;We just turned a 3.4 user rating into a percentage that tells us where
3.4-rated movies stand with respect to all of the movies within the
database.  If we repeat the process for all possible movie ratings and
plot the results, we get a chart like this:&lt;/p&gt;


	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/pix/20060114/ecdf-all-catwoman.png" title="Empirical cumulative distribution of IMDb movie ratings" alt="Empirical cumulative distribution of IMDb movie ratings" /&gt;&lt;/p&gt;


	&lt;p&gt;Each point on the S-shaped curve relates a movie&amp;#8217;s rating with its
standing in the database.  The circle on the lower portion of the curve,
for example, represents Catwoman.  Its position corresponds to a 3.4
user rating on the horizontal axis and a 0.05 portion (5 percent) on
the vertical axis.  Thus a 3.4-rated movie is in the 5th percentile.
The triangle on the upper portion of the curve corresponds to Batman
Begins, relating the movie&amp;#8217;s 8.3 rating to its glorious standing in the
97th percentile.&lt;/p&gt;


	&lt;p&gt;Because the curve covers all ratings, not just the extremes, we now
have a way to quantify the goodness or badness of middle-ground
movies.  Let&amp;#8217;s return to &lt;a href="http://imdb.com/title/tt0360717/"&gt;King Kong&lt;/a&gt;,
currently rated 7.8, and &lt;a href="http://imdb.com/title/tt0388482/"&gt;Transporter
2&lt;/a&gt;, currently rated 6.1.  Look up
their percentiles on the curve above.  (Try it.)  If you are careful,
you should get close to the actual values of 91 and 42, respectively.&lt;/p&gt;


	&lt;p&gt;This would be a good time to reflect upon our intuition about user
ratings.  Earlier, we thought a 6.1 user rating suggested that a movie
was somewhat better than the typical movie.  Now, however, we see that
a 6.1 is worth somewhat less than is typical.&lt;/p&gt;


	&lt;p&gt;Even though their ratings differ by only 1.7 user-rating units, King
Kong is in the 91st percentile &amp;#8211; very good &amp;#8211; and Transporter 2 is way
down in the 42nd percentile &amp;#8211; not so good.  To look at the difference
another way, about &lt;em&gt;half&lt;/em&gt; of the movies in the database fall in
between Transporter 2 and King Kong: 0.91 &amp;#8211; 0.42 = 0.49.  A small
difference in user ratings can represent a large difference in
standings, which might further challenge our intuition about ratings.&lt;/p&gt;


	&lt;p&gt;Additionally, differences in standings are not proportional to
differences in user ratings.  Catwoman, for example, has a user rating
of 3.4 and falls into the 5th percentile.  Transporter 2, with its 6.1
user rating, is a whole 2.7 user-rating units away from Catwoman, but
only 37 percent of movies stand between them.  Even though Transporter
2 is closer to King Kong in terms of user ratings, it is really closer
to Catwoman in terms of standing.&lt;/p&gt;


	&lt;h3&gt; Movie-rating decoder ring&lt;/h3&gt;


	&lt;p&gt;A chart is great for understanding the relationship between user
ratings and movie standings, but it is not ideal for day-to-day use,
when we just want to figure out where a movie stands before deciding
whether it is worth watching.  For times like that, a lookup table is a
convenient alternative.  The table below, for example, summarizes the
rating-standing relationship in a convenient &amp;#8220;decoder-ring&amp;#8221; format.
Find a movie&amp;#8217;s rating in the left column, and the corresponding entry
in the right column gives the movie&amp;#8217;s standing.&lt;/p&gt;


	&lt;table&gt;
		&lt;tr&gt;
			&lt;th&gt; Rating  &lt;/th&gt;
			&lt;th&gt;Percentile &lt;/th&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   4.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;      8     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     19     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     22     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     29     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     33     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     40     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     45     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     55     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     61     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     70     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     76     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     84     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     89     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     94     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     96     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     98     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     99     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   9.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    100     &lt;/td&gt;
		&lt;/tr&gt;
	&lt;/table&gt;




	&lt;p&gt;Using King Kong as an example again, let&amp;#8217;s look up 7.8.  It turns out
that 7.8 is not in the table, but 7.75 is, and it corresponds to the
89th percentile.  So we can guesstimate that King Kong is a bit above
the 89th percentile, which, as we know from earlier, is correct, the
actual value being 91.  The decoder ring is not as precise as the
chart, but it is more than good enough for finding a movie&amp;#8217;s
approximate standing quickly &amp;#8211; something that might be handy on
a Friday night.&lt;/p&gt;


	&lt;h3&gt; Summary: weighing the gold&lt;/h3&gt;


	&lt;p&gt;What have we dug up so far?  First, we computed a few essential
descriptive statistics of the IMDb&amp;#8217;s user ratings.  We learned that
the average rating is 6.2 and that the median, which divides the
ratings into better and worse halves, is 6.4.&lt;/p&gt;


	&lt;p&gt;Second, we plotted a histogram in order to inspect the ratings
visually.  Right away, we could tell that movies rated lower than 4
are among the very worst, and movies rated higher than 8 are among the
very best.&lt;/p&gt;


	&lt;p&gt;Third, in order to give more meaning to ratings in between those two
extremes, we turned to percentiles.  We computed Catwoman&amp;#8217;s by hand;
it&amp;#8217;s in the 5th percentile &amp;#8211; ouch!  Then we plotted a curve that
represents the relationship between user ratings and percentiles.
Using this curve we determined that King Kong is in the 91st
percentile and Transporter 2 is in the 42nd percentile &amp;#8211; a large
difference in movie standings.&lt;/p&gt;


	&lt;p&gt;Finally, we created a tabular &amp;#8220;decoder ring&amp;#8221; to summarize what the
curve depicted.  It is a quick and easy way to find a movie&amp;#8217;s
standing given its user rating.&lt;/p&gt;


	&lt;p&gt;That concludes our first dig of the Internet Movie Database.  Next
time, we will examine the factors that influence movie ratings.  Are
Documentaries better than Horror flicks?  Are old movies generally
better than new movies?  We will ask those questions and more in the
next part of the series.&lt;/p&gt;


	&lt;p&gt;Until then, enjoy a movie or two.  And don&amp;#8217;t forget your slide-rule.&lt;/p&gt;


	&lt;h3&gt;Acknowledgments&lt;/h3&gt;


	&lt;p&gt;The movie information used in this article is courtesy of &lt;a href="http://www.imdb.com"&gt;The Internet
Movie Database&lt;/a&gt; and used with permission.&lt;/p&gt;


	&lt;p&gt;Second, my analysis was performed with
&lt;a href="http://www.r-project.org/about.html"&gt;R&lt;/a&gt; software from the &lt;a href="http://www.r-project.org/"&gt;R Project
for Statistical Computing&lt;/a&gt;.  R is a great
statistics package.  It&amp;#8217;s Free Software, and it has a great community
around it.  &lt;em&gt;Do&lt;/em&gt; check it out.&lt;/p&gt;</description>
      <pubDate>Tue, 17 Jan 2006 20:59:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:305161727c36a521acc37a1452ee7be2</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1</link>
      <category>movies</category>
      <category>statistics</category>
      <category>R</category>
      <category>imdb</category>
      <category>statistics</category>
      <category>movies</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/22</trackback:ping>
    </item>
  </channel>
</rss>
