<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/stylesheets/rss.css" type="text/css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Tom Moertel's Weblog: Tag statistics</title>
    <link>http://blog.moertel.com/articles/tag/statistics?tag=statistics</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Quality rants on programming theory and stuff geeks like</description>
    <item>
      <title> R tips and tricks:  Producing smooth bitmap plots</title>
      <description>&lt;p&gt;The &lt;a href="http://www.r-project.org/"&gt;R statistics system&lt;/a&gt; can produce
first-class data visualizations, commonly known as plots.  Internally,
plots are represented in an abstract graphics format that can be
rendered on any of R&amp;#8217;s wide range of graphics &amp;#8220;devices&amp;#8221; to produce
concrete output &amp;#8211; windows, bitmap files, PostScript files, &lt;span class="caps"&gt;PDF&lt;/span&gt; files,
and others.&lt;/p&gt;


	&lt;p&gt;The bitmap formats, such as &lt;span class="caps"&gt;PNG&lt;/span&gt;, are preferred for posting
plots online because of their widespread support by web browsers.  The
default bitmap-rendering devices in R, unfortunately, produce graphics
that look a little too &amp;#8220;bitmapped&amp;#8221; for modern web tastes.  Here, for example,
is a plot rendered by R&amp;#8217;s &amp;#8220;png&amp;#8221; device:&lt;/p&gt;


&lt;div class="photo"&gt;

	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/blog/pix-20070825/plot.png" title="Plot rendered via R's PNG device" alt="Plot rendered via R's PNG device" /&gt;&lt;/p&gt;


&lt;/div&gt;

	&lt;p&gt;There&amp;#8217;s nothing technically wrong with the plot, but it looks out
of place on a web page.  That&amp;#8217;s because modern web
browsers use font-smoothing and anti-aliasing techniques to render
just about everything else on the page.  Against this clean, un-jagged
backdrop, the oh-so-bitmapped plot looks like a throwback to
a previous era.&lt;/p&gt;


	&lt;p&gt;Happily, we &lt;em&gt;can&lt;/em&gt; produce clean, anti-aliased R plots with a little
help.  Here&amp;#8217;s the earlier plot, anti-aliased:&lt;/p&gt;


&lt;div class="photo"&gt;

	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/blog/pix-20070825/plot2.png" title="Plot rendered via R's PDF device, then post-processed" alt="Plot rendered via R's PDF device, then post-processed" /&gt;&lt;/p&gt;


&lt;/div&gt;

	&lt;p&gt;To produce the anti-aliased plot, I used R to produce a &lt;span class="caps"&gt;PDF&lt;/span&gt; file.  Then I
rendered the &lt;span class="caps"&gt;PDF&lt;/span&gt; file into a &lt;span class="caps"&gt;PNG&lt;/span&gt; image at 300 dpi using Ghostscript.
Finally, I scaled the 300-dpi image down to screen resolution,
producing a high-quality, anti-aliased result.&lt;/p&gt;


	&lt;p&gt;Here&amp;#8217;s the recipe in detail.&lt;/p&gt;


	&lt;p&gt;First, I define an R function called &lt;em&gt;pdfit&lt;/em&gt; that takes an
abstract graphics object and makes a &lt;span class="caps"&gt;PDF&lt;/span&gt;-file rendering of it, using
my preferred graphics-device settings:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;require("lattice")

pdfit &amp;lt;- function(f, ...) {
  trellis.device(dev=pdf, theme="col.whitebg", ...);
  print(f);
  dev.off()
}
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Then, when I create a plot I want to publish, I use &lt;em&gt;pdfit&lt;/em&gt; to render
it into a &lt;span class="caps"&gt;PDF&lt;/span&gt; file:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;P.img &amp;lt;- xyplot( subs.low + subs.high ~ date, ... )

pdfit(P.img, file="image-downloads.pdf")  # render plot into PDF file
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Finally, I use &lt;a href="http://www.ghostscript.com/"&gt;Ghostscript&lt;/a&gt; and
&lt;a href="http://www.imagemagick.org/"&gt;ImageMagick&lt;/a&gt; to convert the &lt;span class="caps"&gt;PDF&lt;/span&gt; file into
a high-quality, anti-aliased &lt;span class="caps"&gt;PNG&lt;/span&gt; file.  (I keep both formats: the &lt;span class="caps"&gt;PDF&lt;/span&gt;
file is best for publishing in printed papers, and the &lt;span class="caps"&gt;PNG&lt;/span&gt; file is
best for posting online.)  I use a simple Makefile to automate the
process of converting the &lt;span class="caps"&gt;PDF&lt;/span&gt; files into &lt;span class="caps"&gt;PNG&lt;/span&gt; files:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;# Makefile (GNU make)

pdfs := $(wildcard *.pdf)
pngs := $(pdfs:.pdf=.png)

all: $(pngs)
.PHONY: all

%.png: %.pdf
    gs -dSAFTER -dBATCH -dNOPAUSE -sDEVICE=png16m \
       -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -r300 \
       -dBackgroundColor='16#ffffff' \
       -sOutputFile=$@ &amp;gt; /dev/null \
       $&amp;lt; &amp;#38;&amp;#38; \
    mogrify -resize 500 $@
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;With this Makefile in my graphics directory, just a single &amp;#8220;make&amp;#8221; 
command is all it takes to convert my &lt;span class="caps"&gt;PDF&lt;/span&gt; images into
anti-aliased &lt;span class="caps"&gt;PNG&lt;/span&gt; files, ready to post online.&lt;/p&gt;


	&lt;p&gt;And that&amp;#8217;s it.&lt;/p&gt;


	&lt;p&gt;Do you have any tips or tricks for making good-looking graphics with
R?  If so, please do share.&lt;/p&gt;


&lt;div class="update"&gt;

	&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt;  There is one downside to the sexy, anti-aliased plots: they
are not as compressible as the old-style jagged plots.  For the
images above, for instance, the anti-aliased &lt;span class="caps"&gt;PNG&lt;/span&gt; file weighs in
at 45&amp;#160;KB, but the original &lt;span class="caps"&gt;PNG&lt;/span&gt; file is a feathery 4.7&amp;#160;KB.
So, if bandwidth is precious to you &amp;#8211; or you&amp;#8217;re planning on getting
Slashdotted &amp;#8211; you might want to stick
with the jaggies.&lt;/p&gt;


&lt;/div&gt;</description>
      <pubDate>Sat, 25 Aug 2007 21:56:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:c0e03deb-df96-4c4b-aea7-a4d9a256421b</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/08/25/r-tips-and-tricks-producing-smooth-bitmap-plots</link>
      <category>statistics</category>
      <category>R</category>
      <category>statistics</category>
      <category>plots</category>
      <category>graphics</category>
      <category>tips</category>
      <category>tricks</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/550</trackback:ping>
    </item>
    <item>
      <title>Fun with statistics:  estimating blog readership (a do-it-yourself recipe)</title>
      <description>&lt;p&gt;As everybody knows, &lt;em&gt;statistics is fun&lt;/em&gt;.  Is there
anything cooler than crushing a heap of seemingly uninteresting
numbers into gleaming jewels of meaning?  Of course not!  Models,
data-visualization plots, and fat data sets are &lt;em&gt;way cool&lt;/em&gt;.
So, let&amp;#8217;s find an excuse to play with them.&lt;/p&gt;


	&lt;p&gt;Here&amp;#8217;s &lt;span style="text-decoration: line-through"&gt;an excuse&lt;/span&gt; &amp;#8211;
I mean, an important and highly relevant question that many of us share:
&lt;em&gt;How many people actually read our blogs&lt;/em&gt;?  To answer the
question, we will need to use statistics, data, and cool plots.
Further, if you&amp;#8217;ve got the raw data for your blog, you can follow
along with your own analysis.  Even more fun!&lt;/p&gt;


	&lt;p&gt;We&amp;#8217;ll start with a simple inspection of common web-log data, using
command-line tools.  After developing a rough understanding of what
useful information we can extract, we&amp;#8217;ll analyze the raw data using a
series of successively more sophisticated techniques.  In the end, we
will derive a simple formula for estimating readership from easily
obtainable data.&lt;/p&gt;


	&lt;p&gt;Sound good?  Then let&amp;#8217;s get rocking.&lt;/p&gt;


	&lt;p&gt;But first, a preemptive strike on would-be poo-pooers: I know all about
FeedBurner.  I know they will track my blog&amp;#8217;s subscribers and use
their mystical powers to infer the number of &amp;#8220;real&amp;#8221; subscribers I
have.  I know it&amp;#8217;s &lt;em&gt;all so easy&lt;/em&gt;.  But easy isn&amp;#8217;t the point.  I want to
&lt;em&gt;understand&lt;/em&gt; what&amp;#8217;s going on.  Just taking somebody&amp;#8217;s word for it isn&amp;#8217;t
nearly as satisfying as figuring it out yourself &amp;#8211; nor as fun.&lt;/p&gt;


	&lt;p&gt;OK.  For real this time, &lt;em&gt;let&amp;#8217;s get rocking.&lt;/em&gt;&lt;/p&gt;&lt;h3&gt; The goal&lt;/h3&gt;


	&lt;p&gt;We want to know how many people read my blog regularly.  By regularly,
I mean that if I post something today, we want to count the people who
will read it within a week&amp;#8217;s time.  That way we&amp;#8217;ll count the weekend
readers but not the one-time readers who will trickle in from search
engines over the months ahead.&lt;/p&gt;


	&lt;p&gt;We can&amp;#8217;t just look at my web-log stats to determine my blog&amp;#8217;s
readership, however.  That&amp;#8217;s because a lot of people read my blog
through online feed aggregators, such as Bloglines and Google Reader,
and never actually &amp;#8220;hit&amp;#8221; my blog when they read it.  (My blog is so
ugly, in fact, that I would expect &lt;em&gt;lots&lt;/em&gt; of my readers to use a feed
aggregator just to protect themselves from my design &amp;#8220;skills.&amp;#8221;)&lt;/p&gt;


	&lt;p&gt;So the goal is to figure out how to count my readers using the data
we can actually get our hands on.&lt;/p&gt;


	&lt;h3&gt; The data&lt;/h3&gt;


	&lt;p&gt;Here&amp;#8217;s what we have: my &lt;span class="caps"&gt;HTTP&lt;/span&gt; server&amp;#8217;s log.  That&amp;#8217;s it.  Can we squeeze
the good stuff from it?  Let&amp;#8217;s find out.&lt;/p&gt;


	&lt;p&gt;Each entry in the log represents a single request for something on my
site.  A typical entry looks like this (split over multiple
lines for your reading pleasure):&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;72.14.199.81 - - [19/Aug/2007:19:31:43 -0400]
"GET /xml/atom/article/472/feed.xml HTTP/1.1" 200 1959 "-" 
"Feedfetcher-Google; (+http://www.google.com/...; 1 subscribers; ...)" 
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;There&amp;#8217;s a lot of potentially useful information in there:&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;the IP address of the host that made the request&lt;/li&gt;
		&lt;li&gt;the date and time that the request was received&lt;/li&gt;
		&lt;li&gt;a summary of the request (e.g., &amp;#8220;GET /xml/atom/article/472/feed.xml &lt;span class="caps"&gt;HTTP&lt;/span&gt;/1.1&amp;#8221;)&lt;/li&gt;
		&lt;li&gt;the response code, typically 200 for a successful response&lt;/li&gt;
		&lt;li&gt;the string sent by the requester&amp;#8217;s user agent to identify itself (e.g., &amp;#8220;Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; ...)&amp;#8221;)&lt;/li&gt;
	&lt;/ul&gt;


	&lt;p&gt;Note that this particular request was made by Google&amp;#8217;s Feedfetcher
for an Atom feed.  Also note that Feedfetcher told us,
via its user-agent identification string, how many of its users have
subscribed to this particular feed.  That&amp;#8217;s good stuff we can use.&lt;/p&gt;


	&lt;p&gt;My blog&amp;#8217;s main Atom feed is at /xml/atom10/feed.xml.  There are other
&amp;#8220;main&amp;#8221; feeds as well (e.g., &lt;span class="caps"&gt;RSS&lt;/span&gt;), but let&amp;#8217;s focus on this one for
now.  Let&amp;#8217;s see who&amp;#8217;s been asking for it recently.  First, I&amp;#8217;ll create a
bash-shell function to grab the subset of the log corresponding to 19
August:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ get_subset() {
    fgrep "GET /xml/atom10/feed.xml" blog_log |
    fgrep 19/Aug/2007;
  }
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Then I&amp;#8217;ll summarize the user-agent part of that subset&amp;#8217;s log entries:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ get_subset |
  perl -lne 'print $1 if /"([^";(]+)[^"]*"$/' |
  sort | uniq -c | sort -rn
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;78 NewsGatorOnline/2.0
47 Vienna/2.1.3.2111
38 Mozilla/5.0
27 YandexBlog/0.99.101
21 NewsFire/69
20 Planet Haskell +http://planet.haskell.org/ ...
19 Feedfetcher-Google
19 AppleSyndication/54
14 Zhuaxia.com 1 Subscribers
14 NetNewsWire/2.1b33
13 RssFwd
13 Bloglines/3.1
11 livedoor FeedFetcher/0.01
10 Feeds2.0
 8 RssBandit/1.5.0.10
 8 Akregator/1.2.6
 7 Eldono
 6 Netvibes
 4 NetNewsWire/3.0
 2 trawlr.com
 2 Opera/9.21
 2 NetNewsWire/3.1b5
 2 NetNewsWire/2.1
 2 Mozilla/3.0
 1 Vienna/2.2.0.2206
 1 Vienna/2.1.0.2107
 1 NetNewsWire/2.1.1
 1 Liferea/1.2.10
 1 JetBrains Omea Reader 2.2
 1 FeedTools/0.2.26 +http://www.sporkmonger.com/projects/feedtools/
 1 Feedshow/2.0
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Of the user agents that fetched my feed, only some, such as
Bloglines and Google Reader, aggregate on behalf of other users, and
only some of those mass aggregators reported how many people have
subscribed through them:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ get_subset |
  perl -lne 'print $1 if /"([^"]*?\d+ subscribers?)/i' |
  sort | uniq -c | sort -rn
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;78 NewsGatorOnline/2.0 (... 22 subscribers
19 Feedfetcher-Google; (... 102 subscribers
14 Zhuaxia.com 1 Subscribers
13 RssFwd (... 1 subscribers
13 Bloglines/3.1 (http://www.bloglines.com; 82 subscribers
11 livedoor FeedFetcher/0.01 (... 1 subscriber
10 Mozilla/5.0 (Rojo 1.0; ... 4 subscriber
 7 Eldono (http://www.eldono.de; 1 subscribers
 6 Netvibes (http://www.netvibes.com/; 12 subscribers
 2 trawlr.com (+http://www.trawlr.com; 4 subscribers
 1 Feedshow/2.0 (http://www.feedshow.com; 1 subscriber
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Of the user agents that don&amp;#8217;t report subscriber counts, most are
single-user feed readers.  The 47 requests from the Vienna-2.1.3.2111
reader, for example, came from 5 distinct IP addresses (which I&amp;#8217;ve
obscured to protect my innocent readers&amp;#8217; identities):&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ get_subset |
  perl -lane 'print $F[0] if m{"Vienna/2.1.3.2111}' |
  sort | uniq -c | sort -rn
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;22 121.44.xxx.xxx
20 208.120.xxx.xxx
 3 69.154.xxx.xxx
 1 84.163.xxx.xxx
 1 202.89.xxx.xxx
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Does that mean I have only 5 distinct readers using Vienna 2.1.3.2111?
Not necessarily.  The first IP address, for example, could represent a
firewall that serves several people from a single corporate
campus.  So there could, indeed, be more than 5 users lurking behind
those addresses, but it&amp;#8217;s hard to know for sure.&lt;/p&gt;


	&lt;p&gt;Thus we can&amp;#8217;t rely on feed-fetching statistics to
reliably determine the count of readers.  The mass aggregators don&amp;#8217;t
all report their subscriber counts, and the stand-alone aggregators&amp;#8217;
fetching habits are not readily interpreted.  And, even if we could
obtain reliable fetching inferences, that only tells us how many
people fetched my blog&amp;#8217;s feeds.  We want to know how many people &lt;em&gt;read&lt;/em&gt;
my blog &amp;#8211; actually look at the articles.&lt;/p&gt;


	&lt;p&gt;To do that, we&amp;#8217;ll need a more-sophisticated approach.&lt;/p&gt;


	&lt;h3&gt; A different approach: counting image downloads&lt;/h3&gt;


	&lt;p&gt;Every once in a while, I&amp;#8217;ll post an article that contains photos or
graphs of something I&amp;#8217;m trying to explain.  Since images like that are
included by reference, they are not actually part of the article
itself.  So when a feed fetcher grabs a syndicated copy of the
article, it won&amp;#8217;t bother to fetch the images. There&amp;#8217;s no need to use
the bandwidth unless the person on the other side of the feed actually
reads the article, at which time the person&amp;#8217;s feed reader can download
the images on demand.&lt;/p&gt;


	&lt;p&gt;Thus we can use the number of image downloads as an estimate of the
number of people who actually read my blog.  For each article that has
images, we can count how many times each image was downloaded during
the article&amp;#8217;s first week online and take the average of the counts as
an estimate of the number of people who read the article.  (Marketing
weasels use this technique, too, to track your reading habits.  The
only difference is that they will often insert gratuitous, personally
identifying images &amp;#8211; &lt;a href="http://en.wikipedia.org/wiki/Web_bug"&gt;web bugs&lt;/a&gt;
&amp;#8211; into their documents to track you specifically.)&lt;/p&gt;


	&lt;p&gt;The image-counting technique isn&amp;#8217;t foolproof, however.  Requests from
people behind proxy servers may never actually make it to my server
to be counted, leading to under-counting.  Also, some web crawlers
fetch images, which may artificially inflate the count of &amp;#8220;readers.&amp;#8221; 
Examining the logs, I didn&amp;#8217;t see many image requests from crawlers,
so our primary concern is under-counting.  Since I&amp;#8217;m OK with a
conservative count, under-counting is acceptable.&lt;/p&gt;


	&lt;p&gt;Let&amp;#8217;s give image-counting a try.  On 15 July 2007, I posted &lt;a href="http://blog.moertel.com/articles/2007/07/15/hailstorm"&gt;a story
about a nasty
hailstorm&lt;/a&gt; that
hit my neighborhood.  The story included some photos of the storm and
its aftermath.  Let&amp;#8217;s count how many times the second photo in the
story was requested on the day the story was posted:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ fgrep "webcam-2007-07-13--153421.jpg" mc_log |
  fgrep 15/Jul/2007 | wc -l
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;884
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;884 times.  Many of those downloads, however, were made by just a few
requesting hosts.  Here are the top ten downloaders:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ fgrep "webcam-2007-07-13--153421.jpg" mc_log |
  fgrep 15/Jul/2007 |
  perl -lane 'print $F[0]' |
  sort | uniq -c | head
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;
 42 84.45.xxx.xxx
 27 192.168.xxx.xxx
 10 75.182.xxx.xxx
  7 83.132.xxx.xxx
  7 213.203.xxx.xxx
  6 72.173.xxx.xxx
  6 67.180.xxx.xxx
  6 65.214.xxx.xxx
  5 89.98.xxx.xxx
  5 85.104.xxx.xxx
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;How do we interpret these duplicate requests? One way would be to say
that each request, duplicate or not, represents a unique reader.  It&amp;#8217;s
plausible.  When many readers share a gateway
firewall, say in a corporate setting, they will all end up making
requests from the same IP address(es).  Thus, if we want to count
all such readers, we should count all of the requests.&lt;/p&gt;


	&lt;p&gt;The more conservative interpretation is that all of the requests from
the same IP address represent only a single reader.  All of the
duplicate requests might be reloads or, perhaps, the work of an
overzealous user-agent working (inefficiently) on behalf of that
user.  Let&amp;#8217;s recount using this conservative assumption:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ fgrep "webcam-2007-07-13--153421.jpg" mc_log |
  fgrep 15/Jul/2007 |
  perl -lane 'print $F[0]' | sort -u | wc -l
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;635
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;So what&amp;#8217;s the real count, 635 or 884?  The truth probably lies
somewhere in between.  To make sure we capture the truth, then, let&amp;#8217;s
use both interpretations in our ongoing analysis.  We will develop low
and high estimates from now on.&lt;/p&gt;


	&lt;p&gt;If you have sharp eyes, you may have noticed that the second IP
address in the list above was from a private network.  That address,
in fact, belongs to my workstation.  When I write articles, I
frequently reload the drafts, and reloading causes the images within
the drafts to be re-fetched.  We&amp;#8217;ll need to filter out my addresses
during our later analyses.&lt;/p&gt;


	&lt;p&gt;There&amp;#8217;s one more thing to consider.  We still need to count the image
downloads for the rest of the week.  So far, we have only counted
those for the article&amp;#8217;s first day online.  So, let&amp;#8217;s re-do our
conservative count, only this time for the whole week. Let&amp;#8217;s also
filter out my private addresses and ignore all but &lt;span class="caps"&gt;HTTP 200&lt;/span&gt;
&amp;#8220;OK&amp;#8221; responses:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;$ fgrep "webcam-2007-07-13--153421.jpg" mc_log |
  fgrep " 200 " |  # only count full downloads (status code = 200)
  grep -P '(1[56789]|2[01])/Jul/2007' |  # Jul 15 thru 21 (7 days)
  perl -lane 'print $F[0] unless $F[0] =~ /^192\.168\./' |
  sort -u | wc -l
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;1601
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;So, we estimate conservatively that my article on the hailstorm was
read by about 1600 people in its first week.  Since the article was
published on 15 July 2007, we can conservatively estimate that my
blog&amp;#8217;s regular readership was about 1600 at that time, too.&lt;/p&gt;


	&lt;p&gt;But that&amp;#8217;s just a single point estimate.  We&amp;#8217;ll need more data
if we&amp;#8217;re to draw reliable conclusions.&lt;/p&gt;


	&lt;h3&gt; Compiling the image data&lt;/h3&gt;


	&lt;p&gt;To compile enough data for meaningful inferences, I have whipped up a
small script (in Perl) to extract and summarize image-download
statistics, given an &lt;span class="caps"&gt;HTTP&lt;/span&gt;-server log.  Running the script on my blog&amp;#8217;s
log, here&amp;#8217;s what we get:&lt;/p&gt;


&lt;div style="font-size: smaller; line-height: 1em; margin-bottom: 1em;"&gt;

	&lt;table&gt;
		&lt;tr&gt;
			&lt;th&gt; Date       &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;Hits low &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;Hits high  &lt;/th&gt;
			&lt;th&gt; Image &lt;/th&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-06-18   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    158   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     192   &lt;/td&gt;
			&lt;td&gt;  lady-beetle-larva-upside-down-small.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-06-18   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    157   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     191   &lt;/td&gt;
			&lt;td&gt;  lady-battle-larva-upside-down-close.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-07-06   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    163   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     200   &lt;/td&gt;
			&lt;td&gt;  lectro-shirt-before-and-after-wash-small &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-07-06   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    163   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     203   &lt;/td&gt;
			&lt;td&gt;  lectro-shirt-before-wash-300dpi.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-07-06   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    168   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     206   &lt;/td&gt;
			&lt;td&gt;  lectro-shirt-before-wash-small.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-07-07   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    155   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     194   &lt;/td&gt;
			&lt;td&gt;  Cladonia-cristatella-close.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-07-07   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    155   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     194   &lt;/td&gt;
			&lt;td&gt;  Cladonia-cristatella.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-08-03   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    147   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     188   &lt;/td&gt;
			&lt;td&gt;  annies-mixup-0003.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-08-03   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    146   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     188   &lt;/td&gt;
			&lt;td&gt;  annies-mixup-0002.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-08-24   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    173   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     217   &lt;/td&gt;
			&lt;td&gt;  blog-fd-usage-vs-time.png &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-09-12   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    271   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     328   &lt;/td&gt;
			&lt;td&gt;  perl-at-work-sign.png &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-10-18   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1448   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    1582   &lt;/td&gt;
			&lt;td&gt;  safe-strings.png * &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-11-04   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1005   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    1351   &lt;/td&gt;
			&lt;td&gt;  old-web-site-3.png &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-11-04   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1011   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    1364   &lt;/td&gt;
			&lt;td&gt;  old-web-site.png &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2006-11-14   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1265   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    1747   &lt;/td&gt;
			&lt;td&gt;  toms-apple-pie.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-05-25   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1567   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2406   &lt;/td&gt;
			&lt;td&gt;  problem-close.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-05-25   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1563   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2400   &lt;/td&gt;
			&lt;td&gt;  receiver-insides.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-05-25   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1551   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2383   &lt;/td&gt;
			&lt;td&gt;  repair.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-06-21   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   2290   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    3024   &lt;/td&gt;
			&lt;td&gt;  perl-and-r.png * &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1574   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2360   &lt;/td&gt;
			&lt;td&gt;  webcam-2007-07-13&amp;#8212;153751.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1562   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2379   &lt;/td&gt;
			&lt;td&gt;  backyard-ice.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1553   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2364   &lt;/td&gt;
			&lt;td&gt;  shredded.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1567   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2346   &lt;/td&gt;
			&lt;td&gt;  webcam-2007-07-13&amp;#8212;153757.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1561   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2355   &lt;/td&gt;
			&lt;td&gt;  webcam-2007-07-13&amp;#8212;153808.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1612   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2469   &lt;/td&gt;
			&lt;td&gt;  hailstorm2.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1592   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2382   &lt;/td&gt;
			&lt;td&gt;  webcam-2007-07-13&amp;#8212;153726.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1586   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2381   &lt;/td&gt;
			&lt;td&gt;  webcam-2007-07-13&amp;#8212;153747.jpg &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td&gt;  2007-07-15   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;   1601   &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    2404   &lt;/td&gt;
			&lt;td&gt;  webcam-2007-07-13&amp;#8212;153421.jpg &lt;/td&gt;
		&lt;/tr&gt;
	&lt;/table&gt;




&lt;/div&gt;

	&lt;p&gt;Like most data sets, this one looks better in graphical form:&lt;/p&gt;


&lt;div class="photo"&gt;
&lt;img src="http://community.moertel.com/~thor/blog/pix-20070821/image-downloads.png" title="Image downloads by date" alt="Image downloads by date" /&gt;
&lt;/div&gt;

	&lt;p&gt;The circles represent our conservative readership estimates, and the
pluses represent our liberal readership estimates.  To
interpret the overall readership trend, focus on one set of estimates,
either circles or pluses.&lt;/p&gt;


	&lt;p&gt;What do we see?  First, it looks like the quantity of downloads has
increased steadily, from a few hundred in July 2006 to the low
thousands by July 2007.  That&amp;#8217;s nice.&lt;/p&gt;


	&lt;p&gt;Second, the data are sparse.  I don&amp;#8217;t post images often, so we don&amp;#8217;t
have much data to go on.&lt;/p&gt;


	&lt;p&gt;Third, it looks like we have some outliers.  If you look at the points
near October 2006 and June 2007, you&amp;#8217;ll see that they jump up from the
surrounding points.  (In the lower-bound series, I have marked these
outliers with a short orange, vertical line segment.) If these jumps
truly represented a sudden increase in readership, we would expect
them to be permanent, reflected in later readership data.  What we
see, however, is that these gains are only temporary.&lt;/p&gt;


	&lt;p&gt;Thus it seems reasonable to conclude that something else is going
on for these images.  If you look back at the data table, I
have marked the pair of curious images with asterisks.  As it
turns out, both of these images were part of stories that were
featured on Reddit.  So, what these data reflect is the normal
readership &lt;em&gt;plus&lt;/em&gt; the Reddit effect.  To avoid throwing off our
inferences, let&amp;#8217;s discard the data for these two images.&lt;/p&gt;


	&lt;p&gt;In the end, we have a pretty good means of estimating my blog&amp;#8217;s
readership on the dates when I posted articles that contained images
(provided those wily Redditers didn&amp;#8217;t pile on the articles).  The
problem is, I would like to know what my readership is all the time,
not just on those rare occasions I post images.  I certainly don&amp;#8217;t
want to resort to using web bugs.  Hey, I&amp;#8217;m no marketing weasel.&lt;/p&gt;


	&lt;p&gt;It&amp;#8217;s time to add yet another layer of sophistication to our analysis.&lt;/p&gt;


	&lt;h3&gt; A combined model: reported subscribers &lt;em&gt;with&lt;/em&gt; image downloads&lt;/h3&gt;


	&lt;p&gt;Let&amp;#8217;s go back to the subscriber numbers reported by online aggregators
such as Bloglines and Google Reader.  If we assume that those
aggregators represent a decent slice of my readers, and that the size
of that slice as a proportion of the whole universe of readers doesn&amp;#8217;t
change much over time, we can model actual readership (as gathered
from image downloads) in terms of reported subscriber numbers.
Then, we can use that model to predict actual readership for the
dates when no image-download data are available.&lt;/p&gt;


	&lt;p&gt;That&amp;#8217;s the plan.  So, let&amp;#8217;s get going.&lt;/p&gt;


	&lt;h4&gt; Gathering subscriber data&lt;/h4&gt;


	&lt;p&gt;So, let&amp;#8217;s grab those subscriber numbers.  Again, I&amp;#8217;ve whipped up
a Perl script to gather the data.  Here&amp;#8217;s what the script does.  It &amp;#8211;&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;scans my blog&amp;#8217;s &lt;span class="caps"&gt;HTTP&lt;/span&gt; server log&lt;/li&gt;
		&lt;li&gt;ignores requests from private networks&lt;/li&gt;
		&lt;li&gt;ignores requests that don&amp;#8217;t report a subscriber count&lt;/li&gt;
		&lt;li&gt;emits one subscriber count for each day of data in the log, computed as the sum of each feed&amp;#8217;s subscriber count, as reported by each aggregator (if an aggregator fetches a feed more than once in a day, all but the final request are ignored)&lt;/li&gt;
	&lt;/ul&gt;


	&lt;p&gt;Running the script on my server log, I got a large data set.  It&amp;#8217;s
so large that I&amp;#8217;ll go straight to the plot:&lt;/p&gt;


&lt;div class="photo"&gt;
&lt;img src="http://community.moertel.com/~thor/blog/pix-20070821/agg-reported-subs.png" title="Subscriber counts, as reported by online aggregators" alt="Subscriber counts, as reported by online aggregators" /&gt;
&lt;/div&gt;

	&lt;p&gt;As you would expect, these subscriber counts are less than the
corresponding reader counts we gathered from image downloads.  Not
everybody uses an online feed reader, after all.&lt;/p&gt;


	&lt;p&gt;One thing that leaps out is the discontinuity around February 2007.
What happened back then?  As it turns out, that is when Google finally
started reporting its subscriber counts.  Since Google has a large
share of the online aggregator market, that one little change resulted
in a big increase in the total of reported counts.&lt;/p&gt;


	&lt;p&gt;Still, that jump is going to make our analysis a bit more difficult.
When we relate subscriber counts to actual readers, we will need to
account for the &amp;#8220;Google effect.&amp;#8221;&lt;/p&gt;


	&lt;p&gt;Likewise, there are a few other sets of outliers &amp;#8211; points that look
like bogus data &amp;#8211; we should keep in mind.  To see whether any of our
image-download data coincide with these outliers, let&amp;#8217;s highlight our
subscriber data for the days when we also have image data:&lt;/p&gt;


&lt;div class="photo"&gt;
&lt;img src="http://community.moertel.com/~thor/blog/pix-20070821/subs-and-dls.png" title="Subscriber counts, highlighted if corresponding image-download data are available" alt="Subscriber counts, highlighted if corresponding image-download data are available" /&gt;
&lt;/div&gt;

	&lt;p&gt;Sure enough, some of our early download data coincide with an outlier
group in July 2006.  Let&amp;#8217;s remove that download data from our analysis
set, too.&lt;/p&gt;


	&lt;p&gt;Our data cleaned, let&amp;#8217;s move on.&lt;/p&gt;


	&lt;h4&gt; The model&lt;/h4&gt;


	&lt;p&gt;Now we are ready to relate subscribers to
readers (as determined by downloads).  Here&amp;#8217;s our model:&lt;/p&gt;


&lt;div style="text-align: center"&gt;
&lt;em&gt;y&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt; = &lt;em&gt;a&amp;#160;&amp;#xB7;&amp;#160;g&lt;sub&gt;i&lt;/sub&gt;&amp;#160;&amp;#xB7;&amp;#160;x&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt; + &lt;em&gt;e&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt;
&lt;/div&gt;

	&lt;p&gt;Where:&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;&lt;em&gt;y&lt;/em&gt; represents actual readers (as estimated from image downloads)&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;x&lt;/em&gt; represents subscribers as reported by online aggregators&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;i&lt;/em&gt; ranges over 1&amp;#8211;&lt;em&gt;N&lt;/em&gt; for our &lt;em&gt;N&lt;/em&gt; data points&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;a&lt;/em&gt; is the coefficient that relates readers to subscribers&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;g&lt;/em&gt; is a true/false factor to indicate whether &lt;em&gt;x&lt;/em&gt; includes Google Reader users&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;e&lt;/em&gt; is the model&amp;#8217;s error term&lt;/li&gt;
	&lt;/ul&gt;


	&lt;p&gt;What the model says is that readership (&lt;em&gt;y&lt;/em&gt;) varies linearly
with subscriber counts (&lt;em&gt;x&lt;/em&gt;) and that the rate at which it
varies is given by &lt;em&gt;a&amp;#160;&amp;#xB7;&amp;#160;g&lt;sub&gt;i&lt;/sub&gt;&lt;/em&gt;.  (Model aficionados may
note that this is a varying-slope model.) The model does not include a
constant term; this is to fix the &lt;em&gt;y&lt;/em&gt;-intercept at 0
because when we have no actual readers, we cannot have any subscribers,
either.  Thus we know the point (0,0) must be part of the fitted model.&lt;/p&gt;


	&lt;p&gt;Here&amp;#8217;s the data set we will use to fit our model:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;        date y.low y.high   x     g
1 2006-06-18   158    192  53 FALSE
2 2006-08-03   146    188  68 FALSE
3 2006-08-24   173    217  89 FALSE
4 2006-09-12   271    328  97 FALSE
5 2006-11-04  1008   1358 112 FALSE
6 2006-11-14  1265   1747 114 FALSE
7 2007-05-25  1560   2396 385  TRUE
8 2007-07-15  1579   2382 401  TRUE
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;This data set combines a summarized version of our image-download data
set with the corresponding data from our aggregator-reported subscriber
set (the red points in the previous plot).&lt;/p&gt;


	&lt;p&gt;The low and high &lt;em&gt;y&lt;/em&gt; values represent our conservative and
liberal interpretations of readership, which we discussed earlier.
You&amp;#8217;ll also note that where multiple images were available for any
particular date, I have averaged their download counts to give a
centralized readership estimate for that date.  (Exercise: For this
model, why shouldn&amp;#8217;t we include multiple images for a single date?)&lt;/p&gt;


	&lt;p&gt;Let&amp;#8217;s plot this data set (just the &lt;em&gt;y.low&lt;/em&gt; part):&lt;/p&gt;


&lt;div class="photo"&gt;
&lt;img src="http://community.moertel.com/~thor/blog/pix-20070821/model-fitting-data.png" title="Data set for model fitting" alt="Data set for model fitting" /&gt;
&lt;/div&gt;

	&lt;p&gt;There aren&amp;#8217;t many points to go on, but because our model is so simple,
there are probably enough.  That means it&amp;#8217;s time to fit our model to
our data.&lt;/p&gt;


	&lt;p&gt;To fit our linear model, I&amp;#8217;ll use the &lt;em&gt;lm&lt;/em&gt; function from the amazingly
cool &lt;a href="http://www.r-project.org/"&gt;R statistics system&lt;/a&gt; (which I&amp;#8217;ve also
been using for our plots).  To summarize the results,
I&amp;#8217;ll use the &lt;em&gt;display&lt;/em&gt; function from the
&lt;a href="http://cran.r-project.org/src/contrib/Descriptions/arm.html"&gt;&amp;#8220;arm&amp;#8221; 
&lt;span class="caps"&gt;CRAN&lt;/span&gt; package&lt;/a&gt;, which accompanies Andrew Gelman and Jennifer Hill&amp;#8217;s
wonderful book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/052168689X/ref=nosim/tommoertesweb-20"&gt;&lt;em&gt;Data Analysis Using Regression and
Multilevel/Hierarchical
Models&lt;/em&gt;&lt;/a&gt;.
(BTW, &lt;a href="http://www.stat.columbia.edu/~gelman/blog/"&gt;Gelman&amp;#8217;s blog&lt;/a&gt; is
fascinating.  It&amp;#8217;s one of my favorite reads.)  If you are following
along and don&amp;#8217;t have the &amp;#8220;arm&amp;#8221; package installed, you can use the
&lt;em&gt;summary&lt;/em&gt; function instead of &lt;em&gt;display&lt;/em&gt;.&lt;/p&gt;


	&lt;p&gt;First, let&amp;#8217;s fit the model to the conservative
data:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;M1.low &amp;lt;- lm (y.low ~ g:x + 0, data=subs.readers)
display(M1.low)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;lm(formula = y.low ~ g:x + 0, data = subs.readers)
         coef.est coef.se
gFALSE:x 6.31     1.59
gTRUE:x  3.99     0.64
  n = 8, k = 2
  residual sd = 356.94, R-Squared = 0.90
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;That&amp;#8217;s a pretty good fit.  Both of our model parameters are
significant (even at the 1-percent level).  The resulting model says
that each subscriber represents about 4 actual readers (or 6.3 readers
if the subscriber count doesn&amp;#8217;t include Google Reader users).&lt;/p&gt;


	&lt;p&gt;Let&amp;#8217;s visualize the model, now fit to our data:&lt;/p&gt;


&lt;div class="photo"&gt;
&lt;img src="http://community.moertel.com/~thor/blog/pix-20070821/m1-fit.png" title="Our model, fit to our data" alt="Our model, fit to our data" /&gt;
&lt;/div&gt;

	&lt;p&gt;The gray line segments represent our fitted model&amp;#8217;s predictions.
Thus, for example, when we have &lt;em&gt;x&lt;/em&gt;&amp;#160;=&amp;#160;100 reported
subscribers, the model predicts that we have about
&lt;em&gt;y&lt;/em&gt;&amp;#160;=&amp;#160;630 actual readers.  Likewise, when we have
400 subscribers, the model predicts that we have about 1600 actual
readers.&lt;/p&gt;


	&lt;p&gt;The two line segments show how our model accommodates the &amp;#8220;Google
effect.&amp;#8221; On the left, we have the pre-Google slope; on the right, the
post-Google slope.  In effect, our model combines two simpler models
and chooses between them based on the Boolean factor &lt;em&gt;g&lt;/em&gt;.&lt;/p&gt;


	&lt;p&gt;And that&amp;#8217;s all there is to the fitting process.
Let&amp;#8217;s repeat the process for the liberal-interpretation data.&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;M1.high &amp;lt;- lm (y.high ~ g:x + 0, data=subs.readers)
display(M1.high)
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;lm(formula = y.high ~ g:x + 0, data = subs.readers)
         coef.est coef.se
gFALSE:x 8.46     2.25
gTRUE:x  6.08     0.91
  n = 8, k = 2
  residual sd = 504.22, R-Squared = 0.91
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Under this model, each subscriber represents about 6 actual readers
(or 8.5 if our subscriber count doesn&amp;#8217;t include Google Reader users).&lt;/p&gt;


	&lt;p&gt;Now that we have our models, let&amp;#8217;s use them to
predict actual readership.&lt;/p&gt;


	&lt;h3&gt;Using our models for prediction&lt;/h3&gt;


	&lt;p&gt;Models ready, we can now predict my blog&amp;#8217;s readership for any day, not
just those days on which I happened to include images in my
postings.&lt;/p&gt;


	&lt;p&gt;I have subscriber data in an R data frame called, unsurprisingly,
&lt;em&gt;subscriber.data&lt;/em&gt;.  It provides, for each day I have subscriber
statistics, values for &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;g&lt;/em&gt;.  (This is the same
data set visualized in the earlier plot &amp;#8220;Aggregator-reported
subscribers to blog.moertel.com.&amp;#8221;) We can tell R to plug these values
into our model to predict the actual number of readers for those days.
Let&amp;#8217;s make both conservative and liberal predictions, storing them in
a new data frame called &lt;em&gt;predicted.readers&lt;/em&gt;:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;predicted.readers &amp;lt;-
  transform(subscriber.data,
            readers.low  = predict(M1.low, subscriber.data),
            readers.high = predict(M1.high, subscriber.data))
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Now let&amp;#8217;s plot our predictions.  First the plot code, just
so you can see how it&amp;#8217;s done in R:&lt;/p&gt;


&lt;pre&gt;&lt;code class="typedin"&gt;xyplot(readers.low + readers.high ~ date,
       data = predicted.readers,
       main = "Predicted actual readers of blog.moertel.com",
       ylab = "Readers",
       xlab = "Date",
       auto.key = list(x = .35, y = .9, corner = c(0,0),
                       text = c("conservative estimate",
                                "liberal estimate"),
                       reverse.rows = T, between = -19))
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;And the resulting plot:&lt;/p&gt;


&lt;div class="photo"&gt;
&lt;img src="http://community.moertel.com/~thor/blog/pix-20070821/predicted-readers.png" title="Readership of blog.moertel.com, low and high predictions" alt="Readership of blog.moertel.com, low and high predictions" /&gt;
&lt;/div&gt;

	&lt;h3&gt;The bottom line&lt;/h3&gt;


	&lt;p&gt;We have distilled a ton of raw data into a simple formula for
predicting my blog&amp;#8217;s actual readership from readily available
subscriber counts.  Just take the total
subscriber count and multiply by 4 and 6, respectively, for low and
high estimates of readership.&lt;/p&gt;


	&lt;p&gt;So, to answer our original question, how many readers does my blog
have?  Only a few days ago, on 18 August, the online aggregators reported
that they were serving my feeds to 442 subscribers.  So we can predict
that, right now, my blog has 1750 to 2650 readers.&lt;/p&gt;


	&lt;p&gt;We have our answer.  Getting it took some doing, but the doing was
fun, so all&amp;#8217;s good.&lt;/p&gt;


	&lt;p&gt;Certainly, we could go on.  There are many interesting questions left
to be answered.  What, for example, is the growth trend of my
readership?  What is Google Reader&amp;#8217;s market share? For now, however,
it&amp;#8217;s time to take a break.&lt;/p&gt;


	&lt;p&gt;I hope you had fun following along.  If you have your own data, I&amp;#8217;d be
interested in hearing about your analytical explorations.  (And, if you
haven&amp;#8217;t installed &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; on your computer yet,
&lt;em&gt;do it now&lt;/em&gt;.  R is seriously cool and comes with great
documentation, examples, and sample data.  If you&amp;#8217;re not using R,
you&amp;#8217;re not having all the fun you deserve.)&lt;/p&gt;


&lt;div class="update"&gt;

	&lt;p&gt;&lt;strong&gt;Update:&lt;/strong&gt; minor editing tweaks for clarity.&lt;/p&gt;


&lt;/div&gt;</description>
      <pubDate>Wed, 22 Aug 2007 21:34:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:01ec2aa2-2a63-4f48-8ab6-a7c1b6af4c20</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/08/22/fun-with-statistics-estiating-blog-readership</link>
      <category>statistics</category>
      <category>R</category>
      <category>statistics</category>
      <category>blog</category>
      <category>fun</category>
      <category>modeling</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/544</trackback:ping>
    </item>
    <item>
      <title>Greasmonkey script annotates IMDb movies with their decoder-ring percentile ranks</title>
      <description>&lt;p&gt;Sam at &lt;a href="http://rephrase.net/"&gt;rephase.net&lt;/a&gt; has harnessed the earth-shattering power of the &lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;IMDb movie-rating decoder ring&lt;/a&gt; to create a &lt;a href="http://rephrase.net/days/07/06/imdb-decoder"&gt;Greasmonkey script that annotates IMDb-listed movies with their percentile ranks&lt;/a&gt;. Now you don&amp;#8217;t need to look up a movie&amp;#8217;s &amp;#8220;star rating&amp;#8221; in the decoder ring to see where the movie ranks; the ranking appears right on the movie&amp;#8217;s IMDb page.&lt;/p&gt;


	&lt;p&gt;Do check out the &lt;a href="http://rephrase.net/box/user-js/scripts/imdb-percentile-ratings.user.js"&gt;script itself&lt;/a&gt;  to see how Sam cleverly embeds a copy of the decoder ring and plucks scores from it as needed.&lt;/p&gt;


	&lt;p&gt;For more on the IMDb movie-rating decoder ring, see:&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;&lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;the decoder ring itself&lt;/a&gt;&lt;/li&gt;
		&lt;li&gt;&lt;a href="http://blog.moertel.com/articles/2007/06/21/talk-fun-with-numbers-r-and-perl-and-imdb-data"&gt;my talk &lt;em&gt;Fun with Numbers: R and Perl and &lt;span class="caps"&gt;IMDB&lt;/span&gt; data&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
		&lt;li&gt;&lt;a href="http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1"&gt;Mining gold from the IMDb&lt;/a&gt;&lt;/li&gt;
	&lt;/ul&gt;</description>
      <pubDate>Wed, 11 Jul 2007 13:49:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:ccf23640-312c-49f8-9e89-7bae08d56c4f</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/07/11/greasmonkey-script-for-imdb-decoder-ring</link>
      <category>hacks</category>
      <category>imdb</category>
      <category>statistics</category>
      <category>greasmonkey</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/513</trackback:ping>
    </item>
    <item>
      <title>Talk: Fun with Numbers: R and Perl (and IMDB data)</title>
      <description>&lt;p&gt;Last week I gave a talk on the &lt;a href="http://www.r-project.org/"&gt;R statistics
system&lt;/a&gt; and Perl for the &lt;a href="http://pgh.pm.org/"&gt;Pittsburgh Perl
Mongers&lt;/a&gt;.  The example that threaded through the
talk was something I have written about here before, &lt;a href="http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1"&gt;extracting
useful information from the Internet Movie
Database&lt;/a&gt;.
If you&amp;#8217;ve read my earlier &lt;a href="http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1"&gt;blog
post&lt;/a&gt;
or have used the &lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;Grand Unified &lt;span class="caps"&gt;IMDB&lt;/span&gt; Movie Rating Decoder
Ring&lt;/a&gt;,
you might find the slides from the talk interesting.  They provide
some more details about the R and Perl code used to analyze the &lt;span class="caps"&gt;IMDB&lt;/span&gt; data
and create the decoder ring.&lt;/p&gt;


	&lt;p&gt;You can get the slides here:&lt;/p&gt;


&lt;div class="slide"&gt;
&lt;a href="http://community.moertel.com/~thor/talks/pgh-pm-perl-and-r.pdf"&gt;&lt;img src="http://community.moertel.com/~thor/talks/pgh-pm-perl-and-r.png" title="Title slide from my talk on R and Perl" alt="Title slide from my talk on R and Perl" /&gt;&lt;/a&gt;
&lt;/div&gt;</description>
      <pubDate>Thu, 21 Jun 2007 14:38:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:790fc9ef-72d5-43fc-b140-f0aaeccad6ee</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/06/21/talk-fun-with-numbers-r-and-perl-and-imdb-data</link>
      <category>talks</category>
      <category>perl</category>
      <category>talks</category>
      <category>R</category>
      <category>imdb</category>
      <category>statistics</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/481</trackback:ping>
    </item>
    <item>
      <title>New Fedora Core RPMS for CRAN packages arm, Matrix, lme4, car, coda, leaps, and mlmRev</title>
      <description>&lt;p&gt;Just a quick note for folks using the &lt;a href="http://www.r-project.org"&gt;R statistics
system&lt;/a&gt; on &lt;a href="http://fedoraproject.org/"&gt;Fedora
Linux&lt;/a&gt;.  I have packaged for Fedora a
bunch of R packages from the &lt;a href="http://cran.r-project.org/"&gt;&lt;span class="caps"&gt;CRAN&lt;/span&gt;&lt;/a&gt;.  (R
packages have to be packaged again, as &lt;span class="caps"&gt;RPM&lt;/span&gt; packages, to integrate with
Fedora Linux.)&lt;/p&gt;


	&lt;p&gt;My initial goal was to package
&lt;a href="http://cran.r-project.org/src/contrib/Descriptions/arm.html"&gt;arm&lt;/a&gt;,
which contains tools for working with various regression models.
(This package accompanies Andrew Gelman and Jennifer Hill&amp;#8217;s wonderful
book &lt;a href="http://www.amazon.com/exec/obidos/ASIN/0521867061/ref=nosim/tommoertesweb-20"&gt;Data Analysis Using Regression and Multilevel/Hierarchical Models&lt;/a&gt;.)
Packaging &amp;#8220;arm,&amp;#8221; however, quickly snowballed into packaging a bunch of
prerequisites.  Thankfully, I have now completed that task and can
share the fruits of my labor with you.&lt;/p&gt;


	&lt;p&gt;All in all, to install &amp;#8220;arm,&amp;#8221; you will need the following RPMs:&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;R-arm-1.0-2&lt;/li&gt;
		&lt;li&gt;R-car-1.2-1&lt;/li&gt;
		&lt;li&gt;R-lme4-0.9975-1&lt;/li&gt;
		&lt;li&gt;R-Matrix-0.9975-1&lt;/li&gt;
		&lt;li&gt;R-R2WinBUGS-2.0-1&lt;/li&gt;
	&lt;/ul&gt;


	&lt;p&gt;The following RPMs are optional (but you will need them if you
want to rebuild the RPMs):&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;R-coda-0.10-1&lt;/li&gt;
		&lt;li&gt;R-leaps-2.7-1&lt;/li&gt;
		&lt;li&gt;R-mlmRev-0.995-1&lt;/li&gt;
	&lt;/ul&gt;


	&lt;p&gt;You can download the packages from the &lt;a href="http://community.moertel.com/ss/space/RPMs"&gt;RPMs
section&lt;/a&gt; of the Community
Projects site.  Better yet, you can use Yum to
download them for you.  Just add the &lt;em&gt;moertel-community&lt;/em&gt;
Yum repository to your /etc/yum.repos.d directory (see &lt;a href="http://community.moertel.com/ss/space/RPMs"&gt;RPMs&lt;/a&gt; for the recipe) and then use the
following command:&lt;/p&gt;


&lt;pre&gt;&lt;code&gt;$ sudo yum install R-arm
&lt;/code&gt;&lt;/pre&gt;

	&lt;p&gt;Yum will automatically resolve dependencies and install the required
packages.  If you want any of the optional packages, add them after
&amp;#8220;R-arm&amp;#8221; on the command line.&lt;/p&gt;


	&lt;p&gt;I have built the packages for Fedora Core 6 on the x86_64 architecture, but the
&lt;a href="http://community.moertel.com/rpms/fedora/6/SPECS/"&gt;&lt;span class="caps"&gt;RPM&lt;/span&gt; specs are available&lt;/a&gt;
if you want to rebuild the packages for other architectures.  (See
the instructions for &lt;a href="http://community.moertel.com/ss/space/Rebuilding+RPMs"&gt;rebuilding RPMs&lt;/a&gt; for help.)&lt;/p&gt;


	&lt;p&gt;&lt;strong&gt;Caveat:&lt;/strong&gt;
I&amp;#8217;m not sure that the R-R2WinBUGS package is fully functional.  It
depends on BRugs, which doesn&amp;#8217;t yet build on the Linux platform.  To
get around this problem, I made R-R2WinBUGS&amp;#8217;s dependency on BRugs
weak; the first package no longer requires the second to install.&lt;/p&gt;</description>
      <pubDate>Wed, 25 Apr 2007 14:07:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:1f62fc7e-a01e-462d-b4ca-e7d8f92f3648</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/04/25/new-fedora-core-rpms-for-cran-packages</link>
      <category>statistics</category>
      <category>linux</category>
      <category>fedora</category>
      <category>R</category>
      <category>statistics</category>
      <category>rpms</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/446</trackback:ping>
    </item>
    <item>
      <title>Engauge Digitizer: a handy tool for extracting data from charts</title>
      <description>&lt;p&gt;Today I wanted to extract the data that were visualized in a
chart I saw on &lt;a href="http://www.blog.sethroberts.net/2007/04/14/omega-3-and-arithmetic-continued/"&gt;Seth Roberts&amp;#8217;s blog&lt;/a&gt;.  That is, I had a &lt;em&gt;picture&lt;/em&gt; of a data set, and I wanted the numbers behind the picture.&lt;/p&gt;


	&lt;p&gt;This task turned out to be surprisingly easy &amp;#8211; once I found &lt;a href="http://digitizer.sourceforge.net/"&gt;Engauge Digitizer&lt;/a&gt;, an open-source (GPL) tool made for this very task.  After I launched Engauge, the digitization process was straightforward:&lt;/p&gt;


	&lt;ol&gt;
	&lt;li&gt;I established the chart&amp;#8217;s coordinate system by clicking in the corners and entering the associated coordinates.&lt;/li&gt;
		&lt;li&gt;Then I had Engauge identify data points.  With the mouse, I selected a data point by hand, teaching Engauge what a point looks like. Then Engauge identified spots on chart that looked like data points and locked on to them.  I was able to step through the points to tell Engauge to skip the few it misidentified.&lt;/li&gt;
		&lt;li&gt;I manually selected a few more data points that were scrunched into blobs and had eluded Engauge&amp;#8217;s point-detection heuristics.&lt;/li&gt;
		&lt;li&gt;Finally, I exported the data set in &lt;span class="caps"&gt;CSV&lt;/span&gt; format.&lt;/li&gt;
	&lt;/ol&gt;


	&lt;p&gt;If you ever need to extract the data behind a chart, do check out
Engauge Digitizer.  (If you use &lt;a href="http://fedoraproject.org/"&gt;Fedora Linux&lt;/a&gt;,
you&amp;#8217;ll be happy to know that I have packaged Engauge for you.
Get it at the &lt;a href="http://community.moertel.com/ss/space/RPMs"&gt;RPMs section&lt;/a&gt; 
of the community site.)&lt;/p&gt;</description>
      <pubDate>Tue, 17 Apr 2007 03:45:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:f8b0d9b7-7322-4d32-bed8-6f5ded82940f</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/04/17/engauge-digitizer-a-handy-tool-for-extracting-data-from-charts</link>
      <category>statistics</category>
      <category>fedora</category>
      <category>statistics</category>
      <category>data</category>
      <category>charts</category>
      <category>plots</category>
      <category>rpms</category>
      <category>tools</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/441</trackback:ping>
    </item>
    <item>
      <title>The IMDB Movie Rating Decoder Ring: updated w/ 2 March 2007 data</title>
      <description>&lt;p&gt;If you want to get more out of &lt;a href="http://imdb.com/"&gt;&lt;span class="caps"&gt;IMDB&lt;/span&gt;&lt;/a&gt; movie ratings, check out my
&lt;a href="http://community.moertel.com/ss/space/IMDB+Movie-Rating+Decoder+Ring"&gt;&lt;span class="caps"&gt;IMDB&lt;/span&gt; Movie Rating Decoder Ring&lt;/a&gt;, now updated with fresher data (as of 2 March 2007).&lt;/p&gt;</description>
      <pubDate>Fri, 09 Mar 2007 17:40:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:f75cbc12-2c78-4a30-9863-968dc535d1a3</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2007/03/09/the-imdb-movie-rating-decoder-ring-updated-w-2-march-2007-data</link>
      <category>statistics</category>
      <category>imdb</category>
      <category>statistics</category>
      <category>movies</category>
      <category>decoder_rinng</category>
      <category>ratings</category>
      <category>stars</category>
      <category>data</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/409</trackback:ping>
    </item>
    <item>
      <title>Mining gold from the Internet Movie Database, part 1: decoding user ratings</title>
      <description>&lt;p&gt;&lt;a href="http://imdb.com/"&gt;The Internet Movie Database&lt;/a&gt; (IMDb) is a rich source
of online movie information.  The problem is, the true gold is buried
deep beneath the site&amp;#8217;s user-friendly exterior and hidden within the
database itself.  With a little digging, however, we can extract the
gold, nugget by nugget, and learn about fun statistical tools for data
analysis.&lt;/p&gt;


	&lt;p&gt;Today, in the first part of our analysis, we will put our intuition
about rating systems to the test.  We will decode IMDb &amp;#8220;user ratings,&amp;#8221; 
those numbers such as 6.1 and 7.8 that summarize how the registered
users of the IMDb rated movies on a scale from 1 to 10, typically
depicted as a series of stars on the screen:&lt;/p&gt;


&lt;div style="text-align: center; margin: 1.5ex; "&gt;
&lt;img src="http://community.moertel.com/~thor/pix/20060114/sample-user-rating.png" title="sample user rating" alt="sample user rating" /&gt;
&lt;/div&gt;

	&lt;p&gt;We will extract the collective wisdom of registered IMDb users in
order to convert a movie&amp;#8217;s user rating into the movie&amp;#8217;s standing
within the database.  This gives us a good indicator of how the movie
stacks up against other movies in general, and that&amp;#8217;s good information
to have when deciding which movies to see in the theater or add to
your Netflix list.&lt;/p&gt;


	&lt;p&gt;Ready to start digging?  Let&amp;#8217;s go!&lt;/p&gt;&lt;h3&gt;Getting to know user ratings: fundamental descriptive statistics&lt;/h3&gt;


	&lt;p&gt;Like most online movie databases, the IMDb encourages its users to
rate movies on a numerical scale, in this case from 1 to 10.  The IMDb
software averages these ratings into a composite &amp;#8220;user rating&amp;#8221; for
each movie.  &lt;a href="http://imdb.com/title/tt0360717/"&gt;King Kong&lt;/a&gt;, for
example, currently has a user rating of 7.8.  &lt;a href="http://imdb.com/title/tt0388482/"&gt;Transporter
2&lt;/a&gt;, on the other hand, has a user
rating of 6.1.&lt;/p&gt;


	&lt;p&gt;Certainly, we have some sense of what these ratings mean.  &lt;span class="caps"&gt;A 6&lt;/span&gt;.1, for
example, is somewhat higher than the midpoint of the 1-to-10 scale.
Thus we might expect a 6.1-rated movie to be somewhat better than the
typical movie.  But is this expectation justified?  Also, we know a
7.8 is better than a 6.1.  But how much better?  Is it 1.7 stars
better?  And, if so, what does that mean?&lt;/p&gt;


	&lt;p&gt;To understand what user ratings mean, we must put them into context.
Let&amp;#8217;s assume that buried within the IMDb is some kind of useful
information that reflects the collective wisdom of the site&amp;#8217;s users.
When a movie is rated 7.8, we will assume that the rating means the
movie is &amp;#8220;better&amp;#8221; than lower-rated movies and &amp;#8220;worse&amp;#8221; than
higher-rated movies.  To what degree, we don&amp;#8217;t know for sure, but
that is what we are about to find out.&lt;/p&gt;


	&lt;p&gt;While we might not know what it means for a movie to be a &amp;#8220;7.8,&amp;#8221; we
probably do have a genuine sense for what it means for a movie to be
among the best of movies, or among the worst, or among the middle of
the pack.  We have developed this sense by experience, by watching
movies over our lifetimes.  What we need is some way of converting the
number 7.8 into something that registers with this
hard-earned experience.&lt;/p&gt;


	&lt;p&gt;As a starting point, let&amp;#8217;s examine the most fundamental descriptive
statistics of the IMDb&amp;#8217;s user ratings:&lt;/p&gt;


	&lt;table&gt;
		&lt;tr&gt;
			&lt;th style="text-align:right;"&gt;Count &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;&amp;nbsp;Mean &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;&amp;nbsp;Median &lt;/th&gt;
			&lt;th style="text-align:right;"&gt;&amp;nbsp;St.Dev. &lt;/th&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;23,396 &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;        6.2 &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;          6.4 &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;           1.4 &lt;/td&gt;
		&lt;/tr&gt;
	&lt;/table&gt;




	&lt;p&gt;Breaking them down:&lt;/p&gt;


	&lt;ul&gt;
	&lt;li&gt;&lt;em&gt;count&lt;/em&gt; &amp;#8211; There are 23,396 user ratings in the database.  (There are
  actually more, but to eliminate fringe movies I am considering only
  those movies that have been rated by more than 100 users.)&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;mean&lt;/em&gt; &amp;#8211; The average user rating is 6.2.  While some ratings are
  lower and others higher, if you were to put all of the ratings in
  a blender and purée them into a homogeneous soup, the soup&amp;#8217;s
  overall rating would balance out to 6.2.&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;median&lt;/em&gt; &amp;#8211; The rating that divides the database in half.  Ratings
  higher than 6.4 fall into the better half; ratings lower than 6.4, the
  worse half.&lt;/li&gt;
		&lt;li&gt;&lt;em&gt;standard deviation&lt;/em&gt; &amp;#8211; This is a measure of how spread out the
  ratings are.  Assuming the distribution of the ratings has a
  bell-curve shape, which we will investigate in a moment, about 85
  percent of the ratings will fall within one standard deviation of
  the mean, i.e., in the range 6.2 +/- 1.4 = 4.8 to 7.6.&lt;/li&gt;
	&lt;/ul&gt;


	&lt;p&gt;Another way to examine the ratings is graphically.  The following
chart, called a &lt;em&gt;histogram&lt;/em&gt;, shows how many movies had each possible
user rating:&lt;/p&gt;


	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/pix/20060114/hist-all.png" title="Histogram of IMDb movie ratings" alt="Histogram of IMDb movie ratings" /&gt;&lt;/p&gt;


	&lt;p&gt;The ratings form a pointy bell curve.  It&amp;#8217;s easy to see that few
movies have ratings lower than 4 or higher than 8; most movies fall in
between.  The movies are most densely packed in the range that is a bit
higher than 6 and a bit lower than 8.  I have plotted the mean
(the triangle) and median (the &amp;#8220;X&amp;#8221;) along the bottom of the chart to put
them into perspective.&lt;/p&gt;


	&lt;h3&gt;Exploring the extremes&lt;/h3&gt;


	&lt;p&gt;With this information, we can begin to make crude interpretations of
user ratings.  Say we hear that
&lt;a href="http://imdb.com/title/tt0327554/"&gt;Catwoman&lt;/a&gt; has a user rating of 3.4.
Before we looked at the histogram, we probably could have guessed that
the movie was not good.  (We may even have heard as much from friends.)
But now that we have seen the histogram, we know that very few movies
had a rating lower than 4, let alone 3.4, and so we know the movie is
among the worst ever released.  It is, no pun intended, an outright
dog.&lt;/p&gt;


	&lt;p&gt;On the other side of the spectrum, 
&lt;a href="http://imdb.com/title/tt0372784/"&gt;Batman Begins&lt;/a&gt; has a user
rating of 8.3.  Since we know that few movies rate better than 8,
we know that this movie is probably among the very best.&lt;/p&gt;


	&lt;p&gt;The following histogram shows where both movies stand:&lt;/p&gt;


	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/pix/20060114/hist-all-catwoman.png" title="Histogram of IMDb movie ratings, augmented" alt="Histogram of IMDb movie ratings, augmented" /&gt;&lt;/p&gt;


	&lt;p&gt;So far, we understand the extremes of the rating system.  Movies lower
than 4 are probably terrible, and movies higher than 8 are probably
great.  No doubt, that is useful information.  But, what about that big
lump in the middle which represents the bulk of movies?  That is where
there real gold is hidden.  To get it, we must dig deeper.&lt;/p&gt;


	&lt;h3&gt;Charting the inner masses&lt;/h3&gt;


	&lt;p&gt;We already know Catwoman is bad, but how bad is it?  One way to
quantify its badness is to count how many movies in the database are
equally bad or worse, and compare that count to the size of the entire
database.  In the database, there are 1,060 movies with Catwoman&amp;#8217;s 3.4
user rating or lower.  The size of the entire database is 23,396
movies.  Dividing the first number by the second, we find that
Catwoman is among the worst 5 percent of movies the database.
It is in the &lt;em&gt;5th percentile.&lt;/em&gt;&lt;/p&gt;


	&lt;p&gt;We just turned a 3.4 user rating into a percentage that tells us where
3.4-rated movies stand with respect to all of the movies within the
database.  If we repeat the process for all possible movie ratings and
plot the results, we get a chart like this:&lt;/p&gt;


	&lt;p&gt;&lt;img src="http://community.moertel.com/~thor/pix/20060114/ecdf-all-catwoman.png" title="Empirical cumulative distribution of IMDb movie ratings" alt="Empirical cumulative distribution of IMDb movie ratings" /&gt;&lt;/p&gt;


	&lt;p&gt;Each point on the S-shaped curve relates a movie&amp;#8217;s rating with its
standing in the database.  The circle on the lower portion of the curve,
for example, represents Catwoman.  Its position corresponds to a 3.4
user rating on the horizontal axis and a 0.05 portion (5 percent) on
the vertical axis.  Thus a 3.4-rated movie is in the 5th percentile.
The triangle on the upper portion of the curve corresponds to Batman
Begins, relating the movie&amp;#8217;s 8.3 rating to its glorious standing in the
97th percentile.&lt;/p&gt;


	&lt;p&gt;Because the curve covers all ratings, not just the extremes, we now
have a way to quantify the goodness or badness of middle-ground
movies.  Let&amp;#8217;s return to &lt;a href="http://imdb.com/title/tt0360717/"&gt;King Kong&lt;/a&gt;,
currently rated 7.8, and &lt;a href="http://imdb.com/title/tt0388482/"&gt;Transporter
2&lt;/a&gt;, currently rated 6.1.  Look up
their percentiles on the curve above.  (Try it.)  If you are careful,
you should get close to the actual values of 91 and 42, respectively.&lt;/p&gt;


	&lt;p&gt;This would be a good time to reflect upon our intuition about user
ratings.  Earlier, we thought a 6.1 user rating suggested that a movie
was somewhat better than the typical movie.  Now, however, we see that
a 6.1 is worth somewhat less than is typical.&lt;/p&gt;


	&lt;p&gt;Even though their ratings differ by only 1.7 user-rating units, King
Kong is in the 91st percentile &amp;#8211; very good &amp;#8211; and Transporter 2 is way
down in the 42nd percentile &amp;#8211; not so good.  To look at the difference
another way, about &lt;em&gt;half&lt;/em&gt; of the movies in the database fall in
between Transporter 2 and King Kong: 0.91 &amp;#8211; 0.42 = 0.49.  A small
difference in user ratings can represent a large difference in
standings, which might further challenge our intuition about ratings.&lt;/p&gt;


	&lt;p&gt;Additionally, differences in standings are not proportional to
differences in user ratings.  Catwoman, for example, has a user rating
of 3.4 and falls into the 5th percentile.  Transporter 2, with its 6.1
user rating, is a whole 2.7 user-rating units away from Catwoman, but
only 37 percent of movies stand between them.  Even though Transporter
2 is closer to King Kong in terms of user ratings, it is really closer
to Catwoman in terms of standing.&lt;/p&gt;


	&lt;h3&gt; Movie-rating decoder ring&lt;/h3&gt;


	&lt;p&gt;A chart is great for understanding the relationship between user
ratings and movie standings, but it is not ideal for day-to-day use,
when we just want to figure out where a movie stands before deciding
whether it is worth watching.  For times like that, a lookup table is a
convenient alternative.  The table below, for example, summarizes the
rating-standing relationship in a convenient &amp;#8220;decoder-ring&amp;#8221; format.
Find a movie&amp;#8217;s rating in the left column, and the corresponding entry
in the right column gives the movie&amp;#8217;s standing.&lt;/p&gt;


	&lt;table&gt;
		&lt;tr&gt;
			&lt;th&gt; Rating  &lt;/th&gt;
			&lt;th&gt;Percentile &lt;/th&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   4.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;      8     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     19     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     22     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     29     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   5.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     33     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     40     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     45     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     55     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   6.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     61     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     70     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     76     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     84     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   7.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     89     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     94     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.25  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     96     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.50  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     98     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   8.75  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;     99     &lt;/td&gt;
		&lt;/tr&gt;
		&lt;tr&gt;
			&lt;td style="text-align:right;"&gt;   9.00  &lt;/td&gt;
			&lt;td style="text-align:right;"&gt;    100     &lt;/td&gt;
		&lt;/tr&gt;
	&lt;/table&gt;




	&lt;p&gt;Using King Kong as an example again, let&amp;#8217;s look up 7.8.  It turns out
that 7.8 is not in the table, but 7.75 is, and it corresponds to the
89th percentile.  So we can guesstimate that King Kong is a bit above
the 89th percentile, which, as we know from earlier, is correct, the
actual value being 91.  The decoder ring is not as precise as the
chart, but it is more than good enough for finding a movie&amp;#8217;s
approximate standing quickly &amp;#8211; something that might be handy on
a Friday night.&lt;/p&gt;


	&lt;h3&gt; Summary: weighing the gold&lt;/h3&gt;


	&lt;p&gt;What have we dug up so far?  First, we computed a few essential
descriptive statistics of the IMDb&amp;#8217;s user ratings.  We learned that
the average rating is 6.2 and that the median, which divides the
ratings into better and worse halves, is 6.4.&lt;/p&gt;


	&lt;p&gt;Second, we plotted a histogram in order to inspect the ratings
visually.  Right away, we could tell that movies rated lower than 4
are among the very worst, and movies rated higher than 8 are among the
very best.&lt;/p&gt;


	&lt;p&gt;Third, in order to give more meaning to ratings in between those two
extremes, we turned to percentiles.  We computed Catwoman&amp;#8217;s by hand;
it&amp;#8217;s in the 5th percentile &amp;#8211; ouch!  Then we plotted a curve that
represents the relationship between user ratings and percentiles.
Using this curve we determined that King Kong is in the 91st
percentile and Transporter 2 is in the 42nd percentile &amp;#8211; a large
difference in movie standings.&lt;/p&gt;


	&lt;p&gt;Finally, we created a tabular &amp;#8220;decoder ring&amp;#8221; to summarize what the
curve depicted.  It is a quick and easy way to find a movie&amp;#8217;s
standing given its user rating.&lt;/p&gt;


	&lt;p&gt;That concludes our first dig of the Internet Movie Database.  Next
time, we will examine the factors that influence movie ratings.  Are
Documentaries better than Horror flicks?  Are old movies generally
better than new movies?  We will ask those questions and more in the
next part of the series.&lt;/p&gt;


	&lt;p&gt;Until then, enjoy a movie or two.  And don&amp;#8217;t forget your slide-rule.&lt;/p&gt;


	&lt;h3&gt;Acknowledgments&lt;/h3&gt;


	&lt;p&gt;The movie information used in this article is courtesy of &lt;a href="http://www.imdb.com"&gt;The Internet
Movie Database&lt;/a&gt; and used with permission.&lt;/p&gt;


	&lt;p&gt;Second, my analysis was performed with
&lt;a href="http://www.r-project.org/about.html"&gt;R&lt;/a&gt; software from the &lt;a href="http://www.r-project.org/"&gt;R Project
for Statistical Computing&lt;/a&gt;.  R is a great
statistics package.  It&amp;#8217;s Free Software, and it has a great community
around it.  &lt;em&gt;Do&lt;/em&gt; check it out.&lt;/p&gt;</description>
      <pubDate>Tue, 17 Jan 2006 20:59:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:305161727c36a521acc37a1452ee7be2</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1</link>
      <category>movies</category>
      <category>statistics</category>
      <category>R</category>
      <category>imdb</category>
      <category>statistics</category>
      <category>movies</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/22</trackback:ping>
    </item>
    <item>
      <title>Open-source statistics: R and ESS</title>
      <description>&lt;p&gt;Recently, I needed to perform some statistical work. But I didn&amp;#8217;t want use my previous tool-of-choice, Mathematica, because I decided after my switch to Linux not to rely on proprietary software when viable open-source alternatives existed. And thus I embarked on a short search for open-source statistics software.&lt;/p&gt;


	&lt;h3&gt;R&lt;/h3&gt;


	&lt;p&gt;My search was fruitful, leading me immediately to the delightfully &lt;span class="caps"&gt;GPL&lt;/span&gt;-licensed &lt;a href="http://www.r-project.org/"&gt;R Project for Statistical Computing&lt;/a&gt;: &amp;#8220;R is a language and environment for statistical computing and graphics.&amp;#8221; (The R system and language are similar to S, developed at Bell Labs.) The R language has functional-programming semantics (which I love) and supports (among others) the object-oriented style of programming, which is used extensively for R&amp;#8217;s statistical interface. Most results in R are delivered in terms of objects, such as tables and and vectors and linear models, whose properties you can inspect and manipulate as you would expect. The underlying classes provide specialized methods for common operations so that the objects do the right things in response to generic commands.&lt;/p&gt;


	&lt;p&gt;Immediately, I was hooked on R. Despite having a sharp initial learning curve, R is straightforward to use. Once you get the lay of the land, you can reliably guess what functions and their arguments mean. The help facility is good, too, and can integrate with your web browser if you desire.&lt;/p&gt;


	&lt;p&gt;And the graphics! Graphs and charts are often the first, best way to size up data sets. R makes it easy to create publication-quality graphs and charts, drawing on any number of supported &amp;#8220;graphical devices.&amp;#8221; Among the stock devices are postscript, pdf, LaTeX, png, xfig, postscript-rendered bitmaps, and &lt;span class="caps"&gt;X11&lt;/span&gt; (windows). For a tiny example of R&amp;#8217;s graphics, see my posts on &lt;a href="http://blog.moertel.com/articles/2006/01/17/mining-gold-from-the-internet-movie-database-part-1"&gt;Mining gold from the Internet Movie Database&lt;/a&gt;.&lt;/p&gt;


	&lt;p&gt;To make the already-attractive R downright irresistible, the R community offers the &lt;a href="http://cran.r-project.org/mirrors.html"&gt;Comprehensive R Archive Network&lt;/a&gt; (CRAN), the R equivalent of Perl&amp;#8217;s &lt;a href="http://search.cpan.org/"&gt;&lt;span class="caps"&gt;CPAN&lt;/span&gt;&lt;/a&gt;. (One of the &lt;span class="caps"&gt;CRAN&lt;/span&gt; mirrors is hosted by Pittsburgh&amp;#8217;s own &lt;a href="http://www.pair.com/"&gt;pair networks&lt;/a&gt;.) &lt;span class="caps"&gt;CRAN&lt;/span&gt; provides packages for esoteric methods of analysis, database integration, genetics, time series analysis, &lt;span class="caps"&gt;HTTP&lt;/span&gt; (!), map projections, vegetation science, and myriad others. Additionally, &lt;span class="caps"&gt;CRAN&lt;/span&gt; provides numerous sample data sets, many corresponding to examples and problem sets from popular statistics textbooks. (I should note that R, out of the box, comes loaded with tools and sample data. &lt;span class="caps"&gt;CRAN&lt;/span&gt; isn&amp;#8217;t in any way remedial but rather expands R&amp;#8217;s initial richness to mind-blowing proportions.)&lt;/p&gt;


	&lt;h3&gt;&lt;span class="caps"&gt;ESS&lt;/span&gt;&lt;/h3&gt;


	&lt;p&gt;Once I started to use R frequently, I grew tired of the command-line interface. That&amp;#8217;s where &lt;a href="http://ess.r-project.org/"&gt;Emacs Speaks Statistics&lt;/a&gt; (ESS) comes in. It&amp;#8217;s an add-on to Emacs that provides a seamless, rich interface to R (and other statistics packages). Since I live in Emacs, &lt;span class="caps"&gt;ESS&lt;/span&gt; was a natural fit for my working style. Highly recommended. (If you&amp;#8217;re interested, I have made a Fedora/RedHat &lt;span class="caps"&gt;RPM&lt;/span&gt; package for &lt;span class="caps"&gt;ESS&lt;/span&gt;. Get it in the RPMs section of the site.)&lt;/p&gt;


	&lt;h3&gt;Summary&lt;/h3&gt;


	&lt;p&gt;If you&amp;#8217;re looking for a good statistics system, get R. Now. And if you use Emacs, too, by all means get &lt;span class="caps"&gt;ESS&lt;/span&gt;. (If you just need a few bare-bones tools, however, you might want to check out my tiny statistics tools in &lt;a href="http://community.moertel.com/ss/space/Tom%27s+Perl+code"&gt;Tom&amp;#8217;s Perl code&lt;/a&gt; on the &lt;a href="http://community.moertel.com/"&gt;Community Projects site&lt;/a&gt;.)&lt;/p&gt;</description>
      <pubDate>Fri, 27 Aug 2004 12:00:00 -0400</pubDate>
      <guid isPermaLink="false">urn:uuid:8851b77d64405a297f0c7d1758379e9c</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2004/08/27/open-source-statistics-r-and-ess</link>
      <category>statistics</category>
      <category>math</category>
      <category>R</category>
      <category>statistics</category>
      <category>ess</category>
      <category>mathematica</category>
      <category>oss</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/28</trackback:ping>
    </item>
  </channel>
</rss>
