Posted by Tom Moertel
Sun, 26 Aug 2007 01:56:00 GMT
The R statistics system can produce
first-class data visualizations, commonly known as plots. Internally,
plots are represented in an abstract graphics format that can be
rendered on any of R’s wide range of graphics “devices” to produce
concrete output – windows, bitmap files, PostScript files, PDF files,
and others.
The bitmap formats, such as PNG, are preferred for posting
plots online because of their widespread support by web browsers. The
default bitmap-rendering devices in R, unfortunately, produce graphics
that look a little too “bitmapped” for modern web tastes. Here, for example,
is a plot rendered by R’s “png” device:
There’s nothing technically wrong with the plot, but it looks out
of place on a web page. That’s because modern web
browsers use font-smoothing and anti-aliasing techniques to render
just about everything else on the page. Against this clean, un-jagged
backdrop, the oh-so-bitmapped plot looks like a throwback to
a previous era.
Happily, we can produce clean, anti-aliased R plots with a little
help. Here’s the earlier plot, anti-aliased:
To produce the anti-aliased plot, I used R to produce a PDF file. Then I
rendered the PDF file into a PNG image at 300 dpi using Ghostscript.
Finally, I scaled the 300-dpi image down to screen resolution,
producing a high-quality, anti-aliased result.
Here’s the recipe in detail.
First, I define an R function called pdfit that takes an
abstract graphics object and makes a PDF-file rendering of it, using
my preferred graphics-device settings:
require("lattice")
pdfit <- function(f, ...) {
trellis.device(dev=pdf, theme="col.whitebg", ...);
print(f);
dev.off()
}
Then, when I create a plot I want to publish, I use pdfit to render
it into a PDF file:
P.img <- xyplot( subs.low + subs.high ~ date, ... )
pdfit(P.img, file="image-downloads.pdf") # render plot into PDF file
Finally, I use Ghostscript and
ImageMagick to convert the PDF file into
a high-quality, anti-aliased PNG file. (I keep both formats: the PDF
file is best for publishing in printed papers, and the PNG file is
best for posting online.) I use a simple Makefile to automate the
process of converting the PDF files into PNG files:
# Makefile (GNU make)
pdfs := $(wildcard *.pdf)
pngs := $(pdfs:.pdf=.png)
all: $(pngs)
.PHONY: all
%.png: %.pdf
gs -dSAFTER -dBATCH -dNOPAUSE -sDEVICE=png16m \
-dGraphicsAlphaBits=4 -dTextAlphaBits=4 -r300 \
-dBackgroundColor='16#ffffff' \
-sOutputFile=$@ > /dev/null \
$< && \
mogrify -resize 500 $@
With this Makefile in my graphics directory, just a single “make”
command is all it takes to convert my PDF images into
anti-aliased PNG files, ready to post online.
And that’s it.
Do you have any tips or tricks for making good-looking graphics with
R? If so, please do share.
Update: There is one downside to the sexy, anti-aliased plots: they
are not as compressible as the old-style jagged plots. For the
images above, for instance, the anti-aliased PNG file weighs in
at 45 KB, but the original PNG file is a feathery 4.7 KB.
So, if bandwidth is precious to you – or you’re planning on getting
Slashdotted – you might want to stick
with the jaggies.
Posted in statistics
Tags graphics, plots, R, statistics, tips, tricks
8 comments
no trackbacks

Posted by Tom Moertel
Thu, 23 Aug 2007 01:34:00 GMT
As everybody knows, statistics is fun. Is there
anything cooler than crushing a heap of seemingly uninteresting
numbers into gleaming jewels of meaning? Of course not! Models,
data-visualization plots, and fat data sets are way cool.
So, let’s find an excuse to play with them.
Here’s an excuse –
I mean, an important and highly relevant question that many of us share:
How many people actually read our blogs? To answer the
question, we will need to use statistics, data, and cool plots.
Further, if you’ve got the raw data for your blog, you can follow
along with your own analysis. Even more fun!
We’ll start with a simple inspection of common web-log data, using
command-line tools. After developing a rough understanding of what
useful information we can extract, we’ll analyze the raw data using a
series of successively more sophisticated techniques. In the end, we
will derive a simple formula for estimating readership from easily
obtainable data.
Sound good? Then let’s get rocking.
But first, a preemptive strike on would-be poo-pooers: I know all about
FeedBurner. I know they will track my blog’s subscribers and use
their mystical powers to infer the number of “real” subscribers I
have. I know it’s all so easy. But easy isn’t the point. I want to
understand what’s going on. Just taking somebody’s word for it isn’t
nearly as satisfying as figuring it out yourself – nor as fun.
OK. For real this time, let’s get rocking.
Read more...
Posted in statistics
Tags blog, fun, modeling, R, statistics
5 comments
no trackbacks

Posted by Tom Moertel
Wed, 11 Jul 2007 17:49:00 GMT
Sam at rephase.net has harnessed the earth-shattering power of the IMDb movie-rating decoder ring to create a Greasmonkey script that annotates IMDb-listed movies with their percentile ranks. Now you don’t need to look up a movie’s “star rating” in the decoder ring to see where the movie ranks; the ranking appears right on the movie’s IMDb page.
Do check out the script itself to see how Sam cleverly embeds a copy of the decoder ring and plucks scores from it as needed.
For more on the IMDb movie-rating decoder ring, see:
Posted in hacks
Tags greasmonkey, imdb, statistics
1 comment
no trackbacks

Posted by Tom Moertel
Thu, 21 Jun 2007 18:38:00 GMT
Last week I gave a talk on the R statistics
system and Perl for the Pittsburgh Perl
Mongers. The example that threaded through the
talk was something I have written about here before, extracting
useful information from the Internet Movie
Database.
If you’ve read my earlier blog
post
or have used the Grand Unified IMDB Movie Rating Decoder
Ring,
you might find the slides from the talk interesting. They provide
some more details about the R and Perl code used to analyze the IMDB data
and create the decoder ring.
You can get the slides here:
Posted in talks
Tags imdb, perl, R, statistics, talks
2 comments
no trackbacks

Posted by Tom Moertel
Wed, 25 Apr 2007 18:07:00 GMT
Just a quick note for folks using the R statistics
system on Fedora
Linux. I have packaged for Fedora a
bunch of R packages from the CRAN. (R
packages have to be packaged again, as RPM packages, to integrate with
Fedora Linux.)
My initial goal was to package
arm,
which contains tools for working with various regression models.
(This package accompanies Andrew Gelman and Jennifer Hill’s wonderful
book Data Analysis Using Regression and Multilevel/Hierarchical Models.)
Packaging “arm,” however, quickly snowballed into packaging a bunch of
prerequisites. Thankfully, I have now completed that task and can
share the fruits of my labor with you.
All in all, to install “arm,” you will need the following RPMs:
- R-arm-1.0-2
- R-car-1.2-1
- R-lme4-0.9975-1
- R-Matrix-0.9975-1
- R-R2WinBUGS-2.0-1
The following RPMs are optional (but you will need them if you
want to rebuild the RPMs):
- R-coda-0.10-1
- R-leaps-2.7-1
- R-mlmRev-0.995-1
You can download the packages from the RPMs
section of the Community
Projects site. Better yet, you can use Yum to
download them for you. Just add the moertel-community
Yum repository to your /etc/yum.repos.d directory (see RPMs for the recipe) and then use the
following command:
$ sudo yum install R-arm
Yum will automatically resolve dependencies and install the required
packages. If you want any of the optional packages, add them after
“R-arm” on the command line.
I have built the packages for Fedora Core 6 on the x86_64 architecture, but the
RPM specs are available
if you want to rebuild the packages for other architectures. (See
the instructions for rebuilding RPMs for help.)
Caveat:
I’m not sure that the R-R2WinBUGS package is fully functional. It
depends on BRugs, which doesn’t yet build on the Linux platform. To
get around this problem, I made R-R2WinBUGS’s dependency on BRugs
weak; the first package no longer requires the second to install.
Posted in statistics, linux
Tags fedora, R, rpms, statistics
no comments
no trackbacks

Posted by Tom Moertel
Tue, 17 Apr 2007 07:45:00 GMT
Today I wanted to extract the data that were visualized in a
chart I saw on Seth Roberts’s blog. That is, I had a picture of a data set, and I wanted the numbers behind the picture.
This task turned out to be surprisingly easy – once I found Engauge Digitizer, an open-source (GPL) tool made for this very task. After I launched Engauge, the digitization process was straightforward:
- I established the chart’s coordinate system by clicking in the corners and entering the associated coordinates.
- Then I had Engauge identify data points. With the mouse, I selected a data point by hand, teaching Engauge what a point looks like. Then Engauge identified spots on chart that looked like data points and locked on to them. I was able to step through the points to tell Engauge to skip the few it misidentified.
- I manually selected a few more data points that were scrunched into blobs and had eluded Engauge’s point-detection heuristics.
- Finally, I exported the data set in CSV format.
If you ever need to extract the data behind a chart, do check out
Engauge Digitizer. (If you use Fedora Linux,
you’ll be happy to know that I have packaged Engauge for you.
Get it at the RPMs section
of the community site.)
Posted in statistics
Tags charts, data, fedora, plots, rpms, statistics, tools
no comments
no trackbacks

Posted by Tom Moertel
Fri, 09 Mar 2007 22:40:00 GMT
If you want to get more out of IMDB movie ratings, check out my
IMDB Movie Rating Decoder Ring, now updated with fresher data (as of 2 March 2007).
Posted in statistics
Tags data, decoder_rinng, imdb, movies, ratings, stars, statistics
1 comment
no trackbacks

Posted by Tom Moertel
Wed, 18 Jan 2006 01:59:00 GMT
The Internet Movie Database (IMDb) is a rich source
of online movie information. The problem is, the true gold is buried
deep beneath the site’s user-friendly exterior and hidden within the
database itself. With a little digging, however, we can extract the
gold, nugget by nugget, and learn about fun statistical tools for data
analysis.
Today, in the first part of our analysis, we will put our intuition
about rating systems to the test. We will decode IMDb “user ratings,”
those numbers such as 6.1 and 7.8 that summarize how the registered
users of the IMDb rated movies on a scale from 1 to 10, typically
depicted as a series of stars on the screen:
We will extract the collective wisdom of registered IMDb users in
order to convert a movie’s user rating into the movie’s standing
within the database. This gives us a good indicator of how the movie
stacks up against other movies in general, and that’s good information
to have when deciding which movies to see in the theater or add to
your Netflix list.
Ready to start digging? Let’s go!
Read more...
Posted in movies, statistics
Tags imdb, movies, R, statistics
5 comments
no trackbacks

Posted by Tom Moertel
Fri, 27 Aug 2004 16:00:00 GMT
Recently, I needed to perform some statistical work. But I didn’t want use my previous tool-of-choice, Mathematica, because I decided after my switch to Linux not to rely on proprietary software when viable open-source alternatives existed. And thus I embarked on a short search for open-source statistics software.
R
My search was fruitful, leading me immediately to the delightfully GPL-licensed R Project for Statistical Computing: “R is a language and environment for statistical computing and graphics.” (The R system and language are similar to S, developed at Bell Labs.) The R language has functional-programming semantics (which I love) and supports (among others) the object-oriented style of programming, which is used extensively for R’s statistical interface. Most results in R are delivered in terms of objects, such as tables and and vectors and linear models, whose properties you can inspect and manipulate as you would expect. The underlying classes provide specialized methods for common operations so that the objects do the right things in response to generic commands.
Immediately, I was hooked on R. Despite having a sharp initial learning curve, R is straightforward to use. Once you get the lay of the land, you can reliably guess what functions and their arguments mean. The help facility is good, too, and can integrate with your web browser if you desire.
And the graphics! Graphs and charts are often the first, best way to size up data sets. R makes it easy to create publication-quality graphs and charts, drawing on any number of supported “graphical devices.” Among the stock devices are postscript, pdf, LaTeX, png, xfig, postscript-rendered bitmaps, and X11 (windows). For a tiny example of R’s graphics, see my posts on Mining gold from the Internet Movie Database.
To make the already-attractive R downright irresistible, the R community offers the Comprehensive R Archive Network (CRAN), the R equivalent of Perl’s CPAN. (One of the CRAN mirrors is hosted by Pittsburgh’s own pair networks.) CRAN provides packages for esoteric methods of analysis, database integration, genetics, time series analysis, HTTP (!), map projections, vegetation science, and myriad others. Additionally, CRAN provides numerous sample data sets, many corresponding to examples and problem sets from popular statistics textbooks. (I should note that R, out of the box, comes loaded with tools and sample data. CRAN isn’t in any way remedial but rather expands R’s initial richness to mind-blowing proportions.)
ESS
Once I started to use R frequently, I grew tired of the command-line interface. That’s where Emacs Speaks Statistics (ESS) comes in. It’s an add-on to Emacs that provides a seamless, rich interface to R (and other statistics packages). Since I live in Emacs, ESS was a natural fit for my working style. Highly recommended. (If you’re interested, I have made a Fedora/RedHat RPM package for ESS. Get it in the RPMs section of the site.)
Summary
If you’re looking for a good statistics system, get R. Now. And if you use Emacs, too, by all means get ESS. (If you just need a few bare-bones tools, however, you might want to check out my tiny statistics tools in Tom’s Perl code on the Community Projects site.)
Posted in statistics
Tags ess, math, mathematica, oss, R, statistics
1 comment
no trackbacks
