Finally! I have blogged 100 thousand words.

Posted by Tom Moertel Mon, 30 Jan 2012 04:48:00 GMT

I have finally done it! With my recent post on tree traversals, I have managed to write 100 thousand words for my blog:

>> Article.find(:all).inject(0) { |sum,a| sum +=
?>        (a.body + a.extended.to_s).split(/\s+/).length }
=> 100334

That sounds impressive until you realize that my first blog post, Fun with Asterisk, was about nine years ago. So we’re only talking, on average, about 11 thousand words per year. And that’s not hard, if you stick with it.

For me, the trick has been sticking with it. I joined a startup at the end of 2007, and my blogging abruptly lost about four fifths of its pace:

Tom's spotty writing record for blog.moertel.com

So I need to discipline myself to blog more frequently. I hope the next 100 thousand words won’t take so long to write.

Finally, I’d like to take this opportunity to thank you for reading and commenting. You’re the reason I wrote those words in the first place. You made the first 100 thousand words fun.

Thank you!

Your pal,
Tom Moertel

Posted in
Tags , , , ,
no comments
no trackbacks
Reddit Delicious

Most popular articles on my blog for 2010: the old stuff rules

Posted by Tom Moertel Tue, 28 Dec 2010 19:02:00 GMT

What did people read on my blog in 2010? Mostly, it was older content. Here are the ten most-popular pages, ordered by unique page views relative to that of the home page (1.0):

1. A Coder’s Guide to Coffee (2002, popularity = 5.30). This oldie continues to be popular mainly because coders still drink coffee – and because the Guide gets rediscovered every few months and posted to Reddit or Hacker News. This year it got an additional boost from being the cover story of Hacker Monthly #4.

2. Never store passwords in a database! (2006, popularity = 3.18). Despite being 4 years old, this article gets a steady flow of readers because lots of programmers are still storing passwords in databases. And getting owned.

3. Ruby 1.9 gets handy new method Object#tap (2007, popularity = 1.37). I’m not sure why this article keeps getting the hits, but it does. People just love Object#tap, I guess.

4. Wondrous oddities: R’s function-call semantics (2006, popularity = 1.22). This article’s popularity is easy to explain: R continues to steamroll just about everything else in statistical computing and has a continuous influx of new, curious users who want to know more about R’s inner workings.

5. Verizon FiOS fiber-optic Internet service: a first look (2005, popularity = 1.05). I think this article is popular because I was an early adopter of FiOS had one of the first hands-on reviews. It gets lots of search hits.

6. A couple of tips for writing Puppet manifests (2007, popularity = 1.02). I’m not sure either of these tips is still relevant. Still, this article brings in readers.

7. How I stopped missing Darcs and started loving Git (2007, popularity = 1.01). Programmers love to talk about DVCSs, Git and Darcs especially. Plus, if you search on “darcs git”, this article is one of the first results.

8. A type-based solution to the ‘strings problem’: a fitting end to XSS and SQL-injection holes? (2006, popularity = 1.00). This article remains popular because it gets readers from two sources: from religious wars over typing systems and from discussions of what to do about XSS vulnerabilities.

9. Don’t let password recovery keep you from protecting your users (2007, popularity = 0.93). This article is a follow-up to Never store passwords…! and tends to pick up a share of its sibling’s traffic.

10. On the evidence of a single coin toss (2010, popularity = 0.78). This short article raises a simple question: If I hand you a coin and claim that it always comes up heads, and you toss the coin and it does come up heads, how much more should you believe my claim compared to before the coin toss? This kind of question is irresistible to anyone even remotely Bayesian, so it ended up on Hacker News and got a lot of traffic in a few days. (The follow-up article is also popular, but didn’t make the top-ten list.)

So, once again, it looks like the old content dominates. Only one article from 2010 made the top ten, and just barely at that.

Posted in
Tags , , ,
no comments
no trackbacks
Reddit Delicious

R tips and tricks: Producing smooth bitmap plots

Posted by Tom Moertel Sun, 26 Aug 2007 01:56:00 GMT

The R statistics system can produce first-class data visualizations, commonly known as plots. Internally, plots are represented in an abstract graphics format that can be rendered on any of R’s wide range of graphics “devices” to produce concrete output – windows, bitmap files, PostScript files, PDF files, and others.

The bitmap formats, such as PNG, are preferred for posting plots online because of their widespread support by web browsers. The default bitmap-rendering devices in R, unfortunately, produce graphics that look a little too “bitmapped” for modern web tastes. Here, for example, is a plot rendered by R’s “png” device:

Plot rendered via R's PNG device

There’s nothing technically wrong with the plot, but it looks out of place on a web page. That’s because modern web browsers use font-smoothing and anti-aliasing techniques to render just about everything else on the page. Against this clean, un-jagged backdrop, the oh-so-bitmapped plot looks like a throwback to a previous era.

Happily, we can produce clean, anti-aliased R plots with a little help. Here’s the earlier plot, anti-aliased:

Plot rendered via R's PDF device, then post-processed

To produce the anti-aliased plot, I used R to produce a PDF file. Then I rendered the PDF file into a PNG image at 300 dpi using Ghostscript. Finally, I scaled the 300-dpi image down to screen resolution, producing a high-quality, anti-aliased result.

Here’s the recipe in detail.

First, I define an R function called pdfit that takes an abstract graphics object and makes a PDF-file rendering of it, using my preferred graphics-device settings:

require("lattice")

pdfit <- function(f, ...) {
  trellis.device(dev=pdf, theme="col.whitebg", ...);
  print(f);
  dev.off()
}

Then, when I create a plot I want to publish, I use pdfit to render it into a PDF file:

P.img <- xyplot( subs.low + subs.high ~ date, ... )

pdfit(P.img, file="image-downloads.pdf")  # render plot into PDF file

Finally, I use Ghostscript and ImageMagick to convert the PDF file into a high-quality, anti-aliased PNG file. (I keep both formats: the PDF file is best for publishing in printed papers, and the PNG file is best for posting online.) I use a simple Makefile to automate the process of converting the PDF files into PNG files:

# Makefile (GNU make)

pdfs := $(wildcard *.pdf)
pngs := $(pdfs:.pdf=.png)

all: $(pngs)
.PHONY: all

%.png: %.pdf
    gs -dSAFTER -dBATCH -dNOPAUSE -sDEVICE=png16m \
       -dGraphicsAlphaBits=4 -dTextAlphaBits=4 -r300 \
       -dBackgroundColor='16#ffffff' \
       -sOutputFile=$@ > /dev/null \
       $< && \
    mogrify -resize 500 $@

With this Makefile in my graphics directory, just a single “make” command is all it takes to convert my PDF images into anti-aliased PNG files, ready to post online.

And that’s it.

Do you have any tips or tricks for making good-looking graphics with R? If so, please do share.

Update: There is one downside to the sexy, anti-aliased plots: they are not as compressible as the old-style jagged plots. For the images above, for instance, the anti-aliased PNG file weighs in at 45 KB, but the original PNG file is a feathery 4.7 KB. So, if bandwidth is precious to you – or you’re planning on getting Slashdotted – you might want to stick with the jaggies.

Posted in
Tags , , , , ,
10 comments
no trackbacks
Reddit Delicious

Fun with statistics: estimating blog readership (a do-it-yourself recipe)

Posted by Tom Moertel Thu, 23 Aug 2007 01:34:00 GMT

As everybody knows, statistics is fun. Is there anything cooler than crushing a heap of seemingly uninteresting numbers into gleaming jewels of meaning? Of course not! Models, data-visualization plots, and fat data sets are way cool. So, let’s find an excuse to play with them.

Here’s an excuse – I mean, an important and highly relevant question that many of us share: How many people actually read our blogs? To answer the question, we will need to use statistics, data, and cool plots. Further, if you’ve got the raw data for your blog, you can follow along with your own analysis. Even more fun!

We’ll start with a simple inspection of common web-log data, using command-line tools. After developing a rough understanding of what useful information we can extract, we’ll analyze the raw data using a series of successively more sophisticated techniques. In the end, we will derive a simple formula for estimating readership from easily obtainable data.

Sound good? Then let’s get rocking.

But first, a preemptive strike on would-be poo-pooers: I know all about FeedBurner. I know they will track my blog’s subscribers and use their mystical powers to infer the number of “real” subscribers I have. I know it’s all so easy. But easy isn’t the point. I want to understand what’s going on. Just taking somebody’s word for it isn’t nearly as satisfying as figuring it out yourself – nor as fun.

OK. For real this time, let’s get rocking.

Read more...

Posted in
Tags , , , ,
5 comments
no trackbacks
Reddit Delicious

Greasmonkey script annotates IMDb movies with their decoder-ring percentile ranks

Posted by Tom Moertel Wed, 11 Jul 2007 17:49:00 GMT

Sam at rephase.net has harnessed the earth-shattering power of the IMDb movie-rating decoder ring to create a Greasmonkey script that annotates IMDb-listed movies with their percentile ranks. Now you don’t need to look up a movie’s “star rating” in the decoder ring to see where the movie ranks; the ranking appears right on the movie’s IMDb page.

Do check out the script itself to see how Sam cleverly embeds a copy of the decoder ring and plucks scores from it as needed.

For more on the IMDb movie-rating decoder ring, see:

Posted in
Tags , ,
2 comments
no trackbacks
Reddit Delicious

Talk: Fun with Numbers: R and Perl (and IMDB data)

Posted by Tom Moertel Thu, 21 Jun 2007 18:38:00 GMT

Last week I gave a talk on the R statistics system and Perl for the Pittsburgh Perl Mongers. The example that threaded through the talk was something I have written about here before, extracting useful information from the Internet Movie Database. If you’ve read my earlier blog post or have used the Grand Unified IMDB Movie Rating Decoder Ring, you might find the slides from the talk interesting. They provide some more details about the R and Perl code used to analyze the IMDB data and create the decoder ring.

You can get the slides here:

Title slide from my talk on R and Perl

Posted in
Tags , , , ,
2 comments
no trackbacks
Reddit Delicious

New Fedora Core RPMS for CRAN packages arm, Matrix, lme4, car, coda, leaps, and mlmRev

Posted by Tom Moertel Wed, 25 Apr 2007 18:07:00 GMT

Just a quick note for folks using the R statistics system on Fedora Linux. I have packaged for Fedora a bunch of R packages from the CRAN. (R packages have to be packaged again, as RPM packages, to integrate with Fedora Linux.)

My initial goal was to package arm, which contains tools for working with various regression models. (This package accompanies Andrew Gelman and Jennifer Hill’s wonderful book Data Analysis Using Regression and Multilevel/Hierarchical Models.) Packaging “arm,” however, quickly snowballed into packaging a bunch of prerequisites. Thankfully, I have now completed that task and can share the fruits of my labor with you.

All in all, to install “arm,” you will need the following RPMs:

  • R-arm-1.0-2
  • R-car-1.2-1
  • R-lme4-0.9975-1
  • R-Matrix-0.9975-1
  • R-R2WinBUGS-2.0-1

The following RPMs are optional (but you will need them if you want to rebuild the RPMs):

  • R-coda-0.10-1
  • R-leaps-2.7-1
  • R-mlmRev-0.995-1

You can download the packages from the RPMs section of the Community Projects site. Better yet, you can use Yum to download them for you. Just add the moertel-community Yum repository to your /etc/yum.repos.d directory (see RPMs for the recipe) and then use the following command:

$ sudo yum install R-arm

Yum will automatically resolve dependencies and install the required packages. If you want any of the optional packages, add them after “R-arm” on the command line.

I have built the packages for Fedora Core 6 on the x86_64 architecture, but the RPM specs are available if you want to rebuild the packages for other architectures. (See the instructions for rebuilding RPMs for help.)

Caveat: I’m not sure that the R-R2WinBUGS package is fully functional. It depends on BRugs, which doesn’t yet build on the Linux platform. To get around this problem, I made R-R2WinBUGS’s dependency on BRugs weak; the first package no longer requires the second to install.

Posted in ,
Tags , , ,
no comments
no trackbacks
Reddit Delicious

Engauge Digitizer: a handy tool for extracting data from charts

Posted by Tom Moertel Tue, 17 Apr 2007 07:45:00 GMT

Today I wanted to extract the data that were visualized in a chart I saw on Seth Roberts’s blog. That is, I had a picture of a data set, and I wanted the numbers behind the picture.

This task turned out to be surprisingly easy – once I found Engauge Digitizer, an open-source (GPL) tool made for this very task. After I launched Engauge, the digitization process was straightforward:

  1. I established the chart’s coordinate system by clicking in the corners and entering the associated coordinates.
  2. Then I had Engauge identify data points. With the mouse, I selected a data point by hand, teaching Engauge what a point looks like. Then Engauge identified spots on chart that looked like data points and locked on to them. I was able to step through the points to tell Engauge to skip the few it misidentified.
  3. I manually selected a few more data points that were scrunched into blobs and had eluded Engauge’s point-detection heuristics.
  4. Finally, I exported the data set in CSV format.

If you ever need to extract the data behind a chart, do check out Engauge Digitizer. (If you use Fedora Linux, you’ll be happy to know that I have packaged Engauge for you. Get it at the RPMs section of the community site.)

Posted in
Tags , , , , , ,
no comments
no trackbacks
Reddit Delicious

The IMDB Movie Rating Decoder Ring: updated w/ 2 March 2007 data

Posted by Tom Moertel Fri, 09 Mar 2007 22:40:00 GMT

If you want to get more out of IMDB movie ratings, check out my IMDB Movie Rating Decoder Ring, now updated with fresher data (as of 2 March 2007).

Posted in
Tags , , , , , ,
1 comment
no trackbacks
Reddit Delicious

Mining gold from the Internet Movie Database, part 1: decoding user ratings

Posted by Tom Moertel Wed, 18 Jan 2006 01:59:00 GMT

The Internet Movie Database (IMDb) is a rich source of online movie information. The problem is, the true gold is buried deep beneath the site’s user-friendly exterior and hidden within the database itself. With a little digging, however, we can extract the gold, nugget by nugget, and learn about fun statistical tools for data analysis.

Today, in the first part of our analysis, we will put our intuition about rating systems to the test. We will decode IMDb “user ratings,” those numbers such as 6.1 and 7.8 that summarize how the registered users of the IMDb rated movies on a scale from 1 to 10, typically depicted as a series of stars on the screen:

sample user rating

We will extract the collective wisdom of registered IMDb users in order to convert a movie’s user rating into the movie’s standing within the database. This gives us a good indicator of how the movie stacks up against other movies in general, and that’s good information to have when deciding which movies to see in the theater or add to your Netflix list.

Ready to start digging? Let’s go!

Read more...

Posted in ,
Tags , , ,
9 comments
no trackbacks
Reddit Delicious

Older posts: 1 2