Posted by Tom Moertel
Tue, 17 Apr 2007 07:45:00 GMT
Today I wanted to extract the data that were visualized in a
chart I saw on Seth Roberts’s blog. That is, I had a picture of a data set, and I wanted the numbers behind the picture.
This task turned out to be surprisingly easy – once I found Engauge Digitizer, an open-source (GPL) tool made for this very task. After I launched Engauge, the digitization process was straightforward:
- I established the chart’s coordinate system by clicking in the corners and entering the associated coordinates.
- Then I had Engauge identify data points. With the mouse, I selected a data point by hand, teaching Engauge what a point looks like. Then Engauge identified spots on chart that looked like data points and locked on to them. I was able to step through the points to tell Engauge to skip the few it misidentified.
- I manually selected a few more data points that were scrunched into blobs and had eluded Engauge’s point-detection heuristics.
- Finally, I exported the data set in CSV format.
If you ever need to extract the data behind a chart, do check out
Engauge Digitizer. (If you use Fedora Linux,
you’ll be happy to know that I have packaged Engauge for you.
Get it at the RPMs section
of the community site.)
Posted in statistics
Tags charts, data, fedora, plots, rpms, statistics, tools
no comments
no trackbacks

Posted by Tom Moertel
Wed, 01 Mar 2006 20:17:00 GMT
While we read, our minds subconsciously correct mistakes and overlook omissions in the steam of words we see, especially when reading familiar texts. This mental feature, which allows us to skim long documents, has a nasty drawback when we are writing: it makes it our own mistakes harder to spot.
One of the most common writing mistakes that our brains stealthily correct is the the duplicate word problem. For example, I inserted a double the into the previous sentence. Did you catch it?
If so, don’t be too proud of your accomplishment. It is easier to see errors in others’ writing than in your own. Your brain is attuned to your natural writing patterns and much more likely to repair your mistakes without your knowing.
To overcome this problem, some writers recommend reading your work backward, but I think computers are a more practical solution.
Here’s the Perl script that I use to spot duplicate words:
#!/usr/bin/perl -n00
# dupwords.pl - find duplicate words in the input stream
print "$ARGV: para $.: ($1)\n"
while /(\b(\w+)\b\s+\b\2\b)/sg;
I use this script from Emacs via shell-command-on-region. I also use it from the command line to find duplicate-word errors in batch:
find . -name '*.txt' | xargs dupwords.pl
The duplicate-words problem is a favorite for programming cookbooks, so if you don’t like my recipe (or Perl), you have many other options.
Posted in perl, writing
Tags perl, tools, writing
no comments
no trackbacks
