Posted by Tom Moertel
Thu, 16 Jun 2011 04:13:00 GMT
Recently, I wrote a simple plagiarism detector in Haskell using
n-grams. The problem it solves is this: You have a document, the
suspect, which you believe may be derived from other documents, the
sources. You want to find regions in the suspect that are identical
to those in the sources.
One simple solution is to use n-grams, which are just the n-word
neighborhoods taken from some sequence of words. For example, the
4-grams of the sequence a b c d e f are a b c d, b c d e, and
c d e f.
The idea for plagiarism detection with n-grams is that, as you
increase n, the probability that a given n-gram will be
shared between two arbitrary documents rapidly vanishes. For example,
Google says, right now, that the 2-gram “rapidly vanishes” appears in
about 11,000 documents, but the 5-gram “two different documents
rapidly vanishes” occurs in no documents. (Once Google indexes this
post, however, that will change.)
But when one document is copied partly from another, they will
probably share many long sequences of words. So, the sharing of longer
n-grams is evidence of copying.
It’s not proof, however. If two papers about U.S. history both contain
the 5-gram “The United States of America,” that coincidence is not surprising. But
when lots of unlikely sharing occurs, it’s strong evidence of some
form of common ancestry.
Fortunately, when copying occurs, it’s often easy to see, once
you know where to look. So my detector just annotates the suspect
document to indicate which portions of it share n-grams with the
source documents. Shared regions are converted to ALL CAPS. For
example, looking for shared 4-grams, my detector finds ample evidence
that the following paragraph was copied from the text above:
$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt
But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA,"
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.
In fact, it was copied: I duplicated an earlier paragraph and then
changed a few words.
Using n-grams to detect sharing is not a perfect system,
but it’s a simple system, and, because its output is
easily interpreted, it’s an effective system. When your eyeballs see
a lot of ALL CAPS TEXT, it’s easy to judge what
it represents.
Anyway, here’s the code.
module Main where
import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)
main = do
n:suspect:sources <- getArgs
putStrLn =<< findMatches (read n) suspect sources
findMatches n suspect sources = do
db <- S.fromList . concat <$> mapM (loadNgrams n) sources
stxt <- readFile suspect
return $ concat (match n db (nGrams n stxt) (whiteWords stxt))
match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
where
matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
isMatch loc ng | ng `S.member` db = [loc .. loc + n 1]
| otherwise = []
mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
where
go _ [] ws = ws
go _ _ [] = []
go i (l:ls) (w:ws) | i < l = w : go (i + 1) (l:ls) ws
go i (l:ls) (w:ws) | otherwise = f w : go (i + 1) ls ws
uniq = S.toList . S.fromList
loadNgrams n file = nGrams n <$> readFile file
nGrams :: Int -> String -> [[String]]
nGrams n = takeWhile ((n ==) . length) . map (take n) .
tails . map normalize . words
normalize = map toLower . filter isAlphaNum
whiteWords :: String -> [String]
whiteWords s = case span isSpace s of
("", "") -> []
(s1, "") -> [s1]
(s1, s') -> (s1 ++ w) : whiteWords s''
where
(w, s'') = break isSpace s'
Posted in programming
Tags haskell, ngrams, plagiarism
9 comments
no trackbacks

Posted by Tom Moertel
Fri, 13 Aug 2010 04:38:00 GMT
For those of you who use Emacs, Git, and Fedora Linux, I have just pushed emacs-magit-0.8.2 to updates-testing. (This package brings the wonderful Magit mode for Emacs to Fedora). For now, you can install it with yum by enabling the updates-testing repo:
sudo yum install emacs-magit --enablerepo=updates-testing
Posted in programming
Tags emacs, fedora, git, magit
no comments
no trackbacks

Posted by Tom Moertel
Thu, 23 Jul 2009 19:50:00 GMT
Recently there has been a lot of talk about “magic” in software. Most of the people doing the talking, however, seem to be talking past one another, primarily because they do not share a common understanding of what “magic” means. The few definitions of the term I have seen seem to miss something essential, so I will provide the definition that I think makes the most sense:
In a software system, magic is any behavior whose semantics are poorly defined and unduly expensive to figure out.
Some notes:
- The “unduly expensive” part means that people with different background knowledge are going to find different things magical. If you use certain techniques every day, you will probably find behaviors that make use of those techniques easier to figure out and thus less magical than other programmers might.
- One way, then, to make things less magical is to exhort programmers to get more knowledge. A more practical approach, however, appears in the next point.
- Most “magic” can be de-magicked simply by defining its semantics.
So when I complain about magic in a software system, I am complaining about code whose behavior is not readily discernible. When I did a lot of Rails programming, for example, I groaned when I encountered a function in the Rails API that took a string but did not make clear what that string was supposed to represent. Was it text? Was it HTML? There was only one way to find out: trace through layers of Rails code to figure out what the framework actually did with the string. That is unduly expensive, so I considered such functions to be magic – annoying, waste-my-time magic.
I do a lot more Python programming these days, and while I encounter less magic, I am mystified by the part of Python culture that turns its back on code that is perceived as being overly clever or, to use the popular phrase, not explicit. What is wrong with cleverness, as long as the result is clearly understood? Cleverness, by itself, is not magic. For real magic, you need fuzzy semantics, too.
That is why, when I hear that code is “magical” or “too clever,” I think “poorly defined.” If you define and document the semantics of such code, you dispel its magic. And if you cannot define the semantics, then the code is magical and too clever, and probably ought to be rewritten.
Thus the test for whether something is magical is the clarity of its semantics. Code having clear semantics is not magical. Code having unfathomable semantics is.
Posted in programming
Tags clarity, cleverness, definitions, magic, programming, semantics
10 comments
no trackbacks

Posted by Tom Moertel
Mon, 02 Feb 2009 17:22:00 GMT
Do you use Emacs? Do you use Git? If so, check out Magit, a delightful Emacs
mode that provides a convenient interface to working with Git. Unlike
many VCS modes, Magit is fully Git-centric: It understands the Git
way of branching, staging, committing, history-rewriting, tagging,
merging, pushing and pulling. It even knows about the reflog and has
some git-svn support.
If you’re a Fedora user,
just install the emacs-magit package. The package is currently in
testing, so install it with the following command:
$ sudo yum --enablereo=updates-testing install emacs-magit
One more thing, Fedora users: please don’t forget to provide
feedback on the package. It’s easy:
- Just visit the emacs-magit page in Bodhi
- Click on the package you installed (e.g., the Fedora 9 or 10 flavor)
- Add a comment, selecting “Works for me” or “Does not work” as appropriate
Hack on!
Posted in programming
Tags emacs, fedora, git, magit
1 comment
no trackbacks

Posted by Tom Moertel
Wed, 07 Jan 2009 20:08:00 GMT
Here are three Emacs tips that I seem to have a hard time
remembering. I’m posting them here to help them stick to
my brain.
Align
Emacs has a number of powerful align commands, but most times the following incantation will suffice to magically align text into neat columns:
C-u M-x align
Repeat
I don’t know how I manage to forget this handy helper, but I do. The following will repeat the most-recently executed command:
C-x z
Continue hitting “z” to continue repeating.
Recursive edit
While you are in the middle of doing something in Emacs, such as
a query-replace-regexp, you can put that something
on hold, do some editing, and then return to what you were doing.
This maneuver is accomplished with Recursive Editing. In short:
C-r (enter recursive edit)
C-M-c (exit recursive edit)
C-] (abort: exit all nested recursive-edit sessions)
Posted in programming
Tags emacs
no comments
no trackbacks

Posted by Tom Moertel
Tue, 18 Dec 2007 03:33:00 GMT
XML is fine for representing document-like things, but when it’s
twisted to represent build recipes, configuration files, and little
programming languages, it opens the gates to XML Hell. Once the
gates are opened, the demons of cargo-cult thinking are loosed upon
the world, where they are free to trick innocent programmers into
working with grotesquely twisted XML documents – something no human
mind was designed to comprehend. Ensnared, these programmers are
slowly drawn into the depths of XML Hell, from which their
lamentations echo across the
universe.
When the demons of cargo-cult thinking come for you, don’t be
ensnared! Instead, be prepared – with PXSL – the Parsimonious XML
Shorthand Language
(pronounced “pixel”).
What’s PXSL? It’s a luxurious, thermonuclear smoking jacket that you
can slip on using a convenient preprocessor. Use it whenever you see
grotesque XML on the horizon. Within PXSL’s plush (and stylish)
protection, you can create all the nasty, twisted XML that may be
demanded of you, but you need not descend into XML Hell to do it.
Instead, you can work from the comfort of a well-stocked lounge, where
clarity and conciseness are always on tap.
For example, here’s a snippet from an XSLT stylesheet, in the
original XML:
<xsl:template match="/">
<xsl:for-each select="//*/@src|//*/@href">
<xsl:value-of select="."/>
<xsl:text> </xsl:text>
</xsl:for-each>
</xsl:template>
And here’s the same snippet, written in PXSL:
template /
for-each //*/@src|//*/@href
value-of .
text << >>
Isn’t that refreshing?
Why PXSL?
There are lots of XML shorthands available. (The PXSL FAQ lists about ten of them.) So why choose
PXSL? Here’s why:
Also, PXSL is battle tested. It was first released in 2003 and has
been saving people from XML Hell since. People who try it seem to like it:
- I think PXSL could do wonders for soothing my irrational hatred for all things XML. —kowey
- Impressive… I converted some of my files from XML to PXSL and the readability was much improved. —chris
- Quite aside from the fact that XSLT is finally somewhat readable, the fact that you’ve added a serious macro system means that some serious scripting of XML can occur. I’m very impressed. —invisible
The next time you’re headed for XML Hell, why not give the venerable PXSL a try? You might just find that you like it, too.
This public service announcement was brought to you in celebration of
the 1.0 release of the pxsl-tools package. The PXSL-to-XML compiler
pxslcc is written in Haskell and uses the
cross-platform Haskell Cabal
build/package system to let you use PXSL just about anywhere.
Posted in programming
Tags haskell, pxsl, xml, xslt
13 comments
no trackbacks

Posted by Tom Moertel
Mon, 10 Dec 2007 21:52:00 GMT
About three years ago, I switched to
Darcs
as my primary source-code management system. It was simple,
intuitive, and powerful, and it made managing my projects more fun and
less frustrating than any centralized VCS ever had. That it was
written in Haskell, one of my favorite programming languages, made
it even better. I was hooked.
Since then, the distributed SCM landscape has changed. Darcs hasn’t
improved much, but its competitors have made long strides, especially
Git and
Mercurial. Both
are crazy fast, vigorously developed, and widely used on large, highly
active real-world projects, such as the Linux kernel and Mozilla 2.
In comparison, Darcs has
stagnated.
When I started working for a new company recently, I had to consider
whether to advocate Darcs or something else. In the end, I decided
that Darcs would be a hard sell. Nobody else at the company uses
Haskell, and having to explain how to avoid the occasional corner
case
seemed liked a losing proposition.
After researching and playing around with Git and Mercurial, I settled
on Git. I like Git’s underlying hashed-blobs model better than
Mercurial’s revlogs, and Git seems to have slightly more development
momentum. Still, it was a close call. Either choice would have been
completely reasonable.
Missing Darcs
When I started using Git on real projects, the one thing I really
missed was the ability to easily amend earlier patches, something
Darcs made trivial. Let me
explain. The typical development workflow goes something like this:
- Checkout copy of upstream code base.
- Implement feature X.
- Commit.
- Implement independent feature Y.
- Commit.
- Implement independent feature Z.
- Commit.
- Push new features back upstream.
Now, what really happens is that when I’m implementing Y or Z,
I’ll realize that I made a mistake in X. The trick is then
fixing X so that my fix is part of the changeset/patch for X that
ultimately gets pushed upstream in the last step. That way, the
upstream folks will see only a single, clean patch for feature X – not
a mishmash of patches that together represent X.
In Darcs, amending the original patch is easy because its patch theory
lets me tweak the patch for X independently of the other patches.
Darcs will simply ask me which patch I want to amend, and I’ll select
the orignal patch for X:
$ emacs # fix X
$ darcs amend-record # amend original patch for X
Mon Dec 10 14:43:13 EST 2007 Tom Moertel <tom@moertel.com>
* Implemented Z
Shall I amend this patch? [yNvpq], or ? for help: n
Mon Dec 10 14:42:12 EST 2007 Tom Moertel <tom@moertel.com>
* Implemented Y
Shall I amend this patch? [yNvpq], or ? for help: n
Mon Dec 10 14:41:46 EST 2007 Tom Moertel <tom@moertel.com>
* Implemented X
Shall I amend this patch? [yNvpq], or ? for help: y
hunk ./x 1
-X1
+X2
Shall I add this change? (1/?) [ynWsfqadjkc], or ? for help: y
Finished amending patch:
Mon Dec 10 14:43:25 EST 2007 Tom Moertel <tom@moertel.com>
* Implemented X
That’s it. The exact same process will work regardless of when I
realize I need to fix X: before I start Y, while I’m implementing Y,
after I’ve committed Y, while I’m working on Z, or after I’ve committed
Z.
Learning to love Git
With Git, however, I can amend a commit only if I haven’t committed anything else before making my fix. In Git’s mind, Y depends on X, and Z
depends on Y, even if they really are independent of one another.
So if I commit the original patch for X and then immediately realize I
need to make a fix, before I start working on Y or Z, it’s easy:
$ emacs # implement X
$ git commit -m 'Implemented X'
# discover problem in X
$ emacs # fix X
$ git commit --amend # amend original patch
More typically, it’s only while I’m working on Y that I’ll
realize I need to fix X. Then it’s more complicated
to amend the original commit:
$ emacs # implement X
$ git commit -m 'Implemented X'
$ emacs # start working on Y
# discover problem in X
$ git stash # stash away half-completed work on Y
$ emacs # fix X
$ git commit --amend # amend original patch for X
$ git stash apply # restore work on Y
$ emacs # continue working on Y
While not as convenient as Darcs’s workflow, it’s perfectly workable.
Now let’s consider another fairly typical case: I commit X and Y and
then start working on Z before I notice the problem in X. I used to
think that Git couldn’t handle this case, but it can, thanks to
git rebase --interactive:
$ emacs # implement X
$ git commit -m 'Implemented X'
$ emacs # implement Y
$ git commit -m 'Implemented Y'
$ emacs # start working on Z
# discover problem in X
$ git stash # stash away half-completed work on Z
$ emacs # fix X
$ git commit -m 'Fixed X'
$ git rebase --interactive HEAD~3 # see comments below
$ git stash apply # restore work on Z
$ emacs # continue working on Z
The
git rebase --interactive command is
powerful. What the
command does, as called in the snippet above, is invoke my editor of
choice on a text file describing the last 3 commits (that’s the
HEAD~3 part):
# Rebasing 3ad99a7..b9a8405 onto 3ad99a7
#
# Commands:
# pick = use commit
# edit = use commit, but stop for amending
# squash = use commit, but meld into previous commit
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
pick 0885540 Implemented X
pick 320b115 Implemented Y
pick b9a8405 Fixed X
I can then edit the file to reorder, merge (squash), and/or remove
the commits. In this example, I want to merge the fix for X into
the original commit that implemented X. So I edit the file like so:
pick 0885540 Implemented X
squash b9a8405 Fixed X
pick 320b115 Implemented Y
Then I save the file, at which point Git takes over and makes the
requested changes, merging the fix for X into the
original commit for X. Now the log shows the original implementation
and fix as one commit:
$ git log
commit f387d650976246c0854d028b040cca40e542be56
Author: Tom Moertel <tom@moertel.com>
Date: Mon Dec 10 15:11:26 2007 -0500
Implemented Y
commit 82a1c849ffd1bd688d5bc9d99be0e63548a89c4c
Author: Tom Moertel <tom@moertel.com>
Date: Mon Dec 10 15:13:03 2007 -0500
Implemented X
Fixed X
commit 3ad99a7ef537b7ae99e435e0d2b4b0d03de92c65
Author: Tom Moertel <tom@moertel.com>
Date: Mon Dec 10 15:11:14 2007 -0500
Initial checkin
Once I figured out how to use git rebase --interactive, I stopped
missing Darcs and started loving Git.
Posted in programming
Tags darcs, dvcs, git, haskell, scm
42 comments
no trackbacks

Posted by Tom Moertel
Thu, 25 Oct 2007 20:26:00 GMT
On the Darcs Users mailing list, I ran across an interesting thread: practical differences between darcs’ patch model and git/mercurial’s?
Among the interesting points of discussion:
- Do the mechanics that give rise to Darcs’s strong cherry-picking abilities also make it susceptible to naughty time-complexity behavior?
- When you merge non-conflicting changes in Git or Mercurial, you must record a merge patch, which binds the two in the development timeline, but in Darcs the respective patches are free to commute. Which behavior is better for real-world development?
If you’re interested in distributed source-code management, it’s an interesting thread to follow.
Posted in programming
Tags darcs, git, mercurial, patches, vcs
no comments
no trackbacks

Posted by Tom Moertel
Sat, 01 Sep 2007 19:39:00 GMT
Via Reddit I found Mark Nelson’s post about a recent word puzzle from NPR’s
Weekend Edition:
Take the names of two U.S. States, mix them all together, then rearrange the letters to form the names of two other U.S. States. What states are these?
The puzzle is fairly straightforward to solve by hand (think about
it), but let’s write a program to solve it. That will give us a convenient
excuse to discuss a super-handy function I use all the time:
clusterBy. In Haskell, it looks like this:
import Control.Arrow ((&&&))
import qualified Data.Map as M
clusterBy :: Ord b => (a -> b) -> [a] -> [[a]]
clusterBy f = M.elems . M.map reverse . M.fromListWith (++)
. map (f &&& return)
What clusterBy does is group a list of values by their signatures,
as computed by a given signature function f, and returns
the groups in order of ascending signature. For example, we
can cluster the words “the tan ant gets some fat” by length, by
first letter, or by last letter just by changing the
signature function we give to clusterBy:
*Main> let antwords = words "the tan ant gets some fat"
*Main> clusterBy length antwords
[["the","tan","ant","fat"],["gets","some"]]
*Main> clusterBy head antwords
[["ant"],["fat"],["gets"],["some"],["the","tan"]]
*Main> clusterBy last antwords
[["the","some"],["tan"],["gets"],["ant","fat"]]
If we use sort as the signature function, we can find anagrams:
*Main> clusterBy sort antwords
[["fat"],["tan","ant"],["gets"],["the"],["some"]]
And that brings us back to the original puzzle. To find the solution,
we must consider each unique pair of state names to form a “word” and
find the anagrams among a list of such “words.”
Assuming we are given
a list of state names on standard input, one state per line, we can
write the shell of our solution as follows:
main = mapM_ print . solve . lines =<< getContents
The shell delegates the real work to solve. It’s job is to
compute the unique, 2-state combinations from the original
list of states, and then find the anagrams among these combinations.
As before, finding the anagrams is simply a matter of calling
clusterBy with the right signature function. We also filter
out the trivial results, which are not valid solutions:
solve = filter ((>1) . length) . clusterBy signature . ucombos
ucombos xs = [[x,y] | x <- xs, y <- xs, x < y]
signature = sort . filter isAlpha . concat
That’s it. Now we can solve the puzzle by feeding our program a list of states:
$ runhaskell anagrams2.hs < states.txt
[["NORTH CAROLINA","SOUTH DAKOTA"],
["NORTH DAKOTA","SOUTH CAROLINA"]]
What a handy little function, that clusterBy.
Posted in programming
Tags clusterby, functions, haskell, hof, puzzles
12 comments
no trackbacks

Posted by Tom Moertel
Sat, 18 Aug 2007 17:01:00 GMT
Like chromatic, I have
watched the recent irrational exuberance for domain-specific languages
(DSLs) with bewilderment. In certain quarters of the programming
universe, it seems that creating DSLs is nearly a rite of passage.
The problem is, more and more of these DSLs appear to have been
created mainly because, well, DSLs are cool these days, even if less
“novel” solutions probably would have been more sensible.
Whereas chromatic unhesitatingly confronted the madness
head-on,
I have so far managed to avoid the fray. Sure, I’ve asked the
occasional probing question of the DSL
enthusiast,
but mostly my reaction has been limited to standing back and staring
in mute amazement at the runaway Domain-Specific Fun-Time Language
Train, screaming down the tracks, destined for its inevitable high-speed
derailment into what I can only expect will be a bridge abutment.
But I’m starting to get the feeling that some of the train’s passengers are
aboard because they think it’s the Right Thing To Do Train,
so maybe it’s time to say something.
To set the record straight, I don’t have anything against DSLs,
embedded or otherwise. (I have created my fair
share,
some of which are actually
useful.) No, my concern is
limited strictly to the rise of the Gratuitous DSL. So let’s talk
about it.
The reason – the right reason – for creating a DSL is because it ultimately lowers the cost
of solving problems. If, then, you create a DSL and the cost of
solving your problems does not go down, why did you create
it? Think about it. Creating a DSL is an expensive proposition. Making
people learn your DSL’s syntax,
semantics, and underlying domain is a lot to ask – it’s costly. If you do ask, if you do make
the imposition, you had better be sure your DSL pays its bills.
But what if your DSL turns out to be a deadbeat? What if using your DSL doesn’t lower the cost of solving problems? Well, guess what? You have
created a Gratuitous Domain Specific Language.
Still unsure of whether you’re on the DSL Train for the wrong reason? No problem. Just take
this simple, seven-step test:
Seven signs you may have created a Gratuitous Domain Specific Language (GDSL)
- You can’t actually explain what a DSL is.
- For your DSL, you can’t explain what the domain is.
- You have a hard time explaining the DSL’s syntax and semantics.
- You have a hard time explaining how the DSL interacts with the language it is embedded in. (For embedded DSLs only.)
- A vanilla library API would have captured the domain’s semantics without awkwardness.
- It’s easier to express complex domain concepts in general-purpose code than in your DSL.
- Your colleagues have a hard time writing things in your DSL.
Did more than a few of the statements ring true? If so, take a bow.
You are the proud creator of a Gratuitous
DSL!1
Even so, it’s not too late. You can always hop off the DSL Train at the next stop.
Update: minor edit for clarity.
Update 2008-03-22: edits for clarity.
Posted in programming, humor
Tags cargocult, coding, culture, dsl, edsl, humor
9 comments
no trackbacks
