Posted by Tom Moertel
Sun, 10 Jul 2011 22:06:00 GMT
Recently, I wrote a simple plagiarism detector as a fun programming exercise. Then, merely a few days later, some company gave me cause to use it.
This company, it seems, was looking to hire a programmer. So they posted a job ad that was more or less word-for-word copied from a job ad that I had written for the company where I work. The duplication, being so extensive, was hard to miss. (The offending company, to its credit, promptly removed the copied ad from its web site when we let them know about it.)
I had originally written the text used in that ad back in December 2010, when we were starting another round of hiring. I had hoped that when the right kind of programmers read it, they would discern that we were programmers just like them, programmers who cared for their craft enough and who cared for their team enough to take hiring other programmers seriously. I didn’t want our ads to seem anything like those spat out by people just mouthing the words that everyone else was mouthing to “get talent.” So I worked on getting the words right, thinking the investment would somehow help us stand apart, if just a bit, when hiring.
But I had failed to consider that authenticity in job ads can be faked by just copying what seems authentic. So, a few days ago, when Googling for statistically unlikely phrases from the text I had written, I was actually surprised to discover that a number of companies and recruiters were now using my words, more or less unchanged, to signal how “authentic” they were.
At first, I was annoyed. But, upon reflection, I realized that the plagiarism was telling me that I had written something worth stealing. That’s a good thing, right?
After all, in a society where all too many people are willing to claim your words as their own, which is worse: to write something and have it stolen or to write something and have it not?
P.S. For the pedants who would point out that nothing was actually “stolen” here, please understand that steal has a well-established sense that means basically to plagiarize, as in T.S. Eliot’s quip that “Mediocre writers borrow. Great writers steal.”
Posted in writing
Tags hiring, plagiarism, recruiters, weaselry, writing
5 comments
no trackbacks

Posted by Tom Moertel
Thu, 16 Jun 2011 04:13:00 GMT
Recently, I wrote a simple plagiarism detector in Haskell using
n-grams. The problem it solves is this: You have a document, the
suspect, which you believe may be derived from other documents, the
sources. You want to find regions in the suspect that are identical
to those in the sources.
One simple solution is to use n-grams, which are just the n-word
neighborhoods taken from some sequence of words. For example, the
4-grams of the sequence a b c d e f are a b c d, b c d e, and
c d e f.
The idea for plagiarism detection with n-grams is that, as you
increase n, the probability that a given n-gram will be
shared between two arbitrary documents rapidly vanishes. For example,
Google says, right now, that the 2-gram “rapidly vanishes” appears in
about 11,000 documents, but the 5-gram “two different documents
rapidly vanishes” occurs in no documents. (Once Google indexes this
post, however, that will change.)
But when one document is copied partly from another, they will
probably share many long sequences of words. So, the sharing of longer
n-grams is evidence of copying.
It’s not proof, however. If two papers about U.S. history both contain
the 5-gram “The United States of America,” that coincidence is not surprising. But
when lots of unlikely sharing occurs, it’s strong evidence of some
form of common ancestry.
Fortunately, when copying occurs, it’s often easy to see, once
you know where to look. So my detector just annotates the suspect
document to indicate which portions of it share n-grams with the
source documents. Shared regions are converted to ALL CAPS. For
example, looking for shared 4-grams, my detector finds ample evidence
that the following paragraph was copied from the text above:
$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt
But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA,"
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.
In fact, it was copied: I duplicated an earlier paragraph and then
changed a few words.
Using n-grams to detect sharing is not a perfect system,
but it’s a simple system, and, because its output is
easily interpreted, it’s an effective system. When your eyeballs see
a lot of ALL CAPS TEXT, it’s easy to judge what
it represents.
Anyway, here’s the code.
module Main where
import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)
main = do
n:suspect:sources <- getArgs
putStrLn =<< findMatches (read n) suspect sources
findMatches n suspect sources = do
db <- S.fromList . concat <$> mapM (loadNgrams n) sources
stxt <- readFile suspect
return $ concat (match n db (nGrams n stxt) (whiteWords stxt))
match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
where
matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
isMatch loc ng | ng `S.member` db = [loc .. loc + n 1]
| otherwise = []
mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
where
go _ [] ws = ws
go _ _ [] = []
go i (l:ls) (w:ws) | i < l = w : go (i + 1) (l:ls) ws
go i (l:ls) (w:ws) | otherwise = f w : go (i + 1) ls ws
uniq = S.toList . S.fromList
loadNgrams n file = nGrams n <$> readFile file
nGrams :: Int -> String -> [[String]]
nGrams n = takeWhile ((n ==) . length) . map (take n) .
tails . map normalize . words
normalize = map toLower . filter isAlphaNum
whiteWords :: String -> [String]
whiteWords s = case span isSpace s of
("", "") -> []
(s1, "") -> [s1]
(s1, s') -> (s1 ++ w) : whiteWords s''
where
(w, s'') = break isSpace s'
Posted in programming
Tags haskell, ngrams, plagiarism
9 comments
no trackbacks
