Good enough to steal

Posted by Tom Moertel Sun, 10 Jul 2011 22:06:00 GMT

Recently, I wrote a simple plagiarism detector as a fun programming exercise. Then, merely a few days later, some company gave me cause to use it.

This company, it seems, was looking to hire a programmer. So they posted a job ad that was more or less word-for-word copied from a job ad that I had written for the company where I work. The duplication, being so extensive, was hard to miss. (The offending company, to its credit, promptly removed the copied ad from its web site when we let them know about it.)

I had originally written the text used in that ad back in December 2010, when we were starting another round of hiring. I had hoped that when the right kind of programmers read it, they would discern that we were programmers just like them, programmers who cared for their craft enough and who cared for their team enough to take hiring other programmers seriously. I didn’t want our ads to seem anything like those spat out by people just mouthing the words that everyone else was mouthing to “get talent.” So I worked on getting the words right, thinking the investment would somehow help us stand apart, if just a bit, when hiring.

But I had failed to consider that authenticity in job ads can be faked by just copying what seems authentic. So, a few days ago, when Googling for statistically unlikely phrases from the text I had written, I was actually surprised to discover that a number of companies and recruiters were now using my words, more or less unchanged, to signal how “authentic” they were.

At first, I was annoyed. But, upon reflection, I realized that the plagiarism was telling me that I had written something worth stealing. That’s a good thing, right?

After all, in a society where all too many people are willing to claim your words as their own, which is worse: to write something and have it stolen or to write something and have it not?

P.S. For the pedants who would point out that nothing was actually “stolen” here, please understand that steal has a well-established sense that means basically to plagiarize, as in T.S. Eliot’s quip that “Mediocre writers borrow. Great writers steal.”

Posted in
Tags , , , ,
5 comments
no trackbacks
Reddit Delicious

Writing a simple plagiarism detector in Haskell

Posted by Tom Moertel Thu, 16 Jun 2011 04:13:00 GMT

Recently, I wrote a simple plagiarism detector in Haskell using n-grams. The problem it solves is this: You have a document, the suspect, which you believe may be derived from other documents, the sources. You want to find regions in the suspect that are identical to those in the sources.

One simple solution is to use n-grams, which are just the n-word neighborhoods taken from some sequence of words. For example, the 4-grams of the sequence a b c d e f are a b c d, b c d e, and c d e f.

The idea for plagiarism detection with n-grams is that, as you increase n, the probability that a given n-gram will be shared between two arbitrary documents rapidly vanishes. For example, Google says, right now, that the 2-gram “rapidly vanishes” appears in about 11,000 documents, but the 5-gram “two different documents rapidly vanishes” occurs in no documents. (Once Google indexes this post, however, that will change.)

But when one document is copied partly from another, they will probably share many long sequences of words. So, the sharing of longer n-grams is evidence of copying.

It’s not proof, however. If two papers about U.S. history both contain the 5-gram “The United States of America,” that coincidence is not surprising. But when lots of unlikely sharing occurs, it’s strong evidence of some form of common ancestry.

Fortunately, when copying occurs, it’s often easy to see, once you know where to look. So my detector just annotates the suspect document to indicate which portions of it share n-grams with the source documents. Shared regions are converted to ALL CAPS. For example, looking for shared 4-grams, my detector finds ample evidence that the following paragraph was copied from the text above:

$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt

But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA," 
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.

In fact, it was copied: I duplicated an earlier paragraph and then changed a few words.

Using n-grams to detect sharing is not a perfect system, but it’s a simple system, and, because its output is easily interpreted, it’s an effective system. When your eyeballs see a lot of ALL CAPS TEXT, it’s easy to judge what it represents.

Anyway, here’s the code.

{-

Simple plagiarism detector that detects n-gram duplication.

Usage: ./Plag N suspect.txt source.txt... > annotated-suspect.txt
(where N is the n-gram size; try 3, 4, and 5).

Tom Moertel <tmoertel@gmail.com>
2011-06-11

-}

module Main where

import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)

main = do
  n:suspect:sources <- getArgs
  putStrLn =<< findMatches (read n) suspect sources

findMatches n suspect sources = do
  db <- S.fromList . concat  <$> mapM (loadNgrams n) sources
  stxt <- readFile suspect
  return $ concat (match n db (nGrams n stxt) (whiteWords stxt))

match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
  where
    matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
    isMatch loc ng | ng `S.member` db = [loc .. loc + n - 1]
                   | otherwise        = []

mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
  where
    go _   []     ws               = ws
    go _   _      []               = []
    go i (l:ls) (w:ws) | i < l     = w : go (i + 1) (l:ls) ws
    go i (l:ls) (w:ws) | otherwise = f w :  go (i + 1) ls ws

uniq = S.toList . S.fromList

loadNgrams n file = nGrams n <$> readFile file

nGrams :: Int -> String -> [[String]]
nGrams n = takeWhile ((n ==) . length) . map (take n) .
           tails . map normalize . words

normalize = map toLower . filter isAlphaNum

-- returns words like 'words' but preserves leading whitespace for each

whiteWords :: String -> [String]
whiteWords s = case span isSpace s of
  ("", "") -> []
  (s1, "") -> [s1]
  (s1, s') -> (s1 ++ w) : whiteWords s''
    where
      (w, s'') = break isSpace s'

Posted in
Tags , ,
9 comments
no trackbacks
Reddit Delicious