Writing a simple plagiarism detector in Haskell

Posted on June 16, 2011

Recently, I wrote a simple plagiarism detector in Haskell using n-grams. The problem it solves is this: You have a document, the suspect, which you believe may be derived from other documents, the sources. You want to find regions in the suspect that are identical to those in the sources.

One simple solution is to use n-grams, which are just the n-word neighborhoods taken from some sequence of words. For example, the 4-grams of the sequence a b c d e f are a b c d, b c d e, and c d e f.

The idea for plagiarism detection with n-grams is that, as you increase n, the probability that a given n-gram will be shared between two arbitrary documents rapidly vanishes. For example, Google says, right now, that the 2-gram “rapidly vanishes” appears in about 11,000 documents, but the 5-gram “two arbitrary documents rapidly vanishes” occurs in no documents. (Once Google indexes this post, however, that will change.)

But when one document is copied partly from another, they will probably share many long sequences of words. So, the sharing of longer n-grams is evidence of copying.

It’s not proof, however. If two papers about U.S. history both contain the 5-gram “The United States of America,” that coincidence is not surprising. But when lots of unlikely sharing occurs, it’s strong evidence of some form of common ancestry.

Fortunately, when copying occurs, it’s often easy to see, once you know where to look. So my detector just annotates the suspect document to indicate which portions of it share n-grams with the source documents. Shared regions are converted to ALL CAPS. For example, looking for shared 4-grams, my detector finds ample evidence that the following paragraph was copied from the text above:

$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt

But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA,"
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.

In fact, it was copied: I duplicated an earlier paragraph and then changed a few words.

Using n-grams to detect sharing is not a perfect system, but it’s a simple system, and, because its output is easily interpreted, it’s an effective system. When your eyeballs see a lot of ALL CAPS TEXT, it’s easy to judge what it represents.

Anyway, here’s the code.

{-

Simple plagiarism detector that detects n-gram duplication.

Usage: ./Plag N suspect.txt source.txt... > annotated-suspect.txt
(where N is the n-gram size; try 3, 4, and 5).

Tom Moertel <tmoertel@gmail.com>
2011-06-11

-}


module Main where

import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)


main = do
  n:suspect:sources <- getArgs
  putStrLn =<< findMatches (read n) suspect sources

findMatches n suspect sources = do
  db <- S.fromList . concat  <$> mapM (loadNgrams n) sources
  stxt <- readFile suspect
  return $ concat (match n db (nGrams n stxt) (whiteWords stxt))

match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
  where
    matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
    isMatch loc ng | ng `S.member` db = [loc .. loc + n - 1]
                   | otherwise        = []

mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
  where
    go _   []     ws               = ws
    go _   _      []               = []
    go i (l:ls) (w:ws) | i < l     = w : go (i + 1) (l:ls) ws
    go i (l:ls) (w:ws) | otherwise = f w :  go (i + 1) ls ws

uniq = S.toList . S.fromList

loadNgrams n file = nGrams n <$> readFile file

nGrams :: Int -> String -> [[String]]
nGrams n = takeWhile ((n ==) . length) . map (take n) .
           tails . map normalize . words

normalize = map toLower . filter isAlphaNum


-- returns words like 'words' but preserves leading whitespace for each

whiteWords :: String -> [String]
whiteWords s = case span isSpace s of
  ("", "") -> []
  (s1, "") -> [s1]
  (s1, s') -> (s1 ++ w) : whiteWords s''
    where
      (w, s'') = break isSpace s'