Writing a simple plagiarism detector in Haskell
Posted by Tom Moertel Thu, 16 Jun 2011 04:13:00 GMT
Recently, I wrote a simple plagiarism detector in Haskell using n-grams. The problem it solves is this: You have a document, the suspect, which you believe may be derived from other documents, the sources. You want to find regions in the suspect that are identical to those in the sources.
One simple solution is to use n-grams, which are just the n-word neighborhoods taken from some sequence of words. For example, the 4-grams of the sequence a b c d e f are a b c d, b c d e, and c d e f.
The idea for plagiarism detection with n-grams is that, as you increase n, the probability that a given n-gram will be shared between two arbitrary documents rapidly vanishes. For example, Google says, right now, that the 2-gram “rapidly vanishes” appears in about 11,000 documents, but the 5-gram “two different documents rapidly vanishes” occurs in no documents. (Once Google indexes this post, however, that will change.)
But when one document is copied partly from another, they will probably share many long sequences of words. So, the sharing of longer n-grams is evidence of copying.
It’s not proof, however. If two papers about U.S. history both contain the 5-gram “The United States of America,” that coincidence is not surprising. But when lots of unlikely sharing occurs, it’s strong evidence of some form of common ancestry.
Fortunately, when copying occurs, it’s often easy to see, once you know where to look. So my detector just annotates the suspect document to indicate which portions of it share n-grams with the source documents. Shared regions are converted to ALL CAPS. For example, looking for shared 4-grams, my detector finds ample evidence that the following paragraph was copied from the text above:
$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt But it's not rock solid. If two documents ABOUT U.S. HISTORY BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA," that's not shocking. BUT WHEN LOTS OF sharing occurs, that's pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.
In fact, it was copied: I duplicated an earlier paragraph and then changed a few words.
Using n-grams to detect sharing is not a perfect system, but it’s a simple system, and, because its output is easily interpreted, it’s an effective system. When your eyeballs see a lot of ALL CAPS TEXT, it’s easy to judge what it represents.
Anyway, here’s the code.
{-
Simple plagiarism detector that detects n-gram duplication.
Usage: ./Plag N suspect.txt source.txt... > annotated-suspect.txt
(where N is the n-gram size; try 3, 4, and 5).
Tom Moertel <tmoertel@gmail.com>
2011-06-11
-}
module Main where
import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)
main = do
n:suspect:sources <- getArgs
putStrLn =<< findMatches (read n) suspect sources
findMatches n suspect sources = do
db <- S.fromList . concat <$> mapM (loadNgrams n) sources
stxt <- readFile suspect
return $ concat (match n db (nGrams n stxt) (whiteWords stxt))
match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
where
matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
isMatch loc ng | ng `S.member` db = [loc .. loc + n - 1]
| otherwise = []
mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
where
go _ [] ws = ws
go _ _ [] = []
go i (l:ls) (w:ws) | i < l = w : go (i + 1) (l:ls) ws
go i (l:ls) (w:ws) | otherwise = f w : go (i + 1) ls ws
uniq = S.toList . S.fromList
loadNgrams n file = nGrams n <$> readFile file
nGrams :: Int -> String -> [[String]]
nGrams n = takeWhile ((n ==) . length) . map (take n) .
tails . map normalize . words
normalize = map toLower . filter isAlphaNum
-- returns words like 'words' but preserves leading whitespace for each
whiteWords :: String -> [String]
whiteWords s = case span isSpace s of
("", "") -> []
(s1, "") -> [s1]
(s1, s') -> (s1 ++ w) : whiteWords s''
where
(w, s'') = break isSpace s'

The difficult part is where to get enough sources (or better, n-grams database). I suppose it is an easy way to get banned in Google.
And this is an n-grams’ database. 6DVDs (24 GB gzipped) to be exact.
And this is Google Books Ngrams database, which covers many published books, but includes only n-grams which appear over 40 times:
Somehow, links disappeared
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
http://ngrams.googlelabs.com/datasets
Fun stuff! Since I work in bioinformatics, one thing that keeps recurring is applying phylogeny to other things that have ancestry. For instance, building a phylogeny of documents, or of source code. If I understand correctly, you highlight (capitalize) just the n-grams that are shared. Have you considered doing a local alignment (edit distance)? Anyway, great stuff, keep it coming.
It seems like you could do much more interesting plagiarism detection for programming assignments. With haskell-src-exts, you could parse and compare Haskell trees. So if you get two assignments handed in like this:
factorial 0 = 1 factorial n = n * factorial (n – 1)
and
factorial 0 = 1 factorial x = x * factorial (x – 1)
You just do a structural comparison.
$a 0 = 1 $a $x = $x * $a ($x – 1)
And reduce:
factorial 0 = 1 factorial x = z where z = x * factorial (x – 1)
=>
factorial 0 = 1 factorial x = x * factorial (x – 1)
Etc.
Chris:
You should check out the paper entitled ‘Clone Detection and Elimination for Haskell’ PEPM 2010. By Chris Brown and Simon Thompson.
In the paper we compare Haskell trees, using alpha renaming, where we look for instances of a common generalisation. But the work could easily be extended to plagiarism detection.
Ketil,
Edit-distance-based methods (including the Smith-Waterman Algorithm used in computational biology) have in fact been used for plagiarism detection. See, for example, Robert W. Irving’s paper on Plagiarism and Collusion Detection using the Smith-Waterman Algorithm.
For my uses, however, something simple was all I needed, so I went no further than n-grams.
You are correct: I highlight the shared n-grams. To keep things simple, the version I posted just capitalizes the them, but in the version I actually use, I added logic to color code the n-grams to indicate which source document they match.
Cheers,
Tom
Please publish the source code for Visual Basic .NET.
This way, more people will appreciate the cleverness of your idea and adopt your power tricks.
I made a similar plagiarism detector using Factor.
If you’re curious how a concatenative approach compares to Haskell, you might find it interesting.