Recently, I wrote a simple plagiarism detector in Haskell using n-grams. The problem it solves is this: You have a document, the suspect, which you believe may be derived from other documents, the sources. You want to find regions in the suspect that are identical to those in the sources.
One simple solution is to use n-grams, which are just the n-word neighborhoods taken from some sequence of words. For example, the 4-grams of the sequence a b c d e f are a b c d, b c d e, and c d e f.
The idea for plagiarism detection with n-grams is that, as you increase n, the probability that a given n-gram will be shared between two arbitrary documents rapidly vanishes. For example, Google says, right now, that the 2-gram “rapidly vanishes” appears in about 11,000 documents, but the 5-gram “two arbitrary documents rapidly vanishes” occurs in no documents. (Once Google indexes this post, however, that will change.)
But when one document is copied partly from another, they will probably share many long sequences of words. So, the sharing of longer n-grams is evidence of copying.
It’s not proof, however. If two papers about U.S. history both contain the 5-gram “The United States of America,” that coincidence is not surprising. But when lots of unlikely sharing occurs, it’s strong evidence of some form of common ancestry.
Fortunately, when copying occurs, it’s often easy to see, once you know where to look. So my detector just annotates the suspect document to indicate which portions of it share n-grams with the source documents. Shared regions are converted to ALL CAPS. For example, looking for shared 4-grams, my detector finds ample evidence that the following paragraph was copied from the text above:
$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt
But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA,"
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.
In fact, it was copied: I duplicated an earlier paragraph and then changed a few words.
Using n-grams to detect sharing is not a perfect system, but it’s a simple system, and, because its output is easily interpreted, it’s an effective system. When your eyeballs see a lot of ALL CAPS TEXT, it’s easy to judge what it represents.
Anyway, here’s the code.
{-
Simple plagiarism detector that detects n-gram duplication.
Usage: ./Plag N suspect.txt source.txt... > annotated-suspect.txt
(where N is the n-gram size; try 3, 4, and 5).
Tom Moertel <tmoertel@gmail.com>
2011-06-11
-}
module Main where
import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)
= do
main :suspect:sources <- getArgs
nputStrLn =<< findMatches (read n) suspect sources
= do
findMatches n suspect sources <- S.fromList . concat <$> mapM (loadNgrams n) sources
db <- readFile suspect
stxt return $ concat (match n db (nGrams n stxt) (whiteWords stxt))
match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
= mapAt (map toUpper) matchlocs wws
match n db ngs wws where
= uniq . concat $ zipWith isMatch [0..] ngs
matchlocs | ng `S.member` db = [loc .. loc + n - 1]
isMatch loc ng | otherwise = []
mapAt :: (a -> a) -> [Int] -> [a] -> [a]
= go 0 locs ws
mapAt f locs ws where
= ws
go _ [] ws = []
go _ _ [] :ls) (w:ws) | i < l = w : go (i + 1) (l:ls) ws
go i (l:ls) (w:ws) | otherwise = f w : go (i + 1) ls ws
go i (l
= S.toList . S.fromList
uniq
= nGrams n <$> readFile file
loadNgrams n file
nGrams :: Int -> String -> [[String]]
= takeWhile ((n ==) . length) . map (take n) .
nGrams n . map normalize . words
tails
= map toLower . filter isAlphaNum
normalize
-- returns words like 'words' but preserves leading whitespace for each
whiteWords :: String -> [String]
= case span isSpace s of
whiteWords s "", "") -> []
("") -> [s1]
(s1, -> (s1 ++ w) : whiteWords s''
(s1, s') where
= break isSpace s' (w, s'')