<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="/stylesheets/rss.css" type="text/css"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/">
  <channel>
    <title>Tom Moertel's Weblog: Improving Typo's spam protection</title>
    <link>http://blog.moertel.com/articles/2006/01/16/improving-typos-spam-protection</link>
    <language>en-us</language>
    <ttl>40</ttl>
    <description>Quality rants on programming theory and stuff geeks like</description>
    <item>
      <title>Improving Typo's spam protection</title>
      <description>&lt;p&gt;I noticed that my site has been picking up more comment spam recently.
&lt;a href="http://www.typosphere.org/"&gt;Typo&lt;/a&gt; has built-in spam protection, but for
some reason a few spam comments that ought to have been caught slipped
through its filters.  Curious, I investigated.&lt;/p&gt;


	&lt;p&gt;Most spam comments contain links to sites favored by the spammers.
The sites are almost always of the form &lt;em&gt;x.domain&lt;/em&gt;.com,
where &lt;em&gt;domain&lt;/em&gt; is one of a few higher-level domains and &lt;em&gt;x&lt;/em&gt; is drawn
from a large set of values from the realms of gambling, pornography,
and male enhancement.  It seems that the spammers pay for a few real
domains and then create a ton of subdomains under them.&lt;/p&gt;


	&lt;p&gt;One of the ways to detect comment spam is to find URIs in comments and
look up the sites they point to in &lt;span class="caps"&gt;DNS&lt;/span&gt;-based
&lt;acronym title="spam-URI realtime blackout lists"&gt;SURBL&lt;/acronym&gt;s,
such as &lt;a href="http://www.surbl.org/"&gt;multi.surbl.org&lt;/a&gt; and
&lt;a href="http://bsb.empty.us/"&gt;bsb.empty.us&lt;/a&gt;.  The thing is, when SURBLs list a
spammy site &lt;em&gt;x.domain&lt;/em&gt;.com, sometimes they list it under the full
hostname &lt;em&gt;x.domain&lt;/em&gt;.com and sometimes they list it
under the higher-level domain
&lt;em&gt;domain&lt;/em&gt;.com.  To be safe, Typo looks up both forms when it checks
for spam.&lt;/p&gt;


	&lt;p&gt;Here&amp;#8217;s the code it uses:&lt;/p&gt;


&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="constant"&gt;HOST_RBLS&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;rbl&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
  &lt;span class="keyword"&gt;begin&lt;/span&gt;
    &lt;span class="keyword"&gt;if&lt;/span&gt; &lt;span class="punct"&gt;[&lt;/span&gt;
        &lt;span class="constant"&gt;IPSocket&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;getaddress&lt;/span&gt;&lt;span class="punct"&gt;([&lt;/span&gt;&lt;span class="ident"&gt;host&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;rbl&lt;/span&gt;&lt;span class="punct"&gt;].&lt;/span&gt;&lt;span class="ident"&gt;join&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;.&lt;/span&gt;&lt;span class="punct"&gt;')),&lt;/span&gt;
        &lt;span class="constant"&gt;IPSocket&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;getaddress&lt;/span&gt;&lt;span class="punct"&gt;((&lt;/span&gt;&lt;span class="ident"&gt;domain&lt;/span&gt; &lt;span class="punct"&gt;+&lt;/span&gt; &lt;span class="punct"&gt;[&lt;/span&gt;&lt;span class="ident"&gt;rbl&lt;/span&gt;&lt;span class="punct"&gt;]).&lt;/span&gt;&lt;span class="ident"&gt;join&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;.&lt;/span&gt;&lt;span class="punct"&gt;'))&lt;/span&gt;
       &lt;span class="punct"&gt;].&lt;/span&gt;&lt;span class="ident"&gt;include?&lt;/span&gt;&lt;span class="punct"&gt;(&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;127.0.0.2&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;)&lt;/span&gt;
      &lt;span class="ident"&gt;throw&lt;/span&gt; &lt;span class="symbol"&gt;:hit&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="expr"&gt;#{rbl}&lt;/span&gt; positively resolved &lt;span class="expr"&gt;#{domain.join('.')}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;rescue&lt;/span&gt; &lt;span class="constant"&gt;SocketError&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

	&lt;p&gt;The code iterates over the list of SURBLs it has and queries each
twice &amp;#8211; once for the host and once for the domain in question &amp;#8211; saving
the results of the queries in an array.  Then if the array includes a
positive response (127.0.0.2), it throws a &amp;#8220;hit&amp;#8221; notice to the
calling code, which will block the associated comment.&lt;/p&gt;


	&lt;p&gt;Unfortunately, the code doesn&amp;#8217;t quite work as intended.  Although a
positive response for &lt;em&gt;either&lt;/em&gt; the host or the domain should register
as a hit, the code requires &lt;em&gt;both&lt;/em&gt; queries to return positive
responses.  As a result, the code yields a lot of false negatives
because most lists don&amp;#8217;t include both host and domain forms of spammy
sites; the required double positive is thus hard to obtain.&lt;/p&gt;


&lt;p&gt;The cause of the problem is the attempt to query for both forms of the
site before checking either response.  The queries are performed by
calling &lt;code&gt;IPSocket.getaddress&lt;/code&gt;, which performs a &lt;span class="caps"&gt;DNS&lt;/span&gt; query
for the &amp;#8220;A&amp;#8221; record associated with its argument.  If the record
exists, the call returns it; otherwise, the call raises a
&lt;code&gt;SocketError&lt;/code&gt; exception.&lt;/p&gt;

	&lt;p&gt;The exception is what causes the logic to break down.  When either the
host or domain is &lt;em&gt;not&lt;/em&gt; in the queried &lt;span class="caps"&gt;SURBL&lt;/span&gt;, which will almost always
be the case for reasons I explained earlier, one of the queries will
result in a &lt;code&gt;SocketError&lt;/code&gt; exception.  The exception will be
caught by the &lt;code&gt;rescue&lt;/code&gt; clause later in the code, but not
before the opportunity to test the other query&amp;#8217;s response and throw a
&amp;#8220;hit&amp;#8221; has been lost.&lt;/p&gt;


	&lt;p&gt;My fix was to replace the above code with a call to a new helper
method:&lt;/p&gt;


&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="ident"&gt;query_rbls&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="constant"&gt;HOST_RBLS&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;host&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;domain&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;join&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;.&lt;/span&gt;&lt;span class="punct"&gt;'))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

	&lt;p&gt;The helper, defined later, makes the actual queries:&lt;/p&gt;


&lt;div class="typocode"&gt;&lt;pre&gt;&lt;code class="typocode_ruby "&gt;&lt;span class="keyword"&gt;def &lt;/span&gt;&lt;span class="method"&gt;query_rbls&lt;/span&gt;&lt;span class="punct"&gt;(&lt;/span&gt;&lt;span class="ident"&gt;rbls&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;*&lt;/span&gt;&lt;span class="ident"&gt;subdomains&lt;/span&gt;&lt;span class="punct"&gt;)&lt;/span&gt;
  &lt;span class="ident"&gt;rbls&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;rbl&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
    &lt;span class="ident"&gt;subdomains&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;uniq&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;each&lt;/span&gt; &lt;span class="keyword"&gt;do&lt;/span&gt; &lt;span class="punct"&gt;|&lt;/span&gt;&lt;span class="ident"&gt;d&lt;/span&gt;&lt;span class="punct"&gt;|&lt;/span&gt;
      &lt;span class="keyword"&gt;begin&lt;/span&gt;
        &lt;span class="ident"&gt;response&lt;/span&gt; &lt;span class="punct"&gt;=&lt;/span&gt; &lt;span class="constant"&gt;IPSocket&lt;/span&gt;&lt;span class="punct"&gt;.&lt;/span&gt;&lt;span class="ident"&gt;getaddress&lt;/span&gt;&lt;span class="punct"&gt;([&lt;/span&gt;&lt;span class="ident"&gt;d&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="ident"&gt;rbl&lt;/span&gt;&lt;span class="punct"&gt;].&lt;/span&gt;&lt;span class="ident"&gt;join&lt;/span&gt;&lt;span class="punct"&gt;('&lt;/span&gt;&lt;span class="string"&gt;.&lt;/span&gt;&lt;span class="punct"&gt;'))&lt;/span&gt;
        &lt;span class="ident"&gt;throw&lt;/span&gt; &lt;span class="symbol"&gt;:hit&lt;/span&gt;&lt;span class="punct"&gt;,&lt;/span&gt; &lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;&lt;span class="string"&gt;&lt;span class="expr"&gt;#{rbl}&lt;/span&gt; positively resolved &lt;span class="expr"&gt;#{d}&lt;/span&gt; =&amp;gt; &lt;span class="expr"&gt;#{response}&lt;/span&gt;&lt;/span&gt;&lt;span class="punct"&gt;&amp;quot;&lt;/span&gt;
      &lt;span class="keyword"&gt;rescue&lt;/span&gt; &lt;span class="constant"&gt;SocketError&lt;/span&gt;
        &lt;span class="comment"&gt;# NXDOMAIN response =&amp;gt; negative:  d is not in RBL&lt;/span&gt;
      &lt;span class="keyword"&gt;end&lt;/span&gt;
    &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;end&lt;/span&gt;
  &lt;span class="keyword"&gt;return&lt;/span&gt; &lt;span class="constant"&gt;false&lt;/span&gt;
&lt;span class="keyword"&gt;end&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

	&lt;p&gt;Because some SURBLs don&amp;#8217;t use 127.0.0.2 but some other &amp;#8220;A&amp;#8221; record to
indicate a positive response, my helper removes the hard-coded address
test.&lt;/p&gt;


	&lt;p&gt;I also made a few more improvements to the spam-protection
code.  The full set of changes is available as &lt;a href="http://www.typosphere.org/trac/ticket/657"&gt;Patch
657&lt;/a&gt; on the Typo Trac site.&lt;/p&gt;</description>
      <pubDate>Mon, 16 Jan 2006 01:34:00 -0500</pubDate>
      <guid isPermaLink="false">urn:uuid:a154474c903e93da6922f1a53a563f0a</guid>
      <author>Tom Moertel</author>
      <link>http://blog.moertel.com/articles/2006/01/16/improving-typos-spam-protection</link>
      <category>typo</category>
      <category>typo</category>
      <category>ruby</category>
      <category>spam</category>
      <trackback:ping>http://blog.moertel.com/articles/trackback/23</trackback:ping>
    </item>
  </channel>
</rss>
