Improving Typo's spam protection

By
Posted on
Tags: , ,

I noticed that my site has been picking up more comment spam recently. Typo has built-in spam protection, but for some reason a few spam comments that ought to have been caught slipped through its filters. Curious, I investigated.

Most spam comments contain links to sites favored by the spammers. The sites are almost always of the form x.domain.com, where domain is one of a few higher-level domains and x is drawn from a large set of values from the realms of gambling, pornography, and male enhancement. It seems that the spammers pay for a few real domains and then create a ton of subdomains under them.

One of the ways to detect comment spam is to find URIs in comments and look up the sites they point to in DNS-based SURBL (spam-URI realtime blackout lists)s, such as multi.surbl.org and bsb.empty.us. The thing is, when SURBLs list a spammy site x.domain.com, sometimes they list it under the full hostname x.domain.com and sometimes they list it under the higher-level domain domain.com. To be safe, Typo looks up both forms when it checks for spam.

Here’s the code it uses:

HOST_RBLS.each do |rbl|
  begin
    if [
        IPSocket.getaddress([host, rbl].join('.')),
        IPSocket.getaddress((domain + [rbl]).join('.'))
       ].include?("127.0.0.2")
      throw , "#{rbl} positively resolved #{domain.join('.')}"
    end
  rescue SocketError
  end
end

The code iterates over the list of SURBLs it has and queries each twice – once for the host and once for the domain in question – saving the results of the queries in an array. Then if the array includes a positive response (127.0.0.2), it throws a “hit” notice to the calling code, which will block the associated comment.

Unfortunately, the code doesn’t quite work as intended. Although a positive response for either the host or the domain should register as a hit, the code requires both queries to return positive responses. As a result, the code yields a lot of false negatives because most lists don’t include both host and domain forms of spammy sites; the required double positive is thus hard to obtain.

The cause of the problem is the attempt to query for both forms of the site before checking either response. The queries are performed by calling IPSocket.getaddress, which performs a DNS query for the “A” record associated with its argument. If the record exists, the call returns it; otherwise, the call raises a SocketError exception.

The exception is what causes the logic to break down. When either the host or domain is not in the queried SURBL, which will almost always be the case for reasons I explained earlier, one of the queries will result in a SocketError exception. The exception will be caught by the rescue clause later in the code, but not before the opportunity to test the other query’s response and throw a “hit” has been lost.

My fix was to replace the above code with a call to a new helper method:

query_rbls(HOST_RBLS, host, domain.join('.'))

The helper, defined later, makes the actual queries:

def query_rbls(rbls, *subdomains)
  rbls.each do |rbl|
    subdomains.uniq.each do |d|
      begin
        response = IPSocket.getaddress([d, rbl].join('.'))
        throw , "#{rbl} positively resolved #{d} => #{response}"
      rescue SocketError
        # NXDOMAIN response => negative:  d is not in RBL
      end
    end
  end
  return false
end

Because some SURBLs don’t use 127.0.0.2 but some other “A” record to indicate a positive response, my helper removes the hard-coded address test.

I also made a few more improvements to the spam-protection code. The full set of changes is available as Patch 657 on the Typo Trac site.