I noticed that my site has been picking up more comment spam recently. Typo has built-in spam protection, but for some reason a few spam comments that ought to have been caught slipped through its filters. Curious, I investigated.
Most spam comments contain links to sites favored by the spammers. The sites are almost always of the form x.domain.com, where domain is one of a few higher-level domains and x is drawn from a large set of values from the realms of gambling, pornography, and male enhancement. It seems that the spammers pay for a few real domains and then create a ton of subdomains under them.
One of the ways to detect comment spam is to find URIs in comments and look up the sites they point to in DNS-based SURBL (spam-URI realtime blackout lists)s, such as multi.surbl.org and bsb.empty.us. The thing is, when SURBLs list a spammy site x.domain.com, sometimes they list it under the full hostname x.domain.com and sometimes they list it under the higher-level domain domain.com. To be safe, Typo looks up both forms when it checks for spam.
Here’s the code it uses:
HOST_RBLS.each do |rbl|
begin
if [
IPSocket.getaddress([host, rbl].join('.')),
IPSocket.getaddress((domain + [rbl]).join('.'))
].include?("127.0.0.2")
throw :hit, "#{rbl} positively resolved #{domain.join('.')}"
end
rescue SocketError
end
end
The code iterates over the list of SURBLs it has and queries each twice – once for the host and once for the domain in question – saving the results of the queries in an array. Then if the array includes a positive response (127.0.0.2), it throws a “hit” notice to the calling code, which will block the associated comment.
Unfortunately, the code doesn’t quite work as intended. Although a positive response for either the host or the domain should register as a hit, the code requires both queries to return positive responses. As a result, the code yields a lot of false negatives because most lists don’t include both host and domain forms of spammy sites; the required double positive is thus hard to obtain.
The cause of the problem is the attempt to query for both forms of the
site before checking either response. The queries are performed by
calling IPSocket.getaddress
, which performs a DNS query
for the “A” record associated with its argument. If the record
exists, the call returns it; otherwise, the call raises a
SocketError
exception.
The exception is what causes the logic to break down. When either the
host or domain is not in the queried SURBL, which will almost always
be the case for reasons I explained earlier, one of the queries will
result in a SocketError
exception. The exception will be
caught by the rescue
clause later in the code, but not
before the opportunity to test the other query’s response and throw a
“hit” has been lost.
My fix was to replace the above code with a call to a new helper method:
HOST_RBLS, host, domain.join('.')) query_rbls(
The helper, defined later, makes the actual queries:
def query_rbls(rbls, *subdomains)
.each do |rbl|
rbls.uniq.each do |d|
subdomainsbegin
= IPSocket.getaddress([d, rbl].join('.'))
response throw :hit, "#{rbl} positively resolved #{d} => #{response}"
rescue SocketError
# NXDOMAIN response => negative: d is not in RBL
end
end
end
return false
end
Because some SURBLs don’t use 127.0.0.2 but some other “A” record to indicate a positive response, my helper removes the hard-coded address test.
I also made a few more improvements to the spam-protection code. The full set of changes is available as Patch 657 on the Typo Trac site.