Spam stuff: early prototype of “too many user unknown DNSBL”

As posted to SPAM-L… reposting in my journal mostly for my records…

———- Forwarded message ———-
Date: Mon, 3 Jan 2005 13:06:28 -0800 (PST)
From: Greg Connor
To: SPAM-L@PEACH.EASE.LSOFT.COM
Subject: MISC: early prototype “too many user unknown DNSBL”

This is a project that I’m working on, sort of for work but mostly in my own spare time. It doesn’t actually do anything useful yet, but I wanted to get some feedback on it from you fine folks…

The idea is that I want to keep track of the last 10 transactions from each IP, and if 9 of the last 10 transactions were user unknown, then that IP should go on a local DNSBL for something like 2 hours.

Currently there are two pieces of the puzzle somewhat working.

http://www.nekodojo.org/~gconnor/mydnsbl/myscanner

The “myscanner” script tails a logfile where my mailservers send their syslog output. It takes multiple lines with the same transaction ID and puts the pieces back together, so that the output contains one line per transaction, telling the IP and the result.

Currently “myscanner” only understands Barracuda output, but a similar framework could be used to make sendmail logs into transactional output. It is currently highly dependent on our specific output though.

Transaction ids are kept in a hash, which is pruned of old entries to keep it to a certain memory size. Statistics output at the end (or more frequently if you like) show how many log lines were from an “unknown” transaction and therefore could not be tracked to a specific IP.

myscanner has a lot of different keywords for different types of transactions to show or hide, different levels of detail, and whether to tail the end of
the log (x bytes in and forward) or to start at the beginning and run through to the end. It could probably be modified to create a named pipe so that
syslog could write directly to it independent of a disk file, but this was not needed for our purposes.

The default mode is “tally” which produces output suitable for “mycollect” (see below). Sample output looks like this:

var> ~gconnor/bin/myscanner
ip=192.48.159.22, result=Sender address rejected: Domain not found
ip=70.58.157.59, result=Recipient address rejected: Blocked
ip=216.120.237.254, result=No such user: 
brushlike19debacle@holodeck.engr.sgi.com
ip=70.65.159.153, result=Recipient address rejected: Blocked
ip=62.47.11.178, result=Blocked: spam
ip=85.96.172.59, result=Recipient address rejected: Blocked
ip=153.2.234.220, result=Delivered
ip=210.235.235.149, result=Blocked: spam
ip=82.81.114.69, result=No such user: karl.m@sgi.com
ip=81.225.57.70, result=No such user: dizzy523@holodeck.engr.sgi.com
^C
From: Jan  3 12:27:22
  To: Jan  3 12:27:27
lines=1206 orphaned=281(23%) unmatched=0
transcount=242 cache=49 completed=193 incomplete=0(0%)

Note that if your mail program already reports the result (Sent OK, Unknown user, spam, unknown domain, etc) on the same line as the IP, you probably don’t need myscanner; just alter mycollect to interpret the log format of your mailer.

http://www.nekodojo.org/~gconnor/mydnsbl/mycollect

mycollect keeps track of every IP seen so far in a hash, and with each IP it keeps the result of the most recent 10 transactions, where “result” is either bad, ok or wtf. If 9 of the last 10 transactions are “bad” then the IP is added to the internal “blocked list”. (I haven’t arranged to push the data to DNS since I’m still investigating the feasibility and cost/benefit of such a DNSBL)

mycollect also keeps detailed statistics, which is its main job right now. Because I’m still in the investigation phase, I wanted to get detailed info about how many IPs would be blocked, and how many messages that made it through would have been blocked.

Statistics are reported at every 10,000 transactions. Output looks like:

> ~gconnor/bin/myscanner today show rbl | ~gconnor/bin/mycollect
total = 10000 (100%) (rbl = 73, ok = 2, bad = 24)
would block = 63 (0%) (rbl = 0, ok = 0, bad = 0)
cache size = 5850, blocks size = 13
...
total = 150000 (100%) (rbl = 69, ok = 1, bad = 28)
would block = 13992 (9%) (rbl = 0, ok = 0, bad = 9)
cache size = 26296, blocks size = 1163

This indicates that after 10,000 transactions, 13 IPs would be added to the blocked list, and 63 of those transactions would have been avoided, if the DNSBL had been real. Later, after 160,000 transactions (about an hour for us) the BL has grown to 1163 IPs and would be blocking 9% of all transactions if it were live.

As I said, right now my main target is to see statistics from the simulation as to how much incoming mail would really be blocked. I have seen the number grow to like 26% for certain days, meaning that the DIY-BL would cancel 26% of all mail transactions if conditions are right. It looks like most of the mail that would have been blocked would have resulted in “User unknown” or would have been caught by other tests anyway, but the real test will be to compare the messages that would have gotten through (ok) but are now stopped, to see
if the “ok” number drops significantly.

Still to do are purging the IP cache (to keep it to a certain size by removing old entries and keeping recent ones) and expiring entries from the block list after 2 hours to see if there’s any change in effectiveness.

Any feedback is appreciated. Right now it is pretty customized to my environment, but if you feel like playing with it and running your own output through it, please feel free. I would be interested to see what kind of numbers you come up with :)

Thanks for taking the time to read!
gregc

EDIT:
Stats gathered from the work server so far today: Out of 2M transactions, 64% were blocked by sbl/xbl/njabl, 33% were user unknown/spam/other error, 1% was accepted. Using a simulated “9 strikes you’re out” BL, 19% would move from user unknown to BL for a total of 83%. (1% of transactions were added to my simluated BL and then showed up on a real RBL later). That 2M transactions came from 124716 unique IPs, and 15416 were added to the simulated BL.

0 thoughts on “Spam stuff: early prototype of “too many user unknown DNSBL”

  1. torquemada

    I like the idea of this kind of monitoring, but I think being able to unblock by IP might be a good idea; that is, continue to monitor traffic from blocked IPs, and if it looks like valid traffic is coming in, allow the option to remove the block. I’m thinking here in particular of IP ranges handed out over DHCP by large hosts (Comcast, etc), where a particular spammer may change IPs daily (or more often), so that the blocked IP can get handed off to someone who isn’t spamming.

    1. nekodojo

      I agree. I think the first version will just keep IPs on the list for 2 hours and then give them another bite at the apple. 2h or 4h is probably not enough to be annoying.

      I also like the policy CBL has which is “Removal is free, just click here” – anyone getting their mail refused can request removal, but the IP will get added back to the list if more spew comes from it.

      Probably a combination of the two would work well. Also if the data could be used to feed another BL (like CBL or XBL) then a timeout would probably be the best – by the time the bad IP falls off the local list, it has already been submitted to a larger BL.

  2. traveller_blues

    Does this cover the corner case of someone spoofing someone else’s IP — or, for that matter, the dynamic IP allocation done by some ISPs? To wit — will I get my e-mail to you blocked for two hours because someone else had my IP for a little while, uploaded tons of spam, and then shut down their dialup?

    Other than those concerns? I like this idea. Lots.

    -Traveller

    1. nekodojo

      You are right, a dynamic IP might come with a “history” – only a problem if your actual mail server is also a static IP (though it’s possible for a server to “camp” on an IP and keep the same address indefinitely as long as the server is up. A timeout would probably take care of that, or some quick-request removal service.

      It’s near impossible to spoof the IP, since there has to be a two-way communication for the spam coming in to work. That’s because the first packet from the spammer says “I’d like to start a conversation”, and the reply from me says “Great, here is a session number, use it for any further packets”. For this purpose, I would only use the IP that my server says it was actually talking to, not something picked up from the headers of the message which might be more easily spoofed.

Leave a Reply