As posted to SPAM-L… reposting in my journal mostly for my records…
———- Forwarded message ———-
Date: Mon, 3 Jan 2005 13:06:28 -0800 (PST)
From: Greg Connor
Subject: MISC: early prototype “too many user unknown DNSBL”
This is a project that I’m working on, sort of for work but mostly in my own spare time. It doesn’t actually do anything useful yet, but I wanted to get some feedback on it from you fine folks…
The idea is that I want to keep track of the last 10 transactions from each IP, and if 9 of the last 10 transactions were user unknown, then that IP should go on a local DNSBL for something like 2 hours.
Currently there are two pieces of the puzzle somewhat working.
The “myscanner” script tails a logfile where my mailservers send their syslog output. It takes multiple lines with the same transaction ID and puts the pieces back together, so that the output contains one line per transaction, telling the IP and the result.
Currently “myscanner” only understands Barracuda output, but a similar framework could be used to make sendmail logs into transactional output. It is currently highly dependent on our specific output though.
Transaction ids are kept in a hash, which is pruned of old entries to keep it to a certain memory size. Statistics output at the end (or more frequently if you like) show how many log lines were from an “unknown” transaction and therefore could not be tracked to a specific IP.
myscanner has a lot of different keywords for different types of transactions to show or hide, different levels of detail, and whether to tail the end of
the log (x bytes in and forward) or to start at the beginning and run through to the end. It could probably be modified to create a named pipe so that
syslog could write directly to it independent of a disk file, but this was not needed for our purposes.
The default mode is “tally” which produces output suitable for “mycollect” (see below). Sample output looks like this:
var> ~gconnor/bin/myscanner ip=188.8.131.52, result=Sender address rejected: Domain not found ip=184.108.40.206, result=Recipient address rejected: Blocked ip=220.127.116.11, result=No such user: email@example.com ip=18.104.22.168, result=Recipient address rejected: Blocked ip=22.214.171.124, result=Blocked: spam ip=126.96.36.199, result=Recipient address rejected: Blocked ip=188.8.131.52, result=Delivered ip=184.108.40.206, result=Blocked: spam ip=220.127.116.11, result=No such user: firstname.lastname@example.org ip=18.104.22.168, result=No such user: email@example.com ^C From: Jan 3 12:27:22 To: Jan 3 12:27:27 lines=1206 orphaned=281(23%) unmatched=0 transcount=242 cache=49 completed=193 incomplete=0(0%)
Note that if your mail program already reports the result (Sent OK, Unknown user, spam, unknown domain, etc) on the same line as the IP, you probably don’t need myscanner; just alter mycollect to interpret the log format of your mailer.
mycollect keeps track of every IP seen so far in a hash, and with each IP it keeps the result of the most recent 10 transactions, where “result” is either bad, ok or wtf. If 9 of the last 10 transactions are “bad” then the IP is added to the internal “blocked list”. (I haven’t arranged to push the data to DNS since I’m still investigating the feasibility and cost/benefit of such a DNSBL)
mycollect also keeps detailed statistics, which is its main job right now. Because I’m still in the investigation phase, I wanted to get detailed info about how many IPs would be blocked, and how many messages that made it through would have been blocked.
Statistics are reported at every 10,000 transactions. Output looks like:
> ~gconnor/bin/myscanner today show rbl | ~gconnor/bin/mycollect total = 10000 (100%) (rbl = 73, ok = 2, bad = 24) would block = 63 (0%) (rbl = 0, ok = 0, bad = 0) cache size = 5850, blocks size = 13 ... total = 150000 (100%) (rbl = 69, ok = 1, bad = 28) would block = 13992 (9%) (rbl = 0, ok = 0, bad = 9) cache size = 26296, blocks size = 1163
This indicates that after 10,000 transactions, 13 IPs would be added to the blocked list, and 63 of those transactions would have been avoided, if the DNSBL had been real. Later, after 160,000 transactions (about an hour for us) the BL has grown to 1163 IPs and would be blocking 9% of all transactions if it were live.
As I said, right now my main target is to see statistics from the simulation as to how much incoming mail would really be blocked. I have seen the number grow to like 26% for certain days, meaning that the DIY-BL would cancel 26% of all mail transactions if conditions are right. It looks like most of the mail that would have been blocked would have resulted in “User unknown” or would have been caught by other tests anyway, but the real test will be to compare the messages that would have gotten through (ok) but are now stopped, to see
if the “ok” number drops significantly.
Still to do are purging the IP cache (to keep it to a certain size by removing old entries and keeping recent ones) and expiring entries from the block list after 2 hours to see if there’s any change in effectiveness.
Any feedback is appreciated. Right now it is pretty customized to my environment, but if you feel like playing with it and running your own output through it, please feel free. I would be interested to see what kind of numbers you come up with :)
Thanks for taking the time to read!
Stats gathered from the work server so far today: Out of 2M transactions, 64% were blocked by sbl/xbl/njabl, 33% were user unknown/spam/other error, 1% was accepted. Using a simulated “9 strikes you’re out” BL, 19% would move from user unknown to BL for a total of 83%. (1% of transactions were added to my simluated BL and then showed up on a real RBL later). That 2M transactions came from 124716 unique IPs, and 15416 were added to the simulated BL.