Something I am thinking about working on this month. This is probably only interesting to anti-spam advocates, sysadmins, and other hard-core geeks.
I recently posted to SPAM-L:
(This is just a summary of what I am thinking about doing with our incoming crap mail. Feedback is appreciated but I’m mostly just posting to summarize for those who are curious.)
The biggest challenge will be the large volume of mail, so I have decided to take a slightly different approach to scaling the problem: I will start by sampling only a percentage of the load by load balancing the same way we do for our web site. That means of the 3M sendmail connections coming in, I can choose 1% of those (30,000) to scrutinize closely and turn that number up once we prove the approach is scalable. (This is for altavista.com and a few other “nomail” domains that have not been used for legit email in 18+ months :)
My big concern was that I don’t want to kill ORDB, SBL, SORBS/SPEWS with 35 queries per sec. For now I will probably process only 1% of the traffic (0.3 qps) against these in a normal caching-nameserver style (the rest will be sent to our normal MX setup that gets postmaster and abuse and rejects the rest). I want to get started on processing the mail and looking for other sources not listed on these. I really do want to set up rsync for those feeds, but for now this is sufficient.
My goal is to start building a local blacklist that I can consult before other RDNSBL, and once this is built up to catch a large percentage of crap, I can turn up the volume. The starting criteria will be something like:
1. sighted in ORDB, SBL, etc – block for 24 hrs, then eligible to query again (because we don’t need to find spam sources someone else already found)
2. sent us 100 or more messages with 99% spam according to SA – block for 2 weeks (this is the useful info I want – spammers IP addresses and headers from the spam)
I’ll post again when I have anything interesting to report :)
My previous post on the subject was this:
I wrote a while ago about our “dead domains mail server” – a pair of sendmail servers that serve nothing but bounces for our dead and much-abused domains. I am now seeking some more specific advice from folks who have had to deal with a large volume of crap mail. We currently get about 3 million sendmail connections per day, 99.9% of these serve “User Unknown” messages, but since I have some time on my hands I would like to start analyzing the stream of crap so I can feed some useful info back to the anti-spam community.
Here is what I want to accomplish.
1. Use best-available technology to filter the crap. Dump anything already catchable by ORDB, SBL, SPEWS. (I can figure out how to do this, but I’m interested in comments as to which DNSBL’s seem to be most useful and timely)
2. Send the crap that does make it through step 1 into Spam Assassin, so that it can feed DCC/Razor (I can work out how to do this, but I’m not sure how much benefit this gives back to the community.)
3. Periodically view a “sample” of the crap to try and catch more hosts that should be blocked, but aren’t in any DNSBL yet. (Tips are welcome here, things that people usually look for, what criteria something should meet before posting it to “sightings” etc.)
4. Catch any bounces from forged mail and let those postmasters know that they should a. reject the mail instead of accept-then-bounce and b. check out SPF to help catch forgery on the front line.
5. If I can figure out how, I would like to add a record to a mysql database for each incoming mail, something like
originating IP and rDNS, maybe AS or Arin Netblock
claimed mail-from address
found in SBL, SPEWS, ORDB
(This last one will be the most challenging and I may or may not do it. Ideally it would be sort of useful to report on where the largest stream of crap comes from, but it would probably take a lot of work and there are probably others who have already done this better than I could do)
One thing I am thinking about is possibly switching to Postfix, if it would make the job of sending stuff to SA, logging to a database, etc. a bit easier. Anyone have thoughts or suggestions on this?
Thanks for any pointers.