Spamassassin/procmail autolearning setup

Posted for willowisp and callicrates but others might be interested too.

Here is how I have spamassassin set up at work and at home… I wanted to have a system that SpamAssassin runs mostly by itself, but when it learns something wrong I want to be able to correct it.

1. .procmailrc edited for the following:

# UNALTERED BACKUP JUST IN CASE

:0 c
Incoming.backup.$MMYY

# SPAMCOP FILTERING (only if size < 256 k)

:0fw: spamassassin.lock
* < 256000
| /usr/bin/spamassassin

# Second copy to track the autolearn
# Classified as learned-spam, learned-not-spam, unlearned

:0c:
* ^X-Spam-Status:.*autolearn=spam
Spam/track-learned-spam

:0c:
* ^X-Spam-Status:.*autolearn=ham
Spam/track-learned-not-spam

:0c:
* ^X-Spam-Status:.*autolearn=no
Spam/track-unlearned

# Tag not found means unlearned
:0c:
* ! X-Spam-Status:.*autolearn
Spam/track-unlearned

# Decide on the spam level you want to auto-reject
# Leave commented until you are sure spam tagging is working right
#:0:
#* ^X-Spam-Level: *******
#/dev/null

2. Enable auto learning in ~/.spamassassin/user_prefs

# Enable the Bayes system
use_bayes               1

# Enable Bayes auto-learning
auto_learn              1

# Alter the thresholds for auto-learning a bit
# (site default is 0.1-12.0 unlearned)
bayes_auto_learn_threshold_nonspam      2.0
bayes_auto_learn_threshold_spam         10.0

# Just so that the BAYES_ header will appear in every message
# default score of 0 means the header doesn't appear at all
score BAYES_40 0.001
score BAYES_44 0.001
score BAYES_50 0.001
score BAYES_56 0.001

# Downgrade anything that is not english
ok_languages            en
ok_locales              en

3. Set up crontab (mine, not root) to re-learn stuff that was missed daily. After learning, it will tack the messages onto the appropriate learned folder and empty the missed folders. (Folders must be plain unix mailboxes for this to work, not .mbx)

>vi /udir/gconnor/crontab.gconnor
05 03 * * *     /udir/gconnor/bin/spamlearner
:wq

> crontab /udir/gconnor/crontab.gconnor

> vi /udir/gconnor/bin/spamlearner
#!/bin/csh
sa-learn --ham --mbox mail/Spam/missed-not-spam
cat > mail/Spam/track-learned-not-spam &&
  cp /dev/null mail/Spam/missed-not-spam 

sa-learn --spam --mbox mail/Spam/missed-spam
cat > mail/Spam/track-learned-spam &&
  cp /dev/null mail/Spam/missed-spam 

sa-learn --rebuild
:wq

> chmod +x /udir/gconnor/bin/spamlearner

> /udir/gconnor/bin/spamlearner
Learned from 0 message(s) (0 message(s) examined).
Learned from 0 message(s) (0 message(s) examined).

After this runs correctly for a few days, you may alter crontab to send stdout to null.

4. If spamassassin learns something incorrectly, remove it from the “learned” folder and place it in the “missed” folder.

Incorrectly learned as good:
Spam/track-learned-not-spam -> Spam/missed-spam

Incorrectly learned as spam:
Spam/track-learned-spam -> Spam/missed-not-spam

5. Periodically look at the “unlearned” folder and place stuff in missed-spam or missed-not-spam to be learned.

This system is designed for one person. The bayes system seems to work better when it has a good sample of both good and spam mail. Also, if used for multiple people, they will have access to view everyone else’s good mail (unless you only provide access to the missed folders). It is possible to make the folders appear to multiple people by making symlinks into their mail directories, and making sure the original is writeable by the correct group.

I also have a recipe for creating “imap shared folders” in case the symlinks don’t work right, ask via comment if you want it :)

Leave a Reply