[conspire] anti spam

Mon Mar 21 17:12:57 PDT 2011

Quoting Adam Cozzette (acozzette at cs.hmc.edu):

> I have to admit that I haven't actually tweaked SpamAssassin that
> much, because most of the time it works well enough. I did lower the
> maximum allowed spam score down from 5.0 to 3.0 some time ago after
> starting to receive a lot of false negatives. Of course, that results
> in a false positive once in a while, but not too often.
> 
> I might try your suggestion of upping the weight of the Bayesian test,
> and then see if that will allow me to increase the allowed spam score
> (and thus get fewer false negatives) without getting too many false
> positives. Thanks for your suggestions.

You're very welcome.  I suspect you should be able to get a fair amount
of improvement.  Don't forget, also, the option to customise your SA 
installation by backfilling custom rules files.  (Use caution:
Third-party custom SA rulesets have not always been well written, and
some have caused Perl to chew up ungodly amounts of RAM and system load.
See:  http://wiki.apache.org/spamassassin/OutOfMemoryProblems)

Implementing custom rulesets is a process in transition, that has been
put into (one hopes, temporary) confusion by the collapse of the SARE
(SpamAssassin Rules Emporium) effort and Web site.  The confusion is in
part because the sheer mass of obsolete reference to SARE rules /
channels / etc. -- leading to the need for adding further verbiage
warning people to disable/remove SARE rules as they are no longer being
maintained.  (Oh well.)

Looks like this page is fairly current:
http://wiki.apache.org/spamassassin/CustomRulesets

Since I gave my unsolicited opinion about why MUA-level spam-blocking 
is _inherently inadequate_, and why prior interception at the MTA
(receiving SMTP server) level is orders of magnitude more effective, let
me elaborate for a moment, on why that is:

A receiving SMTP server has a unique window of opportunity to perform
receipt-time checking on the arriving mail, for traits highly correlated
with spam, to do so at low RAM/CPU/time expense, and to take intelligent
action based on those traits _even_ before accepting the mail at all --
such that the SMTP host avoids even letting spam land in disk queues at
all, and so that a 55x Reject DSN (Delivery Status Notification) can be
issued _directly_ to the delivering IP, without the usual dilemma of
whether to discard suspect mail, hide it away in an inevitably-ignored
mbox of suspect messages (what is euphemistically/unrealistically called 
'quarantining'), or (worst option) issue an outbound reject to the
claimed (but probably forged) sender -- the latter generating
'backscatter spam'.  See: http://en.wikipedia.org/wiki/Backscatter_(e-mail)

The 'traits' I'm thinking of are various but include gross violation of
numerous SMTP technical requirements.  These violations are easily
detectable through low-load checks right inside the MTA's front-end
code.  E.g.:

o  Sending IP is trying a lower-preference MX host first.
o  Sending IP doesn't use valid hostname in the HELO/EHLO SMTP handshake
o  Sending IP cites a non-existent domain in the envelope headers.
o  Sending IP cites a domain with no valid MX or A records.
o  Sending IP cites a domain whose MX/A hosts refuse return mail to
   the claimed sender.
o  Sending IP cites a domain whose MX/A hosts refuse return mail to
   the RFC-mandated 'postmaster' user.  (The 'abuse' user is also RFC-
   mandated, but far less widely honoured by legitimate systems.)
o  Sending IP disconnected immediately upon getting 200 Accept without
   bothering to do a QUIT.
o  Sending IP puts RFC-disallowed whitespace betwen MAIL FROM: and 
   the angle-bracketed claimed sender.
o  Sending IP is claiming to send mail for a domain that publishes
   SPF or DKIM authentication data, but isn't listed as an authorised
   IP for that domain.
o  Sending IP cannot handle a multiline 220 banner greeting by my
   receiving MTA and drops connections.
o  Sending IP drops connections if asked to wait for modest delays
   before mail is accepted.

Checking these things at SMTP receiving time is _extremely_ effective
at filtering out spam quickly, with very low load, and with very low
levels of false positives.  It does the heavy lifting of rejecting a 
high percentage of obvious spam without the need to spawn high-load 
external parsers such as SA.

Why do those traits so correlate strongly with spam?  Because spammers'
motto is 'We make it up in volume.'  What I mean is:  The coders who
create malware for legions of zombified Windows machines to crank out
spam (1) Cannot be bothered to code carefully and observe SMTP technical
requirements.  They're spewing in all directions:  They see no need to 
behave like good Internet citizens.  They just connect to any target IP
that answers on port 25 and see if it'll accept slop.  If it won't, the
connection gets dropped and the malware moves on to the next IP.  (2) 
Are in a tearing hurry.  If the slop isn't accepted instantly, again,
drop and move to next victim.

Performing such checks incidentally _also_ makes any subsequent checking
by high-load parsers like SA more effective:  Because the overwhelming
majority of easily detectable spam has already been 55x-rejected before
SA even enters the picture, it's easier to justify letting SA take its
time and carefully check the mail.  E.g., you can enable SA's consulting
of RBLs, SPF records, and DKIM records, and applying custom SA rulesets,
before deciding spamicity.

We folks who run our own MTAs also take one other easy step that makes
SA more effectient:  We run it in daemonised mode as 'spamd', such that 
you don't have a huge Perl process starting and stopping all the time.
(That _might_ also be useful in your usage scenario, so you might want
to see about it.  Used like that, the 'spamc' spam client handles the
query and handles communication with the back-end daemon.)

Anyway, thank you for indulging this and the immediately preceding
mini-rants:  I just get really, really tired of seeing Linux users
speaking as if GMail were the pinnacle of anti-spam effectiveness, when
in fact it's just _not_ very good, even by the standards of half-assed
Linux hobbyist sysadmin efforts with zero administrative effort.