[conspire] Some spam handling

Mon Mar 2 03:29:17 PST 2015

Daniel --

Top of my personal mbox, as seen in mutt:

      1 NDX 150302 Dr. Nelson Cody   ( 53) Top of the day                         
->    2  D+ 150302 ottokar-rybnik at wp (4456) FW: YTZC_Wyploty_UCU_opisy            
      3 NDX 150302 aragornwp at wp.pl   (4456) Re: HTYC_Wyploty_UBW_opisy            
      4 NDX 150302 dfrazierchsc at wp.p (4457) PD: FW: BQOX_Wyploty_UKV_opisy        
      5 NDX 150302 praca at wp.pl       (4456) FW: TDLG_Wyploty_UHZ_opisy            
      6 ND+ 150302 florex at wp.pl      (4456) Re: TMVA_Wyploty_UTL_opisy            
      7 NDX 150302 theressajzlnrajvl (4457) Re: CYZS_Wyploty_UXF_opisy            
      8 NDX 150302 ireneusznych at wp.p (4456) Re: PD: FW: FSNP_Wyploty_UGR_opisy    
      9 NDX 150302 elux at wp.pl        (4456) Re: PD: FW: JXAV_Wyploty_UMO_opisy    
     10 NDX 150302 artbed at wp.pl      (4457) PD: FW: VDCG_Wyploty_UQV_opisy        
     11 NDX 150302 lford at wp.pl       (4456) PD: FW: UDBJ_Wyploty_URD_opisy        
     12 NDX 150302 doniec at wp.pl      (4456) PD: ZKNV_Wyploty_UVD_opisy            
     13 NDX 150302 noconwojciech at wp. (4456) PD: FW: QRBM_Wyploty_UGW_opisy        
     14 NDX 150302 basia1937 at wp.pl   (4456) Re: PD: FW: PSLH_Wyploty_UJQ_opisy    
     15 N X 150302 logcheck system a (  4) linuxmafia.com 2015-03-02 01:02 System 
     16 NDX 150302 detainmentk95 at wp. (4456) Re: ZXSM_Wyploty_UBE_opisy            

All of those similar-looking ones are doubtless the same spams you spoke
of, consisting pretty much entirely of just a Zip archive attachment.

So, the thing is, there's only so much you can do to _automatically_
recognise spam.  The closest one can come to telling the software
'Consider to be spam anything that looks approximately like _this_' 
is to feed mails of that sort to a Bayesian classifier.  This cannot 
really be done fully programmatically:  A human needs to pick them out
and do the feeding.

And that's just what I'm doing.  One of a number of factors Exim4 (the
MTA) uses to decide spamicity is the assessment of spamd, the daemonised
(and system-wide) form of SpamAssassin.  spamd includes a Bayesian
classifier, and you need to continually feed it examples of spam and of
ham (non-spam) that you wish it to generalise from.

I saved the one's I'm referring to below, writing them out to mbox
/tmp/spam.

Then:

linuxmafia:/# su - Debian-exim
Debian-exim at linuxmafia:~$ sa-learn --spam --mbox /tmp/spam
Learned tokens from 0 message(s) (0 message(s) examined)
Debian-exim at linuxmafia:~$ 

What the hell?

Let's compare against using the same tool to 'learn' an mbox of known 
non-spam, /tmp/ham:

Debian-exim at linuxmafia:~$ sa-learn --ham --mbox /tmp/ham
Learned tokens from 11 message(s) (11 message(s) examined)
Debian-exim at linuxmafia:~$ 

OK, nothing particularly wrong with the Bayesian classifier; it's
something about those particular messages (the spams).  Doesn't take
much Web-searching to confirm my suspicion:

http://fixunix.com/spamassassin/253119-re-sa-learn-max-message-size.html

  [The] maximal size of message parsed by SA is hardcoded at 256K.
  I think that applies for reporting as well as for checking

That was my recollection, too.  If really huge messages were scanned and 
classified, the tokens database files would be easily overwhelmed, and
basically you would end up DoSing yourself.  After manually using mutt
to whack down the size of each of the 14 spam messages in /tmp/spam 
(essentially deleting all but about 20 of each message's attached
Base65-encoded Zip archive):

Debian-exim at linuxmafia:~$ sa-learn --spam --mbox /tmp/spam 
Learned tokens from 14 message(s) (14 message(s) examined)
Debian-exim at linuxmafia:~$

There.   However, I fear that this really won't help much, because spamd
lacks the ability (in the version I have installed, at least) to, say,
read and analyse the first 256kB of any large message and ignore
everything after that.

So, that probably explains why there's been a flurry of such things
arriving at Mailman.  Not _onto_ the mailing lists, of course, but I'm
sure listadmins see some of it lodging in the Mailman admin queues.
As with all such held spam, it's easy to just disregard it in the queues
and let it age out and get thrown away.

I'm afraid I can't spare the time to do this sort of thing _very_
frequently, especially the bits that require diagnosis time.