Spam filtering best practice
Most email users are delighted when, after receiving spam, they tell their ISP who throws a switch and suddenly voila no more spam. But the problems are then not apparent for some months as the ISPs clients don't know what wanted emails they have missed. "Hands up who isn't here!" is clearly not the best manner of handling wanted email.
Many ISPs rely on block lists. Osirusoft was a particularly bad example which mis-identified so many sources of emails as spam that it was forced out of business. The service relied on a well meaning but misguided individual whose methods were flawed enough for people to become angry. Those ISPs who thought and think still that they can simply identify certain servers as only ever sending spam are brazenly stupid and potentially negligent. However, perhaps Osirusoft was shut down more on account of inappropriate use of his data than the data itself. The fallacy of IP blocking, and those inetrnet providers who use it is that there is certainty in genuine email communication and spam. Other services identify "open relays" and technical attributes likely to lead to a server being misused for spam, but again IPs who block emails absolutely on the basis of those results and services are misguided.
The only compilation of IP black-lists that has validity is Spamcop. This is because the data is compiled in real time as a result of spam sending servers pinging spam-trap addresses. We used to provide such facilities to Spamcop but the forged addresses of viral emails are potentially a concern to which we did not want to contribute.
So people are beginning to learn that the Hands up who isn't here! approach, together with approaches that assume certainty, are not good ideas.
The three principles of good spam filtering
- Rejected email should be sent to a spam bin for inspection.
- The nature of emails should be treated as uncertainties rather than certainties.
- No one method of analysis is likely to be wholly correct. Any particular method is likely to be only partially probably right. No single method used alone should be relied upon.
The AntEspam .co.uk approach to spam filtering
METHODOLOGY:- SPAM EQUALS PLUTONIUM
Like many, we started by using SpamAssassin on our server. The system examines email characteristics and assigns probabilities to them which are added together to provide a result. The system looks at all features of emails from technical aspects to specific content.
We applied the standard system to protect a growing number of clients who share a common business and who have email addresses exposed on webpages. As the webmaster needed to monitor the effectiveness of the work he was doing for the clients, and those addresses were used not for private correspondence purposes relating only to the business, the addresses copied to the webmaster. Our experience results from working for five years or so upon the hundreds of spam-traps which resulted.
After applying SpamAssassin we were able to examine both the spam traps for wanted and unwanted emails together with the spam bin for unwanted emails, which inevitably would include some wanted emails from time to time.
A normal configuration of clients and spam bins results in the webmaster seeing none of the false positives and the clients each having to waste time having to re-invent the wheel trying to work out individual alterations to filters as solutions to the same problems, with varying levels of perception and successes.
In contrast, we were in a uniquely priviledged position overseeing email to a global community of email boxes sharing a common spam-bin within which spam characteristics became obvious.
It soon transpired that up to 90% of spam
- comes from no more than around a dozen sources and
- utilises no more than half a dozen methodologies.
Once one has cracked the nature of these, identifying spam reliably is not simple but at least becomes easier.
In essence we analyse emails and then apply sets of filters on quantum physical principles. Having identified a methodology or linguistic theme and assigned a probability to it having arisen from any one of them, one can then apply a multiplication procedure with the probability of the email having been sent from a known spam source or group of sources. Rather than adding these probabilities together, they can be effectively multiplied and this probability multiplication approach enables us to be particularly certain about what is spam.
The resulting certainty of real spam identification on the basis of probability combination enables us to be confident. The combination of probabilities derived from different types and classes of test add to the confidence level, akin to atomic structure in which electrons reside in defined shells according to their number and energy level. If, in the analysis, we see enough layers of probability as electron shells, we find a spam as massive as a dangerous Plutonium atom and know that we don't have to go near it. This means that we only need to inspect the lighter atoms in our spam bins so that we increase the probability of finding that tiny wanted hydrogen atom which needs to rise above the rest to be forwarded to the intended recipient.
HOW WE ACHIEVE RADIOACTIVE REPROCESSING OF SPAM
Many of our Spam diagnoses are given on our example spam email pages. If you are familiar with spams and conventional scoring systems, you will see how our filters pick up emails which pass through normal configurations.
Ordinary spam filters commonly block between 1% and 10% of wanted emails. These are the sort of filters crudely applied to "Hands up who'se not here" systems. Although partially successful and academically sound Bayesian heuristics often cannot work effectively. They are often only applied to words rather than groups of words which are more easily and often more effectively spotted by human rather than by machine. They contain assumptions that the spammers have circumvented. We look at the ways in which spammers avoid the conventional filters and because the commercial computer industry merely mass markets mass production. The mass involved becomes such a beheamouth that the little innovation appears to be desired. In contrast the spammers are constantly innovating and so avoid the conventional filters endlessly peddling their wares. Only when spamming ceases to be effective will spammers cease to bother to spam . . .
However, the failure of the industry at large to take on the level and speed of innovation of the spammers makes our task easier.
A training in physics drums in the maxim "Spot the Assumption" and like bloodhounds we follow the trail and publicise them to our SpamInsights subscribers . . .
You or your company could follow the trail . . . but with the advantage of our background of long experience you'll end up spending fortunes reinventing and duplicating the wheels. Over the years we have accumulated libraries of filters, words and phrases together with safe and effective probability ratings to apply.
OUR CONCEPTS AND CLASSES OF FILTER
Looking through our resources of example spam emails you'll see banks of colourful and appetizing filters that feed our engines - APPLES, CHERRIES, ORANGES, PIPS and PRICKLYPEARS to name a few, which sometimes get combined into TARTS and a GLUTTON or two at a FEAST.
- APPLES are library of measurements of up to 280,000 entropic features within the email .
- CHERRIES are nice things
- ORANGES are libraries of ambiguities which can be nice but are acidic or dry depending on the conditions. Cart loads of APPLES and ORANGES are strewn across the road if spammers are in hot pursuit behind us...
- PIPS are libraries of things we don't want, downright nuisances and
- PRICKLYPEARS are easily identified and are best handled with care.
- Further libraries and areas of linguistic analysis include items relating to Nigerian 419 scams (NIGER and WIDOW), viruses (DOOM) and phishing (BANKSCAM)
- Deliberate inexactitudes are included to take advantage of better accuracy, wider identification and control provided by Fuzzy Logic feeding into banks of multiplicational matrices of combinational rulesets.
If we don't like what we see we blast it into outer cyberspace with our PHAZOR - you'll see OUTPHAZORED - but these emails which end up in the spam bin have to be pretty bad to be blasted beyond the bin into oblivion. We cease to look at emails scoring beyond 800 but we have some control spam-traps to monitor example emails scoring beyond 4000.
Beyond our gluttonous tarts, and phazors, we identify a further layer of consideration for analysis by a yet further sophisticated banks of filters which work on wholly new criteria.
Care and experience are necessary, together with knowledge of underlying assumptions inherently behind the application of each component test, analysis and ruleset. The fuzzy logic combinational multiple probabilistic approach is littered with checks and balances so that, as a matter of probability, no one series of factors leads to a series of unfortunate events leading to the losss of a wanted email.
David Pinnegar BSc ARCS
We see so much spam and have developed filters for so many years that we are able to supply expert consultancy advice on spam reduction issues. Please contact us to enquire.