qsf - quick spam filter
Filtering: qsf [-snrAtav] [-d DB] [-g
[-L LVL] [-S SUBJ] [-H MARK] [-Q NUM]
Training: qsf -T SPAM NONSPAM [MAXROUNDS] [-d DB]
Retraining: qsf -[m|M] [-d DB] [-w WEIGHT] [-ayN]
Database: qsf -[p|D|R|O] [-d DB]
Database merge: qsf -E OTHERDB [-d DB]
Allowlist query: qsf -e EMAIL [-m|-M|-t] [-d DB] [-g DB]
Denylist query: qsf -y -e EMAIL [-m -m|-M -M|-t] [-d DB] [-g DB]
Help: qsf -[h|V]
qsf reads a single email on standard input, and by default outputs it on standard output. If the email is determined to be spam, an additional header ("X-Spam: YES") will be added, and optionally the subject line can have "[SPAM]" prepended to it.
qsf is intended to be used in a procmail(1) recipe, in a ruleset such as this:
For more examples, including sample procmail(1) recipes, see the EXAMPLES section below.
Before qsf can be used properly, it needs to be trained. A good way to train qsf is to collect a copy of all your email into two folders - one for spam, and one for non-spam. Once you have done this, you can use the training function, like this:
This will generate a database that can be used by qsf to guess whether email received in the future is spam or not. Note that this initial training run may take a long time, but you should only need to do it once.
To mark a single message as spam, pipe it to qsf with the --mark-spam or -m ("mark as spam") option. This will update the database accordingly and discard the email.
To mark a single message as non-spam, pipe it to qsf with the --mark-nonspam or -M ("mark as non-spam") option. Again, this will discard the email.
If a message has been mis-tagged, simply send it to qsf as the opposite type, i.e. if it has been mistakenly tagged as spam, pipe it into qsf --mark-nonspam --weight=2 to add it to the non-spam side of the database with double the usual weighting.
The qsf options are listed below.
If you prefix the filename with a TYPE, of the form btree:$HOME/.qsfdb, then this will specify what kind of database FILE is, such as list, btree, gdbm, sqlite and so on. Check the output of qsf -V to see which database backends are available. The default is to auto-detect the type, or, if the file does not already exist, use list. Note that TYPE is not case-sensitive.
Normally you would not need to use the deny-list.
If EMAIL is just the word MSG on its own, then an email will be read from standard input, and the email addresses given in the "From:" and "Return-Path:" headers will be used.
Using -e automatically switches on -a.
If you also specify -y, then the deny-list will be operated on. Remember that -m and -M are reversed with the deny-list.
If you specify an email address of the form @domain (nothing before the @), then the whole domain will be allow or deny listed.
This can be used to decide which backend is best on your system. Use -d to select a backend, eg qsf -B spam nonspam -d GDBM - this will create a temporary database which is removed afterwards.
The exception to this is the MySQL backend, where a full database specification must be given (-d MySQL:database=db;host=localhost;...) and the database table given will not be wiped beforehand or dropped afterwards.
As with -T, if MAXROUNDS is specified, training will never be done for more than this number of rounds; the default is 200.
The following options are only for use with the old binary tree database backend or old databases that haven't been upgraded to the new format that came in with version 1.1.0.
Currently, you cannot use qsf to check for spam while the database is being updated. This means that while an update is in progress, all email is passed through as non-spam.
There is an upper size limit of 512Kb on incoming email; anything larger than this is just passed through as non-spam, to avoid tying up machine resources.
The plaintext token mapping maintained by --plain-map will never shrink, only grow. It is intended for use by housekeeping and user interface scripts that, for instance, the user can use to list all email addresses on their allow-list. These scripts should take care of weeding out entries for tokens that are no longer in the database. If you have no such scripts, there is probably no point in using --plain-map anyway.
Avoid using the deny-list (-y) in any automated retraining, as it can be cause the filter to reject mail unnecessarily. In general the deny-list is probably best left unused unless explicitly required by your particular setup.
If both the allow-list and the deny-list are enabled, then email addresses will first be checked against the deny-list, then the allow-list, then the domain of the email address will be checked for matching "@domain" entries in the deny-list and then in the allow-list.
To filter all of your mail through qsf, with the allow-list enabled and the "spam rating" header being added, add this to your .procmailrc file:
If you want qsf to add "[SPAM]" to the subject line of any messages it thinks are spam, do this instead:
To automatically mark any email sent to email@example.com as spam (this is the "naive" version):
To do the same, but cleverly, so that only email to firstname.lastname@example.org which qsf does NOT already classify as spam gets marked as spam in the database (this stops the database getting too heavily weighted):
# The above two lines can be skipped if you've
# already piped the message through qsf.
# If the qsf database says it's not spam,
# mark it as spam!
* ^X-Spam: NO
| qsf -am
Remove the -a option in the above examples if you don't want to use the allow-list.
A more complicated filtering example - this will only run qsf on messages which don't have a subject line saying "your <something> is on fire" and which don't have a sender address ending in "@foobar.com", meaning that messages with that subject line OR that sender address will NEVER be marked as spam, no matter what:
For more on procmail(1) recipes, see the procmailrc(5) and procmailex(5) manual pages.
A couple of macros to add to your .muttrc file, if you use mutt(1) as a mail user agent:
Again, remove the -a option in the above examples if you don't want to use the allow-list.
Note, however, that the above macros won't work when operating on multiple tagged messages. For that, you'd need something like this:
If you use qmail(7), then to get procmail working with it you will need to put a line containing just DEFAULT=./Maildir/ at the top of your ~/.procmailrc file, so that procmail delivers to your Maildir folder instead of trying to deliver to /var/spool/mail/$USER, and you will need to put this in your ~/.qmail file:
This will cause all your mail to be delivered via procmail instead of being delivered directly into your mail directory.
See the qmail(7) documentation for more about mail delivery with qmail.
If you use postfix(1), you can set up a system-wide mail filter by creating a user account for the purpose of filtering mail, populating that account's .qsfdb, and then creating a shell script, to run as that user, which runs qsf on stdin and passes stdout to sendmail(8).
Doing this requires some knowledge of postfix configuration and care needs to be taken to avoid mail loops. One qsf user's full HOWTO is included in the doc/ directory with this package.
A feature called the "allow-list" can be switched on by specifying the --allowlist or -a option. This causes messages' "From:" and "Return-Path:" addresses to be checked against a list of people you have said to allow all messages from, and if a message's "From:" or "Return-Path:" address is in the list, it is never marked as spam. This means you can add all your friends to an "allow-list" and qsf will then never mis-file their messages - a quick way to do this is to use -a with -T (train); everyone in your non-spam folder who has sent you an email will be added to the allow-list automatically during training.
You can manually add and remove addresses to and from the allow-list using the -e (email) option. For instance, to add email@example.com to the allow-list, do this:
To remove firstname.lastname@example.org from the allow-list, do this:
And to see whether email@example.com is in the allow-list or not, just do this:
In general, you probably always want to enable the allow-list, so always specify the -a option when using qsf. This will automatically maintain the allow-list based on what you classify as spam or non-spam.
The only times you might want to turn it off are when people on your allow-list are prone to getting viruses or if a virus is causing email to be sent to you that is pretending to be from someone on your allow-list.
Because the database format is platform-specific, it is a good idea to periodically dump the database to a text file using qsf -D so that, if necessary, it can be transferred to another machine and restored with qsf -R later on.
Also note that since the actual contents of email messages are never stored in the database (see TECHNICAL DETAILS), you can safely share your qsf database with friends - simply dump your database to a file, like this:
Once you have sent your-database-dump.txt to another person, they can do this:
They will then have an identical database to yours.
When a message is passed to qsf, any attachments are decoded, all HTML elements are removed, and the message text is then broken up into "tokens", where a "token" is a single word or URL. Each token is hashed using the MD5 algorithm (see below for why), and that hash is then used to look up each token in the qsf database.
For full details of which parts of an email (headers, body, attachments, etc) are used to calculate the spam rating, see the TOKENISATION section below.
Within the database, each token has two numbers associated with it: the number of times that token has been seen in spam, and the number of times it has been seen in non-spam. These two numbers, along with the total number of spam and non-spam messages seen, are then used to give a "spamminess" value for that particular token. This "spamminess" value ranges from "definitely not spammy" at one end of the scale, through "neutral" in the middle, up to "definitely spammy" at the other end.
Once a "spamminess" value has been calculated for all of the tokens in the message, a summary calculation is made to give an overall "is this spam?" probability rating for the message. If the overall probability is 0.9 or above, the message is flagged as spam.
In addition to the probability test is the "allow-list". If enabled (with the -a option), the whole probability check is skipped if the sender of the message is listed in the allow-list, and the message is not marked as spam.
When training the database, a message is split up into tokens as described above, and then the numbers in the database for each token are simply added to: if you tell qsf that a message is spam, it adds one to the "number of times seen in spam" counter for each token, and if you tell it a message is not spam, it adds one to the "number of times seen in non-spam" counter for each token. If you specify a weight, with -w, then the number you specify is added instead of one.
To stop the database growing uncontrollably, the database keeps track of when a token was last used. Underused tokens are automatically removed from the database. (The old method was to "prune" every 500 updates).
Finally, the reason MD5 hashes were used is privacy. If the actual tokens from the messages, and the actual email addresses in the allow-list, were stored, you could not share a single qsf database between multiple users because bits of everyone's messages would be in the database - things like emailed passwords, keywords relating to personal gossip, and so on. So a hash is stored instead. A hash is a "one-way" function; it is easy to turn a token into a hash but very hard (some might say impossible) to turn a hash back into the token that created it. This means that you end up with a database with no personal information in it.
When a message is broken up into tokens, various parts of the message are treated in different ways.
First, all header fields are discarded, except for the important ones: From, Return-Path, Sender, To, Reply-To, and Subject.
Next, any MIME-encoded attachments are decoded. Any attachments whose MIME type starts with "text/" (i.e. HTML and text) are tokenised, after having any HTML tags stripped. Any non-textual attachments are replaced with their MD5 hash (such that two identical attachments will have the same hash), and that hash is then used as a token.
In addition to single-word tokens from textual message parts, qsf adds doubled-up tokens so that word pairs get added to the database. This makes the database a bit bigger (although the automatic pruning tends to take care of that) but makes matching more exact.
As well as using the textual content of email to detect spam, qsf also uses special filters which create "pseudo-tokens" based on various rules. This means that specific patterns, not just individual words, can be used to determine whether a message is spam or not.
For example, if a message contains lots of words with multiple consonants, like "ashjkbnxcsdjh", then each time a word like that is seen the special token ".GIBBERISH-CONSONANTS." is added to the list of tokens found in the message. If it turns out that most messages with words that trigger this filter rule are spam, then other messages with gibberish consonant strings will be more likely to be flagged as spam.
Currently the special filters are:
Normally, filters will just cause a token to be added, and these tokens are processed by the normal weighting algorithm. However the GTUBE filter will immediately flag any matching message as spam, bypassing the token matching.
The inbuilt "list" database backend will not necessarily provide the best performance, but is provided because using it requires no external libraries.
If, when qsf was compiled, the correct libraries were available, then it will be possible to use qsf with alternative database backends. To find out which backends you have available, run qsf -V (capital V) and read the second line of output. To see how well a backend performs, collect some spam and non-spam and use qsf -d BACKEND -B SPAM NONSPAM (see the entry for -B above).
Some people find that they get the best performance out of the gdbm backend; this is a library that is widely available on many systems.
To efficiently share a qsf database across multiple machines, you may find the MySQL backend useful. However, using it is a little more complicated.
To use the MySQL backend you will need to create a table with the fields key1, key2, token, value1, value2 and value3. The token, value1, value2, and value3 fields must be VARCHAR(64), BIGINT or INT, and BIGINT or INT respectively, and indexing on the token field is a good idea. The key1 and key2 fields can be anything, but they must be present.
The key1 and key2 fields allow you to have multiple qsf databases in one table, by specifying different key1 and key2 values on invocation.
Instead of specifying a database file with the --database / -d option, you must specify either a specification string as described below, or the name of a file containing such a string on its first line.
The specification string is as follows:
This string must be all on one line, with no spaces.
Since command lines can be seen in the process list, it is probably best to specify a filename (eg qsf -d mysql:qsfdb.spec) and put the specification string inside that file.
If you have problems with qsf, please check the list below; if this does not help, go to the qsf home page and investigate the mailing lists, or email the author.
Written by Andrew Wood, with patches submitted by various other people. Please see the package README for a complete list of contributors.
Report bugs in QSF using the contact form linked from the QSF home page: <http://www.ivarch.com/programs/qsf/>
procmail(1), procmailrc(5), procmailex(5)
Someone has written a guide to using qsf with KMail that
can be found at:
This is free software, distributed under the ARTISTIC 2.0 license.