ivarch.com: QSF: Frequently Asked Questions

If you do not see an answer to your question on this page, please try checking the manual, or use the Contact Form.

How is QSF different to other spam filters?
How do I set up QSF to filter my email?
Why does initial training take so long?

How is QSF different to other spam filters?

QSF's targets are speed, accuracy, and simplicity. So:

It is small and is written in C so it starts up quickly, unlike filters written in Perl.
It understands MIME and HTML, so it can intelligently deal with "modern" spam, unlike older Bayesian filters such as ifile.
It runs as an inline filter rather than as a daemon, so it is simple to install.
It is written to do only one job - decide whether an email is spam or not using the content of the message alone - so it is less complex than filters such as SpamAssassin. Less complexity means bugs and security problems are less likely.
As well as words and word pairs, QSF also spots special patterns in email such as runs of gibberish, HTML comments embedded in text, and other common spam giveaways, and its flexible tokeniser allows more patterns to be added as spammers change their tactics.
It can be compiled to depend on no external libraries other than the C library, or it can use GDBM, MySQL, or SQLite for storage.

If QSF does not meet your needs, try looking at the resources page for other spam filters, or make a suggestion using the Contact Form.

How do I set up QSF to filter my email?

First, determine where you are going to do your filtering.

If you will be filtering email as it arrives in your server mailbox, you will need Unix shell access to the server it arrives on.
If you will be filtering email as it arrives in your mailbox on your workstation, then you will need to make sure it is delivered normally to your system spool file (eg using fetchmail).
If you will be filtering email within your email client itself, QSF may or may not be able to help you. If you use KMail, for instance, try this guide: https://web.archive.org/web/20050731082317/http://www.softwaredesign.co.uk/Information.SpamFilters.html

Next, work out whether you have procmail installed on the relevant machine. Doing "man procmail" should work if you have it installed.

If you have procmail, then create / edit your ~/.procmailrc file so it contains the following lines:

:0 wf
| qsf -sra

If you do not have procmail, you may have other alternatives such as maildrop. Check with your server administrator.

Update - Jan 2020 - from Anthony Campbell:
For the fdm tool, add this to fdm.conf:

action "spamfilter" rewrite "/usr/bin/qsf -s"
match all action "spamfilter"  continue
match "^Subject:.*SPAM"  action mbox "%h/Mail/spam"

Next, you need to create a new database so QSF can classify your email. To do this, collect as much recent spam as you can into one mail folder (somewhere between 100 and 2000 messages should be enough). Then collect a similar amount of non-spam in another mail folder.

These mail folders should be in mbox format. Email clients such as Mutt use it; it is one of the standard Unix mailbox formats. If, instead, you have your messages as individual files inside a directory, you can use a command line such as the following to put all the messages in DIRECTORY into one mbox file:

find DIRECTORY -type f -exec formail '{}' ';' >> NEW-MBOX-FILE

Next, run QSF in training mode on your two mbox folders:

qsf -T spam-folder non-spam-folder

From now on, any incoming mail that QSF thinks is spam should end up with an X-Spam: YES header and a subject line starting with [SPAM].

Why does initial training take so long?

When training using the -T option, QSF does not just mark all of the messages in the "spam" folder as spam, and all in the "non-spam" folder as non-spam. Instead, it goes through each message in each folder and only changes its database if it "guesses" the message's classification wrongly. Having tried this on every message, it then restarts the process, and keeps doing it until the number of messages it gets wrong falls to an acceptable number.

The reason it is done this way is to avoid overtraining the database. If too many entries are added to the database at once, the database becomes large and inflexible - it becomes more difficult to teach it new things in future.

Although the database format was recently changed to "age" tokens so that overtraining is less of a problem, the initial training process will probably always be done this way to ensure a balanced data set.

Contents

How is QSF different to other spam filters?

How do I set up QSF to filter my email?

Why does initial training take so long?