Contents
If you do not see an answer to your question on this page, please try checking the manual, or use the Contact Form.
- How is QSF different to other spam filters?
- How do I set up QSF to filter my email?
- Why does initial training take so long?
How is QSF different to other spam filters?
QSF's targets are speed, accuracy, and simplicity. So:
- It is small and is written in C so it starts up quickly, unlike filters written in Perl.
- It understands MIME and HTML, so it can intelligently deal with "modern" spam, unlike older Bayesian filters such as
ifile
. - It runs as an inline filter rather than as a daemon, so it is simple to install.
- It is written to do only one job - decide whether an email is spam or not using the content of the message alone - so it is less complex than filters such as SpamAssassin. Less complexity means bugs and security problems are less likely.
- As well as words and word pairs, QSF also spots special patterns in email such as runs of gibberish, HTML comments embedded in text, and other common spam giveaways, and its flexible tokeniser allows more patterns to be added as spammers change their tactics.
- It can be compiled to depend on no external libraries other than the C library, or it can use GDBM, MySQL, or SQLite for storage.
If QSF does not meet your needs, try looking at the resources page for other spam filters, or make a suggestion using the Contact Form.
How do I set up QSF to filter my email?
First, determine where you are going to do your filtering.
- If you will be filtering email as it arrives in your server mailbox, you will need Unix shell access to the server it arrives on.
- If you will be filtering email as it arrives in your mailbox on your workstation, then you will need to make sure it is delivered normally to your system spool file (eg using fetchmail).
- If you will be filtering email within your email client itself, QSF may or may not be able to help you. If you use KMail, for instance, try this guide: http://www.softwaredesign.co.uk/Information.SpamFilters.html
Next, work out whether you have procmail
installed on the
relevant machine. Doing "man procmail
" should work if you
have it installed.
If you have procmail
, then create / edit your
~/.procmailrc
file so it contains the following lines:
:0 wf | qsf -sra
If you do not have procmail
, you may have other alternatives
such as maildrop
. Check with your server administrator.
Update - Jan 2020 - from Anthony Campbell:
For the fdm
tool, add this to fdm.conf
:
action "spamfilter" rewrite "/usr/bin/qsf -s" match all action "spamfilter" continue match "^Subject:.*SPAM" action mbox "%h/Mail/spam"
Next, you need to create a new database so QSF can classify your email. To do this, collect as much recent spam as you can into one mail folder (somewhere between 100 and 2000 messages should be enough). Then collect a similar amount of non-spam in another mail folder.
These mail folders should be in mbox
format. Email clients such
as Mutt use it; it is one of the standard
Unix mailbox formats. If, instead, you have your messages as individual
files inside a directory, you can use a command line such as the following
to put all the messages in DIRECTORY
into one mbox
file:
find DIRECTORY -type f -exec formail '{}' ';' >> NEW-MBOX-FILE
Next, run QSF in training mode on your two mbox
folders:
qsf -T spam-folder non-spam-folder
From now on, any incoming mail that QSF thinks is spam should end up with an
X-Spam: YES
header and a subject line starting with
[SPAM]
.
Why does initial training take so long?
When training using the -T
option, QSF does not just mark all
of the messages in the "spam" folder as spam, and all in the "non-spam"
folder as non-spam. Instead, it goes through each message in each folder and
only changes its database if it "guesses" the message's classification
wrongly. Having tried this on every message, it then restarts the process,
and keeps doing it until the number of messages it gets wrong falls to an
acceptable number.
The reason it is done this way is to avoid overtraining the database. If too many entries are added to the database at once, the database becomes large and inflexible - it becomes more difficult to teach it new things in future.
Although the database format was recently changed to "age" tokens so that overtraining is less of a problem, the initial training process will probably always be done this way to ensure a balanced data set.