Bogotune FAQ

Official Versions: In bogotune-faq
Maintainer: David Relson <relson@osagesoftware.com>

This document is intended to answer frequently asked questions about bogotune.

Where did bogotune come from?
What's the message count format?
How does bogotune work?
How does bogotune ensure the messages it works with are numerous enough, and well enough classified, to deliver useful recommendations?
Can I tell bogotune to do its work even though it doesn't like the data?

Where did bogotune come from?

Greg Louis wrote the original Robinson geometric-mean and Robinson-Fisher algorithm code for bogofilter. To determine the optimal parameters for the Robinson-Fisher algorithm he wrote bogotune. The initial implementation was written in the R programming language. This was followed by the Perl implementation. Both of these implementations were slow because bogofilter had to be run for each message being scored. David Relson translated bogotune from Perl to C to provide more speed.

What's the message count format?

The parsing of a message by bogofilter takes some time. After parsing, finding the spam and non-spam counts for each token takes additional time. Having to repeate these steps every time bogotune needed a score was slow. It was realized that parsing and look-up could be done once with the results being saved in a special format. Initially this was called the bogolex format because the work was done by piping bogolexer output to bogoutil and formatting the result. Since each processed message begins with the .MSG_COUNT token the format became knowns as the message count format. The convention is to use a .mc extension for these files.

How does bogotune work?

First it reads all the files into memory, i.e. the wordlist and the ham messages and the spam messages. From the wordlist tokens, it computes an initial robx value which is used in the initial scan of the messages to ensure they're usable.

Given the total number of messages in the test set, a target number of false positives is selected for use in determining spam cutoff values in the individual scans.

Then comes the coarse scan. Using 225 combinations of values chosen to span the potentially useful ranges for robs, robx, and min_dev, all the ham messages are scored and the target value is used to find a spam_cutoff score. Then the spam messages are scored and the false negatives are counted. The scan finishes with a listing of the ten best sets of parameters and their scores (false negative and false positive counts and percent).

From the results, the best non-outlying result is picked and these parameters become the starting point for the fine scan.

The fine scan, as the name suggests, scans the region (range of values of robs, robx and min_dev) surrounding the optimum found in the coarse scan, with smaller intervals so as to determine the optimum values more precisely.

How does bogotune ensure the messages it works with are numerous enough, and well enough classified, to deliver useful recommendations?

It has certain minimum requirements that it checks for as it starts up. It will complain (and halt) if there are fewer than 2,000 ham or 2,000 spam in the wordlist, or if there are fewer than 500 ham or 500 spam in the set of test messages. It will warn, but not halt, if there's too little scoring variation in the ham messages or the spam messages or if too many of the ham messages score as spam (or vice versa) on the initial pass. There are additional checks, but I'm sure you get the idea from these examples. For details, use the source :)

Can I tell bogotune to do its work even though it doesn't like the data?

No. At one time we had a -F option to force bogotune to run with unsuitable message data, but it was realized that this could be misleading and had little chance of being helpful. Bogotune will warn the operator if its conclusions are untrustworthy due to marginal input, and will not run if its input data are detectably inadequate.