Subject: CVS commit: pkgsrc
To: None <pkgsrc-changes@NetBSD.org>
From: Hubert Feyrer <hubertf@netbsd.org>
List: pkgsrc-changes
Date: 11/18/2004 12:46:53
Module Name:	pkgsrc
Committed By:	hubertf
Date:		Thu Nov 18 12:46:53 UTC 2004

Modified Files:
	pkgsrc/doc: CHANGES
	pkgsrc/mail/spamprobe: Makefile distinfo

Log Message:
Update spamprobe to 1.0a, patch sent via IRC by the maintainer.

Changes:
	* MimeLineReader.cc: 1.0 branch - fixed MBX record header regex
	* spamprobe.cc (main): Added exec and exec-shared commands.
	  (import_words): modified import command to allow negative values
	  to be specified in the import file.
	* Applied patches for configure.in and aclocal.m4 contributed by
	  Siggy Brentrup for debian compatibility.
	* FrequencyDBImpl_pbl.cc: Invokes new WordData methods to allow
	  storing data in big endian format.
	* WordData.h: Added optional support for storing counts/flags
	  in big endian order for data portability.
	* MimeLineReader.cc (readMBXFileHeader): UW IMAP MBX file format
	  is now auto detected from the first line of the mailbox file.
	* spamprobe.cc (process_extended_options): Removed -o imap-mbx
	  option.
	* spamprobe.cc (process_extended_options): Added -o imap-mbx
	  option to process files as WU-IMAP MBX files rather than mbox
	  files.
	* MimeLineReader.cc (readLine): Added support for WU-IMAP MBX file
	  format.
	* spamprobe.cc (process_stream): Added -o tokenized option
	  to allow people to use an external tokenizer with spamprobe.
	* SpamFilter.cc (scoreToken): Reduced sorting overhead by
	  pre-computing and integer sort value with sorting priorities
	  reflected in the value.  This eliminates several calculations
	  inside of the sort routine.
	* SpamFilter.cc (computeRatio): Capped ratios in calculations to
	  within MIN_PROB and MAX_PROB.  Widened that range.  This avoids
	  problems with div/0 and makes it easier to sort terms.
	* spamprobe.cc (dump_words): dump command can now optionally
	  accept a regular expression as an argument and will only dump
	  terms matching the regular expression.
	  (purge_terms): Added purge-terms command to purge from the
	  database all terms matching a regular expression.
	* spamprobe.cc (main): Fixed bug in command line processing.
	  Thanks to Jem for bug report.
	* spamprobe.cc (train_on_message): Code simplified.  Eliminated
	  redundant recalculation of scores.
	  (train_on_message): Timestamps are now longer updated by
	  train-spam and train-good commands.  They are still updated by
	  train command.
	  (main): Fixed assertion if -P option is specified in a read only
	  operation.
	* spamprobe.cc (main): Added -C command line option to allow users
	  to specify their own min word count.
	* SpamFilter.cc (SpamFilter): Set default minimum word count back
	  to 5 (was 3).
	* spamprobe.cc (process_extended_options): Removed "alt-score"
	  from -o options list because it distributes scores poorly.  New
	  formula achieves the same end with better accuracy.  Added
	  "orig-score" option to allow people to continue using the old
	  formula.  Added "honor-xstatus-header" option for people whose
	  mail server uses X-Status: rather than Status: for the deleted
	  flag.
	  (main): Added -l command line option to allow people to set
	  their own spam threshold if they don't like the default value.
	* SpamFilter.cc (scoreMessage): Added a new scoring formula based
	  on Paul's but taking the nth root of spam and good probabilities
	  to produce more evenly distributed scores.  Lowered the spam
	  threshold to 0.6 to keep accuracy about the same as the original
	  formula.  Highest score seen for a ham so far in tests is 0.44
	  so 0.6 seems safe.  Made the new formula the default instead of
	  Paul's.


To generate a diff of this commit:
cvs rdiff -r1.7895 -r1.7896 pkgsrc/doc/CHANGES
cvs rdiff -r1.10 -r1.11 pkgsrc/mail/spamprobe/Makefile
cvs rdiff -r1.5 -r1.6 pkgsrc/mail/spamprobe/distinfo

Please note that diffs are not public domain; they are subject to the
copyright notices on the relevant files.