The Entry represents one "document". Depending on what it is that you're trying to classify, that can have different meanings, but using the canonical Spam definitions ... each Entry would be one email. Entry is an abstract class that simply implements the IEnumerable<string> interface. This allows the entry to provide the tokens to the Index.

public abstract class Entry : IEnumerable<string>

There is currently only one concrete implementation, the StringEntry. This implementation does a simple tokenization process that takes the string, removes some unneeded characters, and then splits each word into a separate entry. You can easily create custom Entry classes that will tokenize your data accordingly. The tokenization is very important and can have a great effect as to the success of the categorization. Here is Paul Graham describing some thoughts on tokenization in his essay "Better Bayesian Filtering" (

"Mostly I've been working on smarter tokenization ... Now I have a more complicated definition of a token ... Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. E.g. ``foo'' in the Subject line becomes ``Subject*foo''. (The asterisk could be any character you don't allow as a constituent.)"

Of course, simple text classification isn't the only use of this technology. With a good enough tokenization process, one could imagine a system that uses this to classify user behavior to create a simple artificial learner.

Last edited May 8, 2009 at 5:41 AM by joelmartinez, version 1


No comments yet.