Bayesian filtering is based on the principle that most events are dependent and that the probability of an event occurring in the future can be inferred from the previous occurrences of that event (link). A probability value is then assigned to each word or token; the probability is based on calculations that take into account how often that word occurs in one category or another. The most common application of the filter is for identifying words that appear in spam versus legitimate emails. A word by itself is often times useless without the context it was used in.

There is a whole suite of tools that are able to break down content to help improve the filter by supplementing it not only with a database of words to categories, but also sets of N-gram derived from the text. There are several scripts out there that will help with this extraction and it offers a few more layers of depth for Bayesian filtering. One such tool is, Ngram Statistics Package (NSP) which is easy to install and run.


I ran a very basic test against an older post to see how it does with bigram extraction.

# perl bin/count.pl --ngram 2 test.cnt test.txt
# perl statistic.pl --ngram 2 dice test.res test.cnt

Sample bigrams found:

cloud computing, master slave, groups online, Back Again, made absolutely, very costly, extensive development, hefty bill, start ups, distribution awareness

Rather than running a probability that the set of words above would fit into one category in this case, "Technology" we can now compound the score with the probability that those terms fall into the category as well. For another layer of scoring, trigrams can be extracted, 4-grams, etc. In the financial sector the terminology is thick and analysis will be almost impossible without N-gram extraction. "Filed for bankruptcy" and "avoided bankruptcy" could not be further apart. With traditional filtering, the word "bankruptcy" would be meaningless because it really is not an indicator as to the probability that the article is favorable or not because there is no context. In this case by extracting the phrases the filter can understand and score appropriate the difference between the two terms.

Paul Graham has been working on improving the Bayesian filter to deal with spam by splitting the data into categories. Text is classified not only as legitimate or spam based on the context of the message, but the likely hood of tokens appearing in various parts of the message. N-gram filtering in this case wouldn't work as well for spam as the amount of grammar mistakes, misspellings, and word ordering would make any benefit worthless. Spammers are adjusting their content to beat such filters all the time. When the source data is reliable, the N-gram addition to the filter will boost categorization accuracy.

Integration to traditional Bayesian filtering is very easy. Google has been using text processing for a while now. This is a huge area of study in linguistics, language processing and machine learning. With so much data out there and more being collected on a daily basis, deriving context from text will allow for applications to behave smarter.