Highs, Lows of Bayesian Spam Filters

A look at Google's new spam filters by Serge Thibodeau

Dec 08, 2003

On November 26, Stephen Lynch, journalist with the New York Post picked up the phone and initiated a telephone interview with me about an article I wrote on the previous day. The article was in relationship to the current November Google update “dance," dubbed “Florida."

The following day, Lynch wrote this article and it was published in the New York Post, offering his comments and, without being technical, explaining some of the negative effects such an update can have on the average site owner or webmaster.

As the latest Florida monthly Google update "dance" has shown, having a website highly-ranked on the Internet’s No. 1 search engine can receive a devastating blow if search rankings drop as much as some did and without warning. That can post a serious risk to some online stores and certain commercial websites.

In the last 10 days, a lot of articles have also been written by some of my colleagues, some in the SEO field and some, like Seth Finkelstein, who are more in favor of the free flow of information that the Internet can provide.

In this article, I will attempt to describe some of the spam-filtering techniques that Google is reportedly using during this Florida dance. This spam-filtering technology is based on the Bayesian algorithm.

The inner-workings of a spam filter for a search engine
For quite a long time now, Google’s search results have been under attack by search engine spammers that continuously attempt to mask search results, and in the end, clutter the search engines with irrelevant information in their databases.

With the ever-growing popularity of Google and as it tries to handle more and more searching all over the web, the temptation to foul the search results has become attractive to certain spammers, leading to substantial degradation in the quality and relevance of Google’s search results. Since Google is mostly concerned with quality search results that are relevant, it is now cracking down on these unscrupulous spammers, with new spam-filtering algorithms, using Bayesian filtering technology.

At the end of October 2003, Google deployed their new Bayesian anti-spamming algorithm, which appeared to have its search results crash when a previously identified spam site would have normally been displayed. In fact, the searching results were completely aborted when encountering such a spam-intended site. See "Google Spam Filtering Gone Bad" by Seth Finkelstein for more technical information on how this spam elimination algorithm works at Google.

The first shoe that fell
On or around Nov. 5, this spam problematic was in fact reduced significantly, resulting from the "kicking-in" of these new Bayesian anti-spam filters. Although not perfect, this new Bayesian spam-filtering technology seemed to have worked, albeit there were some crashes in some cases.

On or about Nov. 15, Google, as it always does every month, started dancing, performing its needed monthly and extensive deep crawl of the web, indexing more than 3.5 billion pages. This update had some rather strange results, in a way reminding some observers of a previous major algorithm change done in April 2003, dubbed "Dominick," where similar and very unpredictable results could be noted across the web.

It was generally observed that, many "old" and high-ranking sites, some of which were highly regarded as authoritative, which were certainly not spammers in any way, appeared to fall sharply in their rankings or would disappear entirely from Google’s search results.

Since then, there have been many explanations, some not too scientific, that attempted to answer this event that some have categorized as “serious." For an example, some of the best of these explanations can be found in an article that Barry Lloyd wrote: "Been Gazumped by Google? Trying to make Sense of the "Florida" Update!"

More on the Bayesian spam filter
Part of my research and the observations I have done in this matter point to the Bayesian spam filter that Google started to implement in late October. A Bayesian spam filter is a complex algorithm used in estimating the probability or the likelihood that certain content or material detected by Google is in fact spam. In its most basic format, the Bayesian spam filter determines if something "looks spammy" or if, on the other hand, it is relevant content that will truly help the user.

To a certain degree, the Bayesian algorithm has proven efficient in the war against spam in the search engines. Being bombarded by spam as much as Google has been for the past couple of years, it has no choice but to implement such anti-spam safeguards to protect the quality and relevancy of its search results.

However, it is the general feeling in the SEO community that, unfortunately, the current Bayesian algorithm implementation seems to have extreme and unpredictable consequences that were practically impossible to know beforehand.

On the outset, one of the problems with estimating the probability or likelihood that certain content does have spam in it is, given very huge datasets (such as the entire Internet) many “false success stories” can and will occur. It is exactly these false success stories that are at the center of the current problem.

Since this whole event started, there are many who have noted in tests and evaluations that in making the search more selective, differentiating such as trying to remove an irrelevant string tends to deactivate the new search results algorithm, which in turn effectively shuts down the newly-implemented Bayesian anti-spam solution at Google.

One more observation
While we are still on the subject of the new filter, but getting away from the topic of spam-related issues, I noticed during testing of the Florida update that Google is now “stemming." To my knowledge, it’s the first time that Google offers such an important search feature. How does stemming works? Well, for example, if you search for “reliability testing in appliances," Google would suggest “reliable testing in appliances."

To a certain degree, variants of your search terms will be highlighted in the snippet of text that Google provides with each accompanying result. The new stemming feature is something that will certainly help a lot of people with their search for information. Again, Google tries to make its searches the most relevant they can be and this new stemming feature seems like a continuation of these efforts.

Conclusion
In retrospect, and in re-evaluating all the events that have happened in this major dance, it is clear that Google is still experimenting with its newly implemented algorithm and that there are many important adjustments that will need to be done to it to make it more efficient.

With spam being a growing problem day by day, modern search engines have no choice other than to implement better and more “intelligent” spam-filtering algorithms that can make the difference between what is considered as spam and what isn’t.

The next 30 days can be viewed by some as being critical in the proper fine-tuning and deployment of this new breed of application in the war against spam. How the major search engines do it will be crucial for some commercial websites or online storefronts that rely solely on their Google rankings for the bulk of their sales.

In light of all this, perhaps some companies in this position would be well advised in evaluating other alternatives such as PPC and paid inclusion marketing programs to complement their SEO programs. At any rate, it is my guess that search will continue to be an important and growing part of online marketing, both locally, nationally and on a global basis.

References:

1) An anticensorware investigation by Seth Finkelstein
http://sethf.com/anticensorware/general/google-spam.php

2) Better Bayesian filtering by Paul Graham
http://www.paulgraham.com/better.html

About the Author:


Residing in the suburbs of Montreal, Quebec, Serge Thibodeau has been performing professional search engine optimization and priority positioning services since 1997. Serge optimizes commercial web sites of small businesses, medium-size companies as well as Fortune 500 enterprises. Serge serves as CEO for RankforSales.com.

Additionally, Serge has been largely involved as the project leader in the development of Pagina+ (tm), a powerful search engine optimization tool for SEO professionals. Pagina+ (tm) is offered by Rank for $ales's parent company: GCIS Inc.

You can reach Serge Thibodeau at: sthibodeau@rankforsales.com or toll free at: 1-800-631-3221.