Monday, November 17, 2008

Automated Bots: A Problem for Web Analytics

Anyone who has spent time analyzing web logs to a great extent as I have understands that automated bots, or web robots, create an enormous challenge for those who try to make a good business decision based on traffic trends.

An automated web robot is simply a crawler: a program that reads web pages and follows links on those pages to the next set of pages. These robots crawl the web from page to page to page gathering information. Robots have many different purposes for crawling the web:

* indexing web pages in a way that a search engine can find web pages relevant to the search being performed,
* gathering information from web pages (also known as screen scraping),
* learning business relationships between sites by understanding who is linking to whom,
* malicious gathering of email addresses for the purposes of sending unwanted email (spam), and
* for many other applications that the developer of a crawler has designed.

The problem that these robots create is that they leave "tracks" on your web site -- log entries created by your web server indicating that a visit to a web page on your site has occurred. These entries confuse your web analytics engine into thinking that this is an actual human visitor visiting your site. If you try to make a business decision based on behavioral patterns of visitors to your site, these tracks from automated robots, when mixed with tracks left by real visitors to your site, have potential to introduce enough bad data to lead you to a wrong decision and consequently wrong actions.

These robots are very easy to develop and as a result, they are everywhere. Unlike human visitors to your site, they are not going to make a purchase on your site, purchase from your advertisers, or subscribe to your newsletter or RSS feed. You basically want to eliminate these robots from your web analytics engine. However, this is not so simple.

The Internet community introduced an advisory standard in 1994 known as the Robots Exclusion Protocol (see Wikipedia definition) which introduced rules that all web robots must follow. One such rule is to read the "robots.txt" file on your server which has the opportunity specify which pages should never be read by the robot. Placing such an exclusion file on your server is, however, impractical for the purposes of improving web analytics for a couple of reasons:

* Restricting robots prevents the Google, Yahoo, and other search engine robots from crawling and indexing your site. This prevents your pages from being listed under natural search, which is a problem far larger than the confused web analytics.
* Many robots, especially those that have malicious intent, will completely ignore the robots.txt exclusion file. Since the Robots Exclusion Protocol is advisory in nature, robots can simply ignore the file completely.
* If your site is on a public hosting site, such as blog-hosting service, you may not have permission to create your own robots.txt file and place it there.

Furthermore, robots take many steps to make their visits look just like visits from human visitors. While Google and other search crawlers identify themselves by placing an unambiguous User Agent (see Wikipedia definition) which web analytics engines can strip out, many web robots identify themselves as a typical browser, such as Internet Explorer or Firefox. This spoofing makes it especially difficult for your analytics engine to differentiate between a human and a robot visit.
Despite being very clever in their attempt to be seen as human visitors, robots typically have one significant limitation: the vast majority of the robots do not interpret JavaScript present on your site. This limitation is largely due to the fact that developing a JavaScript interpreter is far more complex than simply developing a crawler that traverses HTML links. The ramification of this limitation is that while a standard browser being used by a human visitor will load the HTML of the web page and then immediately load all other resources, including those served by JavaScript, robots will load the HTML of the web page, perhaps graphics on the page as specified by HTML and that's it.

As a result, if you implement your web analytics using a JavaScript-based tool, you are likely to all but eliminate the problem of web robots polluting your analytics data. Analytics Wizard from AnalyticsWizard.com is an example of a free JavaScript-based web analytics tool which I developed and suggest you try.

Keep in mind that even the JavaScript solution to preventing robot polution is short-lived. If browsers have the technology to read JavaScript, eventually, robots will obtain such technology as well and we will need to find better means of filtering them out. However, it is a good bet that there will always be far more robots which do not read JavaScript than the ones that do.

No comments: