GET /robots.txt HTTP/1.1

I was thinking this morning about a search engine that indexes content it isn't supposed to.

A long-standing standard with web search engines is that they abide by a little file called 'robots.txt' (CNN's robots.txt file is here as an example, the robots.txt specification is here). If such a file is found, they are supposed to use the rules in that file to know what to ignore on a web site. Without such a file, a web search engine will index anything it can find.

But what if someone were to build a rogue search engine that finds all these robots.txt files and indexes ONLY the content that is marked with a "Disallow" rule? Certainly unethical, but I don't think that would be illegal. Don't worry-- I'm not about to build such an engine. One probably exists already and I just don't know about it.

I wonder what could be found out there...

About

This article was published on January 16, 2002 9:43 AM.

The article previously posted was Must-have software.

The next article is Toshiba's new 20GB micro-wonder.

Many more can be found on the home page or by looking through the archives.

Powered by Movable Type