GET /robots.txt HTTP/1.1
I was thinking this morning about a search engine that indexes content it isn't supposed to.
A long-standing standard with web search engines is that they abide by a little file called 'robots.txt' (CNN's robots.txt file is here as an example, the robots.txt specification is here). If such a file is found, they are supposed to use the rules in that file to know what to ignore on a web site. Without such a file, a web search engine will index anything it can find.
But what if someone were to build a rogue search engine that finds all these robots.txt files and indexes ONLY the content that is marked with a "Disallow" rule? Certainly unethical, but I don't think that would be illegal. Don't worry-- I'm not about to build such an engine. One probably exists already and I just don't know about it.
I wonder what could be found out there...