Restricting Google on my terms

A while back, I posted a request to Google. I said, "Google, please give me a way to mark up my pages to let your GoogleBot know what to index and what not to index on my site." Point being that a lot of the stuff on my home page and on weblog entries should not be indexed. I want to decide what Google indexes-- not at the page level, but at the paragraph level.

Since Google ignored my plea, I set out to do it myself. After all, it's my server that is giving Google all this information, so can't I just limit what I expose to them?

Yes. You see my site is produced using PHP. So that lets me do different things each time the pages are served. You can do the same thing with other dynamc solutions-- JSP, ASP, etc.

So here's what I did, in a nutshell.

1. Create a PHP function that determines if the remote user is a search bot or not. This goes in a file like "lib.php" or something that gets included on all pages of my site:

<?php
// lib.php

global $SEARCH_BOT;
# set to TRUE if UserAgent looks like a search bot
$SEARCH_BOT = search_bot();

function search_bot() {
  $ua = $_SERVER['HTTP_USER_AGENT'];

  $common_uas = array(
    '!^[Mm]ozilla!',
    '!^MSIE!',
    '!^Opera!',
    '!^Links!',
    '!^Lynx!',
    '!^Java!'
  );

  foreach ($common_uas as $patt) {
    if (preg_match($patt, $ua)) {
      return FALSE;
    }
  }

  $search_bots = array(
    '!^Googlebot/!',
    '!^Mediapartners-Google/!',
    '!^Scooter[-/]!',
    '!^Szukacz/!',
    '!^Atomz/!',
    '!^LexiBot/!',
    '!^Mercator-!',
    '!^Openbot/!',
    '!larbin!i',
    '!crawl!i',
    '!index!i',
    '!seek!i',
    '!spider!i',
    '|(?<!re)search|i',
    '|(?<!mac )find|i'
  );

  $ua = preg_replace('/\s*\(.+$/','',$ua);

  foreach ($search_bots as $patt) {
    if (preg_match($patt, $ua)) {
      return TRUE;
    }
  }

  return FALSE;
}
?>

2. Next, within any PHP pages I want to restrict, I add the include "lib.php"; PHP statement at the top and then surround anything I don't want to be indexed by Google and friends with this:

<?php if (!$SEARCH_BOT) { ?>
    (html content)
<?php } ?>

That's it. Granted, the search_bot() function is not complete-- I'm sure many search engines are slipping through that crack, but the determination it does can be refined and improved. You could go further and restrict by the originating IP address/subnet if you wished. There are lots of possibilities. (And I certainly welcome suggestions and improvements to this function.)

How effective is it? Take a look at this page. Notice the following bits that aren't really relevant to the real content of that page:

  • The page navigation at the top
  • The link to the previous and next entries
  • The Google ad
  • The trackback content
  • The comment content (remember, this is an index of my content, not yours)
  • The list of backlinks at the bottom
  • The blogroll
  • The sideblog content (this was killing accurate search results)
  • The site navigation

There's probably more content I want to exclude from any given page than content I want to include. So... with the solution I gave above, this is how Google sees that page. Amazing, isn't it?

Let's look at my home page through this same lens. (Well, apparently I'm still allowing the Google ad to be indexed there-- I'll have to fix that.) See the difference? Imagine if everyone site were more precisely indexed? It would eliminate a lot of false positive search results. A lot.

But please act responsibly with this technique-- obviously, this would allow you to return content that is altogether different, perhaps as a way to manipulate your page rank results. So please don't abuse this capability-- use it to improve the search results for your weblog/site.

TrackBack

TrackBack URL for this entry:
http://bradchoate.com/mt/feedback/tb/907

Listed below are links to weblogs that reference Restricting Google on my terms:

» Restricting Google on our own terms from leuschke.org links
things like this make me want to learn more PHP [Read More]

» Google Searches from Mama Write's Sideblog
Guiding how Google searches a website... [Read More]

» Fair and Balanced from News Goat
I really like reading services like Google News and Topix, which generate pages by pulling news from thousands of news sites. The variety of sources is nice, and the format makes it easy to scan the headlines. Topix even... [Read More]

12 Comments

Are you worried about getting banned from Google? I've read of people abusing this sort of thing to get higher rankings, and that some search engines will occasionally request a page with a non-bot type UA to see if they get the same content...

Andy Baio said:

You might want to read Google's FAQ on cloaking. Your version is benign, but if it's determined automatically, you might be in trouble.

Brad Author Profile Page said:

Good point-- no, I hadn't seen that. Although, I would argue that the spirit of that restriction is to prevent the kind of abuse I alluded to at the end of this post. If anything, I am purifying the content that is indexed. If Google offered a way to supply "hints" within the page to indicate what should really be indexed, I wouldn't have to resort to this. Sadly, they do not.

I've been using this technique for more than a year I believe-- so far, I haven't been contacted by Google about this practice. Nor have I been banned for doing it. But if this post raises their attention to the issue of indexing "cruft" within a weblog site, hopefully it will produce a Google-sanctioned solution that we can all use.

Who knows, maybe Ev could bring this problem to their attention...

Peter Winnberg said:

First of all, Google will not index the Google ad because it is included using javascript, right?

If you don't want Google to index your comments and trackbacks together with the content of a page, isn't the best solution to not put it there? Instead you could have a link at the bottom of each story to a index of comments and trackbacks for that page.

Brad Author Profile Page said:

Peter-- duh, you're right. The ads won't be indexed, but they are displayed in the cache. I guess that's OK.

But as to putting the comments/trackbacks on a different page? No, I'd rather not. Besides, there's more than just comments and trackback that I want to exclude. I used to get a lot of hits searching for "blah photos" matching for pages all over my site. Even though the page mentions "blah", there were no photos of it/them. It was the site navigation link for "Photos" on the right was causing these false positive search results. So-- should I put my site navigation on a separate page??? Perhaps I should start using frames or something?

Over my dead <body/>.

Mark J said:

Good article. I get some funky search queries that lead people to my site.

I doubt that the "cloak detection" process is completely automated, and I'm sure that if they manually reviewed your content to see what you were hiding, they'd understand. Heck, you could even leave an HTML comment for them:

I don't think it's a problem as long as you are just hiding cruft, and not introducing new content or doing something malicious.

On a different note, those using Mozilla Firefox can change their user agent on the fly. Sometimes I like to surf the web "as googlebot." The results can be interesting.

Ryan said:

Awesome script, Brad. Might I make a small suggestion? Perhaps a clear notice at the top of the page that only shows up in the Google cache and not on your regular site (something along the lines of "This is a cached version of this page with all navigation and cruft removed. If you want to visit the full and most recent version of this page, go to...".

I'd image this would be pretty easy to do since you already have the spider-detecting function.

Nixon said:

Does restricting Google's Mediapartners-Google put you in violation of the Ad-Words T&Cs?

erica said:
I used to get a lot of hits searching for “blah photos” matching for pages all over my site. Even though the page mentions “blah”, there were no photos of it/them.

And unfortunately it's problems like that that cause Google to serve up a lot of results that are nowhere near what a person is searching for and thus waste their time.

Brice said:

I too asked Google for a way to add hints for the bot. No reply. It doesn't seem to exist for general use.

You're defintely cloaking here. Though its good to see that they haven't penalized you for it.

Perhaps there is a little human judgement still applied. That would be welcome info.

Thanks for sharing the script.

Marcos said:

Great script, and good discussion.

chris charatain said:

new song >story of the hood

About

This article was published on July 2, 2004 10:57 AM.

The article previously posted was Time for a new car....

The next article is Blogon 2004.

Many more can be found on the home page or by looking through the archives.

Powered by Movable Type