Restricting Google on my terms
A while back, I posted a request to Google. I said, "Google, please give me a way to mark up my pages to let your GoogleBot know what to index and what not to index on my site." Point being that a lot of the stuff on my home page and on weblog entries should not be indexed. I want to decide what Google indexes-- not at the page level, but at the paragraph level.
Since Google ignored my plea, I set out to do it myself. After all, it's my server that is giving Google all this information, so can't I just limit what I expose to them?
Yes. You see my site is produced using PHP. So that lets me do different things each time the pages are served. You can do the same thing with other dynamc solutions-- JSP, ASP, etc.
So here's what I did, in a nutshell.
1. Create a PHP function that determines if the remote user is a search bot or not. This goes in a file like "lib.php" or something that gets included on all pages of my site:
<?php
// lib.php
global $SEARCH_BOT;
# set to TRUE if UserAgent looks like a search bot
$SEARCH_BOT = search_bot();
function search_bot() {
$ua = $_SERVER['HTTP_USER_AGENT'];
$common_uas = array(
'!^[Mm]ozilla!',
'!^MSIE!',
'!^Opera!',
'!^Links!',
'!^Lynx!',
'!^Java!'
);
foreach ($common_uas as $patt) {
if (preg_match($patt, $ua)) {
return FALSE;
}
}
$search_bots = array(
'!^Googlebot/!',
'!^Mediapartners-Google/!',
'!^Scooter[-/]!',
'!^Szukacz/!',
'!^Atomz/!',
'!^LexiBot/!',
'!^Mercator-!',
'!^Openbot/!',
'!larbin!i',
'!crawl!i',
'!index!i',
'!seek!i',
'!spider!i',
'|(?<!re)search|i',
'|(?<!mac )find|i'
);
$ua = preg_replace('/\s*\(.+$/','',$ua);
foreach ($search_bots as $patt) {
if (preg_match($patt, $ua)) {
return TRUE;
}
}
return FALSE;
}
?>
2. Next, within any PHP pages I want to restrict, I add the include "lib.php"; PHP statement at the top and then surround anything I don't want to be indexed by Google and friends with this:
<?php if (!$SEARCH_BOT) { ?>
(html content)
<?php } ?>
That's it. Granted, the search_bot() function is not complete-- I'm sure many search engines are slipping through that crack, but the determination it does can be refined and improved. You could go further and restrict by the originating IP address/subnet if you wished. There are lots of possibilities. (And I certainly welcome suggestions and improvements to this function.)
How effective is it? Take a look at this page. Notice the following bits that aren't really relevant to the real content of that page:
- The page navigation at the top
- The link to the previous and next entries
- The Google ad
- The trackback content
- The comment content (remember, this is an index of my content, not yours)
- The list of backlinks at the bottom
- The blogroll
- The sideblog content (this was killing accurate search results)
- The site navigation
There's probably more content I want to exclude from any given page than content I want to include. So... with the solution I gave above, this is how Google sees that page. Amazing, isn't it?
Let's look at my home page through this same lens. (Well, apparently I'm still allowing the Google ad to be indexed there-- I'll have to fix that.) See the difference? Imagine if everyone site were more precisely indexed? It would eliminate a lot of false positive search results. A lot.
But please act responsibly with this technique-- obviously, this would allow you to return content that is altogether different, perhaps as a way to manipulate your page rank results. So please don't abuse this capability-- use it to improve the search results for your weblog/site.
Are you worried about getting banned from Google? I've read of people abusing this sort of thing to get higher rankings, and that some search engines will occasionally request a page with a non-bot type UA to see if they get the same content...
You might want to read Google's FAQ on cloaking. Your version is benign, but if it's determined automatically, you might be in trouble.
Good point-- no, I hadn't seen that. Although, I would argue that the spirit of that restriction is to prevent the kind of abuse I alluded to at the end of this post. If anything, I am purifying the content that is indexed. If Google offered a way to supply "hints" within the page to indicate what should really be indexed, I wouldn't have to resort to this. Sadly, they do not.
I've been using this technique for more than a year I believe-- so far, I haven't been contacted by Google about this practice. Nor have I been banned for doing it. But if this post raises their attention to the issue of indexing "cruft" within a weblog site, hopefully it will produce a Google-sanctioned solution that we can all use.
Who knows, maybe Ev could bring this problem to their attention...
First of all, Google will not index the Google ad because it is included using javascript, right?
If you don't want Google to index your comments and trackbacks together with the content of a page, isn't the best solution to not put it there? Instead you could have a link at the bottom of each story to a index of comments and trackbacks for that page.
Peter-- duh, you're right. The ads won't be indexed, but they are displayed in the cache. I guess that's OK.
But as to putting the comments/trackbacks on a different page? No, I'd rather not. Besides, there's more than just comments and trackback that I want to exclude. I used to get a lot of hits searching for "blah photos" matching for pages all over my site. Even though the page mentions "blah", there were no photos of it/them. It was the site navigation link for "Photos" on the right was causing these false positive search results. So-- should I put my site navigation on a separate page??? Perhaps I should start using frames or something?
Over my dead <body/>.
Good article. I get some funky search queries that lead people to my site.
I doubt that the "cloak detection" process is completely automated, and I'm sure that if they manually reviewed your content to see what you were hiding, they'd understand. Heck, you could even leave an HTML comment for them:
I don't think it's a problem as long as you are just hiding cruft, and not introducing new content or doing something malicious.
On a different note, those using Mozilla Firefox can change their user agent on the fly. Sometimes I like to surf the web "as googlebot." The results can be interesting.
Awesome script, Brad. Might I make a small suggestion? Perhaps a clear notice at the top of the page that only shows up in the Google cache and not on your regular site (something along the lines of "This is a cached version of this page with all navigation and cruft removed. If you want to visit the full and most recent version of this page, go to...".
I'd image this would be pretty easy to do since you already have the spider-detecting function.
Does restricting Google's Mediapartners-Google put you in violation of the Ad-Words T&Cs?
And unfortunately it's problems like that that cause Google to serve up a lot of results that are nowhere near what a person is searching for and thus waste their time.
I too asked Google for a way to add hints for the bot. No reply. It doesn't seem to exist for general use.
You're defintely cloaking here. Though its good to see that they haven't penalized you for it.
Perhaps there is a little human judgement still applied. That would be welcome info.
Thanks for sharing the script.
Great script, and good discussion.
new song >story of the hood
Do you mind if I quote a couple of your articles as long as I provide credit and sources back to your website? My website is in the very same area of interest as yours and my visitors would truly benefit from some of the information you provide here. Please let me know if this okay with you. Thanks a lot!Orange Roofing Contractors, 1010 N. Batavia St., #F2, Orange, CA 92867 - (714) 770-8684
I don't know if it's just me or if everybody else encountering problems with your website. It looks like some of the written text in your posts are running off the screen. Can somebody else please provide feedback and let me know if this is happening to them as well? This might be a problem with my internet browser because I've had this happen previously. Many thanksNorth Richland Hills Roofing, 6749 Manor Dr., North Richland Hills, TX 76180 - (817) 381-1700
Hey there I am so happy I found your weblog, I really found you by mistake, while I was looking on Askjeeve for something else, Nonetheless I am here now and would just like to say kudos for a tremendous post and a all round exciting blog (I also love the theme/design), I don’t have time to go through it all at the moment but I have book-marked it and also added in your RSS feeds, so when I have time I will be back to read a great deal more, Please do keep up the excellent work.Raleigh Roofing, 3221 Durham Dr., #101-C, Raleigh, NC 27603 - (919) 701-6300
Hello are using Wordpress for your blog platform? I'm new to the blog world but I'm trying to get started and create my own. Do you require any coding knowledge to make your own blog? Any help would be greatly appreciated!Weatherford Roofing & Roofers, 1880 Mineral Wells Hwy, #102, Weatherford, TX 76088 - (817) 330-8551
Hi would you mind letting me know which webhost you're working with? I've loaded your blog in 3 different internet browsers and I must say this blog loads a lot quicker then most. Can you suggest a good web hosting provider at a fair price? Kudos, I appreciate it!Orange Roofing Contractors, 1010 N. Batavia St., #F2, Orange, CA 92867 - (714) 770-8684