2004-09-28
New Googlebot?
I've recently updated the logging system for the Nanobox to include detailed information from anything with bot
in the User-Agent header, and I noticed something that I hadn't noticed before: when the Googlebot came to my site, it identified as Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
.
I hadn't noticed the Mozilla/5.0 part before, but I just received an article from my google|gmail|blogger|orkut|picasa|puffin
Google Alert talking about a new Googlebot. <Did Google Unleash Additional Googlebots>
This is quite interesting. Of course, it's too soon to draw any conclusions from this. It's possible that the new User-Agent identification might have been made for the same reason that Internet Explorer started identifying as Mozilla/4.0 early on: a lot of sites filter content based on the user agent. By changing the identification, it's possible that Google is trying to prevent scammers who try to manipulate search results by providing different content to search engines. However, given the fact that the name Googlebot/2.1 is still in the identification, that idea is questionable.
I've been having Google problems on the Nanobox for a while, due to a quite poor script I wrote that tries to act as if the web server on my own computer is hosted at that address. Google spent a long time requesting robots.txt and / over and over, presumably failing because of incorrect cache headers my script was sending, and then it eventually started to give up (which was just before I fixed it). Now this new Googlebot comes along, and instead of spidering my site like normal, starting with robots.txt, then /, then pages that / links to, it's just picking off pages that are already in the Google index (just about all of which have moved since, and as I have recently learned Google doesn't follow HTTP redirects).
I'm going to start paying particular attention to the actions of this new Googlebot, as well as the actions of the previous Googlebot, and post any interesting findings I make. Meanwhile, MSNBot is happily checking my broken/redirecting links over and over again with incredible persistance. Sorry, those pages ain't coming back there.
Update 2004.9.28: Today the new Googlebot dove into the Nanobox several directories deep in just a few minutes. Those pages aren't yet up on the Google search results. Looks like my cache problems are finally over. :)
Update 2004.9.29: I forgot to mention that the new Googlebot uses HTTP/1.1 instead of HTTP/1.0 like the previous one uses. Also, it might just be from a regular indexing delay, but I've noticed that pages crawled several days ago haven't yet made it to Google's search results. It's a bit too early to jump to any conclusions, but it sure would be interesting if Google is completely rebuilding their index from scratch (which could mean a drastic change in their relevance algorithm). Time will tell.
0 comments
Comment moderation policy: Your comment will be reviewed before it is added to the site. This is in response to spam and other forms of abuse. I gladly accept comments containing criticism as long as the language is clean.
This weblog is powered by Blogger.