Sunday, December 18, 2005
I hope Everyone is aware of the recent move by webmasterworld.com to make all postings private, people can view and read their threads only after they login,
They banned all bots in their robots.txt file, This is what their robots.txt file says,
"#
# Please, we do NOT allow nonauthorized robots.
#
# http://www.webmasterworld.com/robots
# Actual robots can always be found here for: http://www.webmasterworld.com/robots2
# Old full robots.txt can be found here: http://www.webmasterworld.com/robots3
#
# Any unauthorized bot running will result in IP's being banned.
# Agent spoofing is considered a bot.
#
# Fair warning to the clueless - honey pots are - and have been - running.
# If you have been banned for bot running - please sticky an admin for a reinclusion request.
#
# http://www.searchengineworld.com/robots/
# This code found here: http://www.webmasterworld.com/robots.txt?view=rawcode
User-agent: *
Disallow: /
"
User-agent: *
Disallow: /
The above robots file syntax means no bot whether its a search engine bot or a spam bot, No bot is allowed to crawl webmasterworld.com, But it was a bit strange when Greg boser mentioned this in his blog ( http://www.webguerrilla.com/clueless/welcome-back-brett ) ,
"I was doing some test surfing this morning using a new user agent/header checking tool Dax just built. Just for fun, I loaded up WebmasterWorld with a Slurp UA. Suprisingly, I was able to navigate through the site. I was also able to surf the site as Googlebot and MSNbot.
A quick check of the robots.txt with several different UA’s showed that MSN and Yahoo are now given a robots.txt that allows them to crawl. However, Google is still banned, and humans still must login in order to view content.
Apparently, it’s been this way for awhile because both engines already show a dramatic increase in page counts.
MSN 57,000
Yahoo 160,000
"
We were taken totally by surprise, So how does this work, Except for cloaking you cannot do this through any other method, thought we will do a bit of research on this and tried using a user Agent spoofer to navigate their site, As greg mentioned we tried using the following Useragents,
Yahoo-Slurp
Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Googlebot/2.1 (+https://www.google.com/bot.html)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
with all the above useragents we were able to browse webmasterworld.com peacefully,
Update to greg's post:
Googlebot is now allowed to crawl webmasterworld.com via robots.txt file cloaking, Google has about 250,000 pages now, First when webmasterworld.com didn't cloak their robots.txt file and blocked all robots, Google removed all pages of webmasterworld.com from their index, It is mostly because the robots file URL was directly submitted to the automated URL removal system in google,
Google clearly talks about that here,
"Note: If you believe your request is urgent and cannot wait until the next time Google crawls your site, use our automatic URL removal system. In order for this automated process to work, the webmaster must first create and place a robots.txt file on the site in question.
Google will continue to exclude your site or directories from successive crawls if the robots.txt file exists in the web server root. If you do not have access to the root level of your server, you may place a robots.txt file at the same level as the files you want to remove. Doing this and submitting via the automatic URL removal system will cause a temporary, 180 day removal of the directories specified in your robots.txt file from the Google index, regardless of whether you remove the robots.txt file after processing your request. (Keeping the robots.txt file at the same level would require you to return to the URL removal system every 180 days to reissue the removal.)
"
https://www.google.com/webmasters/remove.html
This is the Robots.txt file we saw using the Googlebot useragent spoofer,
GET Header sent to the bot [Googlebot/2.1 (+https://www.google.com/bot.html)]:
HTTP/1.1 200 OK
Date: Sun, 18 Dec 2005 17:35:10 GMT
Server: Apache/2.0.52
Cache-Control: max-age=0
Pragma: no-cache
X-Powered-By: BestBBS v3.395
Connection: close
Transfer-Encoding: chunked
Content-Type: text/plain
326
#
# Please, we do NOT allow nonauthorized robots.
#
# http://www.webmasterworld.com/robots
# Actual robots can always be found here for: http://www.webmasterworld.com/robots2
# Old full robots.txt can be found here: http://www.webmasterworld.com/robots3
#
# Any unauthorized bot running will result in IP's being banned.
# Agent spoofing is considered a bot.
#
# Fair warning to the clueless - honey pots are - and have been - running.
# If you have been banned for bot running - please sticky an admin for a reinclusion request.
#
# http://www.searchengineworld.com/robots/
# This code found here: http://www.webmasterworld.com/robots.txt?view=rawcode
User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/
This is the header response:
HEAD Header sent to the browser [Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)]:
HTTP/1.1 200 OK
Date: Sun, 18 Dec 2005 17:35:10 GMT
Server: Apache/2.0.52
Cache-Control: max-age=0
Pragma: no-cache
X-Powered-By: BestBBS v3.395
Connection: close
Content-Type: text/plain
URI: www.webmasterworld.com/robots.txt
Source delivered to [Googlebot/2.1 (+https://www.google.com/bot.html)]:
"
User-agent: *
Disallow: /gfx/
Disallow: /cgi-bin/
Disallow: /QuickSand/
Disallow: /pda/
Disallow: /zForumFFFFFF/
"
From the above syntax you can see that webmasterworld.com doesnt ban googlebot or other main bots from crawling their site pages, This is not new for brett before webmasterworld.com went private Googlebot had access to paid section of webmasterworld while normal users need to subscribe,
Now the question does google endorse cloaking, Cloaking is bad as defined by Search engine guidelines, Now we can see that selective cloaking for selective sites are not bad, We dont blame brett for doing it because he has reasons to disallow spam bots and very good reasons to allow nice bots,
Brett explains why he banned bots, he says
"Seeing what effect it will have on unauthorized bots. We spend 5-8hrs a week here fighting them. It is the biggest problem we have ever faced.
We have pushed the limits of page delivery, banning, ip based, agent based, and down right cloaking to avoid the rogue bots - but it is becoming an increasingly difficult problem to control.
webmasterworld.com/forum9/9593-2-10.htm
So what is brett's answer for cloaking?
A webmasterworld.com member asks
"Brett - do you cloak your robots.txt depending on IP address that requests it? "
Brett's answer:
"only for hot/honey pot purposes. "
Webmasterworld.com is one the best place in internet, great webmasters and SEOs are born there, it is pretty harsh to complain about them but truth cannot be hidden for a long time, if not us someone will blog on this already, greg ( webguerilla ) has discussed a lot of this issue,
SEO Blog team.