Quantcast

Blocking Search Engines Crawlers / Spiders using Robots.txt File

Posted on 19 August 2009 by Kittu

You require a robots.txt file only if you don’t want searchbots to crawl and therefore index certain pages. i.e you want to block particular pages of your website from coming in search results of search engines. If you want your entire site to be indexed, you don’t need robots.txt file in the first place.

Robots.txt is simply an ASCII text file placed at the root of your domain.

For example, http://www.domainname.com/robots.txt.

Now, what are the kind of pages or areas of your website you might want to block from search engines:

  • Web pages under construction.
  • Your directories containing scripts and CSS style sheets.
  • prevent indexing of the images in your website.
  • Any form ( contact or inquiry) that visitors fill up.
  • Pages that are copyrighted.

Some points to note about robots.txt file:

  • robots.txt file can be seen by anyone by typing the above URL and hence know what areas of your website you have blocked from searchbots.
  • Some robots ignore the robots.txt file; malware robots that scan the web for security vulnerabilities and spammers are among them. Googlebot is considered a very well behaved spider for following the robots.txt file religiously.

For those confused between “crawler”, “robot” and a “spider”:

  • Robot - Any program that goes out onto the web to do things. This includes search engine crawlers, but also many other programs, like email scrapers, site testers, and so on.
  • Crawler - the special kind of robot that search engines use.
  • Spider - A term used mainly by SEO Professionals. Refers to a crawler.

So, a Robot is a kind of Crawler and crawls the website and helps indexing it.

Many robots refuse to comply with robots.txt file, hence they also ignore another form of the same, using metadata robot instructions. While robots.txt file contain instructions for the complete website, metadata robot instructions contain instructions for searchbots only for a particular page. Metadata robot instructions, as the name suggests, are included in the metadata section of the webpage.

The code to stop all the spiders indexing a particular webpage is :

<META NAME = “robot” CONTENT = “NOINDEX”>

The code to stop a particular spider e.g Googlebot indexing a particular webpage is :

<META NAME = “googlebot” CONTENT = “NOINDEX”>

The code to stop all the spiders following the links on your page is:

<META NAME=”robots” CONTENT =”NOFOLLOW”>

The code to stop particular spider e.g Googlebot following the links on your page is:

<META NAME=”googlebot” CONTENT =”NOFOLLOW”>

Whenever a search engine spider visits a website, it first visits the robots.txt file. If it finds the following code,

User Agent: *

Disallow: /

  • ” * ” applies to all the searchbots, i.e Google’s crawler, Yahoo’s crawler. All of them.
  • ” / ” means dont crawl or visit any pages of this site.

And If the code is,

User Agent: Googlebot

Disallow: /images/

Disallow: /scripts/

Googlebot is prevented from accessing any images or scripts or data placed in the folder “images” and “scripts” respectively.

Infact, you can control Googlebot’s crawl rate. You need to sign up for Google Webmasters Tool for that. There are three settings: slower, normal and faster. Normal is the default and recommended choice. If you select slower option, Google will not be able to crawl your website as often as it would otherwise. If you select a faster option, unless you make changes very frequently, your bandwidth is likely to get wasted.

How do you know if your site has been visited by a robot / crawler:

  • a site repeatedly checking for the file ‘/robots.txt’ might be a crawler.
  • check your server logs for sites that retrieve many documents, especially in a short time.

Note of Caution:

  • One should be very careful with the robots.txt file. Be attentive as to what code you are writing and double check it in case you end up blocking all the search engines crawlers from your entire site.
  • If you have a robot.txt file, keep checking it weekly or monthly. There are cases of websites being hacked and hackers placing theDisallow : / Command, and the entire site getting dropped from Google.
  • Google observes both NOINDEX and NOFOLLOW instructions, but to be on the safer side, use of robots.txt file is recommended.

I have given out the basics of using robots.txt, but if you are keen to learn more and get indepth knowledge, visit robotstxt.org and read their FAQ’s.

1 Comments For This Post

  1. Inder@SeoNext Says:

    Great info…Like the post, Good to know information for business as well as SEO’s…Thanks for sharing.

Leave a Reply

seo Tips seo Tricks Advertise Here
-->
-->

Recommended Books







Join Me @




Google Analytics Alternative