Posted to Ben Finklea's blog on October 5th, 2011

How to Optimize Your Robots.txt File

The robots.txt file is a file that sits at the root level of your website and asks spiders and bots to behave themselves when they’re on your site. You can take a look at it by pointing your browser to http://www.yoursite.com/robots.txt.

Think of it like an electronic No Trespassing sign that can easily tell the search engines not to crawl a certain directory or page of your site.

Having a search engine-optimized robots.txt file is very important for SEO. It clearly established rules and expectations for Google’s spiders, which directly affects what web pages rank and don’t rank. Keep reading for more robots.txt knowledge than you thought you’d ever need on how to search engine-optimize the Drupal robots.txt file.

The robots.txt file is required
On December 1, 2008, John Mueller, a Google analyst, said that if the Googlebot can’t access the robots.txt file (say the server is unreachable or returns a 5xx error result code) then it won’t crawl the website at all. In other words, the robots.txt file must be there if you want the web site to be crawled and indexed by Google. Read his full comments here.

Drupal 6 Robots.txt
Drupal 6 provides a standard robots.txt file that does an adequate job. It likes like this:

This file carries instructions for robots and spiders that may crawl your site. Robots.txt directives Now that we’ve taken a glance at what the file looks like, let’s take a deeper look at each directive used in the Drupal robots.txt file. This is a bit tedious, but that’s why I’m here. It truly is worth it to understand exactly what you’re telling the search engines to do.

Pattern matching
Google (but not all search engines) understands some wildcard characters. The following table explains the usage of a few wildcard characters:

Editing your robots.txt file 
There may be a few times throughout your website’s SEO campaign that you’ll need to make changes to your robots.txt file. This section provides the necessary steps to make each change.

1. Check to see if your robots.txt file is there and available to visiting search bots. Open your browser and visit the following link: http://www.yourDrupalsite.com/robots.txt.

2. Using your FTP program or command line editor, navigate to the top level of your Drupal website and locate the robots.txt file.

3. Make a backup of the file.

4. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor tool.

5. Most directives in the robots.txt file are based the on line user-agent :. If you are going to give different instructions to different engines, be sure to place them above the User-agent: *, as some search engines will only read the directives for * if you place their specific instructions following that section.

6. Add the lines you want.

7. Save your robots.txt file, uploading it if necessary, replacing the existing file. Point your browser to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refreshing on your browser to see the changes.

Problems with the default Drupal robots.txt file
There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled.

The reason is that Drupal does not require the trailing slash (/) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.

For example, /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. But, put in http://www.yourDrupalsite.com/admin (without the trailing slash) and you'll see that it is allowed. Disaster!

Fortunately, this is relatively easy to fix.

Fixing the Drupal robots.txt file 
Carry out the following steps in order to fix the Drupal robots.txt file:

1. Make a backup of the robots.txt file.

2. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor.

3. Find the Paths (clean URLs) section and the Paths (no clean URLs) section. Note that both sections appear whether you've turned on clean URLs or not. Drupal covers you either way. They look like this:

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

4. Duplicate the two sections (simply copy and paste them) so that you have four sections—two of the # Paths (clean URLs) sections and two of # Paths (no clean URLs) sections.

5. Add 'fixed!' to the comment of the new sections so that you can tell them apart.

6. Delete the trailing / after each Disallow line in the fixed! sections. You should end up with four sections that look like this:  

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

# Paths (clean URLs) – fixed!
Disallow: /admin
Disallow: /comment/reply
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs) – fixed!
Disallow: /?q=admin
Disallow: /?q=comment/reply
Disallow: /?q=contact
Disallow: /?q=logout
Disallow: /?q=node/add
Disallow: /?q=search
Disallow: /?q=user/password
Disallow: /?q=user/register
Disallow: /?q=user/login

7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).

8. Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes. Now your robots.txt file is working as you would expect it to.

Additional changes to the robots.txt file
Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.

In many cases, though, you should tweak your robots.txt file for optimal SEO results. Here are several changes you can make to the file to meet your needs in certain situations:

• You are developing a new site and you don't want it to show up in any search engine until you're ready to launch it. Add Disallow: * just after the User-agent:

• Say you're running a very slow server and you don't want the crawlers to slow your site down for other users. Adjust the Crawl-delay by changing it from 10 to 20.

• If you're on a super-fast server (and you should be, right?) you can tell the bots to bring it on! Change the Crawl-delay to 5 or even 1 second. Monitor your server closely for a few days to make sure it can handle the extra load.

• Say you're running a site which allows people to upload their own images but you don't necessarily want those images to show up in Google. Add these lines at the bottom of your robots.txt file:

User-agent: Googlebot-Image
Disallow: /*.jpg$
Disallow: /*.gif$
Disallow: /*.png$

If all of the files were in the /files/users/images/ directory, you could do this:
User-agent: Googlebot-Image
Disallow: /files/users/images/

• Say you noticed in your server logs that there was a bad robot out there that was scraping all your content. You can try to prevent this by adding this to the bottom of your robots.txt file: User-agent: Bad-Robot Disallow: *

• If you have installed the XML Sitemap module, then you've got a great tool that you should send out to all of the search engines. However, it's tedious to go to each engine's site and upload your URL. Instead, you can add a couple of simple lines to the robots.txt file.

Adding your XML Sitemap to the robots.txt file

Another way that that the robots.txt file helps you search engine optimize your Drupal site is by allowing you to specify where your sitemaps are located. While you probably want to submit your sitemap directly to Google and Bing, it's a good idea to put a reference to it in the robots.txt file for all of those other search engines.

You can do this by carrying out the following steps:

1. Open the robots.txt file for editing.

2. The sitemap directive is independent of the User-agent line, so it doesn't matter where you place it in your robots.txt file.

4. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?). Go to http://www.yoursite.com/robots.txt and double-check that your changes are in effect. You may need to perform a refresh on your browser to see the changes.

For help with your Drupal XML sitemap, check out my other post about it: X(ML) Marks the Spot: Your Drupal SEO Guide to XML Sitemaps.

Thanks For Reading!