How to Fix the Problems with Drupal’s Default Robots.txt File
No one is perfect. And neither is the Drupal robots.txt file. In fact, there are several problems with the file. If you test out your default robots.txt file line by line using Google Webmaster Tools’ robots.txt testing utility, you will find that a lot of paths which look like they are being blocked will actually be crawled.
The reason is that Drupal does not require the trailing slash ( / ) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.
For example: / admin / is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. Not so fast. Put in http://www.yourDrupalsite.com/admin (without the slash) and you’ll see that is it allowed. “It’s a trap!” Not really, but fortunately it is relatively easy to fix.
Do you want to know how to fix the problems with Drupal’s default robots.txt file in eight easy steps? Please read on.
What in Tarnation is a Googlebot?
Huh? Google what?! Googlebot! Google and other search engines use server systems–commonly referred to as spiders, crawlers, or robots–to travel the expanse of the Internet and find each and every website. Google’s system is also referred to as Googlebot to distinguish it from all the other search engine robots. While this is not a reported number, it is estimated that the Googlebot crawls 10 billion websites each week! I’d like to see it race R2D2!
Fixing the Drupal Robots.txt File
Like I said earlier, fixing Drupal’s default robots.txt file is relatively easy. Carry out the following steps in order to fix the file:
1. Make a backup of the robots.txt file.
2. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor.
3. Find the Paths (clean URLs) section and the Paths (no clean URLs) section. Note that both sections appear whether you've turned on clean URLs or not. Drupal covers you either way. They look like this:
# Paths (clean URLs)
Disallow: /contact/ Disallow: /logout/
# Paths (no clean URLs)
4. Duplicate the two sections (simply copy and paste them) so that you have four sections—two of the # Paths (clean URLs) sections and two of # Paths (no clean URLs) sections.
5. Add 'fixed!' to the comment of the new sections so that you can tell them apart.
6. Delete the trailing / after each Disallow line in the fixed! sections. You should end up with four sections that look like this:
# Paths (clean URLs)
# Paths (no clean URLs)
# Paths (clean URLs) – fixed!
# Paths (no clean URLs) – fixed!
7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).
8. Go to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes.
Now your robots.txt file is working as you would expect it to.
Additional Changes You Can Make for SEO
Now that you have fixed your default robots.txt file, there are a few additional changes you can make. Using directives and pattern matching commands, the robots.txt file can exclude entire sections of the site from the crawlers like the admin pages, certain individual files like cron.php, and some directories like /scripts and /modules.
In many cases, though, you should tweak your robots.txt file for optimal SEO results. Here are several changes you can make to the file to meet your needs in certain situations:
• You are developing a new site and you don’t want it to show up in any search engine until you’re ready to launch it. Add Disallow: * just after the User-agent:
• The server you are running is very slow and you don’t want the crawlers to slow your site down your site for visitors. Adjust the Crawl-delay by changing it from 10 to 20.
• If you're on a super-fast server (and you should be, right?) you can tell the bots to bring it on! Change the Crawl-delay to 5 or even 1 second. Monitor your server closely for a few days to make sure it can handle the extra load.
• You're running a site which allows people to upload their own images but you don't necessarily want those images to show up in Google. Add these lines at the bottom of your robots.txt file:
If all of the files were in the /files/users/images/ directory, you could do this:
• Say you noticed in your server logs that there was a bad robot out there that was scraping all your content. You can try to prevent this by adding this to the bottom of your robots.txt file:
User-agent: Bad-Robot Disallow: *
• If you have installed the XML Sitemap module, then you've got a great tool that you should send out to all of the search engines. However, it's tedious to go to each engine's site and upload your URL. Instead, you can add a couple of simple lines to the robots.txt file.
For more information on robots.txt and Drupal SEO, check out my book: Drupal 8 SEO.
Thank You For Reading!
No one likes people who don’t share, especially giant flying cats. So if you liked what you read, please share my post with any of our socially-labeled buttons, or we’ll sick Fluffy after you! Please subscribe to our RSS feed as well so you can receive daily fodder from our blog.