Call us 7am - Midnight 0207 855 2055My.tsoHostshopping_basket0 Item(s): £0.00

keyboard_backspaceBack to the Blog

Do You Know What Robots.txt Files Are?

Posted 25th October, 2013 by Aliysa

Every day robots and search engine crawlers scour the web collecting data from public web pages. The data collected is used for a number of things by a number of different people and organisations. Search engines use web crawlers to index the web and provide relevant search results.

Do I have any control over what they crawl on my site?

As a website owner, you have some control over what they crawl on your site. A robots.txt file can be added to your website which specifies to all bots or specific bots the pages that must not be crawled. When a bot visits your website the first thing it will do is identify and abide by the rules defined in robots.txt. If a robot.txt file is not found, they assume the website owner is happy for the whole site to be crawled. It’s also possible to deny certain bots access to your entire site.

Here’s our robots.txt page, and here’s the BBC's.

Why would I want to block bots from accessing areas of my site?

SEO. You can use robot.txt to prevent search engine bots from crawling less important content on your site, giving them more time to crawl the valuable parts, in line with your SEO strategy. It’s additionally useful for preventing search engine spiders from crawling pages on your site with duplicate content, which could get you penalised and shoved down search result rankings. Furthermore, you can include a directive in your robots.txt file listing the location of your sitemap; sitemaps help search engine crawlers to identify how your pages are interlinked to make it easier for them to index your entire site.

So I can securely hide information on my site?

A common misconception of robots.txt is that all crawlers follow the rules and that you can safely privatise data on a website through uploading this file to your site. This is not the case. Robots.txt relies on the cooperation of robots, and malicious bots are especially unlikely to cooperate and honour the standard. Therefore it’s advised you password protect or keep content offline if you want to guarantee it remains private or confidential.

I don’t have anything to hide on my site - do I need a robots.txt file?

Some people say yes, some people say no. As search engine crawlers actively check for it when they initially access a website, I say include one, even if you don’t have any pages or content you want to hide. When a bot searches and finds no robots.txt file, your server logs will return a 404 error. A simple robots.txt file allowing all bots to crawl all pages keeps your logs clean of lots of 404 errors and saves a few bytes of bandwidth.

So, how do I create a robot.txt file?

Create a regular text file called ‘robots.txt’ and upload it to the root directory of your site. The format for most commands is a ‘User-agent’ line to identify the bot in question (you can usually find the bot name on the corresponding website), followed by at least one ‘Disallow’ line. Here are a few examples:

User-agent: *

This command says that all bots (depicted by *) are free to crawl every page.

User-agent: Googlebot
Disallow: /

This tells Googlebot not to access any file in the root directory, effectively blocking it from accessing any part of the website.

User-agent: Googlebot-Image
Disallow: /images/

This instructs Google’s image crawler not to index the /images directory.

User-agent: *
Disallow: /images/image1.jpg
Disallow: /test/

This instructs all bots not to access a particular file (/images/image1.jpg) and a test directory (/test/).

Some bots including Googlebot can understand a few other commands including crawl delay and ‘allow’ directives.

What if I use WordPress?

A virtual robots.txt file is automatically added to your site by WordPress, meaning there’s not a hard copy on your server to edit. You can create and add one to your server yourself which will then be used instead, or you can install a plugin, such as WP Robots Txt, that enables you to alter your robots.txt file from the admin area.

What happens if I make a mistake with my robots.txt file?

You could risk blocking bots from your site altogether and greatly damaging your search rankings! Be careful and ensure it’s correctly written.

Categories: Tips, SEO

You may also like:

11 of the top plugins to optimise your WordPress eCommerce site
Podcasting - why you should talk the talk!
7 questions to ask a web designer before you hire them
5 Netflix documentaries every designer must watch
7 UX experts to turn to for insight and inspiration
5 New Year's resolutions for SMEs that can be completed in a day