What is robots.txt?
The robots.txt is a text file located in the root directory of a website. It is used to tell search engine robots which pages they should not crawl.
Why is robots.txt used?
There are several reasons why you should use robots.txt. The most common are:
- To prevent search engines from indexing pages that are not finished or are not ready to be indexed.
- To protect sensitive areas of the website, such as the administrator control panel.
- Help search engines NOT to crawl pages that are not important.
How to create a robots.txt file?
Creating a robots.txt file is very simple. You can create it with any text editor.
The robots.txt file is a plain text file, so you do not need to use any special programming language.
The content of the robots.txt file is very simple.
It is useful to “block” certain areas or files of your website to search engines to avoid indexing duplicate content.
I always put in quotation marks: blocking or control, because, it does not always work as we expect.
We must use other methods such as meta robots in the headers.
Rules and syntax of robots.txt
The robots.txt has some simple rules but they can induce us to certain errors:
- The name of the robot(user-agent) and the action must be entered.
- The action can be of two types: disallow and allow.
- What is really important is the disallow. The allow is an exception within the disallow. It makes no sense to say allow everything because that is the default behavior of a search engine, to crawl everything (good advice is to be minimalist).
- It is a text file (txt) not HTML.
- Always lowercase
- There may be empty lines between the different agents, but there should not be between the guidelines.
- We can put comments with the hash (#) as it will be ignored by search engines.
- It is highly recommended to place your sitemap.xml
Some examples and let’s read it by lines:
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow: /
# Bloqueamos a los robots de este directorio
User-agent: *
Disallow: /cosasprohibidas/
In the above example we do not block anything to Google (the disallow is empty) and we block the whole web to Bing by using (/) which takes us to the root folder of the web.
In the third guideline, as the commentary points out, we block all agents from the directory: /prohibitedthings/.
When we want to address all the robots, we use the asterisk ( * )
There are other guidelines such as crawl delay. You can see it in the following example:
User-agent: Bingbot
Crawl-delay: 5
The crawl-delay works as a delay in seconds to avoid overloading the server requests.
It is something that is unnecessary for most websites. I see sense for large websites or media with a lot of traffic.
The robots.txt accepts regular patterns or wildcards, something very useful if we want to block certain directories of our website.
Example: the asterisk ( * ) to block directories starting with the same word: /folder*/ and will block all directories: folder1, folder2, etc.
User-agent: Googlebot
Disallow: /carpeta*/
The dollar sign at the end of the URL if we want to block for example an extension (such as a pdf or gif, for example we put /* .pdf$ ).
User-agent: Bingbot
Disallow: /*.pdf$
What happens to the default robots.txt in WordPress?
WordPress default robots.txt is good because:
- It does not block any web frontend resource (the public part).
- Blocks all backend resources (administrative part of the web) with one exception:
- The admin-ajax.php which provides support for plugins and themes and can be used in the public part of the website.
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
The WordPress default robots.txt example also teaches us an important factor of syntax. By “allow” overwriting a previous directive, we block everything from wp-admin, except: admin-ajax.php.
A curious detail. WP takes care of creating a virtual robots.txt.
The file does not really exist in the public_html where you usually upload all the web files. Until you upload a file, it is virtual. Curious, isn’t it?
Should we use the default robots.txt?
Why not?
What should be added is the sitemap and all SEO plugins add it automatically.
I believe that less is more. It is good to be minimalist in this file, search engines tend to check it frequently. Think carefully if you want to block something and why.
I usually block the feeds so that they are not appearing in the Search Console reports, but I tend to put few specifications. In this example, I show you my robots.txt:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Allow: /wp-includes/*.js
Allow: /wp-includes/*.css
Disallow: *trackback
Disallow: /feed/
Disallow: */feed/
Sitemap: https://wajari.com/sitemap_index.xml
As you can see, I don’t tend to put a lot of “milongas”.
I do not see it necessary for the vast majority of websites and we must not forget that this file will be visited almost every day by Google, so let’s think in terms of effectiveness.
Use if you want to block something. Remember, great power, great responsibility.
Always check that it is working properly. Here are some tools for you to try. Don’t just copy and paste without rhyme or reason, analyze for yourself why you do that.
- SEO Tool to analyze a robots.txt
- Official Google documentation on robots.txt which is very well explained.
An important detail:
Remember that Google does whatever it wants.
So as I told you in meta robots video, if you want to block a page or section from the search engine results, you must also use the noindex because the search engine could come directly, without going through your directive of this simple file.
You will see that it is such a simple file that you will have no problem and it is part of the essence of SEO. If you have any questions, I will be glad to read your comments 😉
Live long and prosper!