The original Robots.txt syntax was pretty straightforward. You could only use the Disallow directive to exclude pages and each Disallow directive acted like a broad match at the end. This seemed pretty intuitive to most people and for a while the world was a a happy place.
A few people with large and complicated sites discovered some exceptions that couldn’t be covered and so the Robots.txt was extended to allow further control with a some new features.
Unfortunately, these were interpreted differently by each search engine and the supporting documentation is pretty thin on the ground.
Google’s documentation on the Allow command extends to a single example combining all 3 features.
The Allow: /*?$ directive will allow any URL that ends in a ? (more specifically, it will allow any URL that begins with your domain name, followed by a string, followed by a ?, with no characters after the ?).
Wikipedia’s page on Robots.txt suggests that Google process all Allow Commands first and only then moves on to Disallow.
From this limited information we interpreted the following 3 rules…
If you thought the same as us then prepare to be very surprised. After experimenting with the Robots.txt testing tool in Webmaster Tools, we found something completely different.
Disallow: /example.html
Allow: /example*
In the example number 1 Disallow beats Allow directly contradicting the rule number 1
Disallow: */example.html*
Allow: /example.html$
Example number 2 tears rule number 2 into pieces as despite the strong $ Allow loses against Disallow.
Disallow: /example.html*
Allow: /example.html
Disallow: /example.html
Allow: /example.html*
And the final examples comes with no surprise rebutting the rule number 3 – depending on the placement of the wildcard, a Disallow beats Allow or an Allow beats a Disallow.
It took us a while to figure out and might take a minute to get your head around, but the answer is rather simple. The number of characters you use in the directive path is critical in the evaluation of an Allow against a Disallow. The rule to rule them all is as follows:
Just to clarify we’re talking about the number of characters in the matching directive path after the Allow: or Disallow: statement. This includes all the * and $ characters. e.g.
Disallow: /example* (9 characters) Allow: /example.htm$ (13 characters) Allow: /*htm$ (6 characters)
In the following example, the URL /example.htm will be disallowed because the Disallow directive contains more characters (7) than the Allow directive (6).
Allow: /exam*
Disallow: /examp*
If you add a single character to the Allow directive, the number of characters is equal and the Allow wins. An Allow directive with equal or more characters always beats a Disallow.
Allow: /examp*
Disallow: /examp*
This even applies to exact matches using $. In the example below, the URL /example.htm will be disallowed because the matching Disallow directive contains more characters.
Allow: /example.htm$
Disallow: */*example*htm
Another interesting side effect is that a broad match using a wildcard at the end becomes more powerful than one without due to the additional character. In the following example, the URL /example.htm will be disallowed because the Disallow directive contains more characters than the Allow directive due to the additional * character.
Allow: /example
Disallow: /example*
Which of these would win? The directives are not lined up so it’s hard to see.
Allow: /example.htm Disallow: /********htm
This is better. You can see they are the same length so the Allow would win.
Allow: /example.htm Disallow: /********htm
Use Robotto for free to monitor your robots.txt text files for changes and remind you to re-test a list of sample URLs in Webmaster Tools.
Or DeepCrawl to crawl your site in full and show you exactly what’s indexable and what’s disallowed, noindexed or canonicalised.
After playing around with the robots.txt testing tool for a while we found 2 other interesting anomalies. These don’t actually affect the way robots.txt works because they only occur within competing Allow or Disallow statements and when the lengths of the directives are identical. We’ve included them here because they might be useful to help explain something that hasn’t been discovered yet and for completeness. They also suggest that the solution developed by Google might not have been as carefully planned as one would have expected or could give a clue as to the underlying technology.
In this example, the second Disallow wins because it uses a * whereas the other uses a $. Both have identical numbers of characters in the directive.
Disallow: /example.htm$
Disallow: /example.htm*
In this example, the first Disallow wins because it has a greater number of non-wildcard characters excluding the wildcards. Both have identical numbers of characters in the directive.
Disallow: /*xample.htm
Disallow: /****ple.htm
Although the way Google handles robots.txt files allows very powerful combinations to cover any scenario, it’s not intuitive or even documented sufficiently which is likely to result in a number of sites being incorrectly indexed. What do you think?
on September 14, 2011 at 7:54 pm
· Permalink
Hi,
I have a question regarding the robots.txt file.
I saw in one of my client accounts that the robots.txt file was integrating the Allow line between the one of Disallow. Surprisingly when I discovered that the website was not indexed any more in Google whereas it was during the past.
Do you think Google need to see the Allow and Disallow lines in a specific order?
The robots.txt looked like this:
User-agent: *
Disallow: /cache
Allow: /
Disallow: /images
Thanks for your answer
on September 16, 2011 at 9:35 am
· Permalink
To the best of our knowledge the order doesn’t matter for Google.
Even if it did, there is no reason why the robots.txt you listed should cause a site to be de-indexed so I think there must be another cause.