<- Blog.Crawling

The robots.txt Protocol Explained: History, Syntax, Logic, and Real-World Traps

robots.txt is a plain text file at the root of a domain that instructs web crawlers which paths they are permitted to fetch. Proposed by Dutch software engineer Martijn Koster in February 1994 and refined into an IETF standard 28...

FoundationPracticalOfficial docInteractive
Jun 8, 2026.11 min read
The robots.txt Protocol Explained: History, Syntax, Logic, and Real-World Traps

robots.txt is a plain text file at the root of a domain that instructs web crawlers which paths they are permitted to fetch. Proposed by Dutch software engineer Martijn Koster in February 1994 and refined into an IETF standard 28 years later, it remains the primary access control mechanism for every major crawler operating today. This article covers the full syntax, the logic governing conflict resolution, the two wildcard characters Google supports, and the four real-world traps that produce the most SEO damage, including the most misunderstood fact about the entire protocol: robots.txt blocks crawling, but it cannot prevent a page from being indexed.


Why Does robots.txt Still Matter 30 Years Later?

robots.txt matters because Googlebot fetches it before crawling a single page on your site. Every instruction in that file shapes which of your pages are discovered, how quickly, and at what cost to your crawl budget. A single misplaced Disallow rule can make an entire site section invisible to search engines within hours. A missing sitemap directive delays URL discovery for sites that depend on it.

Google's own documentation introduces robots.txt plainly: "A robots.txt file tells search engine crawlers which URLs the crawler can access on your site. This is used mainly to avoid overloading your site with requests; it is not a mechanism for keeping a web page out of Google." That last clause is the most consequential thing to understand before writing a single line of robots.txt, and it is consistently misread.


A Brief History: From a 1994 Mailing List to RFC 9309

The problem Martijn Koster was solving in February 1994 was immediate and operational: the website he administered at UK computer security company Nexor had been involuntarily flooded by a badly written crawler written by a user trying to learn Perl. (That user, later revealed to be science fiction novelist Charles Stross, also wrote the first robots.txt-compliant crawler, CharlieSpider/0.3.) Koster posted his proposal to the www-talk mailing list, the main coordination channel for early web development, and within months, major crawler authors voted it into informal consensus.[1]Source 1Google Search Central Blog. "Formalizing the Robots Exclusion Protocol Specification." Google for Developers, July 2019.View source[2]Source 2IETF RFC 9309: Robots Exclusion Protocol. Illyes, G., Zeller, H., Sassman, L., et al. September 2022.View source

The key word is informal. The original 1994 proposal covered only three concepts: User-agent, Disallow, and path matching. The Allow directive, the wildcard characters * and $, and the Sitemap directive were all later additions, most of them introduced unilaterally by Google and then adopted by other crawlers. The specification was not formalized into an IETF standard until September 2022, when it became RFC 9309. During the 28-year gap, each search engine implemented its own interpretation, producing real inconsistencies. One of the most consequential: the Crawl-delay directive is respected by Bing, Yandex, and many other crawlers, but Googlebot ignores it entirely and never has supported it. Site owners who include Crawl-delay in their robots.txt to manage Googlebot's request rate are wasting a directive on the one bot it does not affect.


How Does robots.txt Work Mechanically?

When Googlebot is about to crawl a site for the first time or revisit it, it begins by fetching the robots.txt file with a standard HTTP GET request. It reads the rules, identifies which blocks apply to its user-agent, and then applies those rules to every subsequent URL decision during that crawl session. The rules in a robots.txt file apply only to the specific hostname, protocol, and port number the file was fetched from. A file at https://example.com/robots.txt does not govern https://www.example.com/ or https://shop.example.com/: those subdomains need their own robots.txt files.[4]Source 4Google Search Central. "Introduction to robots.txt."View source

The File Location Rule

robots.txt must live at the root of the host it governs: yourdomain.com/robots.txt. Crawlers do not look for robots.txt in subdirectories. A file placed at yourdomain.com/blog/robots.txt is not a valid robots.txt file and will be ignored. The filename is case-sensitive on case-sensitive servers; Robots.txt and ROBOTS.TXT are not the same file.

The User-agent Directive: Targeting Specific Crawlers

User-agent groups identify which crawler a rule block applies to. The wildcard * applies to all crawlers not otherwise specified. Named user-agents like Googlebot, Bingbot, and GPTBot apply only to those specific crawlers. When multiple groups are relevant to a single crawler, Google merges them internally before applying rules.

User-agent: *
Disallow: /private/

User-agent: Googlebot
Allow: /private/public/
Sitemap: https://example.com/sitemap.xml

In this example, all crawlers are blocked from /private/ except that Googlebot is additionally allowed to crawl /private/public/. Googlebot processes both blocks together.

Disallow and Allow: The Two Core Instructions

Disallow takes a path and tells the crawler not to fetch any URL whose path begins with that string. Allow carves exceptions out of a blocked directory. An empty Disallow value (no path) means allow everything. Disallow: / means block everything. The difference of a single forward slash is the difference between a fully crawlable site and a fully blocked one.

# Block all crawlers from admin and search result pages
User-agent: *
Disallow: /wp-admin/
Disallow: /search?

# Allow all other content
Allow: /

Path matching is case-sensitive and uses prefix logic. Disallow: /blog blocks /blog, /blog/, and /blog/category/article. To block only a directory's contents without blocking the directory URL itself, use a trailing slash: Disallow: /blog/.


How Do Wildcard Characters Work in robots.txt?

Google and Bing support two wildcard characters in path values: * and $. These were not part of the original 1994 specification. They were Google extensions later formalized in RFC 9309.[2]Source 2IETF RFC 9309: Robots Exclusion Protocol. Illyes, G., Zeller, H., Sassman, L., et al. September 2022.View source[3]Source 3Google Crawling Infrastructure. "How Google Interprets the robots.txt Specification."View source

* matches zero or more of any character, functioning as a glob wildcard. $ anchors the match to the end of the URL string. Together they enable precise targeting of URL patterns without listing every affected URL individually.

PatternWhat it blocks
Disallow: /*?Any URL containing a query string parameter
Disallow: /*.pdf$Any URL ending exactly in .pdf
Disallow: /tag/*/page/Any paginated tag archive URL
Disallow: /search?s=*Any search result URL
Allow: /wp-content/uploads/Carves uploads out of a blocked wp-content directory

The $ anchor matters more than most practitioners realize. Disallow: /admin blocks /admin, /admin/, /administrator, and /admin-tools. Disallow: /admin$ blocks only the exact path /admin, leaving /admin/ and every subdirectory accessible. Misreading this distinction is a frequent source of unintended blocking.


The Sitemap and Crawl-delay Directives

The Sitemap directive tells crawlers the location of your XML sitemap file. It should appear outside any User-agent block and uses the full URL:

Sitemap: https://example.com/sitemap.xml

This is a low-effort, high-value addition to any robots.txt file. When Googlebot fetches your robots.txt file, it reads the Sitemap directive and adds your sitemap URLs to its processing queue. For large sites or sites with weak internal linking, this accelerates URL discovery meaningfully.

Crawl-delay is a directive that requests a minimum number of seconds between successive requests from a crawler. As noted, Googlebot does not support it. If you need to reduce Googlebot's crawl rate specifically, the correct mechanism is the crawl rate setting in Google Search Console, not Crawl-delay in robots.txt.


How Does Google Resolve Conflicting Rules?

When multiple rules in a robots.txt file match the same URL, Google applies the most specific rule based on path length. The longer the matching path, the higher its precedence. If two matching rules are identical in length, Allow beats Disallow.[3]Source 3Google Crawling Infrastructure. "How Google Interprets the robots.txt Specification."View source

According to Google's official robots.txt specification: "When matching robots.txt rules to URLs, crawlers use the most specific rule based on the length of the rule path. In case of conflicting rules, including those with wildcards, Google uses the least restrictive rule."

A practical example:

User-agent: *
Disallow: /products/           # Path length: 10 characters
Allow: /products/featured/     # Path length: 19 characters

A URL like /products/featured/new-arrival matches both rules. The Allow rule is longer (19 vs 10 characters) so it wins: the URL is crawlable. Reversing the intent requires reversing which directive is longer, not simply switching their order. Order does not determine precedence in Google's implementation. Path length does.

This is the detail most robots.txt guides omit. Developers who expect first-match-wins behavior (as some older crawlers use) will write rules that appear to work but produce different results in practice with Googlebot.


What Can robots.txt Not Do?

robots.txt cannot prevent a page from being indexed. This is the most consequential misunderstanding about the protocol, and Google's documentation states it directly: "If your web page is blocked with a robots.txt file, its URL can still appear in search results, but the search result won't have a description."[4]Source 4Google Search Central. "Introduction to robots.txt."View source

If Google has previously crawled a page and indexed it, then you add a Disallow rule for that page, the URL remains in Google's index. Googlebot stops fetching the page, but the indexed version persists until Google's systems decide to remove it. If Google has never crawled the page but discovers its URL through a link from an external site, it can index that URL as a bare reference: no title, no description, no content assessment, but still present in the index.

To remove a page from Google's index, the correct tools are: the noindex meta tag or the X-Robots-Tag HTTP header (which require the page to be crawlable), or the URL Removal Tool in Search Console (for urgent temporary removals), or the page returning a 404 or 410 HTTP status code (which signals Google to de-index and eventually stop crawling the URL).


What Are the Most Damaging Real-World robots.txt Traps?

The Rendering Trap: Blocking CSS and JavaScript

Google renders pages using a headless Chrome instance. To render a page correctly, it needs access to the same CSS and JavaScript files the browser uses. If your robots.txt blocks the /assets/, /js/, /themes/, or /wp-content/ directories, Googlebot cannot load your stylesheets or scripts. It crawls a degraded plain-HTML version of your pages that may be missing navigation elements, product descriptions loaded by JavaScript, layout signals, and structured data.

Google's own documentation is explicit on this: "If the absence of these resources make the page harder for Google's crawler to understand the page, don't block them, or else Google won't do a good job of indexing that page's content."

The practical symptoms are subtle and delayed: pages with lower-than-expected ranking performance, URL Inspection showing resources blocked in the rendering screenshot, and Search Console warnings about blocked resources. The fix is removing any Disallow rule covering directories that contain files required to render your pages correctly.

The Production Launch Trap

Staging environments routinely use User-agent: * Disallow: / to prevent a work-in-progress site from being indexed. This is correct practice. The trap occurs when the site goes live and the robots.txt file is copied as-is to the production server without removing the blanket Disallow rule. Because Googlebot fetches robots.txt before crawling anything, the entire site becomes invisible within hours of launch, with no error message visible to the site owner.

Notice how the damage is invisible from inside the site itself: the pages load, the navigation works, analytics fires normally. The only signal is a sudden collapse in organic impressions in Search Console and a Disallow flag on every URL checked through the URL Inspection tool. Adding "verify robots.txt does not block all crawlers" to every pre-launch checklist is the only reliable prevention.

The Self-Defeating Trap: Blocking a Page You Also Want to noindex

A blocked URL cannot communicate a noindex directive. When Googlebot encounters a Disallow rule, it does not fetch the page. When it does not fetch the page, it cannot read the <meta name="robots" content="noindex"> tag inside the page's HTML. The result is that the blocked page may still appear in Google's index as a bare URL reference, exactly what the noindex was intended to prevent.

Google's Search Console documentation captures the correct model: to prevent a page from appearing in search results, Googlebot must be able to crawl the page (no Disallow rule) and must find a valid noindex directive on it. Blocking and noindexing simultaneously produces neither outcome reliably.

The correct fix: remove the Disallow rule for the page, allow Googlebot to crawl it, and let the noindex tag do its job.

Sources

  1. Google Search Central Blog. "Formalizing the Robots Exclusion Protocol Specification." Google for Developers, July 2019.

  2. IETF RFC 9309: Robots Exclusion Protocol. Illyes, G., Zeller, H., Sassman, L., et al. September 2022.

  3. Google Crawling Infrastructure. "How Google Interprets the robots.txt Specification."

  4. Google Search Central. "Introduction to robots.txt."

Share

About the Contributors

Frequently Asked Questions (FAQs)

What does robots.txt actually do?+

robots.txt tells web crawlers which URLs they are permitted to fetch on a site. It is a politeness protocol based on voluntary compliance: well-behaved crawlers like Googlebot follow the rules; malicious scrapers typically ignore them. robots.txt controls crawling behavior only. It does not control indexing, ranking, or whether a page appears in search results.

Can robots.txt stop Google from indexing a page?+

No. Blocking a URL in robots.txt prevents Googlebot from fetching it, but a URL can still be indexed if Google discovers it through a link from another page. If you block a page with robots.txt and it was previously indexed, the existing index entry persists. To remove a page from Google's index, use a noindex meta tag (which requires the page to be crawlable), a 404 or 410 status code, or the URL Removal Tool in Search Console.

What is the difference between Disallow and noindex?+

Disallow is a robots.txt directive that prevents Googlebot from fetching a URL. noindex is an HTML meta tag (or HTTP header) that prevents Googlebot from including a fetched page in its index. Disallow operates at the crawl stage; noindex operates at the indexing stage. Because Googlebot must crawl a page to read its noindex tag, using Disallow to block a page you also want to noindex is self-defeating.

Does robots.txt support wildcards?+

Google and Bing support two wildcard characters in robots.txt path values: * (matches any sequence of characters) and $ (anchors the match to the end of the URL). These were originally Google extensions to the 1994 standard and were formally included in IETF RFC 9309 in 2022. They are not supported by all crawlers, but all major search engines now recognize them.

Why does Googlebot ignore Crawl-delay in robots.txt?+

Googlebot has never supported the Crawl-delay directive and ignores it in robots.txt. The directive instructs crawlers to wait a minimum number of seconds between consecutive requests. To reduce Googlebot's crawl rate specifically, use the crawl rate slider in Google Search Console under Settings. Crawl-delay does function for Bingbot and several other crawlers.

How should I test my robots.txt file?+

The robots.txt Tester in Google Search Console (under Legacy Tools) shows how Googlebot interprets your file and tests whether specific URLs are blocked or allowed. The URL Inspection tool also reports when a specific URL is blocked by robots.txt and which rule is responsible. For bulk testing, third-party crawlers like Screaming Frog can identify all blocked URLs across a site in a single crawl.

Contributors

Reviewed by people
who know the system.

All Authors ->