<- Blog.Crawling

Crawl Budget Explained: Rate Limit, Crawl Demand, and What Wastes It

Crawl budget is the number of URLs Googlebot can and wants to crawl on your site within a given period. It is not a fixed number, and it is not a setting you can directly configure. It emerges from the interaction of two forces: how...

FoundationPracticalOfficial docInteractive
Jun 8, 2026.10 min read
Crawl Budget Explained: Rate Limit, Crawl Demand, and What Wastes It

Crawl budget is the number of URLs Googlebot can and wants to crawl on your site within a given period. It is not a fixed number, and it is not a setting you can directly configure. It emerges from the interaction of two forces: how fast your server allows Googlebot to fetch pages, and how much Google values crawling your site's content. When those forces are misaligned, important pages stop getting discovered on time. This article explains both forces, identifies the four most common budget wasters, addresses the most persistent misconception in crawl budget management, and walks through the diagnostic process for identifying crawl inefficiency in Search Console and server logs.


What Is Crawl Budget?

Crawl budget is the set of URLs Google can and wants to crawl on a site, determined by two components: crawl capacity (how fast Googlebot can fetch without overloading the server) and crawl demand (how much Google prioritizes crawling a site's pages based on quality, authority, and freshness). The interaction of these two components (not either one alone) determines how many pages get crawled and how often.[1]Source 1Google Crawling Infrastructure. "Optimize Your Crawl Budget." Google for Developers.View source[2]Source 2Google Search Central Blog. "What Crawl Budget Means for Googlebot." Google for Developers.View source

Google's official crawl budget documentation defines the concept precisely: "Taking crawl capacity and crawl demand together, Google defines a site's crawl budget as the set of URLs that Google can and wants to crawl. Even if the crawl capacity limit isn't reached, if crawl demand is low, Google will crawl your site less."


The Two Forces That Determine Crawl Budget

Crawl Capacity Limit: How Fast Googlebot Can Crawl Your Server

Crawl capacity limit is the maximum number of simultaneous parallel connections Googlebot uses to crawl a site, combined with the time delay between fetches. Google's crawlers calculate this limit dynamically based on server health. If your server responds quickly and consistently, the capacity limit rises over time and Googlebot crawls more pages per day. If your server slows down, returns 5xx errors, or times out repeatedly, the limit drops and Googlebot backs off, sometimes significantly.

The crawl capacity limit is not something Google provides as a visible metric. You can observe it indirectly through the Crawl Stats report in Search Console, where a sudden drop in total daily crawl requests often correlates with degraded server response times. This makes server performance a direct SEO variable: response time above 200ms is the conventional threshold at which crawl efficiency degrades noticeably.

Crawl Demand: How Much Google Wants to Crawl Your Content

Crawl demand is how much Google's systems prioritize your site relative to everything else competing for crawl resources. According to Google's official documentation, the main factors are: perceived inventory (how many URLs Google thinks exist on your site), popularity (how much external attention each URL receives), and staleness (how overdue a page is for a refresh based on its last known change).

The perceived inventory signal is the one site owners control most directly. Without guidance from robots.txt or other signals, Google tries to crawl all or most URLs it knows about on your site. If your site generates thousands of low-value parameter URLs and Google has discovered them all, those URLs compete for the same crawl budget as your actual product pages, articles, and category pages. Reducing perceived inventory by eliminating or blocking low-value URLs is the most reliable way to improve crawl demand efficiency.


Does Crawl Budget Matter for Your Site?

Crawl budget is genuinely important only for a specific subset of sites. Google's official crawl budget documentation states the guide is intended for two categories: large sites with over one million unique pages whose content changes moderately often (roughly weekly), and medium or larger sites with over 10,000 unique pages whose content changes very rapidly (daily). Google explicitly says that for smaller sites, "merely keeping your sitemap up to date and checking your index coverage regularly is adequate."

The practical threshold is approximately 10,000 pages with frequent content changes. Below that, Googlebot typically crawls the entire site efficiently regardless of URL quality issues. For an e-commerce catalog, news publisher, or programmatic content platform with tens of thousands of URLs changing daily, crawl budget becomes a real constraint that directly determines how quickly new content appears in search results.

The diagnostic signal that confirms a crawl budget problem is the "Discovered: currently not indexed" status in Search Console's Pages report. A large and growing population of URLs in this state means Google knows about those pages but is not prioritizing their crawl, typically because budget is being consumed elsewhere.


What Wastes Crawl Budget? The Four Main Culprits

Faceted Navigation and Parameter URL Explosion

Faceted navigation is the filtering system that lets users narrow product results by attributes: color, size, brand, price, material, rating. Every filter combination that generates a crawlable URL creates a new entry in Google's perceived inventory. On a catalog with 4 filter dimensions, each carrying 10 values, the combinatorial math produces 10,000 possible URL combinations from a single category page. A site with 200 categories and that filter structure has 2 million potential parameter URLs competing with actual product pages for crawl budget.

A fashion retailer case study published in 2026 found that despite having 50,000 actual products, the site had 2.3 million indexed pages. Googlebot was spending approximately 95% of its crawl budget on filter combinations that carried no unique search value. New product pages were being recrawled infrequently because the parameter URLs consumed the available crawl allocation first.

The fix is to prevent parameter URLs from entering Google's frontier. The most effective approach is robots.txt disallow rules targeting the parameter patterns (for example, Disallow: /*?color= and Disallow: /*?sort=). This keeps those URLs out of Google's crawl queue entirely. For faceted filters that have genuine search value (a filter for "women's size 10 running shoes" that matches actual search queries), give those filtered states clean, canonical URLs with unique content and strong internal links, and handle the rest with robots.txt.

Redirect Chains

Each redirect in a chain consumes a separate crawl request. A three-hop chain (old-page redirects to temp-redirect, which redirects to another-redirect, which redirects to final-destination) costs Googlebot three fetches to reach one page. Google's crawl budget documentation lists "avoid long redirect chains" explicitly in its best practices.

The wasted budget compounds on large sites during migrations, where thousands of old URLs may have inherited multi-hop chains accumulated over multiple previous migrations. Every hop adds latency and consumes a crawl slot that could have gone to new content. The fix is to audit all redirects and flatten every chain to a single direct 301 pointing to the final destination. Screaming Frog's redirect chain report surfaces these automatically during a crawl.

Soft 404 Errors

A soft 404 is a page that returns an HTTP 200 (success) status code while displaying empty or error-level content. A product page for an out-of-stock item that shows "No results found" with a 200 response code is a soft 404. An empty category page, a search results page with no matches, and a deleted blog post that redirects to a generic error message are all soft 404 candidates.

Google's documentation says soft 404s "will continue to be crawled and waste your budget." Because the server says the page exists (HTTP 200), Googlebot keeps returning to check for new content that never arrives. The fix is to return proper 404 or 410 status codes for pages that genuinely no longer exist. For temporarily unavailable products, return 200 with a "notify me" option and meaningful content rather than a blank page, which prevents the soft 404 signal without deleting ranking history.

Low-Quality Paginated Archives and Thin Content Pages

Pagination is not inherently a crawl budget problem. Page 2 of a category containing 40 products is legitimate content. Page 47 of a tag archive containing three posts, or page 73 of a blog filtered by a rarely-used category, is a crawl budget drain. The same applies to author archive pages with one post, tag pages with two posts, and year-archive pages containing only older content that has been superseded.

These pages have no external search demand and minimal internal authority. Googlebot allocates crawl slots to them because they exist as crawlable URLs, but they return value to the index at near-zero rate. Blocking these with robots.txt or returning noindex (for pages that should not appear in search but do not need crawl prevention) removes them from the perceived inventory that determines crawl demand.


Why noindex Does Not Save Crawl Budget

This is the most important misconception in crawl budget management, and it is explicitly addressed in Google's official documentation: "Don't use noindex, as Google will still request, but then drop the page when it sees a noindex meta tag or header in the HTTP response, wasting crawling time."

The noindex directive tells Google not to include a page in the index. It does not tell Google not to fetch the page. Googlebot must crawl the page, render it, read the HTML, find the noindex tag, and then discard the page from the index. Every step of that process consumes crawl capacity. For a site with 50,000 noindexed parameter URLs, that means Googlebot is spending 50,000 crawl slots per cycle on pages it immediately discards.

The correct split is: use robots.txt to block crawling of URLs you never want Google to touch at all. Use noindex for pages that need to be crawled (so Google can follow their links and update its understanding of canonical relationships) but should not appear in search results. Never use noindex as a substitute for robots.txt when the goal is to recover crawl budget.

Notice how this distinction reveals the correct tool for each scenario: a faceted navigation URL with no SEO value should be blocked in robots.txt because Google has no reason to crawl it. A staging version of a page that must be accessible for quality checks but should not rank should use noindex because the crawl itself is acceptable, just not the indexing.


How Do You Diagnose Crawl Budget Problems?

Two sources of data provide complementary views of crawl efficiency: Google Search Console's Crawl Stats report and server log files. Neither is sufficient alone.[3]Source 3Google Search Central. "Crawl Stats Report." Search Console Help.View source

Crawl Stats report. Access it in Search Console under Settings, then Crawl Stats. The report shows total crawl requests per day, average response time, download size, and a breakdown by purpose (Discovery vs. Refresh), response code, file type, and Googlebot type. The key diagnostic signals are: a high proportion of 4xx and 5xx response codes (indicating broken or error-prone pages consuming crawl slots), a response time consistently above 200ms (indicating server performance is constraining the capacity limit), a high proportion of Refresh crawls and low proportion of Discovery crawls (indicating Googlebot is cycling through old pages rather than reaching new ones), and a large crawl share going to file types other than HTML (parameter URLs, JavaScript files, image variants).

Server log file analysis. Log files are ground truth. They record every URL Googlebot requested, the HTTP status code returned, the crawl timestamp, and the user agent. Search Console Crawl Stats shows aggregate patterns; log files show individual URL-level behavior. Tools such as Screaming Frog Log File Analyser, Botify, and JetOctopus can process these logs at scale and surface the specific URLs consuming the most budget. A common finding during log analysis: Googlebot is spending significant crawl time on URLs that do not appear in any sitemap and are not linked from any important page, because they were discovered months ago via a parameter link and have been in the crawl queue ever since.

The most diagnostic question a log file analysis answers: what percentage of Googlebot's visits to your site land on pages you actually want indexed? If that percentage is below 50%, you have a crawl budget problem. The budget is being spent, but not on the right pages.

Sources

  1. Google Crawling Infrastructure. "Optimize Your Crawl Budget." Google for Developers.

  2. Google Search Central Blog. "What Crawl Budget Means for Googlebot." Google for Developers.

  3. Google Search Central. "Crawl Stats Report." Search Console Help.

Share

About the Contributors

Frequently Asked Questions (FAQs)

What is crawl budget in SEO?+

Crawl budget is the number of URLs Googlebot can and wants to crawl on a site within a given timeframe. It is determined by two components: crawl capacity limit (how many parallel connections and how quickly Googlebot can fetch without overloading the server) and crawl demand (how much Google values crawling a site's pages based on authority, freshness, and perceived inventory size). The interaction of both components determines actual crawl behavior.

Does crawl budget affect small websites?+

According to Google's official documentation, crawl budget is not a concern for most small sites. The guidance explicitly states that sites with fewer than a few thousand URLs that change infrequently will be crawled efficiently without any special optimization. Crawl budget becomes a practical SEO constraint for sites with more than 10,000 frequently updated pages, or for large sites with more than one million pages updated weekly.

What are the biggest crawl budget wasters?+

The four major crawl budget wasters are: faceted navigation and URL parameter combinations that generate thousands of near-duplicate pages; redirect chains where multiple hops are required to reach the final destination; soft 404 errors (pages returning HTTP 200 with empty or error-level content); and low-quality paginated archives, tag pages, and author pages with minimal unique content. All four cause Googlebot to spend crawl slots on pages that return little indexable value.

Does noindex save crawl budget?+

No. Google's documentation explicitly states that noindex does not save crawl budget because Googlebot still fetches and renders the page before reading the noindex directive and dropping the page. To prevent crawling of truly unwanted URLs, use robots.txt disallow rules. Reserve noindex for pages that should be crawled but not indexed, such as search results pages or filtered variants where canonical signals matter but the pages themselves should not appear in Google Search.

How do I check my crawl budget in Google Search Console?+

Google Search Console does not display crawl budget as a single number, but the Crawl Stats report (Settings, then Crawl Stats) provides the data you need to assess crawl efficiency. Look at the breakdown of crawl requests by purpose (Discovery vs. Refresh), response code, and file type. A high proportion of Refresh crawls and 4xx response codes, combined with slow average response time, indicates budget is being consumed inefficiently. The Pages report's "Discovered: currently not indexed" volume is the most direct signal of a crawl demand problem.

What is the difference between crawl rate limit and crawl demand?+

Crawl rate limit is the server-side ceiling: how many pages per second Googlebot can fetch without harming your server's performance. Crawl demand is the content-side floor: how many of your pages Google actually considers worth crawling based on their quality, freshness, and link authority. A site can have high capacity (fast server) but low demand (lots of thin or duplicate pages), causing Googlebot to crawl less than capacity would allow. Fixing server speed alone never resolves a demand problem.

Contributors

Reviewed by people
who know the system.

All Authors ->