<- Blog.Crawling

How Web Crawlers Work: Seeds, URL Frontiers & Crawl Rate

A web crawler is a program that discovers pages on the web by fetching URLs, reading their HTML, extracting links, and adding those new links to a queue of pages to visit next. Tha...

FoundationPracticalInteractiveOfficial doc
May 26, 2026.18 min read
How Web Crawlers Work: Seeds, URL Frontiers & Crawl Rate

A web crawler is a program that discovers pages on the web by fetching URLs, reading their HTML, extracting links, and adding those new links to a queue of pages to visit next. That sounds simple, but at search-engine scale, crawling becomes a prioritization problem.

A crawler cannot fetch every URL on the internet immediately. It has to decide which pages to crawl first, which pages to delay, which servers to slow down for, and which duplicate or low-value URLs to ignore. That decision-making system is what separates a basic crawler from a production search crawler.

Quick answer: A web crawler starts with seed URLs, downloads each page, extracts links, removes duplicates, adds new URLs to a URL frontier, and schedules the next fetch. The URL frontier decides crawl order, while politeness rules prevent the crawler from overloading any one server. Googlebot also uses an algorithmic process to decide which sites to crawl, how often, and how many pages to fetch[1]Source 1Google. "How Search Works." Google for Developers.View source.

This guide explains the full crawler loop, seed sets, URL frontiers, crawl ordering, front queues, back queues, politeness, crawl-delay, Googlebot crawl rate, crawl budget, and the SEO mistakes that stop important pages from being discovered.


How a web crawler works in 7 steps

At a high level, crawling follows this loop:

Step What happens SEO impact
1. Start with seed URLs The crawler begins from known URLs Pages not linked or submitted may never be discovered
2. Fetch the page The crawler sends an HTTP request Server speed and errors affect crawl efficiency
3. Parse the HTML The crawler reads links, text, canonicals, meta tags, and resources Broken HTML or JS-only links can delay discovery
4. Extract links Internal and external links are collected Good internal linking helps important pages enter the crawl path
5. Normalize URLs URLs are cleaned and standardized Parameter chaos can create duplicate crawl paths
6. Deduplicate URLs Already-seen URLs are filtered out Prevents wasting crawl resources
7. Add URLs to the frontier New URLs enter the crawl queue Priority and politeness decide when they are fetched

The loop then repeats. Every newly crawled page can reveal more URLs, and every new URL can reveal even more pages.

The hard part is not fetching one page. The hard part is deciding what to fetch next.


What is a web crawler?

A web crawler, also called a spider, bot, or robot, is automated software that browses the web by following links from page to page. Search engines use crawlers to discover pages, understand site structure, and collect content for indexing.

Googlebot is Google’s main web crawler. Once Google discovers a URL, Googlebot may crawl the page to understand what is on it. Google says its crawling is algorithmic: its systems decide which sites to crawl, how often to crawl them, and how many pages to fetch from each site[1]Source 1Google. "How Search Works." Google for Developers.View source.

A basic crawler does four things:

  1. Fetches a URL
  2. Reads the page
  3. Extracts links
  4. Adds new links to a queue

A production crawler adds more layers:

  1. URL normalization
  2. Duplicate detection
  3. Priority scoring
  4. Host-level politeness
  5. Robots.txt checks
  6. Rendering for JavaScript-heavy pages
  7. Recrawl scheduling
  8. Error handling
  9. Crawl trap detection

That is where crawling becomes a search-engine architecture problem, not just a scripting problem.


What are seed URLs?

A seed URL is a starting URL a crawler knows before it begins crawling.

Examples of seed URLs include:

  • A homepage submitted in Google Search Console
  • A URL listed in an XML sitemap
  • A high-authority page already known to the crawler
  • A news publisher homepage in a news crawler
  • A product category page in an ecommerce crawler
  • A documentation homepage in a focused technical crawler

Seed URLs matter because crawlers discover new pages by following links outward from pages they already know.

If a page has no internal links pointing to it and is not listed in a sitemap, the crawler has no normal discovery path to it. That kind of page is usually called an orphan page.

For SEO, this means discovery is not automatic. A page needs a path into the crawler’s system.


Seed set example

A simple crawler might start with this seed set:

seeds = [
    "https://example.com/",
    "https://example.com/blog/",
    "https://example.com/products/"
]

From those three URLs, the crawler fetches the pages, extracts links, and discovers more URLs.

A larger crawler may begin with millions of known URLs. But the principle is the same: the seed set defines the crawler’s starting map.

Bad seed selection creates bad coverage. If a crawler starts from a narrow group of websites, it may discover pages that are close to that group faster than pages outside it. In SEO, the same principle applies at site level: if your homepage and category pages do not link to your important pages, those pages are farther away from the crawler’s starting points.


What is a URL frontier?

The URL frontier is the system that stores URLs waiting to be crawled.

Most beginners think of it as a simple queue:

URL 1 → URL 2 → URL 3 → URL 4

But at scale, a simple queue is not enough.

A production URL frontier must answer two questions at the same time:

  1. Priority: Which URL should be crawled first?
  2. Politeness: How fast can we crawl this host without overloading it?

That means the URL frontier is not just a list. It is a scheduling system.

The classic Mercator-style frontier separates these concerns into front queues and back queues. Stanford’s information retrieval material describes the URL frontier as a system where front queues manage prioritization and back queues enforce politeness[2]Source 2Stanford University. "Web Crawling." CS276: Information Retrieval and Web Search.View source.


URL frontier diagram

Use this diagram or turn it into a custom graphic for the final article.

Seed URLs

Fetcher

HTML Parser

Link Extractor

URL Normalizer

Duplicate URL Filter

URL Frontier

┌──────────────────────┐
│ Front Queues │ → decide what is important
└──────────────────────┘

┌──────────────────────┐
│ Back Queues │ → decide when each host can be crawled
└──────────────────────┘

Fetcher

Image brief for designer: Create a clean crawler-flow diagram showing seed URLs, fetcher, parser, deduplicator, URL frontier, front queues, back queues, and fetcher loop.

Suggested image alt text:
Diagram showing how a web crawler moves from seed URLs to the URL frontier, front queues, back queues, and fetcher.


Front queues vs back queues

A strong crawler separates priority from politeness.

Queue type Main job What it controls
Front queues Prioritization Which URLs deserve attention first
Back queues Politeness How quickly the crawler contacts each host

Front queues decide importance

Front queues organize URLs by priority.

A URL might receive higher priority if:

  • It has many internal links
  • It is close to the homepage
  • It belongs to an important section
  • It changes frequently
  • It has strong external links
  • It is likely to be fresh or valuable
  • It belongs to a trusted host

For a search engine, priority is about maximizing value per fetch. For a site owner, priority is about making sure the pages that matter most are easy for crawlers to find.

Back queues protect servers

Back queues group URLs by host or domain. This prevents a crawler from sending too many requests to the same server in a short period.

For example, if a crawler discovers 500 URLs from example.com, it should not fetch all 500 immediately. It should space out the requests so the server remains stable.

That is crawl politeness.


Why URL ordering matters

URL ordering is the decision of which discovered URL gets crawled next.

This matters because crawlers operate under limits. They have limited time, bandwidth, compute, and server tolerance. If a crawler wastes too many requests on low-value URLs, it may delay or miss important pages.

The classic paper “Efficient Crawling Through URL Ordering” by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page studied how a crawler should order URLs to reach important pages faster. The paper’s core idea is that URL order affects how quickly a crawler finds valuable pages when it cannot crawl the entire web immediately[3]Source 3Cho, J., Garcia-Molina, H., and Page, L. "Efficient Crawling through URL Ordering." Computer Networks and ISDN Systems 30, no. 1-7 (1998): 161-172.View source.

This is why internal linking matters for SEO.

A page linked from the homepage is easier to discover than a page buried six clicks deep. A page linked from multiple category pages is easier to rediscover than a page linked from one old archive page. A page with no links pointing to it may not enter the crawler’s discovery path at all.


Crawl ordering strategies

Different crawlers use different ordering strategies.

Strategy How it works Best use case Weakness
Breadth-first crawling Crawls pages in discovery order, layer by layer Broad discovery May not go deep quickly
Depth-first crawling Follows one path deeply before returning Small controlled crawls Can get stuck deep in one site
PageRank-priority crawling Prioritizes URLs estimated to be important Search engines Needs link/authority estimates
Freshness-based crawling Prioritizes pages that change often News, ecommerce, feeds May ignore stable evergreen pages
Focused crawling Prioritizes topic-relevant pages Vertical search engines Needs topic classification
Sitemap-assisted crawling Uses submitted sitemaps as discovery hints Site-level SEO Sitemap inclusion does not guarantee crawling

Breadth-first crawling is important because it tends to discover well-linked pages early. Research on large crawls has found breadth-first crawling can yield high-quality pages early when quality is measured with link-based metrics such as PageRank[4]Source 4Najork, M., and Wiener, J. L. "Breadth-First Search Crawling Yields High-Quality Pages." In Proceedings of the 10th International Conference on World Wide Web (WWW10), 114-118, 2001.View source.

But no single strategy is perfect. Modern crawlers usually combine multiple signals.


What is crawl politeness?

Crawl politeness is the set of rules a crawler follows to avoid overwhelming a website.

A crawler that sends too many requests too quickly can:

  • Slow down the website
  • Trigger rate limits
  • Cause 429 responses
  • Cause 5xx errors
  • Get blocked by firewalls
  • Waste its own crawl resources

Politeness is good for both sides. It protects the website, and it helps the crawler keep access to the content.

Common politeness mechanisms include:

  1. Waiting between requests to the same host
  2. Limiting concurrent requests per domain
  3. Respecting robots.txt access rules
  4. Backing off after server errors
  5. Reducing crawl activity when response times get worse

For SEO, politeness matters because server reliability affects how efficiently crawlers can access your pages.


Does Google respect Crawl-delay in robots.txt?

No. Google Search does not use Crawl-delay as a supported robots.txt rule.

Google’s robots.txt documentation explains that robots.txt controls which URLs crawlers can access. It is not a reliable mechanism for keeping pages out of Google’s index, and unsupported robots.txt rules should not be treated as Google directives[5]Source 5Google. "Introduction to Robots.txt." Google Search Central Documentation.View source.

Google has also stated that rules like crawl-delay are not part of RFC9309 and are not supported by Google Search, though some other search engines may support them[6]Source 6Google. "Robots Refresher: Future-proof Robots Exclusion Protocol." Google Search Central Blog.View source.

Example:

User-agent: *
Crawl-delay: 10

This may affect some crawlers, but it does not directly control Googlebot.

For Googlebot, you should focus on:

  • Fast server response
  • Stable hosting
  • Proper HTTP status codes
  • Clean internal linking
  • Crawlable navigation
  • Sitemaps
  • Avoiding crawl traps
  • Avoiding mass low-value URL generation

How Googlebot adapts crawl rate

Googlebot’s crawl rate is not fixed. Google’s systems decide how much crawling a site can handle and how much crawling Google wants to do.

Google announced that the Search Console crawl rate limiter tool was deprecated on January 8, 2024, because improvements in Google’s crawling logic made the tool less useful[7]Source 7Google. "Search Console Crawl Rate Limiter Tool Is Going Away." Google Search Central Blog.View source.

That update matters because old SEO advice often says, “Use the crawl rate setting in Search Console.” That advice is outdated.

Today, the practical levers are server health, crawl demand, and site quality.

Google has explained crawl budget as the number of URLs Googlebot can and wants to crawl. In other words, crawl budget combines crawl capacity and crawl demand[8]Source 8Google. "What Crawl Budget Means for Googlebot." Google Search Central Blog.View source.


Crawl rate vs crawl budget

Crawl rate and crawl budget are related, but they are not the same thing.

Concept Meaning Main lever
Crawl rate How fast Googlebot makes requests Server capacity and response stability
Crawl demand How much Google wants to crawl Popularity, freshness, importance
Crawl budget How many URLs Google can and wants to crawl Crawl rate + crawl demand

A fast server can support more crawling, but speed alone does not make every page worth crawling.

If a site has thousands of thin, duplicate, parameterized, or low-value URLs, higher crawl capacity can still be wasted.

Better crawl budget optimization means:

  • Improve internal linking
  • Remove crawl traps
  • Consolidate duplicates
  • Use canonical tags correctly
  • Keep XML sitemaps clean
  • Return correct status codes
  • Make important pages closer to the homepage
  • Improve content quality and freshness

How server responses affect crawling

Googlebot pays attention to how a server responds.

Server signal Likely crawler response
Fast 200 responses Crawling can continue normally
Slow responses Crawling may become more conservative
Repeated 500/503 errors Crawling may slow down
429 Too Many Requests Signals rate limiting or overload
DNS failures Crawling can be disrupted
Robots.txt unavailable Crawling may be delayed or restricted depending on the situation

Google’s crawling error guidance says availability issues can prevent Google from crawling as much as it might want to crawl.

That does not mean every slow page instantly loses rankings. It means poor availability can reduce crawl efficiency, especially on large sites where Googlebot must choose where to spend crawl resources[8]Source 8Google. "What Crawl Budget Means for Googlebot." Google Search Central Blog.View source[9]Source 9Google. "Troubleshoot Crawling Errors." Google Search Central Documentation.View source.


Why internal linking is crawl architecture

Internal linking is not just UX. It is crawler routing.

A crawler follows links. So your internal links tell crawlers which pages matter, how sections connect, and how far each page is from the site’s strongest entry points.

Compare these two structures:

Weak structure

Homepage

Blog

Archive

Page 7

Important Article

Strong structure

Homepage

Main Topic Hub

Important Article

The second structure is better because the important page is closer to the seed path.

For SEO, important pages should usually be:

  • Linked from the homepage when appropriate
  • Linked from relevant category or hub pages
  • Included in XML sitemaps
  • Linked from related articles
  • Not hidden behind forms, filters, or JavaScript-only navigation
  • Not buried deep in pagination

Internal links help crawlers discover, prioritize, and revisit important pages.


Orphan pages: why crawlers miss them

An orphan page is a page with no internal links pointing to it.

A page can exist, load correctly, and still be hard for crawlers to discover if nothing links to it.

Common orphan page causes:

  • Published pages not added to navigation
  • Old landing pages removed from menus
  • Product pages not linked from categories
  • Blog posts excluded from archives
  • Pages only accessible through site search
  • JavaScript routes without crawlable anchor links
  • Migration mistakes
  • Deleted category links

How to fix orphan pages:

  1. Crawl your own site with a crawler such as Screaming Frog, Sitebulb, or a custom crawler
  2. Export all indexable URLs from your CMS
  3. Export URLs from XML sitemaps
  4. Compare these lists
  5. Find URLs that exist but have no internal inlinks
  6. Add contextual internal links from relevant pages
  7. Keep the XML sitemap clean and updated

A sitemap can help discovery, but internal links still matter because they provide context and importance signals.


Crawl traps: how sites waste the frontier

A crawl trap is a URL pattern that creates too many low-value or duplicate URLs.

Common crawl traps include:

Crawl trap type Example
Faceted navigation ?color=red&size=large&sort=price
Infinite calendars /events/2099/12/31/
Session IDs ?sid=abc123
Internal search results /search?q=blue+shoes
Sort parameters ?sort=price_asc
Filter combinations ?brand=x&size=y&material=z
Duplicate trailing slash patterns /page and /page/
Mixed casing /Product and /product

Crawl traps hurt because they fill the URL frontier with pages that do not deserve crawl attention.

For ecommerce and large publishing sites, crawl traps can delay crawling of real content.

How to reduce crawl traps:

  • Use canonical tags for duplicates
  • Block low-value crawl paths carefully in robots.txt
  • Use noindex where indexing is the problem
  • Avoid linking to infinite URL combinations
  • Keep faceted navigation crawl rules clear
  • Normalize URL parameters
  • Make important category paths static and clean
  • Monitor crawl stats after major site changes

Googlebot vs generic crawlers

Not every crawler behaves like Googlebot.

Feature Generic crawler Googlebot
Starts from seed URLs Yes Yes
Uses a URL frontier Usually Yes, at massive scale
Respects robots.txt access rules Usually Yes
Supports Crawl-delay Some do No, not for Google Search
Adapts crawl rate Advanced crawlers do Yes
Renders JavaScript Sometimes Google can render pages
Uses sitemaps Sometimes Yes, as discovery signals
Uses Search Console data No Google systems can use submitted data
Crawls for public search index Usually no Yes

This is why SEO advice should not treat all crawlers the same.

A custom crawler, Bingbot, Googlebot, and a scraping bot may all fetch URLs, but their rules, goals, and scheduling systems can differ.


Worked example: one article entering the crawl system

Imagine you publish this page:

https://example.com/guides/web-crawlers/

Here is what can happen.

Step 1: Discovery

Google may discover the URL from:

  • Your XML sitemap
  • A link from your homepage
  • A link from a related guide
  • A backlink from another website
  • A URL submitted through Search Console

If the page is not linked anywhere and is missing from the sitemap, discovery becomes weaker.

Step 2: Frontier entry

Once discovered, the URL can enter the crawl system.

It is now waiting to be fetched. But it is not automatically crawled instantly.

The frontier considers signals such as site importance, URL importance, freshness, and crawl capacity.

Step 3: Fetch

Googlebot requests the page.

If the server returns a fast 200 response, the crawl can continue normally.

If the server returns 500, 503, or repeated timeouts, crawling may slow down.

Step 4: Parse

The crawler reads the HTML.

It checks links, canonical tags, robots directives, structured data, headings, and visible content.

Step 5: Extract new links

The crawler finds links to related pages.

For example:

/guides/crawl-budget/
/guides/robots-txt/
/guides/indexing/
/guides/internal-linking/

Those URLs can then enter the frontier.

Step 6: Scheduling

The crawler decides when to crawl those linked pages.

Pages linked from strong, relevant sections are more likely to be discovered and revisited efficiently.

Step 7: Recrawl

If the page changes often, earns links, or becomes important, it may be crawled again more frequently.

If it remains isolated, slow, duplicated, or low-value, recrawling may be less frequent.


Practical SEO checklist for crawlability

Use this checklist before publishing important pages.

Check Why it matters
Page is linked internally Helps crawler discovery
Page is in XML sitemap Adds a discovery path
Page returns 200 status Confirms accessible content
Page is not blocked in robots.txt Allows crawling
Page is not accidentally noindexed Allows indexing
Canonical points to itself or correct preferred URL Avoids duplicate confusion
Important links use <a href> Makes links easier to discover
Page loads quickly Supports crawl efficiency
Page is not buried too deep Improves discovery priority
Related pages link back to it Builds topical structure
Navigation works without requiring user actions Helps crawler access
No infinite parameter links Prevents crawl traps

This checklist is especially important for large websites, ecommerce sites, marketplaces, publishers, and documentation sites.


Common crawler mistakes that hurt SEO

1. Publishing pages with no internal links

A page that is only accessible through a direct URL is weak from a crawler perspective.

Fix: Add links from relevant hubs, categories, and related content.

2. Creating too many filter URLs

Faceted navigation can create thousands or millions of URL combinations.

Fix: Decide which filters deserve indexable static pages and control the rest.

3. Blocking the wrong URLs in robots.txt

Robots.txt controls crawling, not indexing by itself. Blocking a URL can prevent Google from seeing important page-level signals.

Fix: Use robots.txt for crawl control, and use noindex when the goal is to keep a page out of the index.

4. Relying on Crawl-delay for Googlebot

Google Search does not support Crawl-delay.

Fix: Improve server reliability and manage crawl paths instead.

5. Hiding links behind JavaScript actions

If important links only appear after clicks, filters, or client-side rendering, discovery can be delayed or weakened.

Fix: Use crawlable anchor links for important paths.

6. Leaving redirect chains

Long redirect chains waste fetches and slow crawling.

Fix: Link directly to final canonical URLs.

7. Letting 404s and soft 404s accumulate

A few 404s are normal. Large numbers of broken internal links waste crawl attention.

Fix: Repair internal links and remove dead URLs from sitemaps.


How to read Crawl Stats in Google Search Console

For large sites, Google Search Console’s Crawl Stats report can help you understand how Googlebot interacts with your site.

Look for:

  • Total crawl requests
  • Average response time
  • Host status
  • Crawl responses by status code
  • File type crawled
  • Crawl purpose: discovery vs refresh
  • Sudden drops in crawl activity
  • Spikes in 5xx errors
  • Spikes in 404s
  • Increased crawling after migrations or launches

A sudden crawl drop can indicate server issues, robots.txt problems, DNS failures, or reduced crawl demand.

A crawl spike after launching many URLs can be normal, but it can also expose crawl traps if Googlebot starts fetching endless parameter URLs.


Key takeaways

 

  • A web crawler discovers pages by fetching URLs, parsing HTML, extracting links, and adding new URLs to a frontier.
  • Seed URLs are the crawler’s starting points.
  • The URL frontier decides what gets crawled next.
  • Front queues manage priority.
  • Back queues enforce politeness.
  • URL ordering affects which pages are discovered first.
  • Google Search does not support Crawl-delay.
  • Googlebot crawl rate is influenced by server capacity, server responses, and crawl demand.
  • Crawl budget combines what Googlebot can crawl and what it wants to crawl.
  • Internal linking is crawl architecture.
  • Orphan pages, crawl traps, server errors, and duplicate URLs can waste crawl resources.

Sources

  1. Google. How Search Works. Google for Developers.

  2. Stanford University. Web Crawling. CS276: Information Retrieval and Web Search.

  3. Cho, J., Garcia-Molina, H., and Page, L. Efficient Crawling through URL Ordering. Computer Networks and ISDN Systems 30, no. 1-7 (1998): 161-172.

  4. Najork, M., and Wiener, J. L. Breadth-First Search Crawling Yields High-Quality Pages. In Proceedings of the 10th International Conference on World Wide Web (WWW10), 114-118, 2001.

  5. Google. Introduction to Robots.txt. Google Search Central Documentation.

  6. Google. Robots Refresher: Future-proof Robots Exclusion Protocol. Google Search Central Blog.

  7. Google. Search Console Crawl Rate Limiter Tool Is Going Away. Google Search Central Blog.

  8. Google. What Crawl Budget Means for Googlebot. Google Search Central Blog.

  9. Google. Troubleshoot Crawling Errors. Google Search Central Documentation.

Share

About the Contributors

Frequently Asked Questions (FAQs)

What is a web crawler?+

A web crawler is software that discovers web pages by fetching URLs, reading their content, extracting links, and adding new URLs to a crawl queue. Search engines use crawlers to discover pages for indexing.

What is a seed URL?+

A seed URL is a starting URL a crawler already knows before crawling begins. Crawlers use seed URLs to begin discovery and then follow links outward to find more pages.

What is a URL frontier?+

A URL frontier is the system that stores and schedules URLs waiting to be crawled. It decides which URL should be fetched next and how quickly each host should be contacted.

Why does crawl order matter?+

Crawl order matters because crawlers cannot fetch every URL immediately. The order determines which pages are discovered, crawled, and refreshed first.

What is crawl politeness?+

Crawl politeness is the practice of spacing crawler requests so one server is not overloaded. It includes host-level delays, concurrency limits, robots.txt checks, and adaptive backoff after errors.

Does Googlebot follow Crawl-delay?+

No. Google Search does not support the Crawl-delay robots.txt rule. Googlebot uses its own crawl-rate systems instead.

What is the difference between crawl rate and crawl budget?+

Crawl rate is how fast Googlebot can make requests to your site. Crawl budget is the number of URLs Googlebot can and wants to crawl. Google defines crawl budget as combining crawl rate and crawl demand.

How do orphan pages affect crawling?+

Orphan pages have no internal links pointing to them. Crawlers may not discover them through normal link-following, which can delay or prevent crawling.

How do crawl traps affect SEO?+

Crawl traps generate too many low-value URLs, such as endless filters, calendar pages, or session URLs. They waste crawl resources and can delay crawling of important pages.

How can I improve crawlability?+

Improve crawlability by adding internal links, cleaning XML sitemaps, fixing server errors, avoiding crawl traps, using correct canonicals, reducing duplicate URLs, and making important pages easy to reach.

Contributors

Reviewed by people
who know the system.

All Authors ->