A seed URL is a starting URL a crawler already knows before crawling begins. Crawlers use seed URLs to begin discovery and then follow links outward to find more pages.

Why does crawl order matter?

Crawl order matters because crawlers cannot fetch every URL immediately. The order determines which pages are discovered, crawled, and refreshed first.

Does Googlebot follow Crawl-delay?

No. Google Search does not support the Crawl-delay robots.txt rule. Googlebot uses its own crawl-rate systems instead.

What is the difference between crawl rate and crawl budget?

Crawl rate is how fast Googlebot can make requests to your site. Crawl budget is the number of URLs Googlebot can and wants to crawl. Google defines crawl budget as combining crawl rate and crawl demand.

How do orphan pages affect crawling?

Orphan pages have no internal links pointing to them. Crawlers may not discover them through normal link-following, which can delay or prevent crawling.

How do crawl traps affect SEO?

Crawl traps generate too many low-value URLs, such as endless filters, calendar pages, or session URLs. They waste crawl resources and can delay crawling of important pages.

How can I improve crawlability?

Improve crawlability by adding internal links, cleaning XML sitemaps, fixing server errors, avoiding crawl traps, using correct canonicals, reducing duplicate URLs, and making important pages easy to reach.

How Web Crawlers Work: Seeds, URL Frontiers & Crawl Rate

Q: What is a web crawler?

A web crawler is software that discovers web pages by fetching URLs, reading their content, extracting links, and adding new URLs to a crawl queue. Search engines use crawlers to discover pages for indexing.

Q: What is a URL frontier?

A URL frontier is the system that stores and schedules URLs waiting to be crawled. It decides which URL should be fetched next and how quickly each host should be contacted.

Q: What is crawl politeness?

Crawl politeness is the practice of spacing crawler requests so one server is not overloaded. It includes host-level delays, concurrency limits, robots.txt checks, and adaptive backoff after errors.

A web crawler is a program that discovers pages on the web by fetching URLs, reading their HTML, extracting links, and adding those new links to a queue of pages to visit next. That sounds simple, but at search-engine scale, crawling becomes a prioritization problem.

A crawler cannot fetch every URL on the internet immediately. It has to decide which pages to crawl first, which pages to delay, which servers to slow down for, and which duplicate or low-value URLs to ignore. That decision-making system is what separates a basic crawler from a production search crawler.

Quick answer: A web crawler starts with seed URLs, downloads each page, extracts links, removes duplicates, adds new URLs to a URL frontier, and schedules the next fetch. The URL frontier decides crawl order, while politeness rules prevent the crawler from overloading any one server. Googlebot also uses an algorithmic process to decide which sites to crawl, how often, and how many pages to fetch^[1]Source 1Google. "How Search Works." Google for Developers.View source.

This guide explains the full crawler loop, seed sets, URL frontiers, crawl ordering, front queues, back queues, politeness, crawl-delay, Googlebot crawl rate, crawl budget, and the SEO mistakes that stop important pages from being discovered.

How a web crawler works in 7 steps

At a high level, crawling follows this loop:

Step	What happens	SEO impact
1. Start with seed URLs	The crawler begins from known URLs	Pages not linked or submitted may never be discovered
2. Fetch the page	The crawler sends an HTTP request	Server speed and errors affect crawl efficiency
3. Parse the HTML	The crawler reads links, text, canonicals, meta tags, and resources	Broken HTML or JS-only links can delay discovery
4. Extract links	Internal and external links are collected	Good internal linking helps important pages enter the crawl path
5. Normalize URLs	URLs are cleaned and standardized	Parameter chaos can create duplicate crawl paths
6. Deduplicate URLs	Already-seen URLs are filtered out	Prevents wasting crawl resources
7. Add URLs to the frontier	New URLs enter the crawl queue	Priority and politeness decide when they are fetched

The loop then repeats. Every newly crawled page can reveal more URLs, and every new URL can reveal even more pages.

The hard part is not fetching one page. The hard part is deciding what to fetch next.

What is a web crawler?

A web crawler, also called a spider, bot, or robot, is automated software that browses the web by following links from page to page. Search engines use crawlers to discover pages, understand site structure, and collect content for indexing.

Googlebot is Google’s main web crawler. Once Google discovers a URL, Googlebot may crawl the page to understand what is on it. Google says its crawling is algorithmic: its systems decide which sites to crawl, how often to crawl them, and how many pages to fetch from each site^[1]Source 1Google. "How Search Works." Google for Developers.View source.

A basic crawler does four things:

Fetches a URL
Reads the page
Extracts links
Adds new links to a queue

A production crawler adds more layers:

URL normalization
Duplicate detection
Priority scoring
Host-level politeness
Robots.txt checks
Rendering for JavaScript-heavy pages
Recrawl scheduling
Error handling
Crawl trap detection

That is where crawling becomes a search-engine architecture problem, not just a scripting problem.

What are seed URLs?

A seed URL is a starting URL a crawler knows before it begins crawling.

Examples of seed URLs include:

A homepage submitted in Google Search Console
A URL listed in an XML sitemap
A high-authority page already known to the crawler
A news publisher homepage in a news crawler
A product category page in an ecommerce crawler
A documentation homepage in a focused technical crawler

Seed URLs matter because crawlers discover new pages by following links outward from pages they already know.

If a page has no internal links pointing to it and is not listed in a sitemap, the crawler has no normal discovery path to it. That kind of page is usually called an orphan page.

For SEO, this means discovery is not automatic. A page needs a path into the crawler’s system.

Seed set example

A simple crawler might start with this seed set:

seeds = [
    "https://example.com/",
    "https://example.com/blog/",
    "https://example.com/products/"
]

From those three URLs, the crawler fetches the pages, extracts links, and discovers more URLs.

A larger crawler may begin with millions of known URLs. But the principle is the same: the seed set defines the crawler’s starting map.

Bad seed selection creates bad coverage. If a crawler starts from a narrow group of websites, it may discover pages that are close to that group faster than pages outside it. In SEO, the same principle applies at site level: if your homepage and category pages do not link to your important pages, those pages are farther away from the crawler’s starting points.

What is a URL frontier?

The URL frontier is the system that stores URLs waiting to be crawled.

Most beginners think of it as a simple queue:

URL 1 → URL 2 → URL 3 → URL 4

But at scale, a simple queue is not enough.

A production URL frontier must answer two questions at the same time:

Priority: Which URL should be crawled first?
Politeness: How fast can we crawl this host without overloading it?

That means the URL frontier is not just a list. It is a scheduling system.

The classic Mercator-style frontier separates these concerns into front queues and back queues. Stanford’s information retrieval material describes the URL frontier as a system where front queues manage prioritization and back queues enforce politeness^[2]Source 2Stanford University. "Web Crawling." CS276: Information Retrieval and Web Search.View source.

URL frontier diagram

Use this diagram or turn it into a custom graphic for the final article.

Seed URLs
   ↓
Fetcher
   ↓
HTML Parser
   ↓
Link Extractor
   ↓
URL Normalizer
   ↓
Duplicate URL Filter
   ↓
URL Frontier
   ↓
┌──────────────────────┐
│ Front Queues          │  → decide what is important
└──────────────────────┘
   ↓
┌──────────────────────┐
│ Back Queues           │  → decide when each host can be crawled
└──────────────────────┘
   ↓
Fetcher

Image brief for designer: Create a clean crawler-flow diagram showing seed URLs, fetcher, parser, deduplicator, URL frontier, front queues, back queues, and fetcher loop.

Suggested image alt text:
Diagram showing how a web crawler moves from seed URLs to the URL frontier, front queues, back queues, and fetcher.

Front queues vs back queues

A strong crawler separates priority from politeness.

Queue type	Main job	What it controls
Front queues	Prioritization	Which URLs deserve attention first
Back queues	Politeness	How quickly the crawler contacts each host

Front queues decide importance

Front queues organize URLs by priority.

A URL might receive higher priority if:

It has many internal links
It is close to the homepage
It belongs to an important section
It changes frequently
It has strong external links
It is likely to be fresh or valuable
It belongs to a trusted host

For a search engine, priority is about maximizing value per fetch. For a site owner, priority is about making sure the pages that matter most are easy for crawlers to find.

Back queues protect servers

Back queues group URLs by host or domain. This prevents a crawler from sending too many requests to the same server in a short period.

For example, if a crawler discovers 500 URLs from example.com, it should not fetch all 500 immediately. It should space out the requests so the server remains stable.

That is crawl politeness.

Why URL ordering matters

URL ordering is the decision of which discovered URL gets crawled next.

This matters because crawlers operate under limits. They have limited time, bandwidth, compute, and server tolerance. If a crawler wastes too many requests on low-value URLs, it may delay or miss important pages.

The classic paper “Efficient Crawling Through URL Ordering” by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page studied how a crawler should order URLs to reach important pages faster. The paper’s core idea is that URL order affects how quickly a crawler finds valuable pages when it cannot crawl the entire web immediately^[3]Source 3Cho, J., Garcia-Molina, H., and Page, L. "Efficient Crawling through URL Ordering." Computer Networks and ISDN Systems 30, no. 1-7 (1998): 161-172.View source.

This is why internal linking matters for SEO.

A page linked from the homepage is easier to discover than a page buried six clicks deep. A page linked from multiple category pages is easier to rediscover than a page linked from one old archive page. A page with no links pointing to it may not enter the crawler’s discovery path at all.

Crawl ordering strategies

Different crawlers use different ordering strategies.

Strategy	How it works	Best use case	Weakness
Breadth-first crawling	Crawls pages in discovery order, layer by layer	Broad discovery	May not go deep quickly
Depth-first crawling	Follows one path deeply before returning	Small controlled crawls	Can get stuck deep in one site
PageRank-priority crawling	Prioritizes URLs estimated to be important	Search engines	Needs link/authority estimates
Freshness-based crawling	Prioritizes pages that change often	News, ecommerce, feeds	May ignore stable evergreen pages
Focused crawling	Prioritizes topic-relevant pages	Vertical search engines	Needs topic classification
Sitemap-assisted crawling	Uses submitted sitemaps as discovery hints	Site-level SEO	Sitemap inclusion does not guarantee crawling

Breadth-first crawling is important because it tends to discover well-linked pages early. Research on large crawls has found breadth-first crawling can yield high-quality pages early when quality is measured with link-based metrics such as PageRank^[4]Source 4Najork, M., and Wiener, J. L. "Breadth-First Search Crawling Yields High-Quality Pages." In Proceedings of the 10th International Conference on World Wide Web (WWW10), 114-118, 2001.View source.

But no single strategy is perfect. Modern crawlers usually combine multiple signals.

What is crawl politeness?

Crawl politeness is the set of rules a crawler follows to avoid overwhelming a website.

A crawler that sends too many requests too quickly can:

Slow down the website
Trigger rate limits
Cause 429 responses
Cause 5xx errors
Get blocked by firewalls
Waste its own crawl resources

Politeness is good for both sides. It protects the website, and it helps the crawler keep access to the content.

Common politeness mechanisms include:

Waiting between requests to the same host
Limiting concurrent requests per domain
Respecting robots.txt access rules
Backing off after server errors
Reducing crawl activity when response times get worse

For SEO, politeness matters because server reliability affects how efficiently crawlers can access your pages.

Does Google respect Crawl-delay in robots.txt?

No. Google Search does not use Crawl-delay as a supported robots.txt rule.

Google’s robots.txt documentation explains that robots.txt controls which URLs crawlers can access. It is not a reliable mechanism for keeping pages out of Google’s index, and unsupported robots.txt rules should not be treated as Google directives^[5]Source 5Google. "Introduction to Robots.txt." Google Search Central Documentation.View source.

Google has also stated that rules like crawl-delay are not part of RFC9309 and are not supported by Google Search, though some other search engines may support them^[6]Source 6Google. "Robots Refresher: Future-proof Robots Exclusion Protocol." Google Search Central Blog.View source.

Example:

User-agent: *
Crawl-delay: 10

This may affect some crawlers, but it does not directly control Googlebot.

For Googlebot, you should focus on:

Fast server response
Stable hosting
Proper HTTP status codes
Clean internal linking
Crawlable navigation
Sitemaps
Avoiding crawl traps
Avoiding mass low-value URL generation

How Googlebot adapts crawl rate

Googlebot’s crawl rate is not fixed. Google’s systems decide how much crawling a site can handle and how much crawling Google wants to do.

Google announced that the Search Console crawl rate limiter tool was deprecated on January 8, 2024, because improvements in Google’s crawling logic made the tool less useful^[7]Source 7Google. "Search Console Crawl Rate Limiter Tool Is Going Away." Google Search Central Blog.View source.

That update matters because old SEO advice often says, “Use the crawl rate setting in Search Console.” That advice is outdated.

Today, the practical levers are server health, crawl demand, and site quality.

Google has explained crawl budget as the number of URLs Googlebot can and wants to crawl. In other words, crawl budget combines crawl capacity and crawl demand^[8]Source 8Google. "What Crawl Budget Means for Googlebot." Google Search Central Blog.View source.

Crawl rate vs crawl budget

Crawl rate and crawl budget are related, but they are not the same thing.

Concept	Meaning	Main lever
Crawl rate	How fast Googlebot makes requests	Server capacity and response stability
Crawl demand	How much Google wants to crawl	Popularity, freshness, importance
Crawl budget	How many URLs Google can and wants to crawl	Crawl rate + crawl demand

A fast server can support more crawling, but speed alone does not make every page worth crawling.

If a site has thousands of thin, duplicate, parameterized, or low-value URLs, higher crawl capacity can still be wasted.

Better crawl budget optimization means:

Improve internal linking
Remove crawl traps
Consolidate duplicates
Use canonical tags correctly
Keep XML sitemaps clean
Return correct status codes
Make important pages closer to the homepage
Improve content quality and freshness

How server responses affect crawling

Googlebot pays attention to how a server responds.

Server signal	Likely crawler response
Fast 200 responses	Crawling can continue normally
Slow responses	Crawling may become more conservative
Repeated 500/503 errors	Crawling may slow down
429 Too Many Requests	Signals rate limiting or overload
DNS failures	Crawling can be disrupted
Robots.txt unavailable	Crawling may be delayed or restricted depending on the situation

Google’s crawling error guidance says availability issues can prevent Google from crawling as much as it might want to crawl.

That does not mean every slow page instantly loses rankings. It means poor availability can reduce crawl efficiency, especially on large sites where Googlebot must choose where to spend crawl resources^[8]Source 8Google. "What Crawl Budget Means for Googlebot." Google Search Central Blog.View source^[9]Source 9Google. "Troubleshoot Crawling Errors." Google Search Central Documentation.View source.

Why internal linking is crawl architecture

Internal linking is not just UX. It is crawler routing.

A crawler follows links. So your internal links tell crawlers which pages matter, how sections connect, and how far each page is from the site’s strongest entry points.

Compare these two structures:

Weak structure

Homepage
   ↓
Blog
   ↓
Archive
   ↓
Page 7
   ↓
Important Article

Strong structure

Homepage
   ↓
Main Topic Hub
   ↓
Important Article

The second structure is better because the important page is closer to the seed path.

For SEO, important pages should usually be:

Linked from the homepage when appropriate
Linked from relevant category or hub pages
Included in XML sitemaps
Linked from related articles
Not hidden behind forms, filters, or JavaScript-only navigation
Not buried deep in pagination

Internal links help crawlers discover, prioritize, and revisit important pages.

Orphan pages: why crawlers miss them

An orphan page is a page with no internal links pointing to it.

A page can exist, load correctly, and still be hard for crawlers to discover if nothing links to it.

Common orphan page causes:

Published pages not added to navigation
Old landing pages removed from menus
Product pages not linked from categories
Blog posts excluded from archives
Pages only accessible through site search
JavaScript routes without crawlable anchor links
Migration mistakes
Deleted category links

How to fix orphan pages:

Crawl your own site with a crawler such as Screaming Frog, Sitebulb, or a custom crawler
Export all indexable URLs from your CMS
Export URLs from XML sitemaps
Compare these lists
Find URLs that exist but have no internal inlinks
Add contextual internal links from relevant pages
Keep the XML sitemap clean and updated

A sitemap can help discovery, but internal links still matter because they provide context and importance signals.

Crawl traps: how sites waste the frontier

A crawl trap is a URL pattern that creates too many low-value or duplicate URLs.

Common crawl traps include:

Crawl trap type	Example
Faceted navigation	`?color=red&size=large&sort=price`
Infinite calendars	`/events/2099/12/31/`
Session IDs	`?sid=abc123`
Internal search results	`/search?q=blue+shoes`
Sort parameters	`?sort=price_asc`
Filter combinations	`?brand=x&size=y&material=z`
Duplicate trailing slash patterns	`/page` and `/page/`
Mixed casing	`/Product` and `/product`

Crawl traps hurt because they fill the URL frontier with pages that do not deserve crawl attention.

For ecommerce and large publishing sites, crawl traps can delay crawling of real content.

How to reduce crawl traps:

Use canonical tags for duplicates
Block low-value crawl paths carefully in robots.txt
Use noindex where indexing is the problem
Avoid linking to infinite URL combinations
Keep faceted navigation crawl rules clear
Normalize URL parameters
Make important category paths static and clean
Monitor crawl stats after major site changes

Googlebot vs generic crawlers

Not every crawler behaves like Googlebot.

Feature	Generic crawler	Googlebot
Starts from seed URLs	Yes	Yes
Uses a URL frontier	Usually	Yes, at massive scale
Respects robots.txt access rules	Usually	Yes
Supports Crawl-delay	Some do	No, not for Google Search
Adapts crawl rate	Advanced crawlers do	Yes
Renders JavaScript	Sometimes	Google can render pages
Uses sitemaps	Sometimes	Yes, as discovery signals
Uses Search Console data	No	Google systems can use submitted data
Crawls for public search index	Usually no	Yes

This is why SEO advice should not treat all crawlers the same.

A custom crawler, Bingbot, Googlebot, and a scraping bot may all fetch URLs, but their rules, goals, and scheduling systems can differ.

Worked example: one article entering the crawl system

Imagine you publish this page:

https://example.com/guides/web-crawlers/

Here is what can happen.

Step 1: Discovery

Google may discover the URL from:

Your XML sitemap
A link from your homepage
A link from a related guide
A backlink from another website
A URL submitted through Search Console

If the page is not linked anywhere and is missing from the sitemap, discovery becomes weaker.

Step 2: Frontier entry

Once discovered, the URL can enter the crawl system.

It is now waiting to be fetched. But it is not automatically crawled instantly.

The frontier considers signals such as site importance, URL importance, freshness, and crawl capacity.

Step 3: Fetch

Googlebot requests the page.

If the server returns a fast 200 response, the crawl can continue normally.

If the server returns 500, 503, or repeated timeouts, crawling may slow down.

Step 4: Parse

The crawler reads the HTML.

It checks links, canonical tags, robots directives, structured data, headings, and visible content.

Step 5: Extract new links

The crawler finds links to related pages.

For example:

/guides/crawl-budget/
/guides/robots-txt/
/guides/indexing/
/guides/internal-linking/

Those URLs can then enter the frontier.

Step 6: Scheduling

The crawler decides when to crawl those linked pages.

Pages linked from strong, relevant sections are more likely to be discovered and revisited efficiently.

Step 7: Recrawl

If the page changes often, earns links, or becomes important, it may be crawled again more frequently.

If it remains isolated, slow, duplicated, or low-value, recrawling may be less frequent.

Practical SEO checklist for crawlability

Use this checklist before publishing important pages.

Check	Why it matters
Page is linked internally	Helps crawler discovery
Page is in XML sitemap	Adds a discovery path
Page returns 200 status	Confirms accessible content
Page is not blocked in robots.txt	Allows crawling
Page is not accidentally noindexed	Allows indexing
Canonical points to itself or correct preferred URL	Avoids duplicate confusion
Important links use `<a href>`	Makes links easier to discover
Page loads quickly	Supports crawl efficiency
Page is not buried too deep	Improves discovery priority
Related pages link back to it	Builds topical structure
Navigation works without requiring user actions	Helps crawler access
No infinite parameter links	Prevents crawl traps

This checklist is especially important for large websites, ecommerce sites, marketplaces, publishers, and documentation sites.

Common crawler mistakes that hurt SEO

1. Publishing pages with no internal links

A page that is only accessible through a direct URL is weak from a crawler perspective.

Fix: Add links from relevant hubs, categories, and related content.

2. Creating too many filter URLs

Faceted navigation can create thousands or millions of URL combinations.

Fix: Decide which filters deserve indexable static pages and control the rest.

3. Blocking the wrong URLs in robots.txt

Robots.txt controls crawling, not indexing by itself. Blocking a URL can prevent Google from seeing important page-level signals.

Fix: Use robots.txt for crawl control, and use noindex when the goal is to keep a page out of the index.

4. Relying on Crawl-delay for Googlebot

Google Search does not support Crawl-delay.

Fix: Improve server reliability and manage crawl paths instead.

5. Hiding links behind JavaScript actions

If important links only appear after clicks, filters, or client-side rendering, discovery can be delayed or weakened.

Fix: Use crawlable anchor links for important paths.

6. Leaving redirect chains

Long redirect chains waste fetches and slow crawling.

Fix: Link directly to final canonical URLs.

7. Letting 404s and soft 404s accumulate

A few 404s are normal. Large numbers of broken internal links waste crawl attention.

Fix: Repair internal links and remove dead URLs from sitemaps.

How to read Crawl Stats in Google Search Console

For large sites, Google Search Console’s Crawl Stats report can help you understand how Googlebot interacts with your site.

Look for:

Total crawl requests
Average response time
Host status
Crawl responses by status code
File type crawled
Crawl purpose: discovery vs refresh
Sudden drops in crawl activity
Spikes in 5xx errors
Spikes in 404s
Increased crawling after migrations or launches

A sudden crawl drop can indicate server issues, robots.txt problems, DNS failures, or reduced crawl demand.

A crawl spike after launching many URLs can be normal, but it can also expose crawl traps if Googlebot starts fetching endless parameter URLs.

Key takeaways

A web crawler discovers pages by fetching URLs, parsing HTML, extracting links, and adding new URLs to a frontier.
Seed URLs are the crawler’s starting points.
The URL frontier decides what gets crawled next.
Front queues manage priority.
Back queues enforce politeness.
URL ordering affects which pages are discovered first.
Google Search does not support Crawl-delay.
Googlebot crawl rate is influenced by server capacity, server responses, and crawl demand.
Crawl budget combines what Googlebot can crawl and what it wants to crawl.
Internal linking is crawl architecture.
Orphan pages, crawl traps, server errors, and duplicate URLs can waste crawl resources.

Sources

Google. How Search Works. Google for Developers.
Stanford University. Web Crawling. CS276: Information Retrieval and Web Search.
Cho, J., Garcia-Molina, H., and Page, L. Efficient Crawling through URL Ordering. Computer Networks and ISDN Systems 30, no. 1-7 (1998): 161-172.
Najork, M., and Wiener, J. L. Breadth-First Search Crawling Yields High-Quality Pages. In Proceedings of the 10th International Conference on World Wide Web (WWW10), 114-118, 2001.
Google. Introduction to Robots.txt. Google Search Central Documentation.
Google. Robots Refresher: Future-proof Robots Exclusion Protocol. Google Search Central Blog.
Google. Search Console Crawl Rate Limiter Tool Is Going Away. Google Search Central Blog.
Google. What Crawl Budget Means for Googlebot. Google Search Central Blog.
Google. Troubleshoot Crawling Errors. Google Search Central Documentation.