A web crawler's strategy determines which pages it discovers and in what order. The three foundational approaches (breadth-first, depth-first, and focused crawling) produce radically different outcomes from the exact same seed URL. This article explains how each strategy works mechanically, what trade-offs each makes in memory and quality, how distributed systems like Apache Nutch implement them at scale, and what all of this explains about why some pages appear in Google within hours while others wait months.
Why Crawl Strategy Determines Which Pages Get Discovered (and When)
The crawl strategy is simply the rule that determines which URL gets fetched next from the frontier queue. That single decision controls everything: which pages are found first, how deep the crawl goes, how much memory it consumes, and how well-suited the crawler is for broad versus narrow indexing goals.
Imagine a seed URL pointing to a news homepage with 50 links. A breadth-first crawler fetches all 50 linked pages before following any of their outbound links. A depth-first crawler follows the first link, then the first link on that page, continuing down one chain until it hits a dead end before backtracking. A focused crawler evaluates all 50 links for topic relevance and only fetches the ones that score above a threshold. These three decisions lead to completely different crawls within the first 100 fetches, even though each crawler starts from the same URL.
Understanding the mechanics behind each strategy is not purely academic. The difference between a page being indexed within hours of publication and being ignored for weeks is largely a function of how well your site's architecture aligns with the priority-weighted breadth-first approach that Googlebot uses.
What Is Breadth-First Crawling?
Breadth-first crawling (BFS) explores all pages at the current link depth before moving to the next level. Starting from a seed URL, the crawler fetches that page, adds every extracted link to the back of a queue, and then processes those links before following any links they contain. Because high-authority pages accumulate inbound links from many sites, they appear at shallow depths and are always discovered first.
The Queue Data Structure Behind BFS
BFS uses a FIFO (first-in, first-out) queue as its frontier. When the crawler fetches a page, it appends every discovered URL to the back of the queue. The next URL to fetch is always pulled from the front. This ensures every URL at depth 1 is fully processed before any depth-2 URL is touched, creating a wave pattern that expands outward from the seed in concentric layers.
The cost of this breadth is memory. A BFS frontier for a large media site can hold hundreds of thousands of queued URLs simultaneously, because the crawler must hold the entire current depth layer before descending. Real-world implementations use disk-backed queues rather than in-memory structures because the frontier regularly exceeds available RAM.
Why Search Engines Choose BFS for General Indexing
Marc Najork and Janet Wiener established the empirical case for BFS in their 2001 WWW Conference paper "Breadth-First Crawling Yields High-Quality Pages." Their finding: pages discovered by BFS have significantly higher PageRank scores than randomly sampled pages, because important pages attract inbound links from many sites and therefore appear at shallow link depths across the web simultaneously. The crawler naturally follows the authority gradient without any explicit PageRank calculation.[1]Source 1Najork, M., and Wiener, J.L. (2001). "Breadth-First Crawling Yields High-Quality Pages." Proceedings of the 10th International Conference on World Wide Web (WWW '01), pp. 114-118. ACM.View source
The practical implication is direct. Pages that many sites link to (your homepage, your pillar content, your highest-authority category pages) will be discovered by BFS crawlers early regardless of your site's total size. Pages buried four or five clicks deep from any well-linked entry point rely on the crawler committing significant budget to reach them.
What Is Depth-First Crawling?
Depth-first crawling (DFS) follows a single link path from the seed URL as deep as it goes before backtracking to explore alternative branches. If a page has three outbound links, DFS follows the first link, then the first link on that page, and continues recursively until hitting a dead end or a previously visited URL. Only then does it return to try the second branch.
The Memory Advantage of DFS
DFS stores only the current path in a stack rather than all URLs at the current depth level. For narrow, deep site structures, this is a meaningful practical advantage. A BFS crawler covering a site with a deep directory hierarchy holds every page at level 5 in memory before it can move to level 6. A DFS crawler holds only one URL per depth level on the current path: a trivially small footprint by comparison.
Where DFS Falls Short for Web-Scale Crawling
The structural problem with DFS in general web crawling is the trap risk. Sites with paginated archives, tag pages that create overlapping link networks, or infinite URL parameter combinations can absorb an entire crawl budget before the crawler escapes. A DFS crawler following e-commerce product filters (?color=red, ?color=red&size=XL, and so on) can descend through thousands of near-duplicate pages without ever surfacing the category pages that contain the real content value.
DFS also encounters lower-quality pages early. Because it prioritizes depth over link popularity, it reaches low-authority pages within the first branch before ever encountering the well-linked pages that a search index actually needs.
What Is Focused Crawling?
Focused crawling assigns relevance scores to discovered URLs and fetches only those above a threshold. Rather than following all links on a page, a focused crawler evaluates whether each link is likely to lead to topically relevant content before adding it to its frontier. This converts BFS's flat FIFO queue into a priority queue sorted by estimated relevance.
How Focused Crawlers Score Links Before Fetching Them
Soumen Chakrabarti, Martin van den Berg, and Byron Dom formalized the focused crawling approach in their landmark 1999 Computer Networks paper "Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery." Their key insight was that anchor text and the content of the linking page are reliable predictors of the linked page's topic, even before the linked page is fetched. A link labeled "Premier League fixtures" on a sports page can be scored as highly relevant to a football-focused crawler without downloading the destination URL first.[2]Source 2Chakrabarti, S., van den Berg, M., and Dom, B. (1999). "Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery." Computer Networks, 31:1623-1640. Elsevier.View source
This pre-fetch scoring mechanism produces a priority queue where URLs with the highest predicted relevance are fetched before lower-scoring URLs. If a page's score falls below the relevance threshold, its links are not added to the frontier at all. Chakrabarti et al. called this "crawl boundary analysis." It is what gives focused crawlers their efficiency: they completely ignore entire regions of the web that do not match the target topic, rather than fetching pages and discarding them after the fact.
Where Focused Crawlers Are Deployed
Focused crawlers power every vertical search engine. Wikipedia describes vertical search sites as using focused crawlers as their primary indexing mechanism because exhaustive web coverage would be both wasteful and counterproductive. A legal research database needs deep coverage of case law and statutes; it has no use for product listings or blog posts. A real estate aggregator needs property listings, not sports scores.
Google News uses a real-time focused crawler for article discovery, scoring links by publication signals (structured markup, hostname patterns, publication timestamps) rather than static topic relevance. SEO audit tools including Screaming Frog and Sitebulb are focused crawlers by design: they score links by domain membership, treating same-domain links as high relevance and discarding external links entirely, to produce deep coverage of one site rather than shallow coverage of the broader web.
How Do BFS, DFS, and Focused Crawling Compare?
| Dimension | Breadth-First (BFS) | Depth-First (DFS) | Focused Crawling |
|---|---|---|---|
| Frontier data structure | FIFO queue | LIFO stack | Priority queue |
| Page quality encountered early | High (Najork & Wiener, 2001) | Low to mixed | High within target topic |
| Memory requirement | High (entire depth layer) | Low (single path stack) | Medium (scored queue) |
| Trap risk | Low | High (loops, infinite params) | Very low (threshold pruning) |
| Topic targeting | None | None | Explicit |
| Best use case | General search indexing | Site mirroring, narrow deep structures | Vertical search, news, SEO audits |
| Scales to web-wide crawling | Yes, with distributed systems | No | Yes, within a topic domain |
The dominant real-world approach is a hybrid: BFS as the structural backbone with priority-based URL ordering layered on top. Googlebot does not use pure BFS. It maintains a priority-weighted queue that combines breadth-first expansion across many hosts with PageRank-like importance scores to determine which URLs advance to the front. The BFS pattern emerges from the distribution of authority across the web; the priority weighting corrects for cases where pure BFS order would waste budget on low-value pages at shallow depths.
How Distributed Crawlers Scale These Strategies: Apache Nutch
Single-machine crawlers hit throughput limits long before covering a meaningful fraction of the web. The academic crawler IRLbot reported a sustained download rate of 1,789 pages per second in 2009, which required substantial distributed infrastructure. Apache Nutch, the open-source crawler that also spawned the Hadoop project, addresses scale by distributing the URL frontier across multiple machines using MapReduce.[4]Source 4Apache Nutch Project. Nutch2Crawling documentation. Apache Software Foundation.View source
Nutch partitions discovered URLs by domain before each fetch cycle. All URLs from the same domain end up assigned to the same reducer process, which enforces per-domain politeness limits (one active connection at a time, respecting the Crawl-Delay directive from robots.txt) while allowing hundreds of different domains to be fetched in parallel across the cluster simultaneously.
The frontier itself is stored in a distributed NoSQL store where URLs are used as row keys, but with reversed host components: a URL like https://example.com/page becomes a row key stored as com.example/page. This reversal ensures that all URLs from the same domain are grouped into adjacent rows in the sorted table. Scanning a contiguous row range is orders of magnitude faster than random-access lookups scattered across the full table.
Notice how the Nutch architecture makes the strategy choice inseparable from the infrastructure choice. The BFS-with-priority algorithm is conceptually simple. Implementing it correctly at web scale requires distributed file systems (HDFS), domain-partitioned queues, and per-host politeness scheduling that keeps thousands of servers from being overloaded simultaneously. Strategy and system are the same engineering decision.
What Crawl Strategy Means for SEO: Why Some Pages Get Indexed in Hours and Others Wait Months
Googlebot's behavior approximates a priority-weighted BFS. URLs with strong link signals (high internal PageRank, many external inbound links, recent sitemap submission) receive high crawl priority and get fetched within minutes or hours of being published or updated. URLs sitting many clicks from any well-linked page may wait in Googlebot's queue for days or weeks.[3]Source 3Cho, J., Garcia-Molina, H., and Page, L. (1998). "Efficient Crawling Through URL Ordering." Computer Networks and ISDN Systems, 30(1-7):161-172.View source
According to crawl prioritization analysis from multiple SEO research sources, the gap is not marginal. A high-priority URL can be crawled within minutes of publication; a low-priority URL on a poorly-linked site can wait weeks without ever being fetched in a given crawl cycle.
Three site architecture decisions follow directly from understanding BFS-like crawl strategy:
Link depth. Pages more than three clicks from the homepage sit at deep queue levels in a BFS system. Googlebot's crawl budget is not unlimited. Pages that matter (product pages, pillar content, high-converting landing pages) should be reachable within three clicks from a well-linked page.
Internal link clustering. Focused crawlers and priority-BFS systems alike score link neighborhoods. A cluster of topically related pages that all link to each other signals strong relevance and improves crawl priority for the entire group. This is why internal linking is a crawl architecture decision, not merely a UX decision.
XML sitemaps as a queue bypass. A sitemap submitted to Search Console injects URLs directly into Google's crawl queue, bypassing the link-following mechanism entirely. For new content with no established inbound links, sitemap submission is the only reliable path into the BFS frontier before the normal link-following cycle would naturally reach the page.
Sources
Najork, M., and Wiener, J.L. (2001). "Breadth-First Crawling Yields High-Quality Pages." Proceedings of the 10th International Conference on World Wide Web (WWW '01), pp. 114-118. ACM.
Chakrabarti, S., van den Berg, M., and Dom, B. (1999). "Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery." Computer Networks, 31:1623-1640. Elsevier.
Cho, J., Garcia-Molina, H., and Page, L. (1998). "Efficient Crawling Through URL Ordering." Computer Networks and ISDN Systems, 30(1-7):161-172.
Apache Nutch Project. Nutch2Crawling documentation. Apache Software Foundation.





