<- Blog.Crawling

URL Discovery Explained: How Googlebot Finds Pages Through Links, Sitemaps, and Search Console

There is no central registry of web pages. Googlebot must continuously search for new and updated URLs on its own, using a process Google calls "URL discovery." There are three pathways into Googlebot's crawl frontier: following...

FoundationPracticalOfficial docInteractive
Jun 8, 2026.10 min read
URL Discovery Explained: How Googlebot Finds Pages Through Links, Sitemaps, and Search Console

There is no central registry of web pages. Googlebot must continuously search for new and updated URLs on its own, using a process Google calls "URL discovery." There are three pathways into Googlebot's crawl frontier: following hyperlinks from already-crawled pages, reading XML sitemaps submitted through Search Console, and processing manual indexing requests. A page with no inbound links has no link pathway for the crawler to follow. It becomes an orphan: technically published but effectively invisible to Google unless a sitemap entry saves it. This article explains how each discovery pathway works, why they are not equivalent, and what the mechanics of orphan pages reveal about internal linking as a crawl architecture decision.


Why URL Discovery Is the Foundation of Indexing

URL discovery is the prerequisite for every subsequent step in the search pipeline. A page Google has never heard of cannot be crawled. A page that is not crawled cannot be indexed. A page that is not indexed cannot rank. The three discovery mechanisms described in this article are the only routes by which a URL enters Google's crawl queue.[1]Source 1Google Search Central. "In-Depth Guide to How Google Search Works." Google for Developers.View source

Google's scale makes this machinery visible. In 2020, Google stated it discovers more than 30 trillion unique URLs across the web, crawling approximately 20 billion pages per day. That volume requires automated, systematic discovery, not manual submission. Understanding how that automation works is essential for anyone responsible for making pages reachable.


Link following is how Google discovers the vast majority of URLs. When Googlebot fetches a page, it parses the HTML to extract every crawlable link, adds those URLs to the frontier queue, and processes them in priority order. According to Google's own documentation on how search works, "Other pages are discovered when Google extracts a link from a known page to a new page: for example, a hub page, such as a category page, links to a new blog post."[1]Source 1Google Search Central. "In-Depth Guide to How Google Search Works." Google for Developers.View source

How Googlebot Extracts Links from Crawled Pages

Googlebot extracts links from the fully rendered HTML of a page, which includes both static anchor tags and links injected by JavaScript after rendering. For internal links, this means Googlebot can only find pages that are connected by anchor elements with crawlable href attributes. Links hidden behind JavaScript interactions, form submissions, or session-based navigation may not be followed at all, because the crawler does not simulate user actions beyond the initial page render.

The crawl frontier functions as a priority queue. Discovered links are not queued in a flat list but ranked by signals including the authority of the linking page, how many pages link to the destination, and freshness indicators. A link from your homepage to a new article carries far more crawl priority weight than a link from a three-year-old post with no backlinks.

Why Internal Links Signal More Than Just Navigation

From Googlebot's perspective, an internal link is editorial input, not just a navigation convenience. When Page A links to Page B, two things happen simultaneously. First, Googlebot adds Page B's URL to its frontier if it has not seen it before. Second, some portion of Page A's PageRank flows to Page B through the link. This authority flow affects not just discoverability but ranking potential.

Google's John Mueller has called internal linking "super critical for SEO" and "one of the biggest things you can do on a website." The Ahrefs 2024 study of web pages found that 66.2% of pages have only one internal link pointing to them, which means most websites severely underuse this mechanism. A page with many strong internal links is signaled to Googlebot as important and is recrawled more frequently. A page with one weak internal link sits near the bottom of the priority queue.

The practical implication is that internal linking is not a navigation decision made after content is published. It is a crawl architecture decision made before content is expected to rank.


How Do XML Sitemaps Feed URLs into Google's Crawl Frontier?

An XML sitemap is a file that lists URLs in a structured format and is submitted to Google through Search Console or referenced in robots.txt. Unlike link following, which requires Googlebot to traverse your site organically, a sitemap injects URLs directly into Google's awareness regardless of whether any crawled page links to them.[2]Source 2Google Search Central. "Build and Submit a Sitemap." Google for Developers.View source

What Sitemaps Tell Google vs. What They Cannot Guarantee

Google's official documentation on building and submitting sitemaps is explicit: "Keep in mind that submitting a sitemap is merely a hint: it doesn't guarantee that Google will download the sitemap or use the sitemap for crawling URLs on the site." This is not a minor caveat. It is the defining characteristic of the sitemap mechanism.

Sitemaps communicate which URLs exist and which ones you consider important. They do not communicate why those URLs matter, how authoritative they are, or how they relate to each other. Google treats sitemap entries as suggestions and still applies its own quality and relevance filters before adding those URLs to the active crawl queue. According to search advocate John Mueller, the difference between sitemap URL count and indexed URL count is completely normal. Google does not index everything in a sitemap, and not everything in a sitemap deserves to be indexed.

What sitemaps do well: they surface URLs that have no inbound links, including pages buried deep in site structure or published without any internal link connections. They are especially effective on large sites where link following alone would take weeks or months to discover all content. The lastmod attribute (indicating when a page was last modified) gives Google a freshness signal that can improve crawl prioritization for recently updated pages.

When Sitemaps Beat Link-Following and When They Don't

ScenarioBetter mechanismReason
New page on a young site, no inbound linksSitemapNo link path exists yet for Googlebot to follow
High-authority page with many internal linksLink followingAuthority signal is stronger than a hint
Large site with 50,000+ pagesSitemapEnsures all URLs are visible without exhausting crawl budget on link traversal
Deep page 6+ clicks from homepageSitemapLink path is too long for efficient BFS-like crawling
Page with thin content, no external backlinksNeither aloneSitemap exposes the URL; internal links provide the authority needed for indexing to stick

Sitemaps are discovery tools, not authority tools. A sitemap can surface a URL; only inbound links can give it the PageRank it needs to stay indexed and rank.


How Does the Search Console URL Inspection Tool Submit Pages for Crawling?

The URL Inspection tool in Google Search Console allows site owners to manually request crawling of a specific URL. When you submit a "Request Indexing" request, Google runs a quick check to confirm the URL is technically indexable and then places it in a high-priority position within the crawl queue.[3]Source 3Google Search Central. "URL Inspection Tool." Search Console Help.View source

How the Request Indexing Feature Works

Google's documentation states that the URL Inspection tool submits the URL to the indexing queue after passing a brief technical check. Indexing typically occurs within one day to several days, though the timeline depends on site authority, server availability, and content quality. Importantly, the documentation specifies that "there is a daily limit to how many index requests you can submit." In practice, community data places that limit at approximately 10 to 12 requests per property per day.

The Request Indexing feature is appropriate for individual high-priority pages: a new pillar article, a revised product page, or a corrected piece of time-sensitive content. It is not a scalable solution. Google's own guidance is explicit on this: "If you want many pages indexed, try submitting a sitemap to Google."

What the Inspection Report's Discovery Section Reveals

The URL Inspection report includes a "Discovery" section that shows exactly how Google first found the URL under inspection. It reports the referring page (the crawled page that linked to this URL) and any sitemaps that include it. If the "Referring page" field shows "None detected," the page is confirmed as an orphan: no link path has been recorded leading to it. This field is the most direct diagnostic for orphan page status available without a third-party crawl tool.


What Is an Orphan Page and Why Is It Invisible?

An orphan page is a URL that has no inbound internal links from any other page on the same site. Googlebot discovers most URLs by following links from pages it has already crawled. If no crawled page contains a link to a given URL, the crawler has no link pathway to follow to reach it. The page becomes invisible to the standard discovery process.[5]Source 5Ahrefs SEO Glossary. "Orphan Page."View source

Orphan pages can still appear in Google's index if they are listed in an XML sitemap or if external backlinks from other websites provide a crawl pathway. However, the absence of internal links creates two compounding problems that sitemap submission alone cannot resolve.


The Dual Invisibility Problem: Discovery Failure and PageRank Starvation

Most explanations of orphan pages focus on one problem: the crawler cannot find the page. The full problem is two-layered, and both layers must be addressed.

Layer 1: Discovery failure. Without an internal link or sitemap entry, Googlebot never reaches the page. The URL sits on the server, published and accessible, but absent from Google's frontier queue. This is the layer that sitemap submission addresses directly.

Layer 2: PageRank starvation. Even when Google discovers an orphan page through a sitemap, the page receives zero PageRank from the rest of the site. PageRank flows exclusively through links. A page with no inbound internal links has no authority flow, regardless of the site's overall link profile. As Ahrefs summarizes in their SEO glossary: "With no internal links, orphan pages aren't getting any PageRank from the pages of the website. And Google still uses PageRank as one of the most important ranking signals."

These two problems do not cancel each other. A sitemap entry solves discovery but leaves PageRank starvation untouched. Internal links solve both simultaneously: they give Googlebot a link pathway to follow and pass authority to the destination page.

Notice how this dual structure exposes a common misconception: site owners who submit comprehensive sitemaps sometimes assume their orphan pages are "discovered" and therefore fine. Discovery and authority are separate signals. A page can be indexed (discovery solved) and still rank poorly for every query it targets (PageRank starvation unsolved). The fix for an orphan page that matters is always an internal link, not just a sitemap entry.

The scale of this problem is larger than most site owners realize. Semrush research found that orphan pages are one of the most common technical SEO issues identified during site audits. Site migrations, navigation redesigns, and CMS changes are the most frequent causes: pages get restructured, internal links get removed or broken, and the affected pages drift into orphan status without any individual change explicitly making them invisible.


How Do You Audit URL Discovery Health in Google Search Console?

The evidence for discovery problems is available directly in Google Search Console without third-party tools.[4]Source 4Google Search Central. "Crawl Stats Report." Search Console Help.View source

Crawl Stats report. Navigate to Settings, then Crawl Stats, then examine the breakdown of crawl requests by "Purpose." Google distinguishes between Discovery crawls (first-time fetches of new URLs) and Refresh crawls (recrawls of already-known pages). If your site is publishing new content but the discovery crawl count remains flat, your link structure is not generating enough new pathways for Googlebot to follow.

Pages report. The Pages report (formerly Coverage) shows URLs in the "Discovered: currently not indexed" state. This status means Google knows about the URL but has not yet fetched it. It is commonly associated with orphan pages that were discovered via sitemap but have no link authority to justify prioritization.

URL Inspection tool. For any specific URL, the Discovery section in the inspection report shows whether Googlebot found the page via a referring page link, a sitemap, or another method. A "None detected" result in the referring page field is diagnostic confirmation of orphan status.

A pattern common in large e-commerce and content sites: the Crawl Stats report shows thousands of Refresh crawls (Googlebot cycling through already-known pages) while Discovery crawls stagnate. This means internal link structure is not generating new pathways. The crawler is rechecking old pages but failing to reach new content, even if that content is listed in a sitemap.

Sources

  1. Google Search Central. "In-Depth Guide to How Google Search Works." Google for Developers.

  2. Google Search Central. "Build and Submit a Sitemap." Google for Developers.

  3. Google Search Central. "URL Inspection Tool." Search Console Help.

  4. Google Search Central. "Crawl Stats Report." Search Console Help.

  5. Ahrefs SEO Glossary. "Orphan Page."

Share

About the Contributors

Frequently Asked Questions (FAQs)

What are the three ways Google discovers URLs?+

Google discovers URLs through three primary mechanisms: following links from pages it has already crawled, reading XML sitemaps submitted through Search Console or referenced in robots.txt, and processing manual indexing requests submitted via the URL Inspection tool. Link following is the primary mechanism for most pages. Sitemaps are especially valuable for pages with no inbound links. Manual requests are limited to approximately 10-12 URLs per day.

What is an orphan page in SEO?+

An orphan page is a URL with no inbound internal links from any other page on the same website. Because Googlebot discovers most URLs by following links from crawled pages, an orphan page has no link pathway leading to it. It may still be discovered via an XML sitemap or external backlink, but it receives no internal PageRank and typically ranks poorly even when indexed.

Does submitting a sitemap guarantee indexing?+

No. Google's own documentation states that sitemap submission is "merely a hint" and does not guarantee that Google will download the sitemap or crawl the URLs it contains. Google applies its own quality and relevance filters before adding sitemap URLs to the active crawl queue. A sitemap guarantees visibility of the URLs to Google; it does not guarantee crawling, indexing, or ranking.

Why does the URL Inspection tool show "None detected" for referring pages?+

The "None detected" result in the URL Inspection tool's Discovery section means Google has no record of a crawled page linking to the inspected URL. This is the diagnostic confirmation of orphan page status. The URL may have been discovered via a sitemap or external backlink rather than through internal link-following. The fix is to add an internal link from a relevant, already-indexed page to the orphaned URL.

Can an orphan page rank well in Google?+

Rarely, and typically only with strong external backlinks compensating for the absence of internal link authority. Internal links serve two functions simultaneously: they give Googlebot a crawl pathway to reach the page, and they pass PageRank that raises the page's authority. A page discovered via sitemap but receiving no internal links has zero internal authority flow. Even with indexing secured, it starts ranking from a position of near-zero internal authority.

How do I find orphan pages on my site?+

Compare your list of crawled URLs (from a tool like Screaming Frog or a Search Console crawl) against your complete URL inventory (from your XML sitemap or a server-side export). URLs that appear in your inventory but not in your crawl results are your orphan pages. In Search Console, the "Discovered: currently not indexed" status in the Pages report and a "None detected" referring page in the URL Inspection tool are both diagnostic signals worth investigating.

Contributors

Reviewed by people
who know the system.

All Authors ->