Crawl problems are invisible from the outside. A site can look functional to users while Googlebot is silently wasting budget on redirect chains, failing to reach important pages buried six clicks deep, or crawling JavaScript shells that return no indexable content. The only way to expose these failures is through systematic diagnostic work across three data sources: Google Search Console (population-level indexing states), server log files (ground-truth crawl behavior at URL level), and third-party crawlers (link-following simulation that reveals crawl traps and broken paths). Each source answers different questions. None answers all of them. This article walks through a complete crawl audit workflow that pulls all three together, explains what each tool reveals and what it misses, and provides the eight-step diagnostic sequence that moves from symptom to root cause.
Why Does No Single Crawl Tool Tell the Full Story?
The three main data sources for crawl diagnosis each have a fundamental limitation that no configuration or upgrade resolves.
Google Search Console is sampled and lagged. The Pages report shows which URLs Google knows about and their indexing state, but it aggregates data across time windows, omits many URL-level details, and reflects a processed view of crawl behavior rather than the raw crawl record. The Crawl Stats report shows aggregate patterns but cannot show you which specific URLs consumed your budget on a given day, nor whether a particular page was crawled for the first time or re-crawled.
Server log files are ground truth for every request Googlebot made to your server, timestamped to the second, with exact status codes, response sizes, and user-agent strings. They show precisely which URLs were crawled and when. They do not show whether those crawled pages were indexed or rejected, because indexing decisions happen downstream of the crawl in systems the log never touches.
Third-party crawlers (Screaming Frog, Sitebulb) simulate Googlebot's link-following behavior from a given entry point. They reveal crawl traps, redirect loops, broken internal links, and pages blocked by robots.txt, exactly what a spider following links would encounter. They do not represent actual Googlebot behavior, because they follow links from the crawl entry point in real time while Googlebot follows a priority queue built over weeks of prior crawl history.
The complete diagnostic requires all three in sequence: Search Console to identify which symptoms exist at scale, log files to confirm what Googlebot actually did at URL level, and third-party crawlers to expose the structural causes that produced those behaviors.
What Is the Three-Source Diagnostic Model?
The three-source model maps each data source to the diagnostic question it answers best:
| Source | Answers | Misses |
|---|---|---|
| Google Search Console | Which pages are indexed, excluded, or problematic? What are the crawl patterns over time? | Exact URL-level crawl detail, indexing decisions logic, unsampled budget data |
| Server log files | Which URLs did Googlebot actually request? When? How fast? What was returned? | Whether crawled pages were indexed or rejected by the indexer |
| Third-party crawlers | What crawl paths, traps, loops, and blocked pages does the link structure produce? | Actual Googlebot behavior, priority queue decisions, history-dependent crawl patterns |
Run Search Console first to understand scope and prioritize investigation. Use log files to confirm whether Googlebot is crawling the right pages and to identify budget waste patterns. Use third-party crawlers to find the structural root causes: the broken links, redirect chains, robots.txt blocks, and crawl traps that explain the log file patterns you found.
Source 1: Google Search Console: Population-Level Signals
The Crawl Stats Report: Your First Diagnostic Dashboard
The Crawl Stats report (Settings, then Crawl Stats in Search Console) is the starting point for any crawl audit. It shows total crawl requests per day, average response time, download volume, and breakdowns by response code, file type, crawl purpose (Discovery vs. Refresh), and Googlebot type.[1]Source 1Google Search Central. "Crawl Stats Report." Search Console Help.View source
Four signals warrant immediate investigation when found here:
High 4xx rate. A large proportion of Googlebot requests returning 404 errors means Googlebot is spending significant budget on non-existent pages. These are typically legacy URLs from site migrations, broken internal links, or parameter variants that were once crawlable and became dead ends. Every 404 request is a wasted crawl slot.
High 5xx rate. Server errors during Googlebot visits suppress the crawl rate limit over time. Google's crawl capacity algorithm reduces the crawl rate when it encounters repeated server errors, treating them as a signal of server instability. A sustained 5xx rate of more than 2 to 3 percent of crawl requests will visibly reduce total daily crawls within days.
Discovery crawls stagnating. The Crawl Stats report separates Discovery crawls (first-time fetches of new URLs) from Refresh crawls (recrawls of known pages). If your site publishes new content regularly but Discovery crawl volume is flat or declining while Refresh crawls dominate, Googlebot is cycling through existing pages rather than finding new ones. This signals an internal linking problem: new pages are not being connected to the crawl frontier through links from already-indexed pages.
Slow average response time. Response time above 200ms consistently constrains the crawl rate limit. The crawl capacity algorithm factors in server responsiveness when determining how aggressively to crawl. A slow server directly reduces pages crawled per day, compounding all other crawl problems.
The Pages Report: Decoding Every Indexing State
The Pages report (formerly Coverage) categorizes all URLs Google knows about into four primary states that form a crawl priority hierarchy. These are not merely status labels: they represent where a page sits in Google's crawl and indexing pipeline, and pages can move backward through them after algorithm updates.[2]Source 2Google Search Central. "Page Indexing Report." Search Console Help.View source
Submitted and indexed. The page is in the index and is being shown in search results. This is the target state for all intentionally published content.
Crawled: Currently not indexed. Google crawled and evaluated the page and made a deliberate decision not to index it. Based on research into indexing data, this status (despite Google's documentation describing it as pages that "may or may not be indexed in the future") frequently applies to pages that were previously indexed and have since been removed from the index. It is a medium-priority state where Google has enough information to make a quality judgment. Common causes: thin content, near-duplicate content, content that falls below Google's quality threshold for that competitive space, or pages with poor internal authority.
Discovered: Currently not indexed. Google is aware of the URL (from a sitemap, a link, or a previous crawl) but has not crawled it yet. This is a crawl queue management issue, not a content quality issue. The page has not been evaluated. Common causes: insufficient crawl budget, weak internal link coverage that gives the URL low crawl priority, or a large volume of competing low-priority URLs consuming the available budget. Do not apply content fixes to pages in this state; improve their crawl priority first.
Not indexed for other reasons. Pages blocked by robots.txt, carrying noindex directives, returning non-200 status codes, or experiencing canonical conflicts appear under specific sub-reasons in the Excluded section.
Understanding which state a URL is in before applying a fix prevents the common error of rewriting content for pages that are in a queue problem rather than a quality problem.
How to Diagnose "Discovered: Currently Not Indexed" at Scale
A large and growing population of pages in the Discovered state is the most reliable signal of a crawl budget problem. Googlebot knows the pages exist but is not prioritizing their crawl. The correct remediation sequence:
First, check whether those pages are included in your sitemap. If they are in the sitemap but still Discovered, the sitemap alone is insufficient: Google receives the hint but assigns the pages low crawl priority.
Second, examine the click depth of those pages in a third-party crawler. Pages at depth 5 or greater are structurally deprioritized regardless of sitemap status.
Third, look at what pages are consuming the Discovery crawl budget instead. If log files show Googlebot spending Discovery crawl slots on parameter URLs, paginated archives, or faceted navigation variants, those patterns are displacing the high-value pages from the front of the queue.
How to Diagnose "Crawled: Currently Not Indexed" at Scale
Pages in this state have been evaluated and rejected. The diagnostics shift from architecture to content quality:
Sample the affected URLs and assess them honestly for thin content, near-duplication with stronger pages, or low unique value relative to what already ranks for the same queries. For e-commerce sites, product variant pages with identical descriptions are typical culprits. For content sites, paginated archive pages, author pages with few posts, and tag pages with one or two articles commonly accumulate here.
The fix is not to request indexing. Google's documentation explicitly states no need to resubmit these URLs for crawling. The fix is either to improve the page content until it meets the quality threshold, merge it into a stronger page with a 301 redirect, or apply noindex and remove it from the sitemap so it stops consuming crawl budget.
The Sitemaps Report as a Quality Gauge
The Sitemaps report shows submitted URLs versus indexed URLs for each submitted sitemap file. A submitted-to-indexed ratio below 60 percent indicates a structural problem: a large fraction of submitted pages are being rejected by the quality filter. The correct response is to audit the sitemap for non-canonical, noindexed, redirected, thin, or paginated URLs that should not be in the file at all. Improving the ratio by removing weak URLs is always more durable than adding more pages and hoping some index.
Source 2: Server Log Files: Ground Truth at URL Level
What a Log Entry Contains
A standard server log entry for a Googlebot visit looks like this:
66.249.90.77 - - [15/Apr/2026:09:22:14 +0000] "GET /search-engine-basics/crawl-budget/ HTTP/1.1" 200 32417 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
This single line contains: the requesting IP address, the timestamp, the HTTP method and URL, the response code, the response size in bytes, and the user-agent string. For crawl diagnosis, the most useful fields are the URL, the response code, the response size, and the timestamp. Aggregated across thousands of entries, these fields answer: where does Googlebot spend its time, what does it receive, and how frequently does it return?
How to Verify That Entries Are Genuine Googlebot
User-agent strings are trivially easy to forge. Any bot can claim to be Googlebot in its user-agent header. Before using log data for crawl diagnosis, filter entries by user-agent string to isolate Googlebot candidates, then verify each candidate IP address using Google's recommended two-step reverse DNS and forward DNS process:[3]Source 3Google Crawling Infrastructure. "Verifying Googlebot and Other Google Crawlers."View source
- Perform a reverse DNS lookup on the IP address. The returned hostname should end in
googlebot.comorgoogle.com. - Perform a forward DNS lookup on that hostname. The returned IP must match the original IP.
If both checks pass, the entry is a verified Googlebot hit. Screaming Frog's Log File Analyser automates this verification against Google's published IP ranges. Never build diagnostic conclusions on unverified log data: a significant fraction of "Googlebot" entries in raw logs are fake crawlers spoofing the user-agent string.
The Seven Crawl Signals to Extract from Log Files
Once you have a verified set of Googlebot log entries, seven signals cover most diagnostic questions:
1. Crawl frequency by URL template. Group URLs by type (product pages, category pages, blog posts, parameter variants) and calculate average crawl frequency per template. Important pages should be crawled more frequently than low-value pages. If your product detail pages are crawled less often than your faceted navigation variants, the architecture is inverted.
2. Status code distribution by URL template. What percentage of requests to each template return 200, 301, 302, 404, and 5xx? High 4xx rates on a specific template indicate a structural problem (a URL pattern that generates invalid addresses). High 3xx rates indicate redirect chains or an improperly implemented URL normalization.
3. Response time by URL template. Slow server response times disproportionately affect crawl rate. Templates with consistent response times above 500ms are degrading the crawl capacity limit for the entire site.
4. Response size as a rendering signal. A page that should return 45 KB of content but consistently logs 8 KB in server responses is almost certainly returning only the HTML shell without JavaScript-rendered content. Googlebot is crawling the page but receiving an empty container. This diagnostic is unique to log files: it reveals JavaScript rendering failures without any other tool. Cross-reference with the URL Inspection tool's "View Crawled Page" screenshot to confirm.
5. Budget distribution by URL type. What percentage of verified Googlebot requests land on pages that are indexed and valuable versus pages that are excluded, noindexed, or redirect chains? If more than 30 to 40 percent of crawl budget lands on non-indexable pages, the crawl architecture has a budget waste problem that no amount of content improvement will fix.
6. Crawl frequency for newly published content. For each page published in the last 30 days, how quickly did the first Googlebot request appear in the log? Pages that are not visited within 72 hours of publication typically have a discovery problem: they lack strong internal links from already-crawled pages and are not being surfaced through the sitemap fast enough.
7. Post-deployment crawl behavior. After deploying robots.txt changes, redirect updates, or canonical restructuring, monitor logs to confirm Googlebot is behaving as expected. Robots.txt changes take effect within hours; internal link changes propagate through the crawl on the next Googlebot visit to the linking page, which may be days later.
Log File Tools: From Spreadsheet to Enterprise Platform
For smaller sites with manageable log volumes: export your server logs, filter for Googlebot entries in Excel or Python, and analyze by URL pattern. Screaming Frog's Log File Analyser ($259/year alongside the SEO Spider) handles parsing, bot verification, and basic frequency analysis without code.[5]Source 5Screaming Frog. "Log File Analyser."View source
For mid-size sites: tools including JetOctopus and SEOlyzer import logs continuously and provide dashboards showing crawl frequency by template, budget distribution, and anomaly detection.
For enterprise sites with millions of URLs: Botify and Lumar (formerly Deepcrawl) combine log file analysis with crawl data and Search Console integration, enabling correlation between crawl frequency, indexing state, and search performance at scale, revealing which crawled pages drive traffic and which are crawled at high frequency but never rank.
Source 3: Third-Party Crawlers and Simulating Googlebot's Link-Following
Screaming Frog: Exposing Crawl Traps, Redirect Loops, and Broken Links
Screaming Frog's SEO Spider crawls your site from a configurable starting URL, following internal links exactly as a search engine spider would. It records every URL visited, its response code, crawl depth, inbound link count, canonical URL, robots.txt status, and a full suite of on-page signals. The result is a complete map of what is crawlable through your site's link structure, independent of what Googlebot has actually done.[4]Source 4Screaming Frog. "SEO Spider."View source
The most diagnostically useful reports for crawl problem investigation:
Response Codes tab, filtered to 4xx. Every internal link to a 4xx URL is a broken internal link. These waste crawl budget when Googlebot follows them and, for each hop, the crawl slot returns a dead response. The Inlinks panel at the bottom shows which source pages contain the broken link, making repair straightforward.
Response Codes tab, filtered to 3xx. The Redirect Chains export (Bulk Export, Redirects, Redirect Chains) maps every multi-hop chain across the site, showing the full sequence of URLs and the status code at each step. Sort by chain length to prioritize the longest chains affecting high-authority pages first.
Directives tab, filtered to "Noindexed" with Inlinks. Pages that carry a noindex directive but still receive inbound internal links waste the crawl slots that follow those links. If the page should not be indexed, it should also receive minimal internal links; the authority passed to it is dissipated.
Site Structure visualization. The crawl depth and directory tree visualizations expose pages sitting at abnormal click depths. Any important page deeper than 4 clicks from the homepage should be connected with additional internal links from shallower pages.
Sitemap versus crawl comparison. Importing your XML sitemap alongside the crawl result allows Screaming Frog to identify three categories: URLs in the sitemap that are reachable by crawling (correct), URLs in the sitemap that are not reachable by crawling (potential orphans or incorrectly structured sitemaps), and URLs reachable by crawling that are not in the sitemap (potential discovery gaps).
Sitebulb: Visual Architecture and Prioritized Issue Reporting
Sitebulb produces the same underlying crawl data as Screaming Frog but presents it through a prioritized issue framework and richer visualizations. Each detected issue is categorized by severity (critical, warning, advisory) with an explanation of why it matters and what to do.
The Crawl Map is Sitebulb's most distinctive feature for crawl diagnosis. It renders the site's link structure as an interactive visual diagram where node size represents internal link count and node color represents crawl depth. Pages that appear as small, distant nodes with few connections are immediately visible as candidates for architectural improvement. Clusters of isolated nodes indicate topic silos that are not properly cross-linked; pages that should be hub-level but appear as peripheral spokes indicate internal authority misallocation.
Sitebulb also integrates Google Search Console data, allowing crawl findings (click depth, redirect chains, blocked URLs) to be correlated with indexing state (Crawled not indexed, Discovered not indexed) within a single interface. This correlation is operationally valuable: a page that is Discovered not indexed AND sits at depth 7 AND has two inbound internal links AND is excluded from the sitemap has a clear compound diagnosis that drives a clear compound fix.
When to Use Each Tool
Use Screaming Frog for raw data access, custom extraction, and situations where you need to export specific data sets for further analysis in spreadsheet or scripting environments. It is faster for initial crawls and more configurable for custom audit workflows.
Use Sitebulb when the output needs to be presentable to stakeholders who are not deeply technical, when the crawl map visualization would help communicate architectural problems, or when you want the tool to guide prioritization through its built-in issue severity scoring.
Use both together on complex sites: Screaming Frog for the technical data extraction and Sitebulb for the architectural visualization and client reporting layer.
What Does a Complete Crawl Audit Workflow Look Like?
The eight-step sequence moves from high-level symptom identification through root cause isolation to fix validation.
Step 1: Start in Crawl Stats. Open Search Console, navigate to Settings, then Crawl Stats. Note the 30-day averages for total requests, average response time, and status code distribution. Flag anything anomalous: response time above 200ms, 4xx or 5xx rates above 2 to 3 percent, flat Discovery crawl volume despite active publishing.
Step 2: Read the Pages report by status. Export each relevant indexing status (Discovered not indexed, Crawled not indexed, Valid with warnings) and note the URL patterns. Are the problems concentrated in a specific section of the site? A specific URL template? A recently migrated subfolder?
Step 3: Pull and verify log files. Export server logs for the past 30 days. Filter for Googlebot user-agent entries, verify each IP address against Google's published ranges or via reverse/forward DNS, and discard unverified entries. Build a frequency distribution: which URL templates receive the most crawl visits? What percentage of verified crawl requests return 200, 3xx, 4xx, and 5xx responses?
Step 4: Calculate budget distribution. From the verified log data, calculate what percentage of Googlebot's requests landed on pages that are indexed versus pages that are excluded, noindexed, or return errors. If more than 30 to 40 percent of budget is consumed by non-indexable pages, identify the top contributing URL patterns.
Step 5: Check response sizes for JavaScript rendering failures. For any URL template where JavaScript-rendered content should be present, compare average response size in logs against what the fully rendered page should return. Unexpectedly small response sizes confirm rendering failures before any other tool is needed.
Step 6: Run a third-party crawler. Crawl the site with Screaming Frog or Sitebulb starting from the homepage. Export the redirect chain report and sort by chain length. Export the 4xx internal inlinks report. Note click depth for every page that appeared in the Discovered not indexed or Crawled not indexed states from Step 2. Cross-reference: pages that are Discovered not indexed AND have high click depth AND few inbound links have a clear structural diagnosis.
Step 7: Match symptoms to root causes. Use the compound diagnosis framework:
| Symptom | Most likely root cause | Primary fix |
|---|---|---|
| Discovered not indexed, high click depth | Click depth exceeds crawl priority threshold | Add internal links from shallower pages; restructure navigation |
| Discovered not indexed, normal depth, sitemap submitted | Crawl budget consumed by low-value URL patterns | Block parameter/faceted URLs in robots.txt; fix crawl budget wasters |
| Crawled not indexed, thin content | Quality threshold not met | Improve content depth; merge into stronger page |
| Crawled not indexed, near-duplicate | Near-duplicate filter applied | Add canonical tag; improve differentiation |
| High 4xx in logs | Broken internal links pointing at dead URLs | Fix or redirect source links; return 410 for permanently removed pages |
| High 3xx in logs | Redirect chains across site | Flatten all chains to single-hop direct 301 redirects |
| High 5xx in logs | Server instability | Address server-side issues; use CDN; optimize response time |
| Low response size for JS pages | JavaScript not being rendered | Implement SSR or dynamic rendering for SEO-critical pages |
| Flat Discovery crawl volume | New pages not reaching the frontier | Improve internal link to new pages from already-crawled high-authority pages |
Step 8: Validate fixes in log files and Search Console. After implementing fixes, monitor log files to confirm behavior changes: redirect chains should collapse to single hops within the next crawl cycle for affected URLs; 4xx rates should drop as broken links are repaired. Search Console status changes for Discovered URLs take longer: typically one to four weeks before pages move from Discovered to Crawled as the crawl priority adjusts. Set calendar reminders to re-check the Pages report 30 and 60 days after major structural fixes to confirm improvement.
Sources
Google Search Central. "Crawl Stats Report." Search Console Help.
Google Search Central. "Page Indexing Report." Search Console Help.
Google Crawling Infrastructure. "Verifying Googlebot and Other Google Crawlers."
Screaming Frog. "SEO Spider."
Screaming Frog. "Log File Analyser."





