An XML sitemap is a structured file that lists the URLs you want search engines to consider for crawling and indexing. It is a declaration of intent, not a command. Google's own documentation is explicit on this point: submitting a sitemap is "merely a hint" and guarantees neither crawling nor indexing of the URLs it contains. Built correctly, a sitemap accelerates URL discovery for new content, helps Google find deep pages with weak internal link coverage, and provides the lastmod freshness signal that influences recrawl prioritization. Built incorrectly, it becomes noise that erodes Google's trust in your signals over time. This article covers the full sitemaps.org protocol schema, the inclusion and exclusion logic that makes a sitemap useful rather than misleading, and how the Sitemaps report in Search Console functions as a content quality audit tool, not just a submission tracker.
What Is an XML Sitemap?
An XML sitemap is a file in the sitemaps.org XML format that lists URLs on a website along with optional metadata about each URL: when it was last modified, how often it changes, and its relative priority on the site. Google, Bing, and other major search engines read this file to supplement URL discovery through link-following. A sitemap does not replace crawling; it accelerates it.
The sitemaps.org protocol was created in 2005 as a joint initiative by Google, Yahoo, and Microsoft to standardize a mechanism for webmasters to communicate their site's URL structure directly to search engines. Sitemaps.org remains the canonical specification. The schema is defined at http://www.sitemaps.org/schemas/sitemap/0.9.
What Does the Sitemaps.org Protocol Actually Contain?
A valid XML sitemap contains a required XML declaration, a <urlset> root element with the sitemaps.org namespace, and one <url> block per page. Within each <url> block, four child elements are defined by the protocol:[1]Source 1Sitemaps.org. "Sitemap Protocol."View source
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/search-engine-basics/</loc>
<lastmod>2026-04-15</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
| Element | Required | What it does |
|---|---|---|
<loc> | Yes | The full absolute URL of the page. Must use HTTPS if the site is HTTPS. |
<lastmod> | No | The date the page was last meaningfully updated, in W3C date format (YYYY-MM-DD or with timestamp). |
<changefreq> | No | Suggested crawl frequency hint. Values: always, hourly, daily, weekly, monthly, yearly, never. |
<priority> | No | Relative importance of this URL on your site (0.0 to 1.0). Default is 0.5. |
Two of those four elements are effectively useless for Google. One is conditionally useful but only when accurate. One is mandatory and the whole point. Understanding which is which saves significant time and prevents the credibility erosion that comes from feeding search engines false signals.
What Should You Include in an XML Sitemap?
An XML sitemap should contain only pages you genuinely want indexed in search results, that are currently accessible, and that represent the canonical version of their content.
Only Canonical URLs Returning HTTP 200
Every URL in your sitemap should satisfy all three of these conditions simultaneously: it returns an HTTP 200 status code; it is the canonical version of its content (self-referencing canonical tag or no conflicting canonical signals); and it has no noindex meta tag or X-Robots-Tag header.
Google's documentation states directly that it will use the sitemap as a signal about which URLs matter. Including URLs that fail any of those conditions sends contradictory signals. A URL in your sitemap that returns a 301 tells Google you're recommending a page that itself redirects elsewhere. A noindexed URL in your sitemap tells Google you both want it discovered and don't want it indexed, a contradiction that reduces confidence in your sitemap's accuracy over time.
The right approach: treat your sitemap as a curated, authoritative list of indexable pages. Every URL on it should be something you would actively want to appear in Google's search results tomorrow.
The lastmod Attribute: The One Metadata Field That Actually Matters
Google uses <lastmod> if it is "consistently and verifiably accurate." John Mueller confirmed in 2017 that the URL and last modification date are what Google cares about in sitemaps for web search. The key word is accurate. If your sitemap shows a recent lastmod date for a page that has not actually changed, Google will stop trusting your lastmod signals, effectively nullifying them across your entire sitemap.
Update lastmod only when meaningful content changes occur: a significant rewrite of the main text, new structured data, a change to key images. Do not update it for cosmetic layout adjustments, header redesigns, or copyright year changes. A credible lastmod signal accelerates recrawl prioritization for genuinely refreshed content. An inflated lastmod signal eliminates that advantage for every URL on the site.
Use the W3C date and time format. The date-only format YYYY-MM-DD is sufficient for most cases. Full timestamps with timezone offsets (2026-04-15T14:30:00+00:00) are appropriate for high-frequency content publishers who want to signal specific update times.
What Should You Exclude from an XML Sitemap?
The rule of exclusion is the inverse of the inclusion rule: any URL that fails the conditions of being canonical, 200-status, and indexable should be removed from your sitemap. In practice, this means the following categories should never appear:
| URL type | Why to exclude |
|---|---|
| Redirect URLs (301, 302) | Signal a recommended page that doesn't actually exist at that address |
| noindex pages | Contradicts the page's own directive; confuses Google about intent |
| Non-canonical URLs | Competing versions of the same content; include only the canonical URL |
| Paginated URLs beyond page 1 | /category/page/2 and beyond typically add no unique indexable value |
| Faceted navigation / filter URLs | Duplicate or near-duplicate content; wastes crawl budget if discovered |
| URLs blocked by robots.txt | Cannot be crawled even if discovered; creates conflicting signals |
| 404 and 410 URLs | Report dead content as available |
| Thin content pages | Invites crawling and assessment of pages that damage site quality signals |
| Session ID or tracking parameter URLs | Duplicate content with ephemeral identifiers |
The most common sitemap quality problem in the wild is not the absence of needed URLs but the presence of unneeded ones. A sitemap that includes noindexed tag pages, every paginated archive page, and expired product pages with a lastmod of today is not a useful discovery tool. It is a misleading document that trains Google to treat the site's signals as unreliable.
What Do <changefreq> and <priority> Actually Do?
For Google: nothing.[2]Source 2Sitemaps.org. "Frequently Asked Questions."View source
John Mueller stated in 2015 that "priority and change frequency doesn't really play that much of a role with Sitemaps anymore." Google's own documentation confirms this directly: "Google ignores <priority> and <changefreq> values." Gary Illyes, analyst on the Google Search team, went further and called the <priority> tag "a bag of noise."
The reason is history. When these fields were first introduced, webmasters immediately gamed them: every URL was assigned priority=1.0 and changefreq=always. The signal became worthless because it carried no information. Google's systems observe actual crawl data, site authority, and link signals to determine crawl priority; they do not need (and no longer trust) self-reported hints.
Notice how this changes your sitemap production workflow: if you are spending time assigning <priority> values to your pages, that time is wasted for Google. The only fields worth populating are <loc> (required) and <lastmod> (if you can maintain it accurately). A lean sitemap with just those two fields is faster to parse, smaller in file size, and carries more credible signals than a bloated one with four fields filled with arbitrary values.
<changefreq> retains some function for Bing, which has indicated it uses the value as a crawl scheduling hint. If Bing traffic matters to your site, including accurate <changefreq> values remains worthwhile. For Google-only optimization, omit it.
How Do You Structure Sitemaps for Large Sites?
A single sitemap file has two hard limits: 50,000 URLs maximum and 50 MB maximum file size (uncompressed). These are not soft guidelines. Google and Bing will truncate sitemap files that exceed either limit, and the truncation point is unpredictable, meaning some of your important URLs may simply not be read.[4]Source 4Google Search Central. "Manage Sitemaps with Sitemap Index Files."View source
Sites with more than 50,000 indexable URLs need a sitemap index file. This is a parent XML file that lists individual child sitemap files rather than individual page URLs:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-articles.xml</loc>
<lastmod>2026-04-20</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-04-18</lastmod>
</sitemap>
</sitemapindex>
The sitemap index file itself can reference up to 50,000 child sitemaps, creating a theoretical maximum of 2.5 billion URLs across a single sitemap index. In practice, splitting by content type (articles, products, categories, authors) rather than by page number makes the Sitemaps report far more diagnostic: you can see exactly which segments of your site are being indexed at what rate.
Compress all sitemap files with gzip (.xml.gz). This reduces file transfer size by 70 to 90 percent and is supported by all major search engines. Google accepts compressed sitemaps transparently. A 45 MB uncompressed sitemap becomes approximately 4 to 5 MB compressed, reducing the bandwidth hit on every crawl cycle.
How Do You Submit a Sitemap to Google?
There are three submission methods, each appropriate for different use cases.[3]Source 3Google Search Central. "Build and Submit a Sitemap."View source
The primary method is the Sitemaps report in Search Console, under Indexing. Enter the path to your sitemap or sitemap index file and submit. This gives you direct visibility into when Google last read the file, the number of discovered URLs, and any processing errors.
The secondary method is the Sitemap: directive in robots.txt. Googlebot reads robots.txt before every crawl session. Adding Sitemap: https://example.com/sitemap.xml to robots.txt ensures that any crawler visiting your site discovers your sitemap automatically, even if Search Console submission is delayed or unavailable.
User-agent: *
Disallow: /wp-admin/
Sitemap: https://example.com/sitemap_index.xml
The tertiary method is direct URL ping, where you send an HTTP GET request to https://www.google.com/ping?sitemap=https://example.com/sitemap.xml. This notifies Google that your sitemap has been updated and can trigger faster re-processing. It is most useful for high-frequency content publishers who update their sitemap multiple times per day.
For most sites, the Search Console submission plus the robots.txt reference is sufficient. The robots.txt reference serves as a passive always-on notification; the Search Console submission provides the monitoring interface that makes sitemap health visible.
How Does the Sitemaps Report Work as a Content Quality Audit?
After submission, the Sitemaps report shows two numbers: submitted URLs and indexed URLs. The ratio between them is the most direct measure of your sitemap's quality and, by extension, your site's indexable content quality.
According to practitioners monitoring this metric across client sites, a submitted-to-indexed ratio of 90 percent or above indicates a healthy sitemap. A ratio between 60 and 90 percent indicates content quality issues: thin pages, near-duplicate archives, or weakly-linked content that Google discovered and chose not to index. A ratio below 60 percent indicates a structural problem, typically a large volume of low-value auto-generated pages (tag archives, author pages, parameter variants) that have been included in the sitemap but carry insufficient quality signals to earn indexing.
The important insight is that a low ratio is almost never a technical sitemap problem. The fix is not to adjust the sitemap XML. The fix is to either improve the quality of the excluded pages until they warrant indexing, or remove those pages from the sitemap (and optionally apply noindex) so the ratio improves by reducing the denominator rather than by improving the numerator. Submitting fewer, better pages is always preferable to submitting more pages and watching most of them get excluded.
The Sitemaps report also shows the "Last read" date, which should be within the last 7 days for an active site. A stale last-read date indicates Googlebot has stopped checking your sitemap, usually because of repeated errors in the file (broken URLs, malformed XML, or a sitemap that returned a 5xx status when Googlebot tried to fetch it). Fixing sitemap fetch errors restores the regular read cycle.
Sources
Sitemaps.org. "Sitemap Protocol."
Sitemaps.org. "Frequently Asked Questions."
Google Search Central. "Build and Submit a Sitemap."
Google Search Central. "Manage Sitemaps with Sitemap Index Files."





