<- Blog.Crawling

XML Sitemaps Explained: Schema, What to Include, What to Exclude, and Submission

An XML sitemap is a structured file that lists the URLs you want search engines to consider for crawling and indexing. It is a declaration of intent, not a command. Google's own documentation is explicit on this point: submitting a...

FoundationPracticalOfficial docInteractive
Jun 8, 2026.10 min read
XML Sitemaps Explained: Schema, What to Include, What to Exclude, and Submission

An XML sitemap is a structured file that lists the URLs you want search engines to consider for crawling and indexing. It is a declaration of intent, not a command. Google's own documentation is explicit on this point: submitting a sitemap is "merely a hint" and guarantees neither crawling nor indexing of the URLs it contains. Built correctly, a sitemap accelerates URL discovery for new content, helps Google find deep pages with weak internal link coverage, and provides the lastmod freshness signal that influences recrawl prioritization. Built incorrectly, it becomes noise that erodes Google's trust in your signals over time. This article covers the full sitemaps.org protocol schema, the inclusion and exclusion logic that makes a sitemap useful rather than misleading, and how the Sitemaps report in Search Console functions as a content quality audit tool, not just a submission tracker.


What Is an XML Sitemap?

An XML sitemap is a file in the sitemaps.org XML format that lists URLs on a website along with optional metadata about each URL: when it was last modified, how often it changes, and its relative priority on the site. Google, Bing, and other major search engines read this file to supplement URL discovery through link-following. A sitemap does not replace crawling; it accelerates it.

The sitemaps.org protocol was created in 2005 as a joint initiative by Google, Yahoo, and Microsoft to standardize a mechanism for webmasters to communicate their site's URL structure directly to search engines. Sitemaps.org remains the canonical specification. The schema is defined at http://www.sitemaps.org/schemas/sitemap/0.9.


What Does the Sitemaps.org Protocol Actually Contain?

A valid XML sitemap contains a required XML declaration, a <urlset> root element with the sitemaps.org namespace, and one <url> block per page. Within each <url> block, four child elements are defined by the protocol:[1]Source 1Sitemaps.org. "Sitemap Protocol."View source

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/search-engine-basics/</loc>
    <lastmod>2026-04-15</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>
ElementRequiredWhat it does
<loc>YesThe full absolute URL of the page. Must use HTTPS if the site is HTTPS.
<lastmod>NoThe date the page was last meaningfully updated, in W3C date format (YYYY-MM-DD or with timestamp).
<changefreq>NoSuggested crawl frequency hint. Values: always, hourly, daily, weekly, monthly, yearly, never.
<priority>NoRelative importance of this URL on your site (0.0 to 1.0). Default is 0.5.

Two of those four elements are effectively useless for Google. One is conditionally useful but only when accurate. One is mandatory and the whole point. Understanding which is which saves significant time and prevents the credibility erosion that comes from feeding search engines false signals.


What Should You Include in an XML Sitemap?

An XML sitemap should contain only pages you genuinely want indexed in search results, that are currently accessible, and that represent the canonical version of their content.

Only Canonical URLs Returning HTTP 200

Every URL in your sitemap should satisfy all three of these conditions simultaneously: it returns an HTTP 200 status code; it is the canonical version of its content (self-referencing canonical tag or no conflicting canonical signals); and it has no noindex meta tag or X-Robots-Tag header.

Google's documentation states directly that it will use the sitemap as a signal about which URLs matter. Including URLs that fail any of those conditions sends contradictory signals. A URL in your sitemap that returns a 301 tells Google you're recommending a page that itself redirects elsewhere. A noindexed URL in your sitemap tells Google you both want it discovered and don't want it indexed, a contradiction that reduces confidence in your sitemap's accuracy over time.

The right approach: treat your sitemap as a curated, authoritative list of indexable pages. Every URL on it should be something you would actively want to appear in Google's search results tomorrow.

The lastmod Attribute: The One Metadata Field That Actually Matters

Google uses <lastmod> if it is "consistently and verifiably accurate." John Mueller confirmed in 2017 that the URL and last modification date are what Google cares about in sitemaps for web search. The key word is accurate. If your sitemap shows a recent lastmod date for a page that has not actually changed, Google will stop trusting your lastmod signals, effectively nullifying them across your entire sitemap.

Update lastmod only when meaningful content changes occur: a significant rewrite of the main text, new structured data, a change to key images. Do not update it for cosmetic layout adjustments, header redesigns, or copyright year changes. A credible lastmod signal accelerates recrawl prioritization for genuinely refreshed content. An inflated lastmod signal eliminates that advantage for every URL on the site.

Use the W3C date and time format. The date-only format YYYY-MM-DD is sufficient for most cases. Full timestamps with timezone offsets (2026-04-15T14:30:00+00:00) are appropriate for high-frequency content publishers who want to signal specific update times.


What Should You Exclude from an XML Sitemap?

The rule of exclusion is the inverse of the inclusion rule: any URL that fails the conditions of being canonical, 200-status, and indexable should be removed from your sitemap. In practice, this means the following categories should never appear:

URL typeWhy to exclude
Redirect URLs (301, 302)Signal a recommended page that doesn't actually exist at that address
noindex pagesContradicts the page's own directive; confuses Google about intent
Non-canonical URLsCompeting versions of the same content; include only the canonical URL
Paginated URLs beyond page 1/category/page/2 and beyond typically add no unique indexable value
Faceted navigation / filter URLsDuplicate or near-duplicate content; wastes crawl budget if discovered
URLs blocked by robots.txtCannot be crawled even if discovered; creates conflicting signals
404 and 410 URLsReport dead content as available
Thin content pagesInvites crawling and assessment of pages that damage site quality signals
Session ID or tracking parameter URLsDuplicate content with ephemeral identifiers

The most common sitemap quality problem in the wild is not the absence of needed URLs but the presence of unneeded ones. A sitemap that includes noindexed tag pages, every paginated archive page, and expired product pages with a lastmod of today is not a useful discovery tool. It is a misleading document that trains Google to treat the site's signals as unreliable.


What Do <changefreq> and <priority> Actually Do?

For Google: nothing.[2]Source 2Sitemaps.org. "Frequently Asked Questions."View source

John Mueller stated in 2015 that "priority and change frequency doesn't really play that much of a role with Sitemaps anymore." Google's own documentation confirms this directly: "Google ignores <priority> and <changefreq> values." Gary Illyes, analyst on the Google Search team, went further and called the <priority> tag "a bag of noise."

The reason is history. When these fields were first introduced, webmasters immediately gamed them: every URL was assigned priority=1.0 and changefreq=always. The signal became worthless because it carried no information. Google's systems observe actual crawl data, site authority, and link signals to determine crawl priority; they do not need (and no longer trust) self-reported hints.

Notice how this changes your sitemap production workflow: if you are spending time assigning <priority> values to your pages, that time is wasted for Google. The only fields worth populating are <loc> (required) and <lastmod> (if you can maintain it accurately). A lean sitemap with just those two fields is faster to parse, smaller in file size, and carries more credible signals than a bloated one with four fields filled with arbitrary values.

<changefreq> retains some function for Bing, which has indicated it uses the value as a crawl scheduling hint. If Bing traffic matters to your site, including accurate <changefreq> values remains worthwhile. For Google-only optimization, omit it.


How Do You Structure Sitemaps for Large Sites?

A single sitemap file has two hard limits: 50,000 URLs maximum and 50 MB maximum file size (uncompressed). These are not soft guidelines. Google and Bing will truncate sitemap files that exceed either limit, and the truncation point is unpredictable, meaning some of your important URLs may simply not be read.[4]Source 4Google Search Central. "Manage Sitemaps with Sitemap Index Files."View source

Sites with more than 50,000 indexable URLs need a sitemap index file. This is a parent XML file that lists individual child sitemap files rather than individual page URLs:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-articles.xml</loc>
    <lastmod>2026-04-20</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-04-18</lastmod>
  </sitemap>
</sitemapindex>

The sitemap index file itself can reference up to 50,000 child sitemaps, creating a theoretical maximum of 2.5 billion URLs across a single sitemap index. In practice, splitting by content type (articles, products, categories, authors) rather than by page number makes the Sitemaps report far more diagnostic: you can see exactly which segments of your site are being indexed at what rate.

Compress all sitemap files with gzip (.xml.gz). This reduces file transfer size by 70 to 90 percent and is supported by all major search engines. Google accepts compressed sitemaps transparently. A 45 MB uncompressed sitemap becomes approximately 4 to 5 MB compressed, reducing the bandwidth hit on every crawl cycle.


How Do You Submit a Sitemap to Google?

There are three submission methods, each appropriate for different use cases.[3]Source 3Google Search Central. "Build and Submit a Sitemap."View source

The primary method is the Sitemaps report in Search Console, under Indexing. Enter the path to your sitemap or sitemap index file and submit. This gives you direct visibility into when Google last read the file, the number of discovered URLs, and any processing errors.

The secondary method is the Sitemap: directive in robots.txt. Googlebot reads robots.txt before every crawl session. Adding Sitemap: https://example.com/sitemap.xml to robots.txt ensures that any crawler visiting your site discovers your sitemap automatically, even if Search Console submission is delayed or unavailable.

User-agent: *
Disallow: /wp-admin/

Sitemap: https://example.com/sitemap_index.xml

The tertiary method is direct URL ping, where you send an HTTP GET request to https://www.google.com/ping?sitemap=https://example.com/sitemap.xml. This notifies Google that your sitemap has been updated and can trigger faster re-processing. It is most useful for high-frequency content publishers who update their sitemap multiple times per day.

For most sites, the Search Console submission plus the robots.txt reference is sufficient. The robots.txt reference serves as a passive always-on notification; the Search Console submission provides the monitoring interface that makes sitemap health visible.


How Does the Sitemaps Report Work as a Content Quality Audit?

After submission, the Sitemaps report shows two numbers: submitted URLs and indexed URLs. The ratio between them is the most direct measure of your sitemap's quality and, by extension, your site's indexable content quality.

According to practitioners monitoring this metric across client sites, a submitted-to-indexed ratio of 90 percent or above indicates a healthy sitemap. A ratio between 60 and 90 percent indicates content quality issues: thin pages, near-duplicate archives, or weakly-linked content that Google discovered and chose not to index. A ratio below 60 percent indicates a structural problem, typically a large volume of low-value auto-generated pages (tag archives, author pages, parameter variants) that have been included in the sitemap but carry insufficient quality signals to earn indexing.

The important insight is that a low ratio is almost never a technical sitemap problem. The fix is not to adjust the sitemap XML. The fix is to either improve the quality of the excluded pages until they warrant indexing, or remove those pages from the sitemap (and optionally apply noindex) so the ratio improves by reducing the denominator rather than by improving the numerator. Submitting fewer, better pages is always preferable to submitting more pages and watching most of them get excluded.

The Sitemaps report also shows the "Last read" date, which should be within the last 7 days for an active site. A stale last-read date indicates Googlebot has stopped checking your sitemap, usually because of repeated errors in the file (broken URLs, malformed XML, or a sitemap that returned a 5xx status when Googlebot tried to fetch it). Fixing sitemap fetch errors restores the regular read cycle.

Sources

  1. Sitemaps.org. "Sitemap Protocol."

  2. Sitemaps.org. "Frequently Asked Questions."

  3. Google Search Central. "Build and Submit a Sitemap."

  4. Google Search Central. "Manage Sitemaps with Sitemap Index Files."

Share

About the Contributors

Frequently Asked Questions (FAQs)

What is the difference between an XML sitemap and an HTML sitemap?+

An XML sitemap is a machine-readable file designed for search engine crawlers, listing page URLs and optional metadata. An HTML sitemap is a human-readable navigational page listing links to sections of a site. Both can assist with URL discovery, but XML sitemaps are the standard mechanism for communicating directly with search engine crawlers, while HTML sitemaps primarily serve users who want an overview of site structure.

Should I include every page of my site in the XML sitemap?+

No. Include only canonical, indexable pages returning HTTP 200. Exclude redirects, noindex pages, paginated archive pages beyond page 1, faceted navigation URLs, and thin content pages. A smaller, accurate sitemap trains Google to trust your signals. A bloated sitemap that includes low-value pages reduces the indexing ratio and signals poor content architecture.

Does the URL position in a sitemap affect crawl order?+

No. Google does not crawl URLs in the order they appear in a sitemap. The official documentation confirms that Google determines crawl priority based on its own signals (authority, freshness, PageRank) rather than sitemap position. URL order within a sitemap is irrelevant to crawl scheduling.

What is the `lastmod` attribute and how accurate does it need to be?+

lastmod is the date a page was last meaningfully updated, in W3C date format. Google uses it to prioritize recrawling of recently changed content. Accuracy is critical: if your lastmod values are set to today's date for pages that have not changed, Google will stop trusting your lastmod signals entirely. Only update lastmod when substantive content changes occur, such as a rewrite of main body text or a significant update to structured data.

How large can a sitemap file be?+

Each sitemap file can contain a maximum of 50,000 URLs and must not exceed 50 MB uncompressed. Sites exceeding either limit must use a sitemap index file to split URLs across multiple child sitemap files. Gzip compression reduces actual transfer size by 70 to 90 percent and is supported by all major search engines. The 50 MB limit applies to the uncompressed file size regardless of compression.

What does a low indexed-to-submitted ratio mean in Search Console?+

A low ratio typically indicates content quality problems, not a technical sitemap error. If Google is discovering but not indexing a large fraction of your submitted URLs, those pages likely have thin content, weak internal linking, or near-duplicate signals. The solution is to either improve those pages or remove them from the sitemap. A sitemap with 1,000 high-quality URLs and a 95 percent index ratio is far more useful than one with 10,000 URLs and a 40 percent ratio.

Contributors

Reviewed by people
who know the system.

All Authors ->