- Mar 12

XML sitemaps: Help Google discover your pages and improve your SEO

An image of author James Clark, accompanied by various search-related iconography.

Discovery is the very first step in SEO—if a search engine can’t discover your content, it will never crawl and index it, which means searchers won’t be able to access it. An XML sitemap is an optional, but powerful, tool to support the discovery process (and by extension, your technical SEO efforts).

But, what should your sitemap contain? How do you create it? And how do you tell search engines about it? Let’s go on a discovery process of our own—drawing from the official sitemaps protocol and Google’s documentation—to understand how to use sitemaps for better SEO.

Table of contents:

What is an XML sitemap?
How XML sitemaps help your SEO
Types of sitemap
What an XML sitemap should (and shouldn’t) contain
Static and dynamic sitemaps
Size limits for XML sitemaps and sitemap indexes
How to generate an XML sitemap
- Sitemaps on Wix
- Generate sitemaps with Screaming Frog
Submitting your XML sitemap to Google and Bing
Validating your XML sitemap
HTML vs. XML sitemaps

What is an XML sitemap?

Before Google and other search engines can crawl and index your pages, they must first discover them. A sitemap is a document that facilitates the discovery process by telling search engines about the pages on a website that are available to crawl.

Although sitemaps can come in different formats, the most common is XML (extensible markup language)—a language that uses tags to “mark up” and structure information (a bit like HTML). One of the benefits of XML is that both people and computer programs can easily read it.

XML sitemaps follow the sitemaps protocol, which Google, Microsoft, and Yahoo all support. This protocol defines what a sitemap can contain, how to format it, and even how to submit it to search engines.

The start of the sitemaps protocol at sitemaps.org, describing the XML scheme for sitemaps

How XML sitemaps help your SEO

The main way a search engine discovers web pages on a site is by following backlinks. These could be links from your own site (internal links), or links on another site (external links).

Some pages can be difficult for search engines to find. An “orphan page” is one that doesn’t have any inbound links pointing to it, meaning that search engines will never discover it by following links. Your website can even have small groups of orphan pages that only link to each other.

Graphic showing orphan pages on a site with no internal links except to other orphan pages

Another challenge that search engines face is knowing when pages were updated. Although search engines will periodically revisit crawled pages to see whether the content has changed, this isn’t particularly efficient either for search engines or website owners.

Sitemaps solve both of those SEO problems: They tell search engines about pages that are available to crawl, even orphan pages. And, they can also tell search engines when a page was last significantly changed, making crawling more efficient.

That isn’t to say that a sitemap can replace an effective internal linking policy. Remember, links don’t just help discovery, they also tell search engines about the relationship between pages—something that sitemaps can’t do.

Types of sitemap

You might think that “sitemap” and “XML sitemap” are synonymous, but the sitemaps protocol defines three valid sitemap formats:

XML
Text file
Syndication feed

Google and other major search engines can work with any of these formats. In most cases, you’ll want an XML sitemap—but if your platform or CMS doesn’t provide you with this, consider the other two formats:

A text sitemap is a text file (with a .txt extension) that lists all your page URLs, one per line. It can’t contain any other information. Text sitemaps are simple to create, so this is a good option if you have a very small site and rarely add new pages—though in that situation, you may not need a sitemap at all.

A syndication feed is a way of distributing content, especially news content. Although feeds are less popular than they used to be, many platforms still provide them in either the RSS or Atom format.

A screenshot of Google Search Central’s web page on building and submitting a sitemap, showing the pros and cons of sitemap types, including XML, RSS/Atom, and text sitemaps. — Source: Google.

News sites often create feeds for individual categories (or “channels”)—here’s the start of The Guardian’s RSS feed for its culture category:

RSS feed for The Guardian’s Culture category, showing channel information and the start of the first item

One big drawback of using a feed as a sitemap is that it usually only contains the most recent content. Nonetheless, it can still help search engines discover that content (and, through internal links, other content on your site).

The rest of this article focuses on XML sitemaps as these are the most common, and versatile, type of sitemap.

What an XML sitemap should (and shouldn’t) contain

Your sitemap should contain the URLs of all the pages you want search engines to crawl (and subsequently show in search results). Each page has its own pair of opening and closing <url> tags, containing a <loc> element that specifies the page’s location—like this:

<url>
  <loc>https://example.com/myurl1/</loc>
</url>
<url>
  <loc>https://example.com/myurl2/</loc>
</url>

There are plenty of pages you shouldn’t include in your sitemap, for example:

Pages that aren’t the canonical version of the content
Pages that are blocked by robots.txt
Pages set as noindex

In other words, pages you wouldn’t want Google to attempt to crawl and index.

The <loc> element is mandatory, but there are other optional elements you can include with each URL to add more guidance for crawlers:

<lastmod>: The date the page was last significantly modified
<changefreq>: How frequently the page is likely to change (e.g., “monthly”)
<priority>: “The priority of this URL relative to other URLs on your site”—the higher the value (from 0.0 to 1.0), the more important you want crawlers to perceive the page as

Google says it ignores <priority> and <changefreq> values, while Bing says it “largely disregards” them. The <lastmod> tag, then, is the most useful way of indicating to search engines that a previously discovered page needs to be recrawled.

Part of the sitemap for animal rescue center Battersea, showing the <lastmod>, <changefreq> and <priority> values

Sitemap extensions

The sitemap protocol only specifies how to include URLs in a sitemap. However, one of the most powerful features of the protocol is that you can use it to include other types of content (the “X” in “XML” stands for “eXtensible”).

There are Google-supported extensions for the following content types:

You can create separate sitemaps for these content types or include them in your existing sitemap.

The extensions introduce many new mandatory and optional elements. For example, video content requires the <video:thumbnail_loc> tag pointing to the location of the video thumbnail.

Video sitemap from Which.co.uk, showing video-specific elements such as <video:thumbnail_loc> and <video:player_loc>

Static and dynamic sitemaps

Dynamic sitemaps are generated each time they are requested from the server, so they will always be up to date. In other words, if you create a new page on your website, then load your dynamic XML sitemap in a browser tab, it should list your new page. Likewise, if you change an existing page, the sitemap should update the <lastmod> value for that page. (If your sitemap is supposed to be dynamic but isn’t updating, you might have a caching issue.)

Static sitemaps, on the other hand, aren’t generated on the fly and don’t automatically update. As the name suggests, they are just static files.

In almost all cases, a dynamic sitemap is a better option. After all, if one of the main roles of a sitemap is to tell search engines about new content, you want your sitemap to include that content as soon as it is published.

Size limits for XML sitemaps and sitemap indexes

The sitemaps protocol specifies size limits for XML sitemaps “to ensure that your web server does not get bogged down serving very large files” (but also to make the process more efficient for search engines).

Your XML sitemap should:

Be no larger than 50MB (52,428,800 bytes)
Contain a maximum of 50,000 URLs

The size limit refers to the size of the uncompressed file, so compressing the file won’t help you get around this requirement. Instead you should follow the advice given in the protocol:

“If your site contains more than 50,000 URLs or your Sitemap is bigger than 50MB, you must create multiple Sitemap files and use a Sitemap index file. You should use a Sitemap index file even if you have a small site but plan on growing beyond 50,000 URLs or a file size of 50MB.”

A sitemap index is an XML file that lists multiple XML sitemaps. You might have one sitemap for your posts, one for your pages, and another for your categories—all listed in your index.

Lifehacker’s sitemap index lists separate sitemaps for its opinions, how-to’s, explainers, and many other categories.

Sitemap index files have size limits, too. Similar to individual sitemaps, they should:

Not exceed 50MB (52,428,800 bytes)
Include up to 50,000 sitemaps

The sitemaps protocol also has restrictions around content. Some characters must be “escaped”—an ampersand (“&”) is written as “&”, for example.

If you’re using your web platform or CMS to generate your sitemap, it will likely follow the protocol, so you only need to worry about these restrictions if you’re creating your sitemap manually (which is rare).

How to generate an XML sitemap

How you generate your XML sitemap will depend on the CMS or platform you use for your website. Let’s look at how this works for Wix websites as well as how to do this with Screaming Frog, a popular SEO tool.

Sitemaps on Wix

Wix websites come with sitemaps automatically. I say “sitemaps” because the platform provides different sitemaps for different types of pages. The sitemap index lives at https://yoursite.com/sitemap.xml, but this could link to sitemaps for events, forum posts, or more, depending on the functionality your site uses.

For example, this London barbershop has a sitemap specifically for the products it sells on its Wix website:

Sitemap index for https://www.cutsandbruisesbarbershop.com, linking to a separate sitemap for store products

Also, when you complete your Wix SEO Setup Checklist, Wix automatically submits your XML sitemap to Google for you. You’ll need a Premium plan and your own domain to take advantage of this.

Generate sitemaps with Screaming Frog

Maybe your platform or CMS doesn’t generate an XML sitemap for you. Maybe you aren’t even using a platform or CMS and instead hand-coded your site from scratch! In these situations, you’ll have to get a little creative.

If your site is small, you could use the text sitemap format we looked at earlier, create the file manually, and host it on your server.

But, this isn’t very practical if you have more than a couple dozen pages. Instead, you could use an SEO tool called Screaming Frog to crawl your site and create an XML sitemap for you.

This is a powerful option as it will automatically exclude pages that are blocked by robots.txt, set as “noindex,” or have a canonical tag pointing to a different URL—in other words, all the pages you ordinarily wouldn’t want Google to attempt to crawl and index.

The free version of Screaming Frog will crawl up to 500 URLs, so if your site is bigger than this then you’ll need to pay for a license.

A completed Screaming Frog crawl, listing the pages crawled as well as a “Crawl Limit Reached” warning

Just as with a text sitemap, the next step is to host your new XML sitemap on your server (preferably in the root directory) then submit it to Google. If your site changes often, you could even look at scheduling an automated crawl.

One downside of the Screaming Frog approach is that it gives you a static sitemap. If you create a new page or update an existing one, the sitemap won’t automatically change to reflect that.

Submitting your XML sitemap to Google (and Bing)

Once you’ve generated your sitemap, the next step is to inform the major search engines so they can use it. There are two ways to do this.

The first is to specify the path to your sitemap or sitemap index in your robots.txt file, like this example from the Manchester United website:

The robots.txt file from https://www.manutd.com/, containing a link to the XML sitemap

This small change will enable Google and other search engines to find your sitemap the next time they crawl your robots.txt file. The downside here is that you don’t get any feedback: you won’t know when those search engines last read your sitemap, how many pages they discovered, and so on.

For that, you’ll need special tools provided by the search engines themselves. These tools let you both submit your sitemap and see how it is being read:

For Google, the tool to use is Google Search Console. Our complete guide to Google Search Console walks you through the process of first verifying your site in Search Console and then submitting your sitemap (or sitemap index). If you manage a number of sites and want to submit your sitemaps to Google programmatically, use the Search Console API.
Bing has its own equivalent of Search Console, called Bing Webmaster Tools, and the submission process here is also straightforward.

You don’t have to choose one approach or the other. It’s definitely worth both specifying the path to your sitemap in your robots.txt file and submitting your sitemap to search engines individually.

Validating your XML sitemap

It may seem strange to talk about validating your XML sitemap after submitting it, but that’s because submitting your sitemap is actually the best way to validate it. When you submit your sitemap to Google using Search Console (or Bing using Bing Webmaster Tools), the tool will tell you whether your sitemap is valid.

In Search Console, you get a green “success” message if everything is OK:

Submitted sitemaps page in Google Search Console, listing one submitted sitemap and a green “Success” status

But if you get a red message instead, something has gone wrong. Just click on the error to find out more:

Error details page in Google Search Console saying “Your Sitemap appears to be an HTML page. Please use a supported sitemap format instead”

Once you’ve fixed any errors, resubmit it to prompt Google to fetch it again.

There are also free third-party tools you can use to validate your XML sitemap, either by pasting in a link or uploading an XML file.

However, even if your XML is valid, there might be another reason why Google can’t fetch your sitemap: Perhaps your robots.txt file is blocking Googlebot from accessing it. Unlike Google Search Console, a third-party validation tool wouldn’t pick up on this kind of issue.

HTML vs. XML sitemaps

We’ve seen that XML sitemaps are intended for search engines, but there’s another type of sitemap aimed at human users: the HTML sitemap. This is a directory of the main pages or sections on a site, and can help users quickly understand the site’s structure and navigate around.

Your HTML sitemap may sit on a dedicated page, or perhaps in the footer—as with this example from Apple:

Large footer menu from Apple.com containing dozens of categorized links

So called “mega menus” in the site header are, in effect, another kind of HTML sitemap:

Mega-menu from UK retailer Currys listing categories and subcategories of electrical appliances

HTML sitemaps do serve an SEO purpose, too: They are a collection of internal links, which Googlebot will happily use to discover new pages and understand the value of those pages. So in that sense they complement the work of your XML sitemaps.

That doesn’t mean you should use just an HTML sitemap. Thinking specifically about search engine discovery, they have some major drawbacks compared to XML sitemaps:

HTML sitemaps are limited by space on the page, so don’t usually include individual articles, blog posts, or product pages (likely to be the bulk of your new content).
HTML sitemaps don’t tell search engines when content was updated.
HTML sitemaps usually need to be updated manually, so they may not be completely up to date.

For those reasons, you should focus on providing an XML sitemap that is useful to search engines and an HTML sitemap that is useful to your users. If an HTML sitemap wouldn’t be useful to your users, simply don’t include one.

Take control over your discoverability with XML sitemaps

Now, you have an in-depth understanding of sitemaps. Put your knowledge into practice by working through the following questions:

Does my website have an XML sitemap?
Does it list (only) the pages I want to be crawled?
Does it contain all the detail I want it to (e.g., the <lastmod> time)?
Is it valid XML and within the size limit?
Does my sitemap update automatically (or do I have a way of updating it)?
Have I submitted it to Google and Bing?
Have I specified the path to my sitemap in my robots.txt file?
Would my users benefit from an HTML sitemap?

The exercise will help you come up with a plan for your site, revealing any actions you need to take to improve your sitemap coverage and boost page discoverability. And, if you’re in doubt about anything, refer to the sitemaps protocol and Google’s documentation!

James Clark - Web Analyst

James Clark is a web analyst from London, with a background in the publishing sector. When he isn't helping businesses with their analytics, he's usually writing how-to guides over on his website Technically Product. Twitter | Linkedin