Validating XML sitemaps in node.js

How to use XML Schema (XSD) to validate your sitemap.xml in node.js

March 12, 2017 - 3 minute read -
web seo xml node javascript

Introduction

Sitemaps are a way to tell search engines which pages on your site should be crawled and how often. They are written in XML and if it’s not well-formed and valid, search engines won’t not be able to crawl your content. In this post I will demonstrate how to use XML schemas to validate sitemap XML files. For more information about sitemaps and sitemap index files see the sitemap documentation.

Walkthrough

We need to download the XML schema (XSD) to validate against. It’s available here, we’ll save it in it’s own folder as sitemap.xsd:

mkdir -p test/schemas
curl https://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd > \
  test/schemas/sitemap.xsd

We’ll use libxmljs to perform the validation:

npm install libxmljs --save

The validation code will live in a file called sitemap.tests.js and we’ll assume our built sitemap is in the base of the project. Here’s the folder structure:

.
├── package.json
├── sitemap.xml
└── test
    ├── schemas
    │   └── sitemap.xsd
    └── sitemap.tests.js

In this example I’m reading the sitemap and schema from the local file system, however the same approach could easily be used to validate a sitemap that’s generated dynamically. Just call the endpoint and use libxmljs in the same way as shown here, to parse and validate it. Then you can assert on and report the results however you like.

Here’s sitemaps.test.js:

Sitemap index files

If you’re using a sitemap index file to link to multiple sitemap files then it’s a good idea to also validate these. There is a separate schema available here but other than that the process is exactly the same as for the sitemap files themselves.

Google News sitemaps

Google have their own set of extensions to the sitemap format for Google News specific information. The News specific elements are in their own XML namespace, the schema is available here. We’ll save it alongside the main schema.

curl http://www.google.com/schemas/sitemap-news/0.9/sitemap-news.xsd > \
  test/schemas/sitemap-news.xsd

When validating a sitemap that includes both the standard and the News specific elements we need a combined XSD to validate against; we can do this by importing the News schema into the main schema.

For example given the following file layout:

.
└── test
    ├── schemas
    │   ├── sitemap-news.xsd
    │   └── sitemap.xsd
    └── sitemap.tests.js

We can use an xsd:import import element to import another XSD. The import should be a child of the xsd:schema element. Note that the schemaLocation path is relative to the working directory of the node process not the file importing it; here I am assuming that the tests are being run from the root of the project.

Once you have a combined XSD you can validate the sitemap against the XSD as described above.

Conclusion

Sitemaps are an important part of SEO for many sites and we need to make sure they are well-formed and valid. It’s very straight forward to do this as part of the sites test suite.