Knowledge
BANK

Sitemaps

0

What are Sitemaps?

Sitemaps are an easy way to inform search engines about pages you have on your website that are crawlable. Technically, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (the last modification date, how often the content is changed, how important it is relative to other URLs on the website). This allows search engines to crawl the site more intelligently.

Structure of a Sitemap

The basic structure of a Sitemap consists of the following:

  • XML declaration tag (optional): since the Sitemap is just an XML file, it should have this declaration although it is not a must.
  • <urlset> tag (required): contains a set of <url> tags and it must include a “xmlns” attribute with pointing to the namespace of the current sitemap schema.
  • <url> tag (required): corresponds to one url on the site.
  • <loc> tag (required): which contains the url to which a parent <url> tag points to.
  • <lastmod> tag (optional): the last modification date of the url. The format of the date should follow: “YYYY-MM-DDTHH:MI:SS+TZ:ZI” for example, 2004-12-23T18:00:15+00:00.
  • <changefreq> tag (optional): how often the content of this url change. This can be either “always”, “hourly”, “daily”, “weekly”, “monthly”, “yearly”, and “never”. “always” should be given for any link that serves different content each time it is accessed.  “never” should be given for links where content never changes such as archive documents.
  • <priority> tag (optional): gives an indication of how important this link is.

Screen Shot 2013-07-29 at 4.53.09 PM

Entity Escaping

Sitemap files must be UTF-8 encoded. URLs in a sitemap file must be all escaped and url-encoded for readability by the webserver. A reference of url-encoding can be found at:

http://www.w3schools.com/tags/ref_urlencode.asp

Sitemaps for Large Websites

A sitemap file must have at max a size of 10MB and a max of 50K URLs. If a website has more than 50K URLs and/or the size of the sitemap file exceeds 10MB  then sitemap index files can be used such that a site index file reference multiple sitemap files each of which contains up to 50K URLs and a size of at max 10MB.

The sitemap files in this setup can be compressed in “.gz” format which is often a best practice. However, even with compression the size of the sitemap file when uncompressed must not exceed 10MB.

Structure of Sitemap Index Files

The structure of a sitemap index file consists of the following:

  1. <sitemapindex> tag (required): contains information about the different sitemap files. This tag contains a set of <sitemap> tags which refer to individual sitemaps.
  2. <sitemap> tag (required): contains information about an individual sitemap.
  3. <loc> tag (required): the location at which the individual sitemap can be found.
  4. <lastmod> tag (required): the last modification date of the individual sitemap file.

Important Notes Regarding Sitemap Index Files

  • Sitemap index files must not list more than 50K sitemaps and must not be larger than 10MB (same rules of sitemaps applies to sitemap index files).
  • b. You can have more than one sitemap index file referenced by one sitemap index file.
  • c. A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.example.com or http://yourhost.yoursite.com.
  • d. As with Sitemaps, your Sitemap index file must be UTF-8 encoded.

 Sitemap Stage 2 - New Page

Figure 1. Simple Sitemap File

Sitemap Stage 2 - New Page

Figure 2. Sitemap Index Files Referencing Two Sitemap Files

Sitemap - New Page (1)

Figure 3. Cascaded Sitemap Index Files

Telling Search Engine About Your Sitemap

To tell search engines about the location of your Sitemap, include the following line in your robots.txt file (updating with your own path and filename):

sitemap: http://www.example.com/sitemap.xml