/etc

Originally Published: 2016-09-09

If you run a Jekyll blog (like this one!), you might be interested in having your blog posts saved in a web archive like the Internet Archive Wayback Machine. In this post, I’ll show you how you can use an auto-generated sitemap to get a list of all URLs on your Jekyll blog, then feed those URLs to a web archiving process.

Adding a sitemap to your Jekyll blog or website is easy. Assuming your configuration is relatively straightforward, using the jekyll-sitemap plugin can be as simple as adding a line to your site’s _config.yml if you’re using GitHub Pages. Once you’ve done that, test that the URLs you’re generating in the resulting sitemap.xml are valid and you should be good to go. Generating a sitemap also has SEO benefits, as it allows search engines to crawl your site more easily.1

Once you have a working sitemap, you can get the URLs back out of it with sitemap-urls: 2

curl https://mysite.github.io/sitemap.xml | sitemap-urls

Now we can use that list of URLs to drive our archiving process:

curl https://mysite.github.io/sitemap.xml | sitemap-urls | while read url; do \
  curl -g --fail --retry 3 -L -o/dev/null -s "http://web.archive.org/save/$url"; \
done

This should tell the Wayback Machine to save all the URLs from the sitemap. Wrap this up in a script you can put in a periodic cron job and you can rest easy knowing that your pages are being regularly archived. A similar process should work for archiving any (non-Jekyll) website that provides a sitemap.3 You could also use the scripts from my web-archive-triage repository to do some more complicated things, such as only archiving pages that have no snapshot.

Footnotes:

  1. You can also submit your sitemaps to Google and check up on them through the Google Search Console (formerly Webmaster Tools). 

  2. This more complicated bash process may work if you don’t have node/npm installed. 

  3. If you run a WordPress blog, you may also be interested in the Archiver plugin