If you run a Jekyll blog (like this one!), you might be interested in having your blog posts saved in a web archive like the Internet Archive Wayback Machine. In this post, I’ll show you how you can use an auto-generated sitemap to get a list of all URLs on your Jekyll blog, then feed those URLs to a web archiving process.
Adding a sitemap to your Jekyll blog or website is easy. Assuming your configuration is relatively straightforward, using the jekyll-sitemap
plugin can be as simple as adding a line to your site’s _config.yml
if you’re using GitHub Pages. Once you’ve done that, test that the URLs you’re generating in the resulting sitemap.xml
are valid and you should be good to go. Generating a sitemap also has SEO benefits, as it allows search engines to crawl your site more easily.1
Once you have a working sitemap, you can get the URLs back out of it with sitemap-urls
: 2
curl https://mysite.github.io/sitemap.xml | sitemap-urls
Now we can use that list of URLs to drive our archiving process:
curl https://mysite.github.io/sitemap.xml | sitemap-urls | while read url; do \
curl -g --fail --retry 3 -L -o/dev/null -s "http://web.archive.org/save/$url"; \
done
This should tell the Wayback Machine to save all the URLs from the sitemap. Wrap this up in a script you can put in a periodic cron
job and you can rest easy knowing that your pages are being regularly archived. A similar process should work for archiving any (non-Jekyll) website that provides a sitemap.3 You could also use the scripts from my web-archive-triage repository to do some more complicated things, such as only archiving pages that have no snapshot.
Footnotes:
-
You can also submit your sitemaps to Google and check up on them through the Google Search Console (formerly Webmaster Tools). ↩
-
This more complicated
bash
process may work if you don’t havenode
/npm
installed. ↩ -
If you run a WordPress blog, you may also be interested in the Archiver plugin. ↩