My website is now part of the web archive in the Dutch Royal Library. It took some experimenting to get it in there. Blogs will be blogs and the amount of links in mine choked the harvester it seems.
Since 2007 the Royal Library has been archiving websites, and now stores some 25.000 websites. My blog, even though it is one of the oldest still maintained in the Netherlands, never was part of that effort. Mostly because it’s not very visible as a Dutch blog, as it is mostly written in English and resides on a .org domain (when I registered zylstra.org, private persons could not yet register .nl domains, only companies could). At an Internet Archive event organised by the Royal Library last year September I asked about archiving and they told me how to suggest my website for archiving.
Late last January I received a message that my website would be included in their archives from now on.
What followed were several test-runs with their harvester Heritrix, which is also used by the Internet Archive. I wondered about how some of my website’s peculiarities would be dealt with by the harvester. Not every posting is listed on my site for instance, although each does have a direct URL. The years’ worth of weekly notes for instance are not listed in this site. Also many postings are never shown on the front page, and if you page through postings on the front page you will never encounter them. This is true for categories of posts like books, photos, and day to day topics. I discussed this with the web-archivist, who ran some tests. My week notes seemed to be included, but the pagination of the category of day to day stalled out at 180 pages, although there were more still.
To my surprise they also ran into volume limits. Apparently because of ‘bycatch’, things they archive from other sites because I reference them or embed them. In the past few years I have stopped embedding things, like photos, except for my slides, which are hosted on a separate domain I have registered. While it was normal that a site’s additional catch is larger than the site itself, for my site it was very different from what they were used to.
First they limited bycatch to 20GB in a test, and they ran out of space, then they set it at 40GB in a test, and still ran out of space. Raising the limits further did not help. In the end they decided to harvest just what is on my zylstra.org domain and not include any bycatch at all. Which is completely fine by me, precisely because I’ve made the effort to bring all kinds of external content ‘home’ to this domain.
Nevertheless it did surprise me that bycatch turned out to be a problem, as they are using a tool the Internet Archive itself uses too. I asked for some examples of the bycatch. They told me it wasn’t even possible to dump a URL list from the bycatch into a spreadsheet as it hit the maximum number of rows (around 65k iirc). I did get some of the URLs that contributed bigger volumes of bycatch. To my surprise I did not even recognise the links, except one.
One was obvious, 2800 attempts to harvest a page on live.staticflickr.com, as I link a lot to my Flickr hosted images, although I no longer embed them but have local versions on this domain.
Others were not obvious to me at all, theguardian.tv, vp.nyt.com and various content delivery networks. I link to none of them in this site. I do link to The Guardian, about 100 times, and to the NYT about 40 times, and I suppose if the harvester follows those links it will find additional material there that explains the bycatch more fully, if it harvests all the targets I link to too.
If that is the case, that it harvests everything I’ve linked to, then it is the long history of this blog that is the issue and makes the harvester hit its limits.
There are some 20.000 external links in this blog’s articles, as far as I can quickly estimate based on a full content export I made this week.
It basically means that if the harvester attempts to harvest all those links and what resources they include, it adds a number of pages to the archive, roughly equivalent to the current archive itself.
A weblog embraces what the world wide web is, a bunch of links to other websites. The name weblog says it. A web-log is a curation hub for web readers, pointing out other interesting stuff, and not trying to keep you here too long. Over 23 years of blogging yielded some 20.000 links to other websites. In terms of linking a blog becomes the web itself as much as it becomes its author’s avatar in terms of its content given enough time.
From now on my site will be updated in the Royal Library’s archives every year on March 5th.

The facade of the Royal Library in The Hague, photo by Ferdi de Gier, license CC-BY-SA




