On Harvesting for the Royal Library Or How A Blog Breaks The Archive

Posted on 27 March, 2026

My website is now part of the web archive in the Dutch Royal Library. It took some experimenting to get it in there. Blogs will be blogs and the amount of links in mine choked the harvester it seems.

Since 2007 the Royal Library has been archiving websites, and now stores some 25.000 websites. My blog, even though it is one of the oldest still maintained in the Netherlands, never was part of that effort. Mostly because it’s not very visible as a Dutch blog, as it is mostly written in English and resides on a .org domain (when I registered zylstra.org, private persons could not yet register .nl domains, only companies could). At an Internet Archive event organised by the Royal Library last year September I asked about archiving and they told me how to suggest my website for archiving.

Late last January I received a message that my website would be included in their archives from now on.

What followed were several test-runs with their harvester Heritrix, which is also used by the Internet Archive. I wondered about how some of my website’s peculiarities would be dealt with by the harvester. Not every posting is listed on my site for instance, although each does have a direct URL. The years’ worth of weekly notes for instance are not listed in this site. Also many postings are never shown on the front page, and if you page through postings on the front page you will never encounter them. This is true for categories of posts like books, photos, and day to day topics. I discussed this with the web-archivist, who ran some tests. My week notes seemed to be included, but the pagination of the category of day to day stalled out at 180 pages, although there were more still.

To my surprise they also ran into volume limits. Apparently because of ‘bycatch’, things they archive from other sites because I reference them or embed them. In the past few years I have stopped embedding things, like photos, except for my slides, which are hosted on a separate domain I have registered. While it was normal that a site’s additional catch is larger than the site itself, for my site it was very different from what they were used to.
First they limited bycatch to 20GB in a test, and they ran out of space, then they set it at 40GB in a test, and still ran out of space. Raising the limits further did not help. In the end they decided to harvest just what is on my zylstra.org domain and not include any bycatch at all. Which is completely fine by me, precisely because I’ve made the effort to bring all kinds of external content ‘home’ to this domain.

Nevertheless it did surprise me that bycatch turned out to be a problem, as they are using a tool the Internet Archive itself uses too. I asked for some examples of the bycatch. They told me it wasn’t even possible to dump a URL list from the bycatch into a spreadsheet as it hit the maximum number of rows (around 65k iirc). I did get some of the URLs that contributed bigger volumes of bycatch. To my surprise I did not even recognise the links, except one.

One was obvious, 2800 attempts to harvest a page on live.staticflickr.com, as I link a lot to my Flickr hosted images, although I no longer embed them but have local versions on this domain.
Others were not obvious to me at all, theguardian.tv, vp.nyt.com and various content delivery networks. I link to none of them in this site. I do link to The Guardian, about 100 times, and to the NYT about 40 times, and I suppose if the harvester follows those links it will find additional material there that explains the bycatch more fully, if it harvests all the targets I link to too.

If that is the case, that it harvests everything I’ve linked to, then it is the long history of this blog that is the issue and makes the harvester hit its limits.

There are some 20.000 external links in this blog’s articles, as far as I can quickly estimate based on a full content export I made this week.
It basically means that if the harvester attempts to harvest all those links and what resources they include, it adds a number of pages to the archive, roughly equivalent to the current archive itself.

A weblog embraces what the world wide web is, a bunch of links to other websites. The name weblog says it. A web-log is a curation hub for web readers, pointing out other interesting stuff, and not trying to keep you here too long. Over 23 years of blogging yielded some 20.000 links to other websites. In terms of linking a blog becomes the web itself as much as it becomes its author’s avatar in terms of its content given enough time.

From now on my site will be updated in the Royal Library’s archives every year on March 5th.

The facade of the Royal Library in The Hague, photo by Ferdi de Gier, license CC-BY-SA

A Rising New Era of Personal Tools

Posted on 23 March, 2026

by Ton Zijlstra

At PKM Summit this weekend one thing that stood out was that many have started creating their own tools, and were using vibecoding to create them.

While the term agency turned out to be unknown to almost all participants, that is of course what such tools create. The ability to do things, individually or as a group, in this case by creating your own tools to get there.
The power of finding new agency was felt and expressed by quite a few, and played a role in a good number of sessions too.

When I first encountered computers, in the early 1980s, creating your own stuff was the norm. It was almost the only option. Making the machine work for myself. Like software to keep my ham radio logs and print QSL cards.
These days I run a good many smaller and larger personal pieces of tooling on my laptop. Things like making it easy to search by date in my photos on Flickr, or posting to my website from my internal notes, or from within my feedreader.
Things that reduce friction, speed things up, reduce dependency on external systems.

Vibecoding, and especially the Claude Code style of vibe coding, is bringing people to create their own tools, who weren’t able to do so before. A pool of latent needs they can now tap into on their own.

Some I know are really now learning how a computer works under the hood through their vibe coding. Testing the limits of their machines, finding out how fast local stuff can be. Discovering the power of APIs, the utility of cron jobs, and learning how to run their own VPS or local servers.
Others are creating little tools that work the way they want. An app to present books from their collection in that one specific way just so. A mobile app for public transport built on your own existing commute patterns and nothing else. Apps pulling in data from several sources and presenting them in one interface that likely only makes sense to themselves.

Tools built by people realising they are pretty predictable to themselves, and that such highly localised and specifically contextualised predictability now lends itself to automation by the intended user themself.
Tools, in short, where, access to and control over data lies fully with the user, where applications are views on that data (and multiple apps use the same data), and interfaces queries on the data. Along the lines of Ruben Verborgh’s 2017 article “Paradigm Shifts for the Decentralised Web“ but then way more personal. The decoupling that is possible between data, applications and interfaces is even more powerful when you can do them all three for yourself. And then mash them up in any which way you want.

Vibecoding is allowing people to jump the barriers to entry to that. And judging by the stories they share, it feels like pole vaulting over them, not just clearing the barriers. That energy then propels them on to do more.

Over the past months I’ve also heard regularly how people are cancelling paid subscriptions to various online services, and switching to their personal tools that fit their use case much more precisely.

There are many ethical, political, and societal issues with much of the gen AI world, and how models come about, and how corporate vendors exploit and leverage their power.
Yet, where these things are not just consumed but used locally as a leg-up to a different level of self-reliance, it looks quite different. Something is brewing it feels like.
A shift, and I’d love to see more people explore and extend their own agency with such tools.

The Threefold Personal KM Zine

Posted on 22 March, 2026

by Ton Zijlstra

At the European PKM Summit the past two days, Frank Meeuwsen ran a continuous atelier where people could make their own ‘zines and lino cuts. A welcoming space to make something by hand at an event full of inspiring but abstract conversations and talks.

A simple ‘zine folded from an A4 paper provides six small pages, including the front and back. That forces you to be to the point.

I thought of a posting I wrote a little over a year ago, about how personal knowledge management is personal in three ways, and that generally you should always take the P in PKM even more personal than you’re already doing.
Three points to bring across sounded short enough to lend itself for a message in a zine.

The P in PKM is 3 fold personal. (jouw = yours in Dutch)

First, it’s your personal system. You take it with you. It enables and anchors your personal autonomy, and allows you to own your own learning path

Second, it’s your personal knowledge, building on your own curiosity and interests, with your associations, in your language. Your personal network of meaning.

Third, it’s your personal system. Your emergent structures, following your logic, stemming from your personal methods and workflows.

Personal KM is way more personal than you think. And still more.

What About Cognitive Debt In Our Slow AI Organisations?

Posted on 15 February, 2026

by Ton Zijlstra

Bookmarked How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt by Margaret-Anne Storey

I enjoyed this short posting by Margaret-Anne Storey, a CS professor. The effect of using generative tools can indeed lead to loss of overview, and uncertainty about the project I recognise. It creeps in very quickly, especially if I’ve started from something exploratory, as opposed to planned. A cognitive debt accrues because of wanting to move fast or move at all, at the cost of understanding one’s actions in enough detail. It hinders being able to make changes later.

It also makes me wonder something completely different. Partially because of examples I saw last week in Madrid of how BMW and Airbus had sped up some specific tasks orders of magnitude with AI:

If we see companies as slow AI, i.e. context blind algorithms working towards a narrowly defined singular goal (this is where the notion of AI turning all the material in the world including ourselves into paperclips comes from), what methods have we come up with to deal with cognitive debt in organisations? My intuitive response is reporting chains, KPIs, and middle management. Consultancy too, hiring an external actor to blame if needed. That suggests to me we actually didn’t, as so much of that is management-theater. Does any board of any company above a certain size actually know what is going on in their organisations? Understand what consequences changes may have? There’s a world of hurt out there caused by ‘reorganisations’ that all too often seem ritualistic more than rational when seen from the outside.

It may also be why companies easily embrace AI, despite e.g. warnings about cognitive debt. It looks the same as current practice, just with the promise of higher speed.

I saw this dynamic play out vividly in an entrepreneurship course I taught recently. .. one team hit a wall. They could no longer make even simple changes without breaking something unexpected. … no one on the team could explain why certain design decisions had been made or how different parts of the system were supposed to work together. … issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.

Margaret-Anne Storey

On Heritrix Crawler and Internet Archiving

Posted on 3 February, 2026

by Ton Zijlstra

After being informed about the intention of the Royal Library to archive my website, I wondered how some of the aspects my site has may affect what is being collected.
Specifically:

Most of my postings are kept away from the front page but end up in specific categories. These postings do show up in monthly archives and overview pages like for a tag or category.
Some of my postings are unlisted in the site, yet are publicly available. Mostly these are postings I originally only shared through RSS, such as my week notes. They are not in overviews, don’t show up as search results, but have public URLs, and you can navigate to them if you click next / previous post on their surrounding posts in the timeline.

The crawler that will be used for the archiving is Heritrix, which is also used by the Internet Archive itself.
A quick test of some posts from both of the two types above shows they are likely not in the internet archive. I mailed the Royal Library to ask how Heritrix may or may not deal with my site’s quirks. Or perhaps I can generate a complete site map and make that available?

I think I’ll put this up on the front page 😉

Get a Free Ticket for PKM Summit 2026!

Posted on 26 January, 2026

by Ton Zijlstra

I have been interested in personal knowledge management (pkm) for a very long time. I have been an avid notes maker ever since I learned to write. Digital tools from the late 1980s onwards have been extremely useful. And a source of nerdy fascination, I confess. I am certain personal knowledge management (pkm) is of tremendous value for anyone who wants to keep learning and make sense of the world around them.

On March 20 and 21 the European PKM Summit is taking place for the third time in Utrecht, Netherlands. I’ve helped a bit, like for earlier editions with suggesting speakers and workshop hosts for this event.

I am donating a ticket for a student in the Netherlands to attend this two day event. I did the same last year and the year before.

Are you a student at a university in the Netherlands (doing a bachelor’s or master’s) with a strong interest in personal knowledge management (pkm)? (note that it says interest, I don’t expect you to be highly sophisticated or experienced in it!)
Is your interest in pkm to strengthen your personal learning and deepen your interests, rather than increasing (perceived) productivity?
Would you like to go to the PKM Summit on 20 and 21 March in Utrecht, but as a student you cannot afford the 254 Euro ticket price?

Then I have one (1) conference ticket available! Let me know who you are and what fascinates you in pkm or attracts you to the event. If there are several people interested I will choose one. I will donate the ticket a month before the conference, by February 20th, so state your interest before then.

The single condition is that you attend the event on both days and participate actively in the sessions. If you have other ways to attend (by e.g. volunteering for the event staff) then that is preferable. This is specifically for someone who would otherwise not be able to go. I’d be happy to briefly meet you as well at the event, but if not that is perfectly fine too. It would be great however if you would share some of your impressions of the event afterwards online on the open web, especially if that is something you’d normally do anyway.

Interested? Email or DM me (in Dutch, German or English)! My contact details are in the right-hand sidebar.

Interdependent Thoughts

by Ton Zĳlstra

On Harvesting for the Royal Library Or How A Blog Breaks The Archive

A Rising New Era of Personal Tools

The Threefold Personal KM Zine

What About Cognitive Debt In Our Slow AI Organisations?

On Heritrix Crawler and Internet Archiving

Get a Free Ticket for PKM Summit 2026!