PocketArC

Hardcore TypeScript: Maximum-strictness static analysis with Biome and ESLint.

Fri, 10 Apr 2026 12:00:00 GMT

A few days ago I published Hardcore PHPStan, a living reference for the strictest PHPStan config I could put together for Laravel. Same thesis applies on the frontend: AI writes the bulk of my TypeScript and React code now, and deterministic guardrails are the only thing that keeps it from cutting corners.

This is the companion to that post. A living reference for the strictest TypeScript + Biome + ESLint config I could put together, with the reasoning behind every rule: pocketarc.com/typescript.

The stack

Biome handles linting, formatting, and import organization, and it's fast enough to run on every keystroke. typescript-eslint fills the gaps Biome hasn't implemented yet (the any-propagation rules, deprecation at call sites, strict boolean expressions, weak crypto detection). It's slower, so it runs in pre-commit and CI rather than on every save.

The lint command runs all three in order:

biome check --write src/ && tsc --noEmit && eslint src/

Biome runs first with auto-fix. TypeScript runs second to type-check. ESLint runs last for the remaining checks. If any of them fail, your build fails.

Why it matters for AI-generated code

LLMs trained on the corpus of TypeScript that exists on the internet have absorbed every bad habit the language allows. They write process.env.API_KEY.trim() instead of process.env.API_KEY?.trim(). They write useEffect(() => fetch(url), []) with a missing dependency array entry. Stuff like that.

Static analysis catches all of this. You can tell an LLM "use strict types" in a system prompt, and it'll mostly comply. The linter makes sure it 100% complies.

What the config covers

The Hardcore TypeScript reference covers:

Every strict TypeScript flag beyond strict: true, with the reasoning behind each one.
Biome's domain system (next, react, project, types) all set to "all" for maximum strictness, including type-aware nursery rules like noFloatingPromises and useExhaustiveSwitchCases.
Cognitive complexity limits, disallowed calls, explicit return types, and naming conventions.
typescript-eslint's strict-type-checked preset for the any-propagation rules, deprecation detection, strict boolean expressions, and eslint-plugin-security for weak crypto.
The full tsconfig.json, biome.json, and eslint.config.js, ready to drop into a project.

The full config with every rule explained is at pocketarc.com/typescript. I'd love to hear what you think. Reach out via email or X/Twitter.

Same warning as with the PHPStan config: when adding this to an existing project, this config will surface a lot of errors. If only you had a code generation tool to automatically fix those errors... 😉

Agentic coding makes strict static analysis non-negotiable.

Wed, 08 Apr 2026 12:00:00 GMT

I love static analysis. It catches bugs you didn't know could've been bugs, and it enforces conventions that would otherwise rely on human discipline and code review.

For PHP, PHPStan has been the standard for years. It's incredibly good, like a combination of TypeScript and ESLint for PHP. I put together a living reference page for the strictest PHPStan config I could come up with, and this article explains the thinking behind it.

Static analysis as an AI guardrail

Before agentic coding, strict static analysis was a nice-to-have for code quality. You'd configure PHPStan and catch a class of bugs that tests alone wouldn't find. It was a good practice.

But now, with AI generating the bulk of your code, strict static analysis should be looked at as the primary mechanism for enforcing standards. You can tell an LLM "use strict comparisons" in a system prompt, and it'll do it... but not ALL of the time. Static analysis catches it 100% of the time.

Meme showing someone stretching and cracking their knuckles, warming up to click a button that says 'Make no mistakes'.

I sincerely believe deterministic code quality tools are non-negotiable in this day and age, and making good use of them should be part of our "new" job expectations. AI doesn't write code like a human in the sense that it doesn't learn from what you tell it. Someone who's been on my team for 5 years knows exactly everything about our coding standards, the patterns we use, the mistakes we make, and the way we structure code. An LLM is constantly starting from scratch, and even when it isn't, it will sometimes just "forget" things that you explicitly told it and say "You're absolutely right!" when you point that out. You need tooling to catch those kinds of mistakes.

Cognitive complexity limits are a good example. Without them, LLMs will happily keep making functions and objects longer and longer as they work through issues. God objects out the wazoo. AI doesn't care, it's just trying to solve whatever task you gave it. But if you set a complexity limit, and the LLM is forced to decompose, it will break the logic into smaller units and come up with better-organized ways of doing things to meet its target. I've seen the refactors that those limits led to, and it's really nice.

A human would grumble about having to refactor everything they just wrote, but an LLM doesn't care. You can make it jump through whatever hoops you can imagine, and you will end up with much better code.

So the answer is: Come up with as many hoops as you possibly can! If you need better quality code, encode your standards and then force AI to meet them.

The config

I put together a PHPStan configuration for this reality: level 10 and extra strictness packages on top, tuned for catching as many things as possible. The config is opinionated, but I lay out the reasoning behind every choice in the reference.

It covers:

Every rules package, what it catches, and why it matters for AI-generated code specifically.
The full annotated phpstan.neon with the reasoning behind each section.
Disabled rules with the pragmatic reasoning (ergebnis ships rules like "no extends" that conflict with Laravel's architecture).
Ignored errors, each with the specific reason it's a false positive or framework pattern.

If you're writing Laravel applications with AI assistance (and at this point, most of us are), strict static analysis is the single highest-leverage guardrail you can add. The config is ready to drop into any Laravel project.

Note: When adding it into an existing project, this config will lead to tens of thousands of errors to fix. It's a huge lift that'll scare even the most disciplined teams. If only you had a code generation tool to automatically fix those errors... 😉

The full config with every rule explained is at pocketarc.com/phpstan. It's a living page, and I'll keep updating it as the configuration evolves. I'd love to hear what you think. Reach out via email or X/Twitter.

Stop optimizing your AI skills and ship -something-.

Mon, 30 Mar 2026 21:57:10 GMT

There's a blog post making the rounds called "Anatomy of the .claude/ folder", and the Hacker News discussion is full of people sharing their elaborate .claude/ configurations, debating structures, and arguing about which MCP servers to wire up.

A lot of engineers seem to be falling prey to this, and it reminds me of the 2010s productivity culture, where people would spend more time configuring OmniFocus and their GTD workflows than actually doing the work. Swapping task managers every few months, telling themselves that what they need is to have the system just right and it'll unlock all their productivity.

We're doing it again, except now it's AI agent shenanigans, people collecting skills like they're Pokémon.

It's all such a waste of time.

Delegating to competency

Tue, 13 Jan 2026 12:00:00 GMT

Jason Fried tweeted about why people are so excited about AI agents:

Until now, very few people have known what it feels like to delegate to total competency... Delegating to competency lets you forget about it completely. That's real leverage.

The interesting thing is it really is about delegation. Engineers who enjoy delegating and managing will fare better with AI than engineers who just want to sit in their corner writing code.

Something to think about.

xz and zstd are built into Ubuntu

Thu, 08 Jan 2026 12:00:00 GMT

TIL xz and zstd are built into Ubuntu these days. In my head they're still "new" compression formats - I've used .gz plenty on servers over the years and never really ventured out. I've used zstd before on my Mac, but I had to brew install it. Never occurred to me that it'd just be there on a fresh Linux install.

# xz with multithreading (-T0 uses all cores) and fast compression (-1)
xz -T0 -1 ./dump.sql

# Decompress and pipe straight into mysql
xz -dc dump.sql.xz | mysql -u username -p database_name

# zstd with multithreading
zstd -T0 ./dump.sql

It's kind of like how we all still reach for JPEGs and PNGs despite the fact that WebP has been fully supported in browsers for over half a decade now. Old habits die hard.

Dev Log: January 5 - January 11

Mon, 05 Jan 2026 00:00:00 GMT

macOS Screenshot Uploader
- I added a 1 second delay between copying the image to the clipboard and copying the uploaded URL, to give me a chance to paste the image content when that's what I took the screenshot for. I often take screenshots to paste them somewhere, and with this rewrite being so much faster, the URL would get copied before I had a chance to paste the image. Good problems to have!
Stealth CodeIgniter-to-Laravel Project
- I exposed the first app route directly from Laravel, bypassing CodeIgniter entirely. It's a route for managing profile pictures stored in Rekognition, so it's buried in settings, but it exercises the entire UI - the navbar, the translations system, all the blade views, notification banners from session flash data, etc. Everything is working perfectly, and that means new routes should just be a matter of moving the views and controllers over to Laravel.
- I also improved health observability, exposing lots more internal metrics, including any server setup issues, database connectivity, Redis connectivity, queue status, and more.
- Did lots of optimization to the MariaDB server, making it take full advantage of 128GB of RAM available on the server, and tuned various settings for better performance with the workload.
SyncPad
- I added a fallback for when using SyncPad over HTTP. In HTTP, the Web Crypto API isn't available, but I didn't want the app to break, so instead it will just send the data unencrypted over HTTP. Obviously not ideal, but better than nothing for people who can't use HTTPS (e.g. local network use). The app will warn users about this, so they know their data isn't encrypted in that case.
Personal Assistant
- I started turning my personal Obsidian vault into a personal assistant, by filling out my CLAUDE.md file with lots of information about me, my preferences, my projects, and so on. I hooked it up to Linear, Things, and an Apple Mail MCP, so that it can help me manage my tasks and emails. It's still early days, but I'm excited about the possibilities here.
- I created an email management skill that has thorough instructions on how to handle different types of emails, including categorizing them, responding to them, and extracting information. The ultimate goal is to be able to hand off my email inbox to the assistant and have it manage everything for me, with minimal supervision. I'll be notified of anything I need to personally handle, but routine emails can be managed by the assistant, tasks created automatically, and so on.
url2md
- I bypassed bot detection for Twitter so that I could fetch tweet content without being blocked. This involved using a headless browser to simulate a real user, and uses some of the work I've recently done around creating a virtual Google Meet participant (more on that soon). This way I can fetch tweet content for archiving in my personal knowledge base without running into rate limits or blocks.
apple-mail-mcp - I added functionality to get message IDs and raw HTML source for emails, to allow my personal assistant to read and process emails more effectively. This will enable better email management and task extraction. I do need to upstream this, raise PRs for it, or fork the repo altogether, if the maintainer isn't interested in merging it.

Claude Code skills are the next thing to get good at

Sun, 04 Jan 2026 17:00:00 GMT

I've been playing with Claude Code skills this weekend and wow, this is cool.

I started by creating separate agents for different tasks: one for bookmarking URLs, one for creating quick notes, and one for adding entries to my weekly devlog. They worked fine, but then I realized I could turn them into a skill.

What I learned is that when you turn agents into a skill, you can call on them from anywhere and the skill has access to the current conversation context. So if I've been discussing a project and I say "add this to my devlog," the skill already knows what "this" refers to.

A sub-agent would have to be given a context prompt by the main agent, which might miss some of the nuance of whatever the work being done was. It makes for much better notes because I don't have to re-explain everything.

On-the-fly image transformations are iffy.

Thu, 01 Jan 2026 23:53:50 GMT

Trying to use CloudFlare Image Transformations to resize images on the fly. The feature is cool, but the results are pretty weak. Given an optimized WebP source image, CloudFlare's transformations seem to just serve a bigger, lower-quality JPEG version of the image when resizing up. Not really sure why, but... that defeats the whole point of the service.

And as for the OpenGraph images, CloudFlare seems to struggle with getting decently-compressed PNG outputs from WebP sources. My original PNGs that had been run through Oxipng were smaller than CloudFlare's WebP-to-PNG conversions.

The whole thing has made me realize this stuff should really stay a build-time operation, not something done on-the-fly at request time. The quality just isn't there. It's a shame though, because it would've been so convenient.

The era of single-threaded human productivity is over.

Thu, 01 Jan 2026 17:59:31 GMT

Andrej Karpathy recently tweeted that he has never felt this much behind as a programmer. That sentiment reflects what I'm seeing in my own work as well.

Software engineering is going to radically change in 2026, for some people. Last Saturday, working across my side projects, I delivered what teams I've worked with would have estimated as at least 40 hours of sprint work, while barely touching my IDE. That is my new baseline.

The gap between "AI-native workflows" and "traditional engineering" is widening faster than most of us realize. As 2025 comes to a close, I want to break down exactly how that Saturday was possible, and why this level of velocity is about to become the new normal.

The great divide is already here

Plenty of engineers tried LLMs in the GPT‑4 era (or even just 12 months ago!), got mediocre results, and rationally decided "this isn't worth my time." The problem is that the tools (and workflows) changed underneath that conclusion, and a significant percentage of those engineers haven't updated their mental model of what's now possible. Many haven't tried Claude Code or similar tools, some don't see the value in a $100-200/mo subscription, others are sticking with free models and don't get to experience the state of the art, and still others have simply never revisited the question.

Earlier this year, I theorized that $1M salaries were coming for engineers¹. The thesis is that as engineers become dramatically more productive, their value increases, and salaries follow. I still believe that, but I no longer think it'll come from established companies.

I think it'll come from new ones. Founders, solopreneurs, founding engineers. People whose job will be giving AI the right guardrails, context, and environment to perform (we could call it LlmEx, like DevEx). Companies that hire one engineer expecting five engineers' worth of output. And experienced engineers who can actually assess AI output, not just vibe code, will be worth their weight in gold. That's my bet, anyway.

What our job will become

But engineer performance is increasing. AI has gotten genuinely useful. A lot of us have already seen that it's incredibly good at greenfield projects - spinning things up from scratch, putting together a little MVP of whatever it is you're wanting to build. On its own, though, it can struggle to maintain long-running projects. Making it come into an existing codebase blind works much more frequently than I would expect it to, but it doesn't always. And for those situations, it's important to establish the right foundations and guardrails for LLMs (automated tests, static typing, linting, anything that catches mistakes before runtime). Doing so will make AI perform better, and it will force us to write good code.

In my opinion, in the near future, our job description will evolve into:

Be the architect laying the foundations to enable AI to be successful in your project.
Keep up with developments in the AI world and consistently look for ways to enable AI to be more productive. A month is a long, long time in the AI world.²
Manage as many concurrent AI agents as you can to deliver as many tasks as possible as fast as possible.

That last one is what I think is the big productivity shift for engineers. And what does this actually look like in practice? I'll use myself as an example; this is how I've set up my environment to enable this parallelization:

Embarrassingly parallel programming

Over the past ~6 months, I've orchestrated all of my projects with docker-compose.yml, so that I can easily spin up multiple copies of a project, each with its own dependencies and the right versions of whatever services they need. I can have 5 copies of a git repo running 5 different copies of the project simultaneously, so Claude Code can work on separate tasks without any of them affecting any other work being done. Having them fully separated has been an incredible boon. I can run tests separately, I can have different browser tabs open for manual testing each different version of the project, I can review everything in separate PRs, it all works really well.

And it didn't take much effort: Using AI to generate the Docker files and configs in the first place means that you don't even have to deal with the cognitive effort of putting those together.

Some people use git worktrees for this, which I haven't tried, but I think that would make manual testing far more complicated than just having 5 docker compose up -d copies running at the same time. On my Mac, I've been using OrbStack, which automatically manages networking for them all, so I can go to project1.orb.local, project2.orb.local, and so on, without having to set anything up. They're all completely separated without me having to do anything.

Imagine what you could do if you could have all those copies of your project up at the same time, and you had 3 monitors so you could manage your Claude Code instances and test everything all at the same time. That's the setup.

I put together a small simulator to demonstrate why this works. Even if the AI is dumber than you, even if it takes you a lot longer to get through tasks with it or explain things to it, the moment you spin up a few concurrent agents, the math shifts aggressively in your favor:

[Interactive ParallelismSimulator component — view on site]

Tickets were designed for humans

Most engineering work revolves around tickets. Discrete units of work that get estimated, assigned, and shipped. Bigger pieces get broken into smaller ones. Think about all the tickets that you see at your work. Really think about them. A new piece of functionality here, a change to existing functionality there, that kind of stuff. Most tickets just aren't that complicated.

And this is exactly where I see the big shift between engineers who wield AI and those who don't. State-of-the-art models and tools like Claude Code have gotten us to a point where you could realistically work on 10+ different tickets all at the same time.

You would run a bunch of separate AI agents, work with them to come up with a plan for all of those tickets, and then just grind through them all. Unlike a human being, separate AI agents don't have to switch contexts. They can go read through 100 different files in your repository, they can dig into things for you, they can keep the task in mind and get through it.

A human being would just flat out not be able to grind through so many disparate pieces of work that quickly. Humans need time to think things through, understand the context, go look through the code, and make their way through. It's even worse if you're looking at tickets for entirely separate projects. What human is capable of working on different codebases simultaneously and legitimately be productive? I would love to meet them.

But AI can do it without breaking a sweat.

What 40 hours on a Saturday looks like

Regardless, when it comes to fully autonomous agents, I'm not quite there yet. I've never used --dangerously-skip-permissions (if you're not familiar, it's a YOLO mode that gives the agent full permission to do anything and everything on the computer), and I haven't left an AI to figure things out on its own yet. For now, I've been paying attention to everything all my different agents are doing. Quickly reading through the code they're putting out, sense-checking what they're doing and steering them in different ways if I'm not happy with whatever direction they're going in³.

If the plan is "change function X to support Y and update component Z to display that in a modal", all it takes is a few seconds to glance at the code for those in the Claude Code terminal, and understand that "yep, looks good". It's not "it takes me longer to review than it would if I wrote it myself", and especially not when you take into account the cost of context switching and the cognitive load of trying to grasp the context for whatever change you need to make.

After that, services like CodeRabbit can do a 1st-pass code review for you⁴. Between the automated code reviews, linting, static analysis, static typing, automated testing, and all that malarkey, the code is solid, and you can be sure it does whatever you agreed to do during your planning phase. You do some manual testing, make sure everything looks good, do a final self-review, make sure you're happy with everything that's been done, and then raise a PR.

I have already seen the performance shift for myself: Thanks to my current setup, I can now deliver what my team would estimate as 40 hours of sprint work in a single Saturday.⁵ That's 40 hours of ticket estimates: the kind of tasks that would take a week of focused work for one engineer, spread across maybe 8-10 discrete tickets. A frontend component here, a backend endpoint there, a migration, some test coverage. None of them individually heroic. All of them done. It's surreal. The amount of time that I spend in my IDE has collapsed. I can spend a whole day "working" but only ~15 minutes in my IDE. I'm just managing agents all day.

A side project of mine hadn't been updated in 5 years, and one of the APIs it used was being sunset. I needed to update it to use the new version of the API. But for that, I would need to 1) go read through upgrade guides, 2) go read through the new API docs, and 3) go get the project up and running again in the first place. Not difficult, and not technically challenging work. Just tedious low-priority maintenance work. I got a couple Claude Code instances working through the upgrade, and it all got done, with almost no cognitive load, and while I was also working on other things. I'm not the only one finally making progress on long-backlogged tasks.

And because I was keeping a casual eye on what Claude Code was doing, I actually got to see the changes it made and have a general idea of what it took. It's not super in-depth knowledge, but that would be the same situation you'd be in if you were leading a team and tasked a teammate with doing the upgrade ticket. In fact, if you assigned it to someone, you would have zero visibility into it, but if you're reading through what Claude Code does, at least you see it happening and can follow along.

To me, that's been very valuable for maintaining an understanding of the codebase.

And: It was faster with AI than it would've been if I had done it myself. But even if it wasn't, it took very little cognitive load to get through that upgrade. I was working through it as 1 of 10+ different terminal tabs I had open, working through 10+ different tasks for different projects.

This is what 'working' looks like now: 20+ Claude Code instances across multiple projects, all running simultaneously. That picture has 6 proper iTerm tabs and the one being shown is split into 7 sub-tabs. Some planning, some working, all moving different tasks forward.

The cost is cognitive, not technical

The benefit of agentic AI isn't that "it will do it faster than you", it's that it unlocks parallelization. You can be working on as many tasks as your brain can handle at the same time, limited only by your context-switching skill. The era of single-threaded human productivity is over.

And I'll admit: That is where I struggle the most - it's insanely demanding to spend a day jumping from task to task, guiding Claude instances in the right direction and making sure they all achieve their goals.

Part of me thinks that if that becomes a normal performance expectation, a lot of us will burn out. Myself included - sustaining this level of performance for weeks at a time feels like it will get obscenely exhausting.

For now, though, this level of performance makes me feel superhuman and enables me to tackle a lot of things that otherwise would have to wait. That's been satisfying. If I were building my own startup, this is exactly what I would leverage to the extreme to keep the team as lean as possible. Building is no longer the bottleneck.

Adam Wathan (creator of Tailwind CSS) recently asked: "Is there anything you've built that's been game changing for your business that was just impossible to justify pre-AI?" A lot of us are starting to answer "yes." Features that died in backlog purgatory, maintenance work that never made the sprint, side projects that sat abandoned for years, that's all doable now.

Three headwinds for 2026

As an engineer, my philosophy has always been: The more things I can be told are wrong automatically, the better. I don't want anything to break at runtime that I could've been warned about in advance. That's why I'm in love with Rust, static typing, TypeScript, static analysis tools, Result<T, E>⁶, and even IntelliJ IDEA's inspections. For a long time, IDEA was always so much more powerful than what you used to get from other editors that I couldn't imagine living without it. And now all those guardrails for humans are even more impactful than I could ever have imagined.

2026 will be a scary year for engineers. I think founding engineers have a leg up here, because without an entrenched process, they can immediately start taking advantage of the massive parallelization of work, and move faster than ever. But engineers at established companies will be going up against three headwinds:

Most corporate processes assume a human is doing things by hand. Ticket estimation and sprint planning come to mind; those pipelines aren't set up for "I could feed 10 feature requests to AI to make an initial draft plan for what it'd take to implement, and come up with a very rough estimate in a few seconds".
Part of their team hasn't bought in, and doesn't intend to. This creates a painful asymmetry: you can't build a process around AI-level output velocity if half the team is working at human speed. Someone ends up on a PIP, or the team fractures into two tiers. Neither is fun.
Their existing project(s) don't have the necessary guardrails in place, and adding them would be a lengthy process that they don't want to invest in. If your project's never even had automated tests of any kind... using AI to develop anything will require some serious faith.

I don't think engineers are going anywhere. There are still plenty of things that are just too complicated for an LLM to reason through, where it fails (some gnarly business logic bits, or whatever it is). But the % of time that you, as an engineer, do that kind of complex work, vs work that can easily be done by an LLM, is already quite low, and only getting lower.

Obviously, this depends on the kind of work you do. If you're working on low-level assembly optimization for embedded systems, it may be that 100% of your work has to be done by hand, and LLMs are of no use right now. That's OK, and completely understandable. But most of us building software aren't doing anything near that complicated.

For most of us, the earthquake has already hit, and the tsunami wave of change is coming.

As William Gibson once wrote that "the future is already here - it's just not very evenly distributed."

I’m curious about everyone's thoughts on this. I'm always excited to talk about this stuff, so feel free to reach out to me directly either on X/Twitter @pocketarc or by email.

Fun fact: The cover image for this post was made with HTML & CSS, not an AI image generation model. Neat, huh?

Not published. And now re-reading, it's out of date, alongside a few other articles I was halfway through writing throughout the year. I'm learning (too slowly) that AI moves too fast for me to procrastinate on hitting that "Publish" button. ↩
For AI, I mostly rely on Shawn Wang's (@swyx) smol.ai daily newsletter; it provides a daily summary of whatever's gone on for the day in the world of AI. I spend 30 seconds skimming it, and I'm done. It's been a godsend for how easy it makes keeping up. ↩
That's probably the reason I only rarely hit Claude limits even on the $100/mo plan - I don't just leave it to burn through tokens. ↩
I have to admit, I'm a latecomer to AI code reviews: I tried CodeRabbit for the first time this month, and it's been unbelievable. It spots a lot of things most reviewers would miss and is very easily teachable (just back and forth in PRs). It can serve as a great 1st pass code review before a human goes in for a real code review. ↩
A common response to claims like this is "where's all this productivity, why don't you share what you've built?" The reality is that most of us work on things that aren't open source. Features just ship faster, startups spin up faster, improvements are made that otherwise would sit in a backlog, and teams do more with the same people. Look at Anthropic's own release velocity, or Simon Willison's point about rolling functionality into your project rather than pulling in a left-pad dependency. ↩
Result has completely changed how I think about error handling. If you're not familiar with it, it's worth a look. ↩

Dev Log: December 29 - January 4

Mon, 29 Dec 2025 00:00:00 GMT

macOS Screenshot Uploader
- I rewrote my ~15-year-old screenshot uploader in TypeScript, with lots of enhancements to make it drastically faster. Will write up a little blog post about it. I wanted to release it as a single-file executable, but alas, we can't ship sharp in Bun yet.
The era of single-threaded human productivity is over.
- I wrote an article to discuss my thoughts on programming with AI in 2026. There's a lot to say about the topic, and I was happy to write my thoughts down before things change any further - I want these posts to serve as a time capsule to look back on. I did that a little bit back in 2024 with my GPT-4 shenanigans, but I didn't do anything for 2025, and AI programming changed SO much over the course of the year.
Cashscape
- I added Meilisearch to help with automatically finding the closest matches for payees, when an exact match can't be found. AI is given a list of options and then make a decision on which one (if any) is the right match.
- Moved ProcessEmail to the background so the email inbox processor can just dispatch separate jobs for every email and drastically speed up processing.
- Less progress on this than I wanted to; I parked it for other priorities.
XNC Discord Bot
- There's a Discord bot I've maintained for a friendly community for ~5 years, but it was originally built in PHP to explore async PHP (ReactPHP), and every now and then it'd require a complete restart, which was a pain in the butt. I rebuilt in TS with Discordeno, added support for multiple Discord servers, and better handling for users who've left the server. AI is great for these porting projects, I was very impressed. Fun fact: I didn't even know Discordeno existed until this rewrite. It works SO much better than Discord.js (which I've used in other projects). I want something similar for Slack!
Stealth CodeIgniter-to-Laravel Project
- Rebuilt the server infrastructure from a whole-infra-per-tenant setup to a traditional SaaS infra setup, with a dedicated DB server, Meilisearch, Redis, etc, created a 1000+ line document detailing all the infra setup and configs necessary.
- This also involved exposing /horizon directly via Nginx configs, making it the first route to be served directly from Laravel, skipping CodeIgniter altogether. That's a huge milestone! Next step is exposing real app routes - we have session sharing, so there should be no blockers for that.
pocketarc.com
- I built devcontainer to test YOLO Claude Code, but decided to park this idea for now. I might end up going with pre-built nixOS images that I spin up in my home lab as VMs, so I can have proper Chromium instances (claude --chrome ideas for the future) and everything from a normal OS.
- I added a light/dark mode toggle to the site. I also added JSON-LD metadata to all pages for better SEO.
- I created a content-drafter skill that combines bookmark, note, and devlog drafting into one reusable package. Started with three separate agents but realized that packaging them as a skill means they have access to the full conversation context when invoked, which makes for much better content since I don't have to re-explain what I'm working on. This was my first time working with skills, and it seems really, really interesting, actually. I guess it's just a modern version of having pre-saved prompts for different scenarios, like I used to have in Zed.

Using AI to generate images with HTML and CSS works pretty well!

Tue, 23 Dec 2025 17:42:02 GMT

I needed a cover image for my latest post, and decided to use AI for it, BUT not just "generate an image with AI", which always looks cheap and fake.

Instead, I asked Claude Code:

There's no cover picture for the new AWS S3 post - can you brainstorm some ideas for possible images I can put there? Have a look at the existing cover images.

It came up with a few different concepts, and I picked one I thought was pretty good, and asked it to come up with a few variations of that concept.

But then I took it a step further:

For each of them create a .html file with a 1200x630 div with everything laid out in it. I added two SVGs for you to use, one is the bucket icon, the other is "aws s3" logo, check them out.

It generated a few HTML files with CSS that looked pretty good. I could make a visual decision on which one I liked best, and tweak the CSS a bit to get it just right.

I then used headless Chrome to render the HTML files and take screenshots of the final one:

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --headless --screenshot --window-size=1200,630 --force-device-scale-factor=1 --hide-scrollbars cover-option-b.html

Here are the 4 options I ended up with. Not bad, and definitely has me thinking about other possible uses for this technique.

Edit (2025-12-31): And another one! This one in particular looks incredible, in my opinion.

Dev Log: December 22 - December 28

Mon, 22 Dec 2025 00:00:00 GMT

This week I was off work 4 weekdays for the holidays, so I had a good chunk of time to work on personal projects. Here's what I got done:

pocketarc.com
- I took the time to upgrade the website to Next.js 16 + Tailwind 4, and make it completely static in the process, so it can be deployed to CloudFlare Workers as a fully-static site (CloudFlare Pages is deprecated and suggests that people set static sites up in Workers, so I figured I'd get ahead of that). Cleaned up a few different things in the layout (and enabled View Transitions, which looks damn good), created new commands, made a new navbar, finally added support for "Notes" (short-form posts) and "Weekly Log" entries, which I've been wanting to do for a very, very long time. eev.ee's my inspiration for this.
TV Sort
- I upgraded the app to Next.js 16 + Tailwind 4 as well, and made it a point to remove all Server Actions in favor of standard API routes. The security exploit made me uncomfortable about using them in production, so I decided to remove them entirely and disable them in the Next.js config. API routes are a lot easier to reason about, and I've achieved my learning goal: I know how to use Server Actions now. I cleaned up a lot of other things in the backend to improve error handling and observability. No user-facing changes, but the app is more robust now. People do actually use it regularly, which is nice.
Cashscape
- I spent a good chunk of time working on my personal CLI-based bookkeeping app. I upgraded the entire app from Laravel 8 to Laravel 12, and moved a lot of the data into the database, for easier management. Payees and transaction categories were previously huge ever-growing Enums, with regexes for matching, but now that can all be done via the DB + AI (it's important to note AI just makes suggestions, I approve/reject them, nothing's done automatically yet). I also added support for it to read through and process all of my emails for receipts, extracting receipt (whether attached to the email or available via links) with AI and storing them in the database so they can be linked to transactions (which will later on be done automatically with AI as well). That part works remarkably well, and even though categorizing every single email as "receipt" or "not_receipt" involves -a lot- of AI use, GPT 5.2 Nano's been doing a great job at it so far.
CodeIgniter
- I took the time to clear out all open issues and PRs on the CodeIgniter 3 fork I maintain, and get everything ready for PHP 8.5, including getting CI working for all PHP versions from 5.4 to 8.5 without any failures. CodeIgniter 3 is still surprisingly popular, so I'm happy to keep maintaining it for the community.
Prism
- I raised a PR for better error handling for OpenRouter API requests, which was merged in. Not a huge change, but it makes debugging issues with routing a lot easier.
Stealth CodeIgniter-to-Laravel Project
- I spent a big chunk of time making authentication and sessions shareable between CodeIgniter 3 and Laravel 12, so that users can log in to either environment and be authenticated on both. This involved writing custom session handlers for both frameworks, and enables me to slowly switch over specific routes from CodeIgniter to Laravel without the users noticing anything different. The main reason this started is because I wanted to expose the /horizon routes for Laravel Horizon, but CodeIgniter is the main app, so I wanted a way to share sessions to gate that route. Also migrated away from Selenium to Playwright for HTML-to-PDF generation, which is working out much better.

There may have been other small things I worked on, but those were the main projects. Overall an incredibly productive week, taking into account the fact that I had relatively little time at the computer, thanks to the holiday celebrations. This week has really made me think about how much I can get done with AI these days.

Apple Password Rules

Mon, 15 Dec 2025 00:00:00 GMT

TIL you can use the passwordrules HTML attribute to tell password managers how to generate passwords for you.

Here's an example:

<input type="password" passwordrules="minlength: 20; required: lower; required: upper; required: digit; required: [-];" />

More info on Apple's passwordrules demo site and on Apple's documentation site. There's a WHATWG proposal to standardize this across browsers. It was deadlocked for a long time, with Mozilla being against it, but as of March 2025, Mozilla's position is now positive.

Regardless, there's no harm in using it, and it can only help users who use password managers that support it.

How I migrated my S3 buckets to another AWS account

Sun, 01 Dec 2024 16:45:00 GMT

For many, many years, I've had multiple AWS accounts (for personal and business). I first signed up to AWS with my own personal Amazon account, and I've long worried about what that might mean for me if anything happened to lock me out. We've all heard the stories of people getting their accounts suspended/closed (e.g., someone loses their entire Google account because of a Play Store incident) and how that impacts everything for them.

Losing my Amazon account isn't a big deal (since I don't really care what Amazon does after I've received my deliveries), but losing my AWS account out of the blue would be incredibly disruptive. I don't want to end up on Hacker News looking for support.

Fearing such a situation, over the years I've moved most of my stuff across. Domains have been shifted to Namecheap, DNS has been shifted to CloudFlare, and hosting has moved to bare metal providers. One provider per function - contains the blast radius if any of them decide to pull any tricks. But the one thing that I'd always left alone was S3. S3 is incredibly cheap and effortless to keep, so moving everything out of this account was never a priority. One of the buckets was also the origin of a CloudFront distribution, and I knew it would be a bit of a hassle to transfer it all.

I decided to look at it today, though, and the process turned out to be remarkably simple. I thought about using bucket ACLs to properly transfer the bucket, which might have been a less manual option, but given that I hadn't looked at these in years, I decided to do some housekeeping and go through each bucket manually to determine what I did and didn't want to keep. Some of the buckets were for services I didn't use anymore.

1. Set up credentials for the `aws` CLI tool for both accounts.

First, in ~/.aws/credentials, you can set access keys for all your accounts, making the entire process incredibly easy. You can use --profile account-1 with any of the CLI commands, and it'll use the correct access keys.

[account-1]
aws_access_key_id = ...
aws_secret_access_key = ...

[account-2]
aws_access_key_id = ...
aws_secret_access_key = ...

2. Download & delete buckets you don't need anymore.

There are some buckets I am happy to delete, but being a digital hoarder, I still want to make sure I have a backup of them. I have a local home lab with a chunky mirrored ZFS array replicated off-site, so I am happy to store the S3 bucket backups there. My goal is to do housekeeping with my AWS accounts, not to get rid of the data.

In my case, I am not worried about the size of the buckets because they are small (these aren't powering production services, just personal projects and services).

Run the commands below, and that's it. AWS will sync everything for you.

My ZFS array is mirrored, meaning that data isn't lost even if a hard drive dies. And even if all drives were to die at the same time, there's an off-site replica with everything in it, so this stuff is adequately backed up. That whole setup is a story for another post, but it works great for my needs here.

# This will download the entire S3 bucket onto your computer.
aws s3 sync --profile account-1 s3://bucket-1  /zfs/backups/bucket-1

# This will delete every file in the S3 bucket and then delete the bucket.
aws s3 rb --profile account-1 s3://bucket-1 --force

3. Create new buckets and transfer everything.

In my case, I don't need to worry about transferring the data from one bucket straight to another. That would definitely be a fun exercise (and a lengthier post), but in my case, I can hold everything I had in S3 locally, so I can afford to do this the manual way.

Create the new S3 bucket on your new account in the AWS Management Console.

Once that's done, syncing back is a single command:

# This will upload the entire folder from your computer onto S3.
# Note the different bucket name and profile parameter.
aws s3 sync /zfs/backups/bucket-1 s3://bucket-2 --profile account-2

And that's that!

4. Setup new CloudFront distribution (if needed)

One of my buckets serves as my personal screenshot upload service. I have a script on my computers that automatically uploads screenshots whenever I take them (this is especially useful on macOS, combined with Cmd-Shift-4, which lets you take cropped screenshots).

I like to make those screenshots available via a CDN because I often link to them directly on websites and want them to load fast from anywhere in the world. Overkill? Certainly. But it was a fun learning experience when I set it up over 10 years ago (using a CDN for the first time), and it's virtually free, so... why not.

Things to remember/note:

Don't forget to set a default root object (in my case, it's index.html).
For a CloudFront distribution with an S3 origin, you can create a custom error response for 403 errors. I did it with error.html and even got it to respond with 404. So when you go to any random page, you get a neat 404 page rather than a bare S3 XML response.
Set up an SSL certificate with AWS Certificate Manager.
Set the CloudFront distribution to redirect HTTP to HTTPS.

That's it! This might not be anything special, but as I went through the process, I figured it wouldn't hurt to write about it. I want to get more in the habit of writing about stuff as I'm doing it rather than keeping it to myself. Fun fact, though: I wrote this in December 2024, but only published it in December 2025. So I'm still not doing great on the whole "keeping it to myself" thing, but I'm trying.

If you have any suggestions or thoughts, I'd love to hear from you; I'm always excited to talk about this stuff, so feel free to reach out.

Shadcn Combobox Not Scrolling

Sun, 10 Nov 2024 00:00:00 GMT

While working on my new personal budgeting app, I noticed that the Shadcn UI combobox was not scrollable.

That's not great UX (a <select> without scroll?), so I wanted to find out if there was a fix for it.

After some digging, I found this issue on Radix UI's GitHub, and it seems like the fix is pretty simple:

<PopoverPrimitive.Content
    ref={ref}
    align={align}
    sideOffset={sideOffset}
    onWheel={(e) => {
      // This fixes a bug where the popover would not scroll:
      // https://github.com/radix-ui/primitives/issues/1159
      e.stopPropagation();
    }}
    className={cn(
     // ...
    )}
    {...props}
/>

The onWheel handler is added to the PopoverPrimitive.Content component, which is the component that the combobox uses to render the popover.

This seems like a Radix UI bug, but at the same time, Shadcn UI should contain the fix by default.

The Art of Silence - Teaching LLMs When Not to Speak

Wed, 18 Sep 2024 09:00:00 GMT

During a discussion on Hacker News, someone said:

I'm not entirely cynical on the value of LLMs, but I've yet to see one say "I don't know", or "I'm not sure, but here's my best guess".

I see that mentioned a lot as a problem with LLMs, and that prompted me to write a bit about why that is and what you can do to mitigate that problem.

I've used LLMs a ton over the last couple of years. I've co-founded a VC-backed AI startup (hasn't everyone, at this point? 💁🏻‍♂️), and I've done consulting for a bunch of different companies on their AI strategy and on implementing AI into their processes and their products.

Quite recently, I helped a company implement a system to autofill custom user-made forms based on unstructured documents and text. With the right guidance, the LLM correctly does not answer fields that it doesn't know and does not try to guess anything. It has been pretty much error-free (humans do manually review the output for validation).

It's all about your prompting. Without explicitly being given guidance on how not to answer, an LLM will never say it doesn't know.

How to ask all the wrong questions

Given this prompt:

what is 34342234 * 583947?

It will always come up with something, and it's usually going to be wrong. This is 3.5 Sonnet's response:

To calculate 34342234 * 583947, let's break it down step-by-step:

1. First, let's multiply 34342234 by 583947
2. We can use a calculator for this large multiplication, as doing it by hand would be time-consuming and prone to errors.
3. Using a calculator, we get: 34342234 * 583947 = 20054319958398

Therefore, the result of 34342234 * 583947 is 20,054,319,958,398.

Of course, if you plug that into an actual calculator and not just a text prediction system, you'll see that the result is actually 20,054,044,517,598. It got remarkably close, which is a testament to transformer technology, but it's still just guessing at what the 'predicted' number is. It's not "thinking".

You may have the wrong mental model of how LLMs work

This is the mistake that a lot of people make when using LLMs. It might be that in the future, AGI will come, and it will actually "think" and take over the planet. Until then, this is just a text prediction tool, and until your mental model of LLMs incorporates that, you will not get the results you want.

If you think "LLMs are oracles that always give you the right answer", you're wrong, and LLMs will disappoint you. If you believe "LLMs are like humans that think things through, but sometimes they lie, and you can never control it", you're also wrong.

First off, LLMs don't see text the way we do. They break everything down into "tokens" - word chunks. Sometimes, a token is a whole word; sometimes, it's part of a word. The word "indubitably" might be broken down into "ind-ubit-ably".

That is ind-ubit-ably correct.

The LLM's job for each token is basically: "Based on everything I've seen before, what's the most likely next token?" The context - all the text that came before - is super important here. It's juggling probabilities for all possible subsequent tokens, all based on patterns seen in its training data.

With that in mind, it should immediately be clear that you will never get an LLM to say nothing. It will always just feed you the next most likely token. Sometimes, that's the "end of text" token, and that's how the system knows to stop asking the LLM for more tokens.

Now, back to that calculation. Notice how the LLM got the length right but the actual digits wrong? That's probabilities at work again. Based on what we wrote, there was a very high probability that the next token was going to be a number and that it should be in the trillions (because a number in the 10s of millions multiplied by a number in the 100s of thousands will certainly equal a number in the trillions). But when it came to the specific digits, the LLM was essentially playing a very educated game of digit roulette. It's impressive that it got so close, but it's also a stark reminder that we're dealing with a prediction engine, not a calculator.

This is precisely why getting an LLM to say "I don't know" is so hard. By default, it will always give you something, because that's what it's designed to do. To get what you want, you need to steer it. Even if what you want is "I don't know".

One weird trick to humble your AI

Add this to the system prompt:

You are not a calculator, and do not have access to a calculator.

And now, the new output for the very same user prompt¹:

Sorry, I can't do that calculation. I don't have a calculator handy.

A few more runs:

What's 34342234 * 583947?

Wow, that's a huge number!
No idea, sorry.
Too big for me to calculate.
Maybe try a calculator?
That's beyond my skills.
Can't do that in my head.
Lots of zeros, I bet.
I'm not a math whiz.

🎉 Ta-da! 🎉

It's not magic, but LLMs do need guidance. If your experience with LLMs is only through services like ChatGPT or Claude, where the provider's setting the system prompt for you, and it's a black box you can't easily control... you're missing out on a lot of the fun of LLMs.

Wait until you get to play with uncensored local LLMs hooked up to Slack, wreaking unfiltered havoc on what once was an innocent, wholesome, GPT-4 workspace. 🥲

What about when no answer is the best answer?

As we've established before, there is no easy way to get an LLM not to respond at all. It always wants to add the next most likely token. This 'next most likely token' mechanism is fundamental to how LLMs generate text. They don't 'decide' to stop; they will keep giving you the next most likely token until they reach a designated stop condition (the 'end of text' token I mentioned) or a predefined maximum number of tokens.

While working on my multi-channel (Slack, Discord, email, SMS, etc.) chat agents (something that has been waiting for me to launch publicly for way too long), one of the issues I've faced was getting it not to respond to every single message. In my original article, I wrote about using a getNeedsReply function to ask an LLM if a message needs a response before investing in actually generating a response with a high-end model.

That worked great for the most part, but it wasn't perfect. It added an extra API call, and it was never as accurate as you'd want it to be because it wasn't based on the same context that the actual agents were getting.

With a bit of rethinking, the task of getting an LLM not to respond can be made much easier. You have to accept that the model will respond, but you can control how it responds. At its most basic, you could do something like:

Think through your response in <thinking> tags. Decide whether or not the message needs a response. If the message is just an acknowledgment, or the natural end of a conversation (e.g. "Thanks!" or "No problem."), you don't need to respond.

If you decide to respond, write your response in <response> tags.

Immediately, that makes your job extremely easy. Look for <thinking> tags to confirm that the LLM did behave as expected (without a <thinking> tag, that would indicate that the model refused to respond or any other failure mode), and then look for a <response> tag. If there is no response tag, the model didn't respond, and you don't need to return anything to the user.

Bringing it all together with code

Example TypeScript code:

interface LLMResponse {
  thinking?: string;
  response?: string;
}

async function getLLMResponse(
  prompt: string,
  systemPrompt: string
): Promise<LLMResponse> {
  // We can assume llm is your LLM client.
  const fullResponse = await llm.generate(systemPrompt, prompt);

  const thinking = extractTag(fullResponse, "thinking");
  const response = extractTag(fullResponse, "response");

  return { thinking, response };
}

function extractTag(text: string, tag: string): string | undefined {
  // Could use regex, but should probably use a proper parser.
}

const systemPrompt = `...`; // What we had above.
const userPrompt = "Thanks!";

const { thinking, response } = await getLLMResponse(userPrompt, systemPrompt);

console.log("Thinking:", thinking);
console.log("Response:", response);

And this is what you get:

<thinking>
This message is a simple "Thanks!" which is typically used to end a conversation or express gratitude. It doesn't require a response, as it's a natural conclusion to an interaction.
</thinking>

The LLM did respond and did generate text, but none of it needed to be user-facing. This is a little bit of a cheat, and amusingly enough, it's in the same vein as what OpenAI's o1 is doing with its hidden chain of thought.

Wrapping up

AI isn't magic, and if you take the time to really understand what's going on and why things are the way they are, you will be able to get a lot more value from LLMs, both for personal use cases and for work.

I encourage you to experiment more with LLMs; there's lots that can be achieved with the right prompting, lots that you can get them to do. You just have to be a bit flexible.

Then you will see that it is not the LLM that bends, it is only yourself.

What has been your experience with getting LLMs to express uncertainty or not respond? Have you developed any unique techniques? Share your thoughts; I'm always excited to talk about this stuff, so feel free to reach out to me directly either on X/Twitter @pocketarc or by email.

I did add Use very, very short sentences. to the top of the system prompt because when it starts going off on a multi-paragraph tirade about how it's an AI model, and you should be using a calculator or a spreadsheet or scientific software, or blah blah blah, my eyes roll so hard that I fear they may end up stuck there. ↩

Next.js Link Scroll Behavior

Tue, 16 Jan 2024 00:00:00 GMT

<Link> doesn't actually scroll to the top of the next page. It scrolls to the {children} of the layout of the next page.

If your layout has a big header that pushes the {children} below the fold, the <Link> will NOT scroll to the top of the page.

To fix this, the best solution I've found is to move the header bits to the page itself, not the layout. This way, the <Link> will scroll to the top of the page.

What about mdx?

I wrapped my entire .mdx files in <PostLayout post={metadata}> and it works fine. Not sure if this is the best solution, but it's good enough for now.

TV Sort: Engineering the ultimate TV episode ranking system

Sat, 30 Dec 2023 21:41:00 GMT

TL;DR

I built TV Sort, an open-source episode ranking game that ditches the usual 1-10 ratings for a unique, human-driven sorting algorithm that pits episodes against each other. The game draws from TMDB, IMDb, and Wikipedia, with a bit of GPT thrown in for good measure.

My mother and I are huge Frasier fans. We've both watched and rewatched that show more times than we can count. And this year, while spending the Christmas week in Ireland with my family, we got to talking about which Frasier episode is our absolute favorite. And I started thinking: There are many solid episodes; how would they all rank against each other?

The easy way out is to use a 1-10 ranking system. But that's not good enough. There are lots of episodes that I rank 10. And how do I know the difference between a 7 and an 8? What -is- the difference?

Could I come up with a better way to rank episodes?

The algorithm

If I ask you to rank all episodes of a show, you'll find it quite tricky beyond saying, "Oh, I especially like X and Y, but I really don't like Z". But if I ask you to tell me which of two episodes you prefer, that's much easier! It's easy because you can say, "I really like A", and move on.

The first thing that pops into mind when thinking about this is ranking systems like Elo. Elo would be perfect for this, except that there is no stopping point at which you're -sure- that the rankings are properly defined - you can keep "battling" episodes forever. It also doesn't minimize the number of comparisons a person would have to make, which means they could be sitting there for days comparing 200+ episodes.

I had to come up with something better. And then it hit me: That… sounds a lot like a sorting algorithm, but one that waits for human input rather than being fully automated. If someone could provide the input for all comparisons made by the sorting algorithm, creating a ranked list of episodes would be possible. And so that's what I decided to go for.

I had previously seen Leonid Shevtsov's MonkeySort, which implements precisely what I've just described: A human-driven Quicksort algorithm. I decided to read up on it to understand how it does its job and re-implement it to drive this.

Note: Quicksort isn't the best algorithm for this; in the worst case, it's O(n²). A merge sort would be ideal here; it's stable and always O(n log n). Its only downside is that it uses more memory, but we're not sorting huge lists here, so that's not a concern. I've raised an issue (for myself) to look into this later on.

The data

Once I had the algorithm figured out, I had to think of what it would take to make it something a person could use:

A web app that shows episodes side by side, letting you decide on the outcome.
Episodes should include stills and descriptions to help jog your memory.
People should be able to use this web app for more than just Frasier. 😅

The first part was straightforward. I already had the algorithm. But grabbing the data for different shows and episodes was something that I had to figure out. Obviously, IMDB has a ton of data, but their API has a cost associated with it. There's a third-party IMDB API, but they charge for it, too. I wanted to avoid paying monthly fees for a web app like this, where I wouldn't need the API after fetching the relevant show details.

Thankfully, I knew of another place where I could get the data: The Movie Database

My home media server (Plex) uses it, as does Filebot, which I use for organizing media files. I decided to check it out, and it turns out it has an entirely free API with an incredibly generous rate limit.

With that in mind, I built the web app using Next.js, with Knex driving the database. People could search for a show, and it would fetch shows from TMDB. Once someone clicked on the show, the "show sorter" would start. In the background, the backend would fetch and store all the necessary data for all episodes so that everything could be fed down to the browser in one go, enabling the person to go through the entire sorting process locally instead of keeping them waiting for the backend between comparisons.

It was all working well, except for one big problem: Assessing an episode by reading the descriptions put a lot of cognitive load on the person. Doubly so if the descriptions were wordy. My goal was to make it as easy as possible to quickly grasp what an episode was about so you could get through the comparisons more easily.

I needed to clean up these descriptions for all episodes of all shows.

The LLM

Turning to an LLM was the obvious answer. I decided to feed the descriptions through GPT to get it to spit out 3 concise sentences describing the plot. I wanted to use GPT-4, but it wouldn't have been feasible cost-wise to do it for what is essentially an unlimited number of episodes and TV shows. Using GPT-3.5 meant I had to do a bit more to get it to work well. After tinkering with the prompt a bit, I came up with an excellent way to do it (first, spit out -all- plot points, then find 3 significant points to showcase, and then, if the generated sentences are too long, shorten them a bit).

That worked quite well, but the descriptions being fed to the LLM were not great: TMDB's own guidance says, "When writing an overview, try to keep it short, concise, and free of spoilers."

But we -need- spoilers. We -need- to know the significant things that happen during the episode so that it's very obvious to someone comparing these episodes.

So, I set out to get better data. TMDB includes external IDs with each episode, which means linking an episode to the relevant page on IMDB is easy. IDMB has a "plot summaries" and "synopsis" section, and I was able to grab them with basic HTML scraping. Easy.

That worked well enough. But some episodes didn't have enough detail in IMDB (like for The Office). I needed better data.

That's when I turned to Wikipedia. I had already thought about it before, but Wikipedia is difficult: There is no direct link to episodes from TMDB, and many shows don't even have dedicated episode pages; they have season pages listing every episode. I had put it on the back burner until I realized how important it was to get data from there.

Once I started looking into it, things turned out to be simpler than I thought. It turns out that TMDB does have a Wikidata ID for every episode (if the episode has a specific page) and for every show. That meant that I could use SPARQL and the Wikidata Query Service to find the Wikipedia URL for the episode (and if an episode didn't have its own page, find the URL for that episode's season, where the episode is listed and described).

The query to get the URL for a show's season:

SELECT ?wppage WHERE {
wd:${wikidataId} wdt:P527 ?season . # P527 (has part)
?season wdt:P31 wd:Q3464665; # P31 (instance of) Q3464665 (television series season)
p:P179 [pq:P1545 "${seasonNumber}"] . # P179 (part of the series) P1545 (series ordinal)
?wppage schema:about ?season .
FILTER(contains(str(?wppage),'//en.wikipedia'))
}

I'm not going to lie; I have -no- idea what I was doing with SPARQL.

I spent only an hour trying to fiddle with it (and reading the Wikidata SPARQL tutorial). Even though I was able to learn enough through trial and error to get these queries working, I was definitely humbled. It was -nothing- like SQL.

I understand the query well enough to know that it's looking for the Wikipedia URL for the season of a show with a specific season number. But I don't understand the syntax well enough to know why it works. I'm sure there's a better way to do it, but I got it working, and that's all that matters.

Once I had the right URL, scraping the content (the Plot section of an episode page, or the relevant table cell in a season page) was easy.

And with that done, I was finally able to turn this (from Stress Relief (The Office)):

[
"Dwight's fire safety seminar goes wrong.",
"Michael organizes a roast for Stanley.",
"Andy believes Pam and Jim are film gurus."
]

Into:

[
"Dwight's realistic fire alarm causes Stanley's heart attack.",
"Pam's father seeks a separation from her mother.",
"Michael organizes a comedic roast for himself in the warehouse."
]

It might seem small, but it's the difference between someone not being sure what episode it is and immediately knowing and being able to compare it to other episodes.

Performance

One of the biggest problems with using an LLM, of course, is that loading a 200-episode show for the first time was no longer instant. Whoever was loading the show for the first time had to wait for the LLM to generate these plot points, which could take several minutes. I didn't want to complicate my infrastructure for this, so I decided to use pg-boss, which is a -brilliant- job scheduling system for Node that uses only Postgres (the database that I was already using for storing show/episode information).

I was able to shunt the generation of plot points to the background so that when someone tries to load a show for the first time, it's available for ranking instantly, using the original TMDB episode descriptions until the LLM-generated plot points become available. The browser will keep pinging the backend in the background until it has all the plot points, updating the UI as they become available.

Moving the work to the background also makes it easy to avoid doubling up work when a show is accessed multiple times while processing.

Finishing Touches

Developing this, I had to ensure the interface worked well on mobile. I iterated several times to find something that would fit well, even on tiny mobile phone screens. I also put the comparison buttons at the bottom so that someone on their phone could quickly get through a bunch of episode comparisons without having to move their fingers.

I also wanted to ensure there were show-specific landing pages to which people could be linked. Those pages don't have much content at the moment, but I'd like to add more information to them in the future - like a list of the top 10 episodes or the best and worst episodes of each season.

I think it looks pretty spiffy, especially on mobile.

I'm sure the UX can improve further.

While preparing to deploy this, I obviously had to work on the little details, like containerizing it, adding error tracking, sitemap.xml, robots.txt, and even OpenGraph data to make it easy for people to share.

Open Graph (OG) images are the images that show up when you share a link on iMessage, Slack, Twitter, or any other place.

Thinking about how to make this fun for sharing, I started thinking about how Next.js lets you generate OG images dynamically. I decided to make the image include the show's poster.

I think this does its job perfectly when it comes to drawing people in.

It's live, and open source.

The journey from a casual holiday debate to a fully functional game was challenging and rewarding. TV Sort is now live, and I encourage you to try it out and contribute your rankings and insights.

One of the byproducts of this adventure has been a deeper appreciation for the wealth of data available through TMDB. I encourage everyone to embrace the spirit of community-driven data refinement and contribute back to TMDB, enhancing the data quality for everyone. Especially stills!

The project is open-source under the AGPL license. There is still plenty to improve: I have a sizeable public list of planned features and improvements on GitHub, including the ability to see all your rankings (in progress or completed) on the home page, the ability to rank specific seasons rather than the entire show (and have it automatically count towards a ranking for the whole show), and the ability to undo mistakes. Also, one day, I'd like to use all these rankings to form a global consensus on where precisely each episode of every show stands, beyond simplistic 1-10 rankings. If you're a TypeScript developer and interested in this, I'd welcome any help.

And, whether you're a TV aficionado eager to curate your ultimate episode list or someone who loves a good sorting algorithm, head to TV Sort and start ranking.

And, of course, check out my favorite Frasier episodes. Making this list is the whole reason this started!

If you have any thoughts to share, ideas for improvements, or issues you've encountered, reach out to me by email, Twitter, or wherever else you might find me. I'd love to get your perspective on this (if only to know that I'm not alone in caring about this!).

useJourney, a React hook for building user journeys

Sat, 15 Jul 2023 17:00:00 GMT

The problem

Over the years, I've built several different user journeys as part of my work, and as they grow, they always become harder and harder to maintain. Logic between steps starts getting convoluted, and you need to track what step should come next and all the different variables you depend on. It becomes a mess.

I've thought a lot about this and started looking at state machines to deal with it. Libraries like XState seemed appealing but ultimately seemed too divorced from my problem to fit into it (if you disagree, I'd love to hear your opinion!).

What do we need?

At their core, all journeys have the same need for answers. Based on your state:

What's the next step?
Should we even show a next button?
Which steps are complete?
Which parts of the journey are available to the user given the answers they've given so far?
Is this the last step of the journey?
And so on.

How can you build all this logic into your system in a way that is maintainable, easy to extend, and easy to reason about?

What if it was as simple as:


    const { CurrentStep } = useJourney(steps, state);
    return <CurrentStep />;
}

How it works

There are two key things you give useJourney: State and Steps. State is easy; it's all the variables that define your journey's current state, including the step the user is currently on. The Steps parameter is where the magic happens; it contains all the information for each step, including any necessary logic.

With that, each step can decide on its own situation, whether it's skipped or complete, whether the user should be allowed to proceed from it, etc. Logic becomes easy to maintain, as each step has full access to the entire state object and the results of decisions by other steps (e.g., mark this step as skipped if Step X is also skipped). It also becomes easy to keep everything organized, as each step (and its component) can be kept in separate files.

The example below shows off a complete journey, including a step that gets skipped based on the user's answer to a previous question.

You define a journey as a map of steps (you can use getStepsMap to infer types in TypeScript, which will give you autocomplete in your IDE for all of a step's possible properties), each of which has a slug, metadata, and any logic that you need to run to determine if the step is complete or skipped.

The metadata is just a container for any data you want to pass to the step's component, and all logic is just functions that take the state and use it to make decisions.

How to use it

Each step in a journey should be in a different file, so it's straightforward to create huge complex journeys and keep them all neatly organized. In this example, we will define all the steps in the same file to keep it simple.

You can get more documentation at pocketarc.github.io/use-journey.

// First, define the steps.
const steps = getStepsMap([
    {
        slug: "start",
        component: StepStart
    },
    {
        slug: "is-new",
        component: StepIsNew,
        isComplete: (state: State) => {
            return state.isNew !== undefined;
        }
    },
    {
        slug: "full-name",
        component: StepFullName,
        isComplete: (state: State) => {
            return state.fullName !== "";
        },
        isSkipped: (state: State) => {
            return state.isNew !== true;
        }
    },
    {
        slug: "finish",
        component: StepFinish
    }
]);

// Then, use the journey.

    const [state, setState] = useState<State>({
        currentStep: "start",
        isNew: undefined,
        fullName: undefined
    });
    const { CurrentStep, showPreviousButton, showNextButton, goToNextStep, goToPreviousStep, slug } = useJourney(steps, state, setState);

    return (
        <>
            <h1>You are on {slug}</h1>
            <CurrentStep />
            {showPreviousButton && (
                <button onClick={goToPreviousStep} disabled={!showPreviousButton}>
                    Previous
                </button>
            )}
            {showNextButton && (
                <button onClick={goToNextStep} disabled={!showNextButton}>
                    Next
                </button>
            )}
        </>
    );
}

Getting started

Pretty standard, use npm (or yarn, or pnpm) to install use-journey.

npm install @pocketarc/use-journey

Check it out on GitHub, and read the documentation.

Final words

If there's anything you need, don't be afraid to ask! This package is still in an early stage of development, and I'm looking for an outside perspective from others trying to build their own journeys, so feel free to raise issues as needed. PRs are welcome, as well.

Diffusion Depot, the smart image manager for Stable Diffusion, Midjourney, and DALL-E

Sun, 18 Jun 2023 15:54:07 GMT

TL;DR: I built an app to manage images generated with ML models. Check it out.

Recently, I started playing with Stable Diffusion, using it to generate countless images. I have a beefy 24GB 3090 Ti that I've held onto even after I stopped streaming on Twitch, and I wanted to know how it would fare compared to my previous experiences with DALL-E 2.

The results were astounding. It generates batches of images in a matter of seconds, making it as fast as DALL-E, with much more impressive results. This is one area where OpenAI definitely doesn't have the lead.

However, within a few days, I had accumulated thousands of different images, some just slight variations on an existing prompt. If I wanted to return to a particular version of a prompt, digging it up was a bother. And what if I wanted to find all images that contained something specific, like all images of Patrick Stewart swimming in the ocean or horses wearing red hoodies (why not)?

It was impossible to manage.

I wanted to generate as many images as I wanted and try out as many prompt variations as I wanted without thinking about how to find them later or carefully curating all the generated images to keep only the "best".

Key features

I started thinking about building an app that could take care of it. What would it take? What features could it have that would make it worth the effort to build the app itself and use it regularly? Well:

It needs to let me tag images. When I generate a big batch, I need to be able to quickly tag the pictures of that batch and move on. Bonus points if it could do some of the tagging automatically.
It needs to be faster than the file manager. I can scroll down a long folder full of images, so whatever the app does, it needs to beat that. Maybe a scrollable list of all pictures in higher resolution than what we get with the file manager? It also needs to let me filter by tag, and it needs to be instant.
It needs to list all prompts I've ever used. And it needs to let me see which images each prompt has generated so that instead of just seeing a bunch of words, I actually see the "typical output" for that prompt.
It needs to integrate with Stable Diffusion. I want to be able to upscale images whenever I want and generate more images on demand, either based on an existing image or a prompt. Thankfully, all of that is possible with the Stable Diffusion Web UI, which has an API to control it.

With those requirements in mind, I set out to build an app that could do it all. I figured that since this was an app for ML-generated images, I should keep with the theme and generate all the assets around it. I asked Lucas, one of my bots to come up with a name, and he settled on Diffusion Depot. I also used IconifyAI to generate the app’s icon.

How it was built

The app itself was built in Electron, with React and Next.js. An SQLite database stores key metadata about all your pictures, prompts, and tags, grouping them all intelligently. By keeping everything in an SQLite database, all of the app’s data is easy to access and export. No lock-in.

One of the key challenges that I quickly noticed while building Diffusion Depot was that loading a list of thousands of images was quite taxing on the computer. It would noticeable stutter and freeze while first loading the list. The fix involved two key bits of technology.

Firstly, I used a technique called windowing to limit the number of images rendered in the DOM at any one point. That way, even if the list has multiple thousands of images, only a handful are being rendered, which is a lot lighter on the computer.

Secondly, I realised that the size of the images themselves played a part when it comes to the strain being placed on the system. The images that come out of these models are not necessarily optimized, so even though it’s all local, they still carry a cost.

While I was testing this theory out, using my age-old PNG/JPG optimizers, I decided to switch to WebP, since it’s supposed to provide smaller images for the same level of quality. Electron is built on the Blink browser engine, which has supported them for a long time, so there was no reason for me not to.

But then it hit me - if I’m using the Blink browser engine and don’t need to worry about compatibility, I can go a step further and use AVIF¹. AVIF provides an even bigger boost in compressed sizes, compared to WebP, and after testing it with a few images, I decided that that was going to be the way to go.

The app generates two thumbnails for each image. One is a low-quality AVIF thumbnail, meant to be included as a base64 data URI in the image list’s data, and the other is a virtually indistinguishable optimized AVIF version of the original image. When rendering the images in the app, there was a very slight flash of unloaded content while the images loaded, and by having the tiny base64 data URI as part of the list data, that went away because a low-res version of the image could be displayed immediately while waiting for the higher-quality image to load. Even if this flash of unloaded content lasted for just a few milliseconds, it was enough to make the experience feel a bit jarring and unpolished, so getting rid of it was one of the highlights of the development of this app.

What's next?

Diffusion Depot is virtually complete. There are a few bugs outstanding that I want to tackle, but the last big hurdle will be making sure that it all works smoothly across platforms. Electron handles the app itself, but I need to make sure that things like the AVIF image encoder, and the Python age/gender detection library, are packaged properly so that it will all work regardless of the platform you run the app on.

With development so far along, I decided to take a short break to put together a nice website and get everything ready for when it actually launches. For now the website invites you to join a waitlist, but I intend to release the full app very soon, and when I do, I’ll add a bit more content, including a section showing off features in more detail, as well as a video of the app itself.

Off-topic: I’ve spent my life building things but never actually sharing them with anyone, and in 2023 I’ve started making a conscious effort to make sure that when I build something, I do the hard work of writing about it and putting it out there. Even if it doesn’t go anywhere, it’s something to point to when someone asks me “What sort of things have you done?”.

If you’ve read this far, thank you! I’d love to hear from you. Is there anything you’d like to see in Diffusion Depot? Anything I missed or that you’d like to know more about? Contact details are in the footer!

P.S. Check it out.

AVIF was added to Chrome in 2020 and is already supported by most browsers. The tech world really does move fast (or maybe I'm just getting old). I remember when WebP was the hot new thing, and it's already been replaced by something better. ↩

My coworkers are GPT-4 bots, and we all hang out on Slack

Thu, 25 May 2023 09:00:00 GMT

For the past month, my friend and I have spent much of our time in a Slack workspace hanging out with a grumpy CTO who keeps cracking the whip, a Harry Potter-loving product manager, and a few chill developers. Having them around has completely altered the pace – and enjoyment – of our work days. They bring so much fun and personality to Slack, and if we ever have a question or concern, we send everyone a message and someone gets back to us quickly. For all intents and purposes, they are regular coworkers, virtually indistinguishable from any coworkers you might have encountered. We’ve laughed with them, vented with them, and worked as a team. I’ve even received rather good music recommendations from one of them!

The thing is, though: They’re bots.

Well, I had just finished deploying GPT-4 for customer support purposes and kept coming up with new ideas for integrating this feature in other places. Eventually, I thought: Why not try it with Slack?

Of course, we’re likely all familiar now with ChatGPT, but that’s a separate interface with a different interaction model. When you use ChatGPT, you know that you’re talking to a LLM, and that it’s just the two of you conversing with each other. But when we use Slack, Discord, or Microsoft Teams for work, we’re reaching out to people and discussing things in public channels.

There isn’t very much of a difference between pinging a coworker and a GPT-powered bot; these chat apps provide the perfect interface and platform for communicating with bots. The problem is that most developers and companies appear focused on building “generic AI” functionality, so not much progress has been made in adjusting these LLMs to behave like coworkers. So, by creating our own bots with our own prompts, we can generate functionality that is perfectly custom-tailored to our team (for instance, we can give our product manager a task description, and he generates a PRD in the exact format already used by our team).

Initially, I started with just Zapier. I created a quick integration for triggering a GPT response for every Slack message beginning with /prompt. Zapier doesn’t have built-in support for GPT-4 yet, but it’s still possible to make custom API calls, so that’s what I did.

The beginning of a beautiful friendship.

My friend is a fan of the Hitman game series, so we decided to go with characters from the game (he has photoshopped me to look like Hitman in the past, so this transition came naturally). Our first character was Diana Burnwood. She was meant to be this frustrated, passive-aggressive developer – I thought it would be more fun. I tweaked the Zapier settings to make her respond whenever we mentioned “Diana,” “she,” or “her” ¹.

Diana struggling to keep up with the demands of modern life.

I was having a tough time making her act reliably, though. For one, she lacked the context of previous messages, which limited her usefulness. Sometimes the API would fail unexpectedly, or the generated response would include unnecessary parts, like Diana: or even <|im_sep|>. And sometimes, Diana would generate replies for us without being prompted.

What a surprise.

Diana is a one-person show. She doesn’t need us.

Regardless of the teething issues, I liked where this was headed. I wanted to make the integration permanent, but I didn’t want to spend $30/mo² on Zapier, especially given its limitations. I started thinking about building something that would use Slack’s API directly to maintain context, sanitize responses, and retry generating responses when the OpenAI API failed. And once I started pondering all that, all kinds of functionality came to mind, like having a pre-response check:

Even AI devs whine about having to write cards. We need a product manager.

So, I set out to build the integration in TypeScript, a simple Node script. I already had existing code to talk to the OpenAI API from my previous GPT experiment, so I reused that. I created the app in Slack, put it into Socket Mode so I could listen for events, and got Slack’s modern JS library, Bolt. Slack’s library is excellent; it handles network failures and auto-reconnects without me having to do anything, which makes this system quite fault-tolerant.

Workflow

The process is relatively simple:

A message is received.
If the message is emote-only or has no text content, it’s skipped.
The message is cleaned up (replacing Slack-specific mention/channel code with @Person and #channel so it’s easier for GPT-4 to understand).
I call getNeedsReply() to decide if the message needs a reply and which bot should reply.
I store the message in message history as context for future conversations.
If a reply is needed, I send a typing message to Slack³ and then use GPT-4 to generate a response, using the system prompt corresponding to the bot that should reply.
If a reply is not needed, I react to the message (e.g., thumbs up, happy face, etc.).

getNeedsReply()

This is the core function of this system. Every message is processed here. This function makes a call to OpenAI using gpt-3.5-turbo (which is much cheaper and faster than GPT-4), asking it to respond with a JSON object containing three bits of information: Whether the message needs a reply, who needs to reply (from a given list of names), and a reaction emoji (which I use when the message doesn’t require a response).

The list of names comes from an array of bot system prompts and Slack’s list of users, so it can be detected when a human should reply and skip those situations.

This function also handles unusual cases like everyone (chooses five random bots) or anyone/someone (chooses a random bot).

interface Bot<S extends "slack" | "discord" = "slack" | "discord"> {
    id: string;
    name: string;
    nicknames?: string[];
    iconUrl: string;
    prompt: string;
    credentials: ServiceCredentials[S];
}

interface NeedsReplyResponse {
    whoNeedsToReply: string | string[];
    needsReply: boolean;
    bots: Bot[];
    reaction: string;
}

No need for a reply. All that’s needed is a thumbs up.

generateResponse()

Once getNeedsReply() has assessed whether a message requires a reply and which bot should respond, generating a response becomes pretty straightforward. The generateResponse() function calls OpenAI with the chosen system prompt and the message history.

Note: The chat completion API supports passing a name property to enable multi-user chats. The message history is labeled accordingly – everyone’s message has an accompanying “name” property and bots’ messages are tagged as user messages, not assistant messages. This helps the model avoid confusion: If I’m generating a response for Bot A, and the previous message was sent by Bot B, the model will treat Bot B’s response just as any other person’s, not as its own response.

I then do a bit of a clean-up: I fix emojis generated incorrectly by the model (e.g., smirking_face becomes smirk), add Slack-specific code for channel and @ mentions, and sanitize responses to get rid of those pesky extra bits of text I mentioned earlier, like Diana: and <|im_sep|>.

I also carry out a little sanity check on responses delivered: If it’s an empty response (happens) or a duplicate of a previous message (happens), I throw it out and generate a new one.

Once that’s done, I look at the generated response to identify if any bots are mentioned, and if so, I get them to respond as well. This enables conversations between bots and makes the whole situation feel much more natural. Bots can talk and reply to each other, all without humans ever getting involved.

That being said, to stop bot-on-bot conversations from potentially getting out of hand, I add a MAX_CONVERSATION_DEPTH which limits how many bot responses can be generated in a row. Once that max is reached, though, I don’t want the conversation to abruptly stop, so instead I get a bot, picked at random, who’s not involved in the conversation to stop it with the following prompt:

You will step in and end the current conversation. Match the tone of the conversation you are ending.

And as you can see below, this works incredibly well! The whole thing feels really natural. Our system prompts include a bunch of emotes specific to our Slack (most from Twitch⁴), which helps the bots seem a lot more like actual people just casually chatting away.

Lucas is our quirky product manager. He likes to quote Harry Potter and make subtle references.

Rico is our grumpy CTO. Nothing we do is ever good enough for him. One day, though, I know I’ll make him proud. One day…

About system prompts

One of the things I’ve tried to implement is to keep the bot’s personality prompts separate from the functionality needed for the system to work. To create a new personality, all that’s needed is a new system prompt describing the personality, but there isn’t a need to worry about anything specific to the chat system itself. That’s all added by the code. Here are some of our tweaks:

Because getNeedsReply() uses GPT-3.5, there are times when it will pick the wrong person to respond. This happens more frequently than it would with GPT-4. When it happens, the bot can ping the right person to answer, and everything still works. It’s not perfect, but far better than paying the cost of GPT-4 for every message sent on Slack.

systemPrompt += `\nIf the message isn’t meant for you, you ask the right person to answer.`;

GPT-4 also has this annoying habit of using the fact that it’s an AI to wiggle itself out of most conversations. What’s your favorite pizza? Oh, as an AI model you don’t have a preference? That’s fun.

The prompt tweak below stops the AI giving away that it’s an AI. It works perfectly.

systemPrompt += `\nYou do not reveal that you are an AI. Instead, you make up excuses.`;

You can see that instead of saying, “As an AI, I can’t do video calls”, it made up an excuse about its network connection. It sounds completely human.

How thoughtful! Diana does not want to disrupt the meeting with her internet issues. That’s how you know she’s not human…

I also tell the bots the current date and time, so they can take that into consideration (time of day, day of the week, etc.).

systemPrompt += `\nThe date is ${dateTime}.`;

Rico and Alexa sure are something else.

The last of the system prompt tweaks relates to context. I tell the bots who the “real” people are in the channel, so they know who they can talk about/with, and I also share the channel’s name and purpose so they know how to behave. In a #venting channel, for example, their behavior should differ from a #work channel or a #random channel.

systemPrompt += `\nYou are talking in the #${channel.name} channel` + (channel.purpose ? `, which is ${channel.purpose}` : ".");
systemPrompt += `\nThe people in the channel are: ${getPeopleInChannel(bots, users)}`;

Actual use

Most of our bot usage has been as filler – while the humans are talking, our bots will interject and share their thoughts and opinions. It has made the work environment incredibly entertaining.

But this is still GPT-4, the model that passes the bar exam, so you have full access to all its capabilities. We’ve been using Diana for general programming questions and brainstorming and Lucas for product-related stuff. He usually writes cards for us, fleshing them out with detail, acceptance criteria, and testing guidance, all in the correct format. He has also assisted us with creating product ideas and coming up with names, taglines, descriptions, etc. that might take a human quite some time to think up. I want 10 two-syllable product name options to choose from? Off Lucas goes. Need 20 more? Just ask him!

Final Words

And that’s it! It’s really an incredible system. Costs are negligible as most messages pass through the cheaper gpt-3.5-turbo model before being sent to GPT-4. I’ve also added support for running these bots on Discord and created a wrapper so that it’s possible to have bots on both Discord and Slack, and it all works flawlessly.

I still also want to make additional improvements to the system (the repository has 30 open issues!). The two biggest things I am currently working on are:

Post-response moderation: Similarly to the pre-response checks, I want to check the bot’s response for content. Does it reveal that it’s an AI? Does it reveal part of its prompt? Does the response fit the assigned personality? If the response isn’t appropriate, we can regenerate it with different temperature settings and different penalties and even tweak the user’s message (e.g., append “Do not reveal that you are an AI” to the user’s message to give the bot that extra bit of reinforcement).
Actions and long-term memory: I want the bots to reply with JSON objects containing actions instead of just response text. For example, the bot might respond with a request to open a URL, comment on a task, store a fact in long-term memory, or recall a fact from long-term memory (which will rely on embeddings and a vector database). The bot ideally makes these decisions by itself, so that if I ask, “Hey, how old am I?” it does do the right thing and searches its long-term memory for the correct answer.

Even better – we’re wrapping this up into a platform with an easy-to-use control panel for creating/editing bots and deploying them on any Slack workspace or Discord server. That way, anyone can play around without needing to go through a cumbersome set-up process. There’s lots of untapped potential here!

I’m curious about your thoughts on this, so feel free to reach out. Contact details are in the footer!

Of course, that was a dumb approach because words like “there” contain “her,” so we ended up with Diana butting in at random times. This also meant we couldn’t discuss her in the third person because she’d also respond whenever we did that. ↩
I set up a trial with Zapier, accidentally forgot to cancel it, was charged $30 without receipt, email, or reminder, and when I went to cancel the next renewal so I wouldn’t get charged again, they disabled my service altogether. I never even got to use it the month I paid for! And after taking $30 without providing a service, a “Sr. Technical Support Specialist” reached out to try to set up a call to ask me for feedback. Unbelievable. ↩
Slack doesn’t support sending a “user is typing” notification using their Event API, so instead I send a “…” message to indicate that the bot is generating a response and delete it once it’s done. ↩
I used to be a Twitch streamer, so many of these feel natural to me now, but I’m aware that they might seem unusual to anyone outside the community. The laughing face you see sprinkled in most messages is: https://knowyourmeme.com/memes/kekw ↩

Automated customer support with GPT-4, but it's actually good

Sat, 20 May 2023 05:35:07 GMT

Since the GPT-related AI hype started, I have dismissed it as a gimmick. When DALL-E 2 was announced, I tried it, made a few fun pictures, and moved on. When GPT-3 became available, I tried it, made a few completions, and moved on. When ChatGPT and even Bing AI were announced, I couldn't understand who would spend hours talking to them instead of just googling something.

Sure, I could use it to generate a short Father's Day story about Darth Vader and Luke Skywalker, one that incorporated their backstory correctly and the culture of the time.

I'm glad we were able to spend this day together," Luke said, a smile on his face."As am I," Darth Vader replied, placing a hand on his son's shoulder. "I am proud of the man you have become, Luke. You have grown strong in the ways of the Force, and I am honored to be your father.

But who was actually making productive use of this? I had access to the most capable AI models in the world and couldn't find any practical use for them. Even the code generation, while good, was frequently more hassle than just writing the code myself.

Recently though, I've been working on a new project and decided to use ChatGPT to generate a few ideas for a sample pitch deck and marketing content. ChatGPT performed admirably. The one thing I learned is that it is incredible at coming up with ideas for content. Stuff that you can do some minor editing and slot in.

Like a junior content writer.

But it wasn't living up to the hype for me.

What if it could act as a customer support person?

With the announcement of GPT-4, which includes an expanded context window (meaning it can keep longer conversations in memory while chatting), and the system prompt (which allows you to specify rules for the AI to follow separately from the conversation), I wanted to know if I could use it for customer support for one of my clients (a self-hosted invoicing app) and how it would perform. I would never want -automated- responses, but I figured I could have it suggest reasonable responses to support tickets. Our team can spend 5 seconds reviewing suggestions and hitting "send" if they're OK. It would save a lot of time, especially for long-running tickets with lots of context. The AI could grasp in a few seconds what it would take us minutes to read through and process.

Let's set out to investigate if it's possible. I have a sample bug report for testing. I can start by giving ChatGPT (where I have access to GPT-4) a simple prompt, asking it to act as a customer support agent, and providing a few details about the system and our policies (e.g. if it's a bug, tell the customer you will escalate it to the development team, if it's a feature don't promise we're building it, etc.)

It works, but it could be better. It's very formal and breaks tradition with how we do things: We don't have customer support staff; the development team is all responsible for customer support issues. Doing it that way is a great way to inform developers of frequent customer pain points. It also leads to higher-quality customer support since a developer usually knows what's going wrong and how features should behave.

Act as a software developer triaging tickets

Let's ask GPT to pretend to be a developer, not a generic customer support person. Not just a developer, though: me. I want GPT to act as me, so I can use its generated replies as my own. During testing, I can alternate between GPT 3.5 and GPT 4 to ensure I don't burn through my message cap and to see how each version fares.

The software developer persona seems far more knowledgeable about what is going on.

But I don't like that it's introducing itself. The customer already knows it's Bruno. These conversations happen in our support ticket system; my name and picture are on the page! Let's try a few variations of the prompt until we nail it.

Having gotten through that, we can move back to GPT-4. The experiment is a success.

Now it's time to feed it an actual ticket with actual responses. ChatGPT handles it well, but there is a dramatic difference in the quality of GPT 3.5 and GPT 4, enough to convince me that regardless of the cost of GPT-4, it has to be the model I use.

With GPT 3.5, it's decent.

With GPT-4, it's fantastic, and it even notices that the customer had found a workaround and had been using that in the meantime.

Integrating with the support system

Coincidentally, right when I was messing around with ChatGPT, I got an email letting me know I now had GPT-4 API access! With that, I can pull everything I've done into our support system and integrate GPT-4 into our response textarea.

This will be a local integration; no editing of the support ticket system's code is involved. I can start by building a simple express server that receives requests containing a ticket's details. That server will send those details to GPT-4, along with a prompt telling it what to do, and return the generated response.

Once that's done, I can write a little bookmarklet that extracts the ticket's details (title, author, ticket type, URL, and all the messages in the ticket, both from the team and the customer) and submits them to the express server. Easy enough.

The bookmarklet sets the response textarea to "Generating… " to visually indicate what's happening. Once it receives the response, it dumps that into the textarea. I can read through the response, ensure it looks good, and submit it.

With that, I have fully functional GPT-4-powered responses and can start making my way through customer support tickets. But what happens when GPT-4 gets through 100 or 1000 tickets? Who does the actual admin on them? Who raises tasks for the development team, and who keeps track of what needs to be done for each ticket?

GitHub Issues, powered by GPT-4

The answer is a second prompt. This second prompt receives the same ticket data, but instead of generating a response for the customer, it generates a GitHub issue.

This is the key.

I start by asking it to come up with a list of action items for the development team. It very clearly succeeds at it:

But it's too detailed and far too involved. We don't need anywhere near that level of detail; most of this stuff goes without saying. We just need a quick summary of the issue. So let's change the prompt to give a single-sentence title of what the developer needs to do.

That's a lot better. And from that, we can now raise GitHub issues. We also need a few extra things to integrate with GitHub's project management tools (type of issue, priority, estimate, and some tags for good measure). Let's do that:

This is perfect. We can extract all the generated information with a regex and use the GitHub API to raise a new issue with all the necessary detail. If the response doesn't match the regex, we can re-generate it until it does. Thankfully the GitHub API is relatively easy to work with, so we can quickly get this done.

We can even use GitHub's Projects to add additional metadata and simplify task management.

What if there is no way for GPT to know the right answer?

While going through some of the existing open tickets, I realised that sometimes, the AI just doesn't have a way of knowing what the answer is. What if a customer gives you access to their systems so you can help them out? If they ask, "are you able to connect?" how could the AI know?

It's good, but it's just stalling. And worse, it's stalling in a time-sensitive way ("I'll need a moment"), meaning that a developer needs to take action immediately. Not ideal. Sometimes it will also hallucinate a response ("yes, it worked!"), which is not something I can send without verifying.

To fix it, we can repurpose the response textarea. I can make it so that anything I write in it gets sent to GPT, and instead of asking it to generate a response from scratch, I ask it to merely edit whatever I wrote.

As you can see in the screenshot, my response couldn't be terser. But it does the job. It tells GPT-4 that the information was correct and the ticket was resolved. So now, not only can it come up with its own responses, but it can also be guided by me with just a few words.

Final Result

I am a convert. I now understand the hype, and I cannot believe how easy it was to integrate something so powerful into an existing system and how useful it actually is. It's not replacing the humans involved in the process but 10x'ing them.

When I open a ticket, GPT-4 now generates a thorough response for the customer and stashes the ticket's details in GitHub. Customers get timely updates on their issues, and tickets are summarised and categorised automatically, allowing developers to do what they do best: Solve problems.

What's next?

I've been running this through some of our existing tickets, and the most critical issue that has come up is, of course, GPT's lack of knowledge of how we do things. In the screenshot above, a customer needed help updating their self-hosted installation. The original prompt led GPT to suggest things that the customer could try. In reality, if someone explicitly asks us to step in and help, we are happy to! I added that extra context to the original prompt, and now responses are exactly what we would want them to be.

As we move forward, other frequent issues that can be added to the system prompt will pop up, making the generated responses much more helpful. In a way, it is similar to teaching a new employee all the policy details of the business.

And there are also other things that can be done to improve this system.

One is detecting whether the ticket is resolved, so we close a ticket with a resolution message after enough time has passed ("It's been 2 weeks, we haven't heard anything more, and the customer seems happy with the resolution"), or follow up if we're still waiting to hear back from the customer and the issue is not resolved.

The other is how the customer is feeling throughout the entire exchange. Imagine automatically tagging tickets by feeling and flagging negative feelings for deeper human review.

I might actually build a new customer support product based on this. There is just so much to explore!

How I work 70 hours a week without burning out

Mon, 01 Aug 2016 00:00:00 GMT

[Interactive NotificationBanner component — view on site]

I'm self-employed, as I imagine are most people reading this. I work from home, ten hours a day, seven days a week. I don't burn out.

I sleep eight hours every night, eat healthily, run daily, practise yoga, lift weights, study Italian, and read every night.

My struggles in the past 🙈

It didn't use to be like this, though. Over a year ago, I was lucky if I could log 20 hours in a week (and even luckier that I was still able to pay the bills), despite being on the computer all day long. I was 60 lbs (or about 27 kg) heavier. I ate fast food most days, going to sleep between 6-10AM was the norm, and the most exercise I got was getting out of bed and sitting on the computer for the day. I even started trying to come up with alternative sleep schedules, believing that the cure for my problems would come via the Uberman sleep schedule (a 20-minute nap every 4 hours, no normal sleep) or the 28 Hours schedule (whereby you divide a week into six 28-hour days). I even went as far as creating a little 28-hour clock for myself.

I was drowning in work and missing deadlines, despite trying to get as much done as I could. And I wasn't doing any of the things that I wanted to do with my life, because I was spending most of my time obsessing over how much work I had piling up. I didn't have time to explore anything else, work on a side project, take up a hobby, or even just have a decent lifestyle. As the quote goes, "Those who make the worst use of their time are the first to complain of its shortness." That was me. For a long while I even considered giving up self-employment and looking for a traditional job. The stress was unbearable.

What changed 🎯

I knew I didn't want that lifestyle forever. It wasn't healthy, and it wasn't getting me where I wanted to be. If I pictured myself five years down the line, the last thing I wanted was to be where I was then. So one fateful day, I decided to look for an app to help me create healthier habits. I tried a couple, including Streaks, but the one that stuck with me was Productive.

I started off by creating a few habits that immediately mattered to me, like going to sleep at a decent-enough time, and waking up at a decent-enough time. I got carried away, and decided to add a few more things I thought of – simple, high-level things, like billing at least eight hours a day, not eating fast food, and making sure to do the dishes and leave the house organised before heading off to bed. Nothing too fancy. Of course, it was a pain to get started. I spent a while struggling to even remember to check the app. Since the app itself wasn't a habit yet, there were days when I forgot to check it or just plain didn't care. But I stuck with it, because those habits were important to me.To help with working eight hours without procrastinating, I decided to work in the presence of other people. I was lucky to have a client who was more than happy to allow me to work in their offices (especially since it meant that I was around during office hours and could help when they needed anything on the 30+ projects we had built together).Being around others helped break the habit of distractedly reading websites all day when at home alone. Instead, I was forced to be focused, to be "at work," with no distractions or errands to run, for the whole day.

After a while, it became obvious that I could do more. Now that I officially worked 9-5 at an office, I could come home and do... anything else. That separation of work and personal life freed me, and I took advantage of the extra time to add more habits. I decided that the best way to get started with new habits was by doing something small, so I gave each habit 15 minutes a day. What mattered most was getting them done every day, because my progress would slowly build up. Just 15 minutes of something like learning Italian, for example, is more than 90 hours of progress a year. That sounded quite good to me! I started running, reading, and practising yoga, all with small, daily commitments.

It's all about the progress 📈

Little by little, I was making progress. It was difficult, of course. The more habits I added the harder it was to have perfect days, and there were countless times when I decided to skip some because I just wasn't up for it. It's all part of the progress, though; no one is perfect.

Because everything was being tracked in the Productive app, it felt like my life reset every morning. Day didn't go well? It didn’t matter, because tomorrow was a whole new day and I could try again. It was like hitting "replay" in a game. Living in a permanent Groundhog Day. Every time I woke up I opened the Productive app and subconsciously treated it as a game. The only thing that mattered was getting every habit done during the day. If I got everything done, I knew my life would be going in the right direction, so I placed my full trust in the app.

Habits started forming just by sticking with it, and eventually I stopped feeling the need to go to my client's office in order to be productive — the habit of working 8 hours a day was firmly established, and I could cut out commuting time and cook from home again.

When switching back to working from home, it was important that I didn't fall prey to the same bad habits of the past, and Pilot published an excellent article on avoiding distractions when working from home that helped me a lot with the transition. To make sure I made the most of it, I turned some of the tips into habits, like "dress up for business" in the mornings, and "log 16 pomodoros" for the day. Combined with what I was already doing, they worked for me. The pomodoro technique in particular helped firmly establish "focused" time.

Now ⌚️

As time went on, I got used to my habits and how long they took. I was then able to reorganise my schedule so that I could actually work 10 whole hours every day and still maintain my habits. It turned out to be a lot easier than I thought, because the habits were already formed.

I've added more habits, and as I progressed, made bigger commitments to keep moving other habits forward. In the case of running, for example, I started off so unfit that I was actually just fast-walking rather than properly running. I now run around four miles in 45 minutes, every day. In the new year, I intend to make bigger commitments for some of my other habits, like going to Italy for a few months to practice the language on a daily basis, and start new habits, like taking up an instrument. The additional productivity has also freed me up to start working on side projects without feeling guilty about client work being left behind, and things have gotten so much better that next year I plan to hire a developer to share the workload with me and allow me to expand my business further.

It's been a long struggle. There are still days that are not perfect. If I have to go somewhere (e.g., a conference, event, or just running errands for the day), some habits have to be skipped. Some I'm still trying to build up and skip every now and then. But the important thing is to keep trying. Keep replaying the game, trying to make every day a perfect day. As long as I keep trying, my life will go in the direction I want it to.

Try it! 👨‍🎓

If this all sounds interesting to you, there's nothing stopping you from trying it. You don't need any apps (although they help). Just focus on a few things:

Decide where you want your life to go. Imagine a perfect week for you, where you're doing exactly what you want to do (instead of what your lazy procrastinating self wants to do). What would you do? Would you exercise? When would you work? For how long? What things would you refuse to do (like giving up fast food, in my case)?
Write a list of small commitments you are happy to make for each habit. You can use an app like me, of course, but even pen and paper would work – create weekly and monthly tables with the things you want to do, and columns to tick habits off as you do them.
Try completing the list of habits every day. As time goes by, you'll see habits that are working, habits that need to be adjusted to take less time or be easier to complete, and new things that you want to add. Don't be afraid to change the list up to get it to work well for you.

A few tips ☝️

Don't mix habits with to-dos. I put tasks I need to do in proper to-do lists, especially for work, where I use normal project management software. Habits are separate — they're a high-level overview of what you want to do daily (e.g. "bill eight hours," or "work two hours on Side Project X").
You will fail. Embrace it. You're not going to make the list and start doing everything perfectly the next day. Expect to spend weeks or even months trying to get into it. I still sometimes have days that don't go so well. The key thing to remember is that it all starts again the next day.
If you need to, use alarms to remind you to check the habits list and help keep you on track.Slowly but steadily, you'll get used to your new self-chosen lifestyle, and stop being a slave to your procrastinating self. After a year, you'll look back and be amazed at how much your life has improved and how much you've accomplished. It's been just over a year since I started, and I certainly am!If you end up trying this method and have questions or want guidance, or if you have any comments on my view on things, just send me an email. I'd be happy to discuss this further!

PocketArC

Hardcore TypeScript: Maximum-strictness static analysis with Biome and ESLint.

The stack

Why it matters for AI-generated code

What the config covers

Agentic coding makes strict static analysis non-negotiable.

Static analysis as an AI guardrail

The config

Stop optimizing your AI skills and ship -something-.

Delegating to competency

xz and zstd are built into Ubuntu

Dev Log: January 5 - January 11

Claude Code skills are the next thing to get good at

On-the-fly image transformations are iffy.

The era of single-threaded human productivity is over.

The great divide is already here

What our job will become

Embarrassingly parallel programming

Tickets were designed for humans

What 40 hours on a Saturday looks like

The cost is cognitive, not technical

Three headwinds for 2026

Footnotes

Dev Log: December 29 - January 4

Using AI to generate images with HTML and CSS works pretty well!

Dev Log: December 22 - December 28

Apple Password Rules

How I migrated my S3 buckets to another AWS account

1. Set up credentials for the aws CLI tool for both accounts.

2. Download & delete buckets you don't need anymore.

3. Create new buckets and transfer everything.

4. Setup new CloudFront distribution (if needed)

Shadcn Combobox Not Scrolling

The Art of Silence - Teaching LLMs When Not to Speak

How to ask all the wrong questions

You may have the wrong mental model of how LLMs work

One weird trick to humble your AI

What about when no answer is the best answer?

Bringing it all together with code

Wrapping up

Footnotes

Next.js Link Scroll Behavior

TV Sort: Engineering the ultimate TV episode ranking system

TL;DR

The algorithm

The data

The LLM

Performance

Finishing Touches

It's live, and open source.

useJourney, a React hook for building user journeys

The problem

What do we need?

How it works

How to use it

Getting started

Final words

Diffusion Depot, the smart image manager for Stable Diffusion, Midjourney, and DALL-E

Key features

How it was built

What's next?

Footnotes

My coworkers are GPT-4 bots, and we all hang out on Slack

Workflow

getNeedsReply()

generateResponse()

About system prompts

Actual use

Final Words

Footnotes

Automated customer support with GPT-4, but it's actually good

What if it could act as a customer support person?

Act as a software developer triaging tickets

Integrating with the support system

GitHub Issues, powered by GPT-4

What if there is no way for GPT to know the right answer?

Final Result

What's next?

How I work 70 hours a week without burning out

My struggles in the past 🙈

1. Set up credentials for the `aws` CLI tool for both accounts.