Stories by George Violaris on Medium

The agentic development loop: how I ship features with Claude Code

George Violaris — Tue, 10 Mar 2026 08:44:14 GMT

The value of agentic coding isn’t speed. It’s that it forces you to be explicit about what you’d normally leave implicit.

Every engineering team has a dirty secret: the gap between “we know what we want to build” and “we have a plan detailed enough to build it” is where most projects quietly die. Not dramatically. Just a slow drift into scope creep, missed edge cases, and architectural decisions made by whoever happened to be writing code that afternoon.

I’ve been using Claude Code as the backbone of my development workflow, not just for writing code but for planning, review, testing, and deployment. The thing I didn’t expect: it doesn’t just speed things up. It changes the structure of the work. The process ends up more rigorous than what most teams pull off with purely human workflows, and I don’t think that’s an accident.

This isn’t a “look what AI can do” piece. It’s a working methodology I use to ship real products. Opinionated, specific, and it assumes you’re building things that actually need to work in production.

The workflow at a glance

Before diving in, here’s the full loop:

Now let’s break each phase down.

Phase 1: Brainstorming the feature set

The first conversation with Claude Code isn’t about code at all. It’s about scope.

I describe the project, its purpose, its constraints, the user it serves, the infrastructure it runs on, and ask Claude to brainstorm the feature set. The instruction is to think in terms of deliverable capabilities, not implementation details. Features described as outcomes, not as code.

For a simple task list app, this might yield five features. For a production AI platform, it might yield thirty. The size doesn’t matter. The discipline is the same.

What makes this work is that Claude Code has context on your codebase, your infrastructure, your existing architecture. It’s not brainstorming in a vacuum. It’s brainstorming against the reality of what you already have. That’s a different conversation than what you get from a blank-page planning session.

I tell Claude to be exhaustive but honest. If a feature is a “nice to have,” label it that way. If something is a prerequisite for something else, surface that dependency. The brainstorm should be a complete map, not a wish list.

Phase 2: Writing plan files

Every feature gets its own plan file. This is non-negotiable.

A plan file isn’t a user story. It isn’t a ticket. It’s a contract. A document detailed enough that an agent (or a developer unfamiliar with the project) could pick it up and implement the feature without ambiguity. Each plan file includes:

Objective: What does “done” look like? Described in terms of observable behavior, not implementation.
Constraints: What must this feature not do? What existing behavior must it preserve?
Dependencies: What other features, services, or infrastructure does this touch?
Acceptance criteria: Specific, testable conditions that must all pass.
Open questions: Anything unresolved. These get resolved in refinement, not swept under the rug.

Why files and not tickets? Plan files live in your repo. They’re versioned. They’re diffable. When the plan changes, you can see what changed and when. This is infrastructure-as-code thinking applied to project planning, and it holds up better than whatever lives in your project management tool.

Phase 3: Iterative refinement

This is where most agentic workflows stop too early.

I take each plan file back to Claude Code and run a dedicated refinement pass. The instruction is: “Read this plan. Now think about everything that could go wrong, everything that’s underspecified, and everything that’s implicitly assumed but not stated. Rewrite the plan with that level of detail.”

This second pass typically doubles the length of each plan. That’s the point. The detail that comes out is exactly the detail that would otherwise become a bug report three sprints from now.

Notice the human review gate. The agent refines, but you approve. You’re the taste-maker and the context-holder. The agent surfaces the questions you should be asking. You decide the answers.

Phase 4: Cross-plan analysis and master tracking

This is the step that most people skip, and it’s the one that saves you the most pain later.

Once all plans are refined, I ask Claude Code to take all plans together and perform a cross-cutting analysis. Specifically:

Are two features planning to modify the same component? Do they introduce competing abstractions?
Does Feature A’s constraint contradict Feature B’s requirement?
What’s the optimal implementation order given inter-feature dependencies?
What common utilities or patterns should be extracted rather than duplicated?

This analysis feeds into a master tracking file, a single document that maps every plan’s status, its dependencies, known conflicts, and implementation order.

Without this step, you’re basically hoping that feature-level isolation holds up when everything comes together. In my experience, it never does. The cross-plan analysis catches integration issues at the cheapest possible moment, before any code exists.

Phase 5: Implementation, plan by plan

With refined plans and a clear implementation order, the coding phase gets almost boring. That’s the goal.

Each plan gets implemented as a unit. Claude Code reads the plan, writes the implementation, and the code traces directly back to the spec. No interpretation, no creative license with requirements, no scope expansion.

The implementation contract is simple:

Claude Code implements exactly what’s in the plan.
Any deviation requires updating the plan first, then implementing.
The plan file becomes the commit message’s source of truth.

Claude Code’s awareness of your codebase matters here. It knows your patterns, your conventions, your existing abstractions. It doesn’t reinvent what already exists.

Phase 6: Testing, four layers deep

Each implemented plan goes through four testing layers before it’s considered complete. The order matters.

Layer 1: Unit tests

Claude Code writes the tests alongside (or immediately after) the implementation. The instruction is: test the contract, not the implementation. Tests should validate the acceptance criteria from the plan file, not the internal mechanics of how the code works.

I ask Claude to include:

Happy path tests for every acceptance criterion
Boundary condition tests for every constraint
Negative tests, what should not happen
Regression anchors, tests that would break if someone later violates the plan’s constraints

Layer 2: Code review (for agentic code)

This is where things get weird compared to traditional development, and worth going into detail on.

Traditional code review assumes a human wrote the code with intent, and you’re reviewing their reasoning. With agentic code, you’re not reviewing someone’s thought process. You’re reviewing whether an execution matched a specification. This changes what you look for.

What agentic code review actually looks like

Review against the plan, not against vibes. The plan file is your rubric. Every line of code should trace back to a requirement in the plan. If it doesn’t, that’s either scope creep (the agent added something unsolicited) or a plan gap (the plan was underspecified and the agent filled in the blank). Both need to be caught.

Check for silent assumptions. Agents don’t flag uncertainty the way a junior developer might. They don’t say “I wasn’t sure about this, so I went with X.” They just go with X. Your job is to find the X’s and determine if they’re correct. Common places where agents make silent assumptions:

Error handling strategies (retry? fail fast? degrade gracefully?)
Concurrency and ordering guarantees
Default values and fallback behaviors
Security boundaries and input validation depth
Resource cleanup and lifecycle management

Review the self-report. I ask Claude Code to generate a diff annotation alongside every implementation, a summary of what it did, why, and where it deviated from or interpreted the plan. This gives you a starting point for review instead of making you reverse-engineer intent from raw diffs.

Look for pattern violations. The agent knows your codebase, but it doesn’t have the tribal knowledge of your team. It might introduce a pattern that’s technically correct but inconsistent with how your team does things. Watch for:

Logging conventions and observability patterns
Error type hierarchies and exception handling strategies
Naming conventions that carry semantic meaning in your domain
Architectural boundaries (what’s allowed to call what)

Check for unintended coupling. Cross-reference the implementation against the master tracking file. Did this feature’s implementation touch something that belongs to another feature’s domain? The plan-level isolation should hold at the code level too.

The meta-review: reviewing plans as code

It took me a while to see this clearly: in an agentic workflow, plan files are code. They’re the source of truth that the agent compiles into implementation. A bug in the plan produces a bug in the code, and the code-level review might not catch it because the code faithfully implements the (flawed) plan.

This means plan review is a first-class activity. When you review a plan, you’re asking: if an agent implements this literally, will the result be correct? Will it be complete? Will it integrate cleanly with everything else?

Layer 3: Functional testing

After unit tests and code review, Claude Code runs functional tests that exercise the feature end-to-end. These aren’t unit tests with mocks. They’re tests against real (or near-real) environments that validate the feature works as a user would experience it.

The plan’s acceptance criteria are the test cases. Every criterion maps to at least one functional test.

Layer 4: Pipeline and deployment tests

Claude Code knows my CI/CD infrastructure. It can trigger pipeline builds and watch for failures, validate that deployment artifacts are correctly produced, run smoke tests against deployed environments, and verify that rollback procedures work.

This layer catches the class of bugs that only show up in the build/deploy process: dependency resolution issues, environment variable misconfigurations, containerization problems, and so on.

Phase 7: The production boundary, remote debugging and guardrails

This is probably the most polarizing part of this whole setup. Claude Code can SSH into preprod and production environments for debugging. Here’s how I do it without losing sleep.

The guardrails framework

I work with five categories of guardrails, and I don’t budge on any of them.

1. Access scoping

The agent operates with minimum privilege for the task at hand.

Production is read-only by default. The agent can read logs, query metrics, inspect running state, and examine configuration. It cannot modify anything without explicit human approval for that specific action.

Preprod has broader permissions, but still bounded. The agent can restart services, modify configuration, deploy new versions, but within a defined blast radius. It cannot touch data stores that mirror production data without approval.

The principle: treat the agent like a smart junior engineer who hasn’t built trust yet. You wouldn’t give them sudo on day one.

2. Session isolation and audit

Every SSH session the agent opens is:

Fully logged. Every command, every output, every timestamp. The audit trail is immutable and stored outside the agent’s access.
Time-bounded. Sessions have a hard timeout. If the agent hasn’t finished debugging in the allocated window, the session terminates and the agent reports what it found so far.
Attributable. The agent connects with a dedicated service account, not a shared credential. You can always distinguish agent actions from human actions in your audit logs.
Isolated where possible. Ideally the agent operates in a containerized or sandboxed session that limits its blast radius even within the allowed permission set.

3. Rollback contracts

Before the agent takes any action in preprod or production, it must state:

What it intends to do. The specific action, not a vague description.
Why it believes this will help, traced back to the diagnostic evidence.
What the rollback path is, how to undo the action if it makes things worse.
What it will check afterward, how it will verify the action had the intended effect.

This is the plan file pattern applied to operational actions. The same discipline that prevents scope creep in development prevents reckless debugging in production.

4. Kill switches

Multiple layers of “stop everything”:

Command allowlists/denylists. The agent can run kubectl logs but not kubectl delete. It can run SELECT but not DROP. These are enforced at the session level, not by trusting the agent to self-restrict.
Rate limiting. The agent can’t rapid-fire commands. There’s a deliberate cooldown that forces sequential, observable actions.
Escalation triggers. Certain patterns in command output (error spikes, cascading failures, resource exhaustion) automatically pause the agent and alert the human operator.
The big red button. A single command that terminates all agent sessions across all environments, immediately. This should exist and it should be tested regularly.

5. The observe-hypothesize-propose-execute pattern

The agent doesn’t just SSH in and start running commands. It follows a structured debugging protocol:

The framing I keep coming back to: these aren’t restrictions on the agent’s capability. They’re the same operational discipline you’d expect from any engineer with production access. The agent is fast and tireless, but it doesn’t have operational judgment yet. You provide that judgment structurally through guardrails rather than hoping the agent will self-moderate.

Why this works

Stepping back from the mechanics, here’s why I think this actually works better than most purely human workflows.

Plans as forcing functions. Most teams skip the level of specification that plan files require. They rely on shared context, verbal agreements, and the assumption that experienced developers will “figure it out.” The agent can’t figure it out. It needs the spec. So you write the spec. And then you have it forever, as documentation, as test criteria, as onboarding material.

Cross-plan analysis as a free architecture review. The step where Claude examines all plans together for conflicts and overlaps is something most teams only do informally, if at all. Making it an explicit step catches integration bugs before any code exists.

The master tracking file as living documentation. Your project’s state is always visible, always current, always in version control. Nobody needs to ask “what’s the status of feature X?” The tracking file answers it.

Structured debugging as operational maturity. The guardrails framework isn’t just about safety. It’s about building a repeatable, auditable debugging process that works whether the operator is an agent or a human at 3 AM during an incident.

The thing I keep coming back to: the agent doesn’t replace engineering discipline. It demands it. You can’t be lazy about specs when your implementer is a literal machine that will do exactly what you said, including the parts you got wrong.

The plans, the tests, the guardrails, the tracking file. None of that is overhead imposed by working with an agent. It’s what good engineering looks like when you actually do it instead of just talking about it. The agent just makes you write it all down.

George Violaris is CTO at HatchworksVC, where he builds AI products and reviews AI startup deals. He writes about the intersection of engineering practice and AI at Merge Conflict and builds open-source tools including Vio Framework. Find him on X @atr0t0s.

violaris.org

AI existential dread and developer ego death

George Violaris — Tue, 10 Mar 2026 05:17:44 GMT

Last Tuesday, I reviewed fourteen pull requests before lunch. I didn’t write any of them. Three autonomous agents had been working overnight on a feature I’d specced out the evening before, and my job was to read their work, check their reasoning, and decide what to merge. I caught two subtle bugs and one architectural misstep. Solid morning.

On the drive home, I realized I hadn’t written a single line of code in over two weeks. Not one. And the strangest part wasn’t the realization itself. It was that I didn’t feel anything about it.

That numbness is what got me. Because three years ago, writing code wasn’t just what I did. It was who I was.

🚀 Master AI, Cloud & Code in 3 Months

Unlock Free Trial Today

2023: The ego was bulletproof

When ChatGPT became unavoidable in early 2023, I treated it the way most experienced developers did: as a slightly smarter Stack Overflow. Paste in an error message, scan the response, fix the obvious hallucinations, and move on. Sometimes it saved me a tab or two of Googling. Sometimes it was flat-out wrong, and I’d feel a small, satisfying thrill catching it.

Looking back, that thrill should have worried me.

I was measuring my self-worth by how often I was better than it. Every hallucination was proof that I still mattered. The machine was impressive but flawed, and I was the human who knew the difference.

My ego was completely intact. If anything, AI made it stronger. Look at me, the veteran developer, wielding this tool with the discernment that only fifteen years of production bugs and 3 am incidents can provide.

I had no idea what was coming.

2024: The first crack

The shift happened in a meeting. I was designing architecture for a new service. I’d spent two days sketching diagrams, thinking through failure modes, going back and forth with myself about event-driven vs. request-response. On a whim, I described the problem to Claude and asked it to propose something.

It came back with an approach I hadn’t considered. Not a wild one. A boring, pragmatic one that was better than mine. It had spotted a bottleneck I’d missed and suggested a pattern I knew about but hadn’t thought to apply here. The kind of insight you usually get from a senior colleague who’s seen this exact problem before.

I sat with that for a while.

This wasn’t autocomplete. This was judgment. Taste. The ability to hold a complex system in your head and make good trade-offs. The thing I’d always told myself was the part machines couldn’t do.

I used its suggestion. It worked well. I told no one where it came from.

That was the first crack. Not because the AI was right, but because I hid it. If I admitted an AI had out-architected me, I’d have to deal with what that meant, and I wasn’t ready.

2025: The year I stopped typing

By 2025, I was what people started calling a “vibe coder.” Describe what you want in natural language, iterate on the output, assemble systems from generated components. On a good day, I’d write maybe forty lines of code by hand. The rest was generated, reviewed, and refined.

Then one week, the forty lines became zero.

I was building a data pipeline. In the past, this was my happy place. Thoughtful code, clean abstractions, that particular satisfaction of something well-built. But I found myself describing the pipeline to an AI, getting back something functional, adjusting two prompts, and shipping it. The whole thing took an afternoon. It would have taken me three days by hand.

That evening, I sat in my home office and felt something I can only describe as grief. Not for a person. For an identity. The developer I’d spent fifteen years becoming, the one who took pride in elegant solutions, who could debug by instinct, who knew the right pattern for the right problem. That person was becoming unnecessary. Not unemployed. Unnecessary.

The code still needed to be correct. The systems still needed to work. But the act of writing them, the craft I’d built my entire professional identity around, was no longer the bottleneck. I was.

For a while, I told myself AI code was fine for prototypes, but production systems needed a human touch. Then I told myself I’d use AI for the boring parts and keep the interesting work for myself. Then I got angry about everyone shipping sloppy, generating garbage, and losing the fundamentals.

But mostly I was just sad. I missed building something line by line and knowing every decision was mine.

2026: What’s on the other side

I wish I could say there was a clean turning point. There wasn’t. It was more like slowly waking up and realizing the nightmare was actually just a different kind of morning.

Today, I run multi-agent workflows. I spec features, design systems, then dispatch AI agents to implement them in parallel. I review their code with the same rigor I once applied to junior developers’ pull requests, except there are more of them, they work faster, and they don’t get defensive when I request changes. I’m the one who decides what ships. The testing is largely automated, but I’m the one who decides when it’s enough.

It’s not writing code. But I’d argue it’s still engineering.

The ego death was real. “I am a person who writes excellent code” had to die. And the thing that replaced it is harder to romanticize. Nobody writes blog posts about the satisfaction of a well-crafted code review comment. There’s no flow state equivalent when you’re reading someone else’s output, even if that someone is a machine. The dopamine is quieter now. A system prompt that makes agents produce consistently good output. A testing workflow that catches a category of bugs before any human sees them. Shipping in a day what used to take a sprint.

I think about the younger version of me, the one who felt that thrill every time he caught a hallucination. He’d look at what I do now and think I’d given up. That I’d let the machines win.

He’d be wrong. But I get why he’d feel that way. What he was really afraid of wasn’t AI. It was irrelevant. And the only way through that fear was to let go of the version of myself that needed to be the one typing.

What I’d tell you if you asked

I still catch bugs. I still make architectural decisions. I still solve hard problems. They’re just different problems now, and honestly, some days I’m not sure if different is better or just different.

The old me would have hated who I’ve become. But the old me also mass-produced code that fed mass-produced bugs that required mass-produced fixes. It was a hamster wheel, and I’d convinced myself it was a craft.

Most days, I don’t miss it. Some days I do. But I had to grieve it before I could see what was actually in front of me.

AI existential dread and developer ego death was originally published in CodeToDeploy on Medium, where people are continuing the conversation by highlighting and responding to this story.

Google’s MCP Toolbox for Databases Deserves Your Attention

George Violaris — Wed, 04 Mar 2026 08:26:57 GMT

There’s a repo in the googleapis GitHub org that I think more people should know about. It’s called genai-toolbox, now officially MCP Toolbox for Databases. At first glance it looks like another Google cloud tool.

It’s more than that.

What it actually does

MCP Toolbox is an open-source MCP server written in Go. You point it at your databases (Postgres, MySQL, AlloyDB, Spanner, BigQuery, Firestore, and more), define your agent tools declaratively in a tools.yaml file, and it exposes those tools to any AI agent framework that speaks MCP. Which, at this point, is most of them.

A tool definition looks like this:

tools:
  search-hotels-by-name:
    kind: postgres-sql
    source: my-pg-source
    description: Search for hotels based on name.
    parameters:
      - name: name
        type: string
        description: The name of the hotel.
    statement: SELECT * FROM hotels WHERE name ILIKE '%' || $1 || '%';

And consuming it from TypeScript takes about four lines:

import { ToolboxClient } from '@toolbox-sdk/core';

const client = new ToolboxClient('http://127.0.0.1:5000');
const tools = await client.loadToolset('my-toolset');

Those tools work with LangChain, LlamaIndex, Genkit, Google’s ADK, or a raw API call. One definition, any framework.

The real shift: decoupling tool logic from agent code

Here’s what makes this actually interesting.

For the past two years, the standard way to build database-backed AI agents has been to shove tool logic directly into your agent code. You write a Python function that opens a connection, runs a query, handles errors, returns results — and that function lives right next to your agent. Schema changes? Redeploy your agent. Query needs tuning? Redeploy. Second agent needs the same data? Copy-paste or extract into a shared library.

MCP Toolbox inverts this. Tool definitions become a separately managed, separately deployed configuration layer. The Toolbox server runs on its own, hot-reloads by default (change tools.yaml, it picks it up, no restart), and your agent code just points at it.

We’ve done this before. We stopped embedding SQL in application code and moved to ORMs, then API layers, then service meshes. Same trajectory, new domain.

Why MCP matters here

MCP (Model Context Protocol), originally from Anthropic, is quickly becoming the standard wire protocol for agent-tool communication. Google took a project that predated MCP and retrofitted full compatibility onto it instead of building their own thing. That says something.

Claude speaks MCP natively. Cursor, Windsurf, and most modern AI IDEs support it. Google has committed to it. A tool definition you write today for Gemini will work with Claude or GPT-4o tomorrow, without changes. That kind of portability didn’t exist 18 months ago.

What it handles that you don’t want to write yourself

Beyond the declarations, Toolbox deals with a set of infrastructure problems that are genuinely annoying to get right.

Connection pooling. Database connections are expensive. Managing a pool across concurrent agent invocations is non-trivial, especially when agents are async and unpredictable in when they call what.

Authentication. Integrated auth support, so you’re not passing raw credentials through your orchestration layer.

Observability. OpenTelemetry out of the box. Metrics and traces flow into whatever you’re already running.

Tool reuse. Define a financial-data toolset once, load it into your research agent, portfolio agent, and alerting agent. No duplication.

Not flashy. But it’s the difference between a demo and something you’d actually run.

The fintech angle

I spend most of my time at the intersection of AI agents and financial data, so here’s why this caught my eye.

Financial data models are complex and always moving: market data, portfolio state, transaction history, news events, derived signals. Every new agent capability — a new analytical query, a new alert condition, a new join — normally means touching application code. Toolbox lets you treat database tools as a versioned API surface that agents consume, separate from the agent logic.

When you have multiple agents running concurrently against the same data, connection pooling and centralized tool management stop being optional. They’re structural.

The fit is even more natural for persistent memory. Memory retrieval is a parameterized query. Memory writing is an insert. Wrap those as Toolbox tools and any agent, in any framework, can talk to your memory layer through MCP without knowing your schema.

What to watch out for

Still v0.x. Google explicitly flags the API as unstable until v1.0, and minor versions can break things. I wouldn’t move production to it now, but I’d be building proofs of concept.

The config is SQL-centric. If your tools need complex business logic or multi-step operations, you’ll bump up against what YAML and SQL can express. Toolbox doesn’t replace your service layer. It speeds up the data access part.

And it’s Google’s take on what this should look like. As MCP matures, there will be competing implementations. Whether Toolbox becomes the default depends on v1.0 timing and community traction.

Why I care

What’s interesting to me isn’t the feature set per se. It’s what it signals about where agent architecture is going. We’re moving from AI bolted onto apps toward agents as real architectural components that consume managed tool interfaces. Same forces that gave us microservices and Infrastructure as Code, playing out again in a new domain.

MCP Toolbox is an early, concrete version of that.

The repo is at github.com/googleapis/genai-toolbox, Apache 2.0 licensed. Current version is v0.22.0.

If you’re building AI-powered financial platforms or working where databases meet LLMs, I’d like to hear how you’re handling tool architecture. Drop a comment or reach out.

Google’s MCP Toolbox for Databases Deserves Your Attention was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Localhost Ego Trip

George Violaris — Fri, 27 Feb 2026 01:01:00 GMT

I self-hosted an LLM to avoid depending on APIs. I was already depending on three of them.

This one’s embarrassing to admit, but it’s honest.

Early on, I ran a local LLM for the orchestration layer of my AI financial analysis platform. The part that decides how to approach each user request — what data to fetch, what to analyze, how to structure the response — ran on a model hosted on my own machine. The source file’s header literally said “Llama-orchestrated financial analysis agent.” I felt good about it. Self-hosted. No API dependency. Full control.

There was just one problem. I was already calling an API for vision analysis to read chart images. I was already calling APIs for market data. I was already calling cloud services for everything that required external information. The local model was handling one step — the planning and routing — and nothing else.

So I was paying the cost of running a local LLM (memory pressure, build complexity, finicky compilation dependencies) to avoid one more API call. Meanwhile, every other part of the pipeline was hitting the cloud anyway.

The swap

The commit that fixed this changed two files. Swap one dependency for another, point the orchestrator at a cloud API instead of localhost. 17 lines changed. The file header went from “Llama-orchestrated” to “Claude-orchestrated.”

The orchestration got faster and the responses got better. I stopped pretending that running one model locally while depending on cloud APIs for everything else was some kind of principled stance.

Self-hosting made sense as a starting point. It let me prototype without worrying about API costs or rate limits. But I held onto it too long because it felt like the right kind of engineering. The honest reason I kept it wasn’t technical. It was identity. I wanted to be the kind of developer who self-hosts things.

The model per job thing

Once I let go of the idea that the whole system needed to be independent from external APIs, a different kind of architecture fell into place.

Different models are good at different things. Some are fast and cheap and great at structured decisions. Some are strong at understanding images. Some are better at nuanced language and synthesis. Instead of picking one and forcing it to do everything, I started routing different parts of the pipeline to different models based on what each step actually required.

One model handles the planning — looking at a user’s question and deciding what data to fetch and how deep to go. A different model handles visual analysis when chart images are involved. Another handles the synthesis, pulling everything together into a coherent response. Each one was chosen for the job, not for brand loyalty or architectural purity.

This isn’t a novel idea. But it’s one I couldn’t have arrived at while I was committed to running everything on one local model. That commitment narrowed my thinking. Once I dropped it, the question changed from “how do I make this one model do everything?” to “what’s the best tool for each part of the job?”

What the models can’t do

The more I worked with this setup, the more I realized that the interesting engineering wasn’t in the models at all. It was in everything around them.

Making the system feel responsive even though it’s doing heavy work behind the scenes. Designing notifications that pull users back at the right moment without being annoying. Figuring out what data to show first and what to hold back. Getting the streaming right so words appear at a natural pace instead of in jarring chunks. These are UX and infrastructure problems. The models are an ingredient, not the product.

I think a lot of people building with LLMs right now are over-indexing on the model choice and under-investing in everything else. Which model you use matters less than how you orchestrate them, how you present their output, and how you handle the cases where they’re wrong. The engineering challenge has shifted from “build AI” to “build around AI.” The models are good enough. The question is whether the product around them is.

Where this landed

I’m about a month into building. 84 commits. The most important ones are the ones that removed things.

There’s a version of this story where I describe a clean, intentional process — where each pivot was a strategic decision made from a position of clarity. That’s not what happened. What happened was I built things, realized they weren’t working or weren’t worth their complexity, and deleted them. Sometimes it took longer to reach that conclusion than it should have.

The platform works. I use it every day. If you’re curious what financial analysis looks like when you stop trying to build AI from scratch and start giving capable models the right tools, you’ll hear about it soon enough.

1,500 Lines, One Commit

George Violaris — Wed, 25 Feb 2026 14:01:01 GMT

I built a custom charting engine from scratch. Then I deleted it all and accidentally created a better problem to solve.

I wrote a full custom charting system for my AI financial analysis platform. Drawing tools. Per-asset persistence so your annotations followed you between sessions. A keyboard shortcuts modal. Two context providers managing chart state and drawing state separately. I had a whole commit dedicated to getting the Bollinger Band colors right. Faded cyan accents. I was proud of those.

This took weeks. Multiple iterations on the drawing tools alone. I was building with LightweightCharts, which is a solid library, but turning it into a production charting experience meant writing a lot of glue code. Context providers, toolbar components, keyboard handling, state persistence.

Then I looked at TradingView’s embeddable widget.

The commit

It had everything I’d built. RSI, MACD, volume profiles, dozens of drawing tools, multi-pane layouts, and a UX that actual traders already knew how to navigate. It was free to embed.

The commit that replaced my charting system: 10 files changed, 155 insertions, 1,923 deletions. All my context providers, the toolbar, the keyboard shortcuts modal — gone. Weeks of work, replaced by a div and a script tag.

There’s a specific kind of deflation that comes from deleting something you worked hard on. You sit there looking at the diff and thinking about all the edge cases you solved, all the little details you got right. The cyan accents. The per-asset persistence. And then you look at the widget doing all of that plus more, and you know the decision is correct, and it still doesn’t feel great.

But I’d make it again without hesitation. Building a production-grade charting experience is a full-time job for a team. I was one person building a product, and the chart was one piece of it. Every hour I spent on drawing tool edge cases was an hour I didn’t spend on the part that actually differentiates the product — the AI analysis.

The better problem

Here’s the part I didn’t plan for. Deleting my custom charts created a more interesting problem than the one I’d been solving.

When I owned the charting library, the AI pipeline could render its own charts server-side. Same library, consistent output, full control over what the AI would see. When I switched to an embedded widget, that broke. The AI was generating its own charts that looked nothing like what users were actually staring at. Different rendering engine, different indicator styles, different timeframes visible. The AI and the user were literally looking at different pictures.

The fix required thinking about the problem differently. Instead of the AI generating its own view of the data, it needed to see what the user sees. The widget’s own API lets you capture exactly what’s on screen, so now the frontend grabs the user’s actual chart across multiple timeframes and sends those captures to a vision model. The AI analyzes the same image the user is looking at.

I couldn’t have designed this workflow while I was busy building my own charting library. I was too deep in the drawing tool weeds, solving problems that had nothing to do with the core product. The deletion forced me to step back and rethink what the AI actually needed, and the answer turned out to be simpler and better than what I’d been building toward.

Build vs. use

There’s a gravitational pull toward building things yourself, especially when you have the skills to do it. Custom code feels like control. Dependencies feel like risk. And sometimes that instinct is right — there are parts of my system that had to be built from scratch because nothing exists that does what I need.

But the charting wasn’t one of them. I was building a commodity component and treating it like a differentiator. The actual differentiator was what happened after the user looked at the chart — the analysis, the synthesis, the intelligence layer. Every hour I spent on chart rendering was borrowed from that.

I keep asking myself the same question now before I start building anything: is this the part that makes my product different? If the answer is no, I look for something that already exists. For me the differentiator is the AI pipeline and the analysis layer. The charting was never it. I just liked building it.

The 1,923-line deletion was the most productive commit I’ve made on this project. It stung for about a day. Then I started working on the part that actually matters, and I forgot about the cyan accents.

The Model That Never Made a Trade

George Violaris — Sun, 22 Feb 2026 21:49:32 GMT

I built a reinforcement learning trading agent. Then I shelved it — not because it was wrong, but because it wasn’t fast enough to ship first.

The first thing I built for my AI financial analysis platform was a reinforcement learning agent. PPO algorithm, gymnasium environment, three actions: hold, buy, sell. I wrote the environment class, set up the observation space, imported PyTorch and stable-baselines3. A separate module ran a continuous loop pulling news articles, tweets, and price data, storing everything in Redis. The idea was that the agent would learn from this stream of real-time information and figure out when to enter and exit positions.

The `step()` and `reset()` methods ended up commented out. I got far enough to define the environment but never far enough to train anything useful.

Training an RL agent on financial data is a real project. You need historical simulation infrastructure. You need careful reward shaping so the model doesn’t learn degenerate strategies — like never trading, which technically minimizes losses. You need compute to iterate. I was trying to build a product and a research project at the same time, and the research project was winning the fight for my attention while the product sat there unshipped.

The shift

While I was deep in reward functions, large language models were getting good enough that they didn’t need to be trained on domain data. They just needed access to it.

My RL agent needed thousands of episodes of simulated trading to learn anything at all. An LLM needed a function call to a market data API and it could tell you what it saw. Not perfectly, not with the precision a trained model could eventually reach, but well enough to be useful right now.

That’s the tension I kept coming back to. The RL approach would probably produce better results eventually. More refined decision-making, less susceptible to the kind of confident-but-shallow analysis that LLMs sometimes produce. But “eventually” doesn’t ship. And in a market where new AI products are launching every week, getting something useful into people’s hands matters more than getting something perfect into a research paper.

The decision

So I shelved the RL pipeline. Moved the files to an archive folder. Replaced the whole thing with a tool-calling agent — give an LLM access to market data tools and let it plan its own analysis for each request.

The dependency cleanup alone told me I’d made the right call for now. Removing PyTorch, gymnasium, and stable-baselines3 cut about 2GB from the Docker image. The AI container went from sluggish multi-minute builds to something I could iterate on quickly. All that weight, and the code it supported never ran in production once.

The model that shipped has zero training data. It works by looking at the same data a human analyst would — price history, technical indicators, chart patterns — and synthesizing what it sees. It’s not going to catch subtle statistical edges that a trained model might find. But it can tell you that Bitcoin just broke below a key support level with declining volume, and here’s what that usually means, and here’s what to watch for next. That’s useful today, not in six months after training converges.

The plan

The RL work isn’t dead. It’s waiting.

The way I think about it now: the LLM-based pipeline is the fast path to market. It gets the product in front of users, generates real usage data, and proves that the core idea works. The RL layer is the refinement that comes later — trained on actual user interactions and real market outcomes, not simulated environments.

There’s an argument that this is the right order anyway. An RL model trained without real user data is guessing about what matters. An RL model trained on months of actual usage, actual market conditions, and actual outcomes from the LLM pipeline’s recommendations has something real to learn from. The LLM gets you to market. The RL makes you better once you’re there.

The commented-out `step()` function is still sitting in the archive. I’ll get back to it.

Shortcuts that ship

RL felt like the serious approach. Tool-calling felt like a shortcut. But the “shortcut” shipped and the “serious approach” is still commented out.

In a landscape where the models themselves improve every few months, waiting for the perfect architecture is its own kind of risk. You can spend six months training a model and find that a newer foundation model with tool access does the same thing out of the box. Or you can ship now, learn from real users, and build the RL layer on top of actual production data instead of simulated environments.

I’ll come back for it. But I’m glad I stopped waiting.

The EU Doesn’t Kill Innovation. The US Regulatory Mess Might.

George Violaris — Wed, 18 Feb 2026 11:18:10 GMT

Why the ‘Europe is bad for tech’ narrative is lazy, outdated, and costing builders real money

There’s a story Silicon Valley loves to tell. It goes like this: Europe is a graveyard for innovation. A continent drowning in red tape, regulatory overreach, and bureaucratic paralysis, where ambitious startups go to die and tech giants get fined into submission. The US is the land of move fast, break things. Where real builders thrive.

It’s a compelling story. It’s also mostly wrong.

I’ve built and shipped products across AI, fintech, crypto, and data platforms. I’ve navigated regulatory environments on both sides of the Atlantic. What I’ve found is that the reality is almost the inverse of the myth: the EU’s so-called heavy regulation often provides clearer, more predictable ground to build on, while the US, for all its libertarian swagger, has created one of the most fragmented, litigation-heavy, enforcement-by-surprise regulatory environments on the planet.

This isn’t a piece defending bureaucracy for its own sake. It’s a reality check for anyone actually trying to build something serious.

The ‘Regulatory Chaos’ Tax Is Real, and the US Charges It

Let’s start with something builders care about deeply: certainty. When you’re designing a system, writing a compliance policy, or raising a round, you need to know what the rules are. The EU gives you that. The US frequently doesn’t.

Take crypto. The EU passed MiCA (the Markets in Crypto-Assets regulation): a comprehensive, coherent licensing framework that tells you exactly what a crypto exchange, a stablecoin issuer, or a token project needs to do to operate legally across all 27 member states. One framework. One license pathway. Clear definitions. You can build your compliance stack on top of it.

Now look at the US. For years, and this is not hyperbole, the SEC and CFTC have been locked in a jurisdictional cold war over who gets to regulate crypto. The result is a landscape where you genuinely cannot know if your token is a security or a commodity until someone sues you. Where enforcement actions replace regulation as the primary mechanism of governance. Where projects that spent years and millions trying to do the right thing still got obliterated by a regulator that moved the goalposts after the game started.

“Regulate by lawsuit” is not a feature of the US system. It’s a bug that has killed real companies and real jobs.

For builders, this isn’t an abstract problem. It’s an existential one. When regulatory risk is unknowable, you either over-lawyer everything (expensive), move offshore (costly and complex), or fly blind and hope for the best (dangerous). None of these are good for building.

The EU at least tells you the rules before it penalises you for breaking them.

GDPR Was Painful. It Was Also the Best Thing to Happen to Data Infrastructure.

The conventional take on GDPR is that it was a nightmare: a compliance bomb that cost millions, generated cookie banners nobody reads, and made lawyers rich. That part is true. But the conventional take misses what GDPR actually did to the industry.

It forced data hygiene. Before GDPR, most tech companies had almost no idea what data they were collecting, where it was stored, who had access to it, or how long it was retained. GDPR forced teams to answer those questions. The companies that went through that exercise properly came out the other side with dramatically cleaner, more auditable, more trustworthy data infrastructure.

It created a moat. If you’re building any kind of B2B data product, and especially if you’re selling into enterprise or financial services, GDPR compliance is now table stakes for European customers and increasingly a strong differentiator for US enterprise buyers who’ve watched American companies get torched in breaches and FTC actions. “We’re GDPR compliant” has real commercial value.

It set a global standard. California CCPA, Brazil LGPD, India’s DPDP Act: every major data privacy framework that came after GDPR borrowed from it heavily. Love it or hate it, Europe wrote the playbook that the rest of the world is now following. Building to that standard from day one means you’re not scrambling to retrofit privacy into your architecture later.

Privacy by design isn’t a compliance cost. It’s what good engineering looks like in 2025.

The US, by contrast, still has no comprehensive federal data privacy law. It has a patchwork of state laws in California, Virginia, Colorado, Texas and elsewhere, each with different definitions, scopes, and enforcement mechanisms. If you’re operating at scale across US states, you’re not dealing with one GDPR. You’re dealing with 15 overlapping frameworks that are still being actively written. That is objectively more expensive to comply with than a single coherent standard.

The EU AI Act: Scary on the Surface, Sensible Underneath

Of all the EU’s recent regulatory moves, the AI Act generates the most fear among builders. The language is dense, the risk classifications seem arbitrary at first read, and the compliance timelines felt aggressive. But step back and look at what the Act actually does.

It creates a tiered system. Most AI applications, including recommendation engines, productivity tools, content generation, and analytics, fall into the minimal or limited risk categories. These face almost no regulatory requirements beyond some transparency obligations. The heavy requirements kick in for high-risk applications: AI systems making consequential decisions about people’s employment, creditworthiness, medical treatment, or law enforcement. That’s… reasonable? The idea that AI making credit decisions should face scrutiny is not exactly a radical proposition.

Contrast that with the US approach to AI regulation: a mix of executive orders, agency guidance letters, voluntary commitments from labs, and a patchwork of state-level proposals. This creates an environment that feels freer but is actually riskier. Without clear federal standards, you’re one viral incident or high-profile lawsuit away from emergency legislative action written in a panic and implemented badly. The AI Act was written slowly, with extensive industry consultation. That process is annoying. The output is more durable.

More practically: for anyone building AI into financial services, healthcare, or HR, the EU’s high-risk AI requirements largely overlap with what best practice looks like anyway. Model documentation, human oversight mechanisms, bias testing, audit trails. If you’re not doing those things, you should be, regardless of whether Brussels is watching.

The Single Market Is Genuinely Underrated

Here’s something that gets lost in the regulatory noise: the EU’s single market is one of the most powerful commercial facts on earth.

500 million consumers. A single legal and contractual framework for commercial activity. Financial services passporting that lets you acquire a license in one member state and operate across the entire bloc. A unified digital infrastructure actively harmonising everything from e-signatures to payment systems.

Compare this to the US, where operating across state lines in regulated industries like financial services, healthcare, and insurance means navigating 50 different licensing regimes, 50 different consumer protection frameworks, and 50 different attorneys general who may or may not decide to make an example of you. The interstate commerce headaches in the US are genuinely severe for fintech companies, and they rarely get talked about because the narrative is always focused on federal-level US vs. EU comparisons.

A fintech with an EMI license in Lithuania can serve customers in Germany, France, Spain, and Italy. A fintech with a money transmitter license in New York still needs separate licenses in California, Texas, Florida, and 45 other states. Which market is actually more open?

The EU single market is 500 million customers, one license pathway, and one legal framework. That’s not a burden. That’s a distribution advantage.

Startups Are Actually Thriving in Europe

If the EU kills innovation, someone forgot to tell European founders.

Fintech: Revolut, Klarna, Wise, Monzo, N26. Healthcare: Doctolib, Kry, Alan. AI and deep tech: Mistral, Aleph Alpha, Wayve, Helsing, Poolside. Crypto: several of the most sophisticated and compliant exchanges and infrastructure providers in the world are EU-based, precisely because MiCA gave them a framework to build within.

European tech VC hit record levels in recent years and the ecosystem maturation is real. The argument that EU regulation suppresses startup formation has not been borne out by the data. What it has done is push a certain kind of ‘move fast, ignore rules, figure it out later’ approach offshore. And honestly, given what that approach produced in US crypto, US social media, and US consumer data, maybe that’s not the worst outcome.

What the EU produces is startups that are built to last. That understand compliance as architecture rather than afterthought. That have institutional defensibility baked in from the start. This is what enterprise customers want. This is what acquirers value. This is what survives.

The Real Innovation Risk Isn’t Regulation. It’s Regulatory Uncertainty.

Here’s the actual thesis. The threat to innovation isn’t regulation. It’s unpredictability. Builders can handle rules. They cannot handle the absence of rules, or rules that change retroactively, or rules that exist but aren’t enforced until someone powerful decides to make an example of you.

The EU is slow and sometimes frustrating. But it is predictable. Consultations happen publicly. Timelines are published. Enforcement follows process. When the EU says here is a framework, you can build to that framework and have reasonable confidence that following it will protect you.

The US is fast and often exhilarating. But it is unpredictable in ways that materially harm builders. Regulatory agencies change priorities with administrations. Enforcement is selective and political. The gap between what is technically legal and what will get you sued or investigated is enormous and poorly mapped. For companies building in AI, crypto, or data, the three highest-velocity sectors of the current tech cycle, that uncertainty is a real tax on progress.

The scariest thing you can hear as a founder isn’t “here are the rules.” It’s “we’ll tell you the rules after we see what you built.”

You can plan around regulation. You cannot plan around regulatory ambiguity.

What This Means If You’re Building Right Now

Build GDPR compliance into your architecture from day one. Not as a legal checkbox but as a genuine engineering discipline. Data minimisation, purpose limitation, user control: these are good product principles, not just compliance obligations. They also make you immediately commercially viable in the world’s second largest economy.

Take the EU AI Act seriously but proportionately. Spend time on the risk classification for your specific use case. Most AI products are minimal or limited risk. For the ones that aren’t, the high-risk requirements are a quality bar you should be hitting anyway.

If you’re building in fintech or crypto and you haven’t looked at MiCA, look at MiCA. The EU is building the most coherent regulated crypto market in the world. That is a commercial opportunity, not a threat.

And most importantly: don’t let the Silicon Valley narrative about European bureaucracy substitute for actual due diligence. The myth that Europe is closed for business is repeated most loudly by people who haven’t actually built there, or who built something that relied on regulatory arbitrage that was never going to last.

The Bottom Line

The EU has real problems. The AI Act’s implementation will have rough edges. MiCA has gaps. GDPR compliance is genuinely expensive to implement well. The single market is not as seamless as the theory suggests. These are all fair critiques.

But the claim that the EU kills innovation while the US enables it is a myth, and a costly one. The US’s regulatory chaos has destroyed companies, misallocated billions of venture capital into businesses built on sand, and produced outcomes like surveillance capitalism, crypto casino culture, and algorithmic harm at scale that are now generating a political backlash far more disruptive than any Brussels directive.

The EU asked builders to slow down, think carefully, and build things that could last. A lot of builders are realising that was actually good advice.

The myth is that regulation and innovation are in tension. The reality is that the right regulation, clear, predictable, and proportionate, is what makes serious, durable innovation possible.

Europe understood that. The US is still learning it the hard way.

Software Is Not Dead — It’s Evolving

George Violaris — Mon, 16 Feb 2026 17:09:54 GMT

A response to Mark Cuban’s AI claim and what it really means for engineers

A short clip of Mark Cuban has been circulating where he suggests that “software is dead” because AI will customize everything to individual businesses.

Here’s a longer discussion where he talks about AI reshaping work and business:

https://medium.com/media/8a9432fd1a4eedb4f0c5c6cc5c1b0a66/href

His broader thesis is this:

The future isn’t traditional SaaS. It’s AI-driven customization for every company.

That part is directionally correct.

But “software is dead” is the wrong conclusion.

As a CTO and working engineer, I want to explain why — in practical, architectural terms — software isn’t disappearing. It’s becoming more sophisticated.

Why “Software Is Dead” Is Technically Wrong

AI systems are not a replacement for software.

They are software.

Large language models run inside:

Cloud infrastructure
Containers
API gateways
Application servers
Logging pipelines
Databases

A typical AI-powered production system looks like this:

User → Frontend → API Gateway → Application Layer → AI Service → Database
                               ↘ Logging / Monitoring ↙

Every box in that diagram is traditional software engineering.

AI adds capability.

It does not remove architecture.

What AI Actually Changes

AI reduces friction in:

Writing boilerplate code
Drafting documentation
Generating SQL queries
Producing UI scaffolding
Prototyping workflows

That is meaningful.

But production systems are not prototypes.

Real systems require:

Idempotency guarantees
Concurrency control
Security boundaries
Observability
Deployment pipelines
Performance optimization
Failure recovery

AI doesn’t remove those constraints.

If anything, probabilistic outputs introduce new ones.

The Real Shift: From Static SaaS to Adaptive Systems

Cuban’s strongest insight is this:

Businesses want customization.

Traditional SaaS enforced standardized workflows. AI allows systems to adapt dynamically.

We are moving toward:

AI-augmented internal tools
Intelligent workflow orchestration
Domain-specific copilots
Automated decision support

That’s not the death of software.

It’s the next layer of it.

Example: AI Still Needs Engineering

An LLM can generate something like:

export async function createInvoice(orderId: string) {
  const order = await getOrder(orderId);
  const total = order.items.reduce((sum, i) => sum + i.price, 0);
  return saveInvoice({ orderId, total });
}

Looks fine.

In reality, production requires:

Currency precision handling
Tax calculation logic
Transactional safety
Idempotent retries
Audit logging
Multi-tenant isolation
Error compensation flows

The complexity doesn’t vanish.

It compounds.

Where Software Is Expanding

1. AI Integration Engineering

Companies now need:

Secure LLM integrations
Prompt evaluation frameworks
Guardrails and validation layers
Rate limiting and cost controls
Vector databases
Retrieval pipelines

That’s software engineering.

2. Platform Engineering

AI workloads increase demand for:

Distributed systems expertise
Observability tooling
Infrastructure as code
GPU orchestration
Performance tuning

Not less engineering. More.

3. Regulated Systems

In fintech, healthcare, and enterprise SaaS:

Outputs must be traceable
Decisions must be auditable
Data must remain secure

AI adds risk surfaces that engineers must manage.

What Might Actually Be Dying

If something is fading, it’s:

Undifferentiated CRUD SaaS
Static dashboards with no intelligence
Tools that don’t automate meaningfully

But every technological shift has eliminated weak products.

It has never eliminated engineering.

The Strategic Reality for Developers

If you’re early in your career, here’s the correct takeaway:

Don’t abandon software fundamentals.

Double down on them.

Learn:

Systems design
Distributed systems
Data modeling
Security
Performance engineering

Then layer AI on top.

The engineers who win in the next decade will not be prompt typists.

They will be system builders who understand how AI fits inside reliable architecture.

Final Take

“Software is dead” is a viral soundbite.

But software is not dying.

It’s becoming:

More adaptive
More composable
More AI-integrated
More systemically complex

AI is not the end of software engineering.

It’s the beginning of a harder, more interesting version of it.

ObsiTUI: Why I Built a Terminal UI for Obsidian

George Violaris — Sun, 15 Feb 2026 16:00:54 GMT

How a CTO’s workflow frustrations led to building a vim-powered, AI-enhanced terminal client for Obsidian

As someone who spends most of my day in the terminal — managing infrastructure, writing code, debugging systems — there’s something deeply frustrating about having to break my flow to open a GUI just to check my notes.

I’m the CTO at Zeig.ai, a financial analysis platform, and group CTO at Hatchworks VC where we’re building everything from crypto exchanges to AI agent platforms. My work lives in the terminal: TypeScript services, Python scripts, blockchain tooling, Go microservices. But my knowledge base? That lived in Obsidian’s Electron app, an entirely different context that required alt-tabbing, mouse navigation, and a complete shift in muscle memory.

This context switching wasn’t just annoying — it was expensive. Every time I needed to reference a technical note, check a decision log, or look up an API pattern, I had to leave the mental model of the terminal and enter a completely different interaction paradigm.

So I built ObsiTUI: a terminal UI for Obsidian that brings your entire vault into the CLI, with vim keybindings, AI-powered search and chat, graph visualization, and zero compromises on functionality.

A terminal UI for browsing and managing Obsidian vaults — featuring vim keybindings, rich markdown rendering, fuzzy search, graph view, and AI-powered chat/search/suggestions. Built with Ink (React for CLI) and TypeScript.

The Pain Points

1. Context Switching Tax

When you’re deep in a debugging session at 2 AM, hunting down a race condition in your Ethereum yield aggregator, the last thing you want is to:

Alt-tab to Obsidian
Click the search bar
Type your query (in a different UI paradigm)
Navigate results with a mouse
Alt-tab back
Relocate your mental context

This workflow interruption can easily add 30–60 seconds per note lookup. Over a day of deep work, that’s not just lost time — it’s dozens of broken flow states.

2. Vim Users Exiled from Their Editor

I live in vim. My hands know hjkl better than arrow keys. Operators and motions aren't shortcuts—they're how I think about text manipulation. But Obsidian's editor, while good, doesn't give me that modal editing power. There are vim plugins, but they're approximations at best.

With ObsiTUI, I wanted full vim modal editing: normal, insert, and visual modes; operators that compose with motions (d3w, ci(, yap); dot repeat; marks; and a command mode that actually feels like vim. Not a plugin that tries—a real implementation.

3. AI Features That Required Integration Dance

The rise of AI assistants is transforming how we work with knowledge bases. But integrating Claude or GPT into your note-taking workflow typically means:

Copy note text
Switch to ChatGPT/Claude
Paste
Get response
Copy back
Format
Save

Or you could use an Obsidian plugin, but then you’re locked into whatever LLM provider they chose, with whatever API they wrapped, and whatever features they thought to implement.

I wanted AI that worked my way: RAG chat with source citations, semantic search across my vault, intelligent summarization, and tag/link suggestions — all with the flexibility to use Anthropic’s Claude, OpenAI’s GPT, or run completely locally with Ollama.

4. Speed and Performance

Electron apps are… not fast. Obsidian does a remarkable job, but there’s no getting around the overhead of a full browser runtime. When you have thousands of notes and complex queries, you feel it.

Terminal UIs are fast. Ink (React for the CLI) gives me the component model I love from React, but renders directly to the terminal with minimal overhead. No Chromium, no GPU acceleration, just ANSI escape codes and raw speed.

The Solution: ObsiTUI

I built ObsiTUI with a few core principles:

1. Terminal-Native, Not Terminal-Adapted

This isn’t a web app that happens to run in the terminal. It’s built for the terminal from the ground up:

Vim keybindings are first-class, not an afterthought
Fuzzy search uses fzf-style scoring
Graph view uses ASCII art with Bresenham line drawing
Everything is keyboard-driven with mnemonics that make sense

2. AI Without SDK Bloat

I didn’t want to depend on heavyweight AI SDKs that abstract away control. Instead, ObsiTUI uses raw fetch() calls to LLM APIs with SSE streaming. This means:

Support for multiple providers (Anthropic, OpenAI, Ollama) with the same codebase
Zero dependency hell
Full control over prompt engineering
Easy to add new providers

The RAG (Retrieval-Augmented Generation) implementation searches your vault, builds context, and streams responses with [[wiki-link]] citations back to the source notes. Semantic search uses embeddings with cosine similarity, cached locally for speed.

3. Vim Editor That Doesn’t Compromise

The editor is a pure functional implementation with immutable state. Every operation (d, c, y, motion, etc.) returns a new EditorState object. This makes it:

Easy to reason about
Trivial to implement undo/redo
Naturally supports dot repeat
Composable: 3dw works because the implementation composes naturally

It supports:

All standard motions: h/j/k/l, w/b/e, 0/$, gg/G, f/F, %
Operators with counts: 5j, 3dw, c2w
Visual mode with selection
Command mode: :w, :q, :wq, :q!, :
Text objects would be next if there’s demand

4. Features That Scale

ObsiTUI isn’t a minimal viewer — it’s a full vault manager:

Navigation:

Tree sidebar with expand/collapse
Fuzzy finder with name/content/semantic search
Tag browser with drilldown
Backlinks with context
Graph view (global and local)
Navigation history (vim-style jumplist)

Organization:

Bookmarks with reorder
Note pinning (always visible)
Daily notes with auto-creation
Templates with variable replacement
Quick capture to Inbox

Productivity:

Kanban board (scans checkboxes across notes)
Pomodoro timer
Outline/TOC sidebar
Content search with context lines
File watcher with auto-refresh

AI-Powered:

RAG chat with citations
Semantic search
Similar notes by embedding
Note summarization (cached)
Tag/link suggestions
Daily AI brief generation
Note creation from description

Technical Deep Dive

Architecture

ObsiTUI is built with TypeScript and Ink (React for terminals). The entire state lives in App.tsx, with 22 presentational components that render terminal UI.

Key architectural decisions:

Single Source of Truth: All state in one place. No Redux, no context chaos — just React hooks and clear data flow.

Pure Functional Editor: The vim editor is implemented as pure functions that take EditorState and return new EditorState. This makes testing trivial and reasoning straightforward.

VaultIndex Cache: A single-pass cache of all notes builds tags, backlinks, and content search indices. Incremental updates keep it fast even with thousands of notes.

File Watcher: Monitors the vault directory with 500ms debounce, so changes from Obsidian (or any editor) are reflected immediately.

The AI Stack (Without an AI Stack)

Here’s the core of the RAG implementation:

// No SDK, just fetch with SSE
const response = await fetch("https://api.anthropic.com/v1/messages", {
  method: "POST",
  headers: {
    "anthropic-api-key": apiKey,
    "content-type": "application/json",
    "anthropic-version": "2023-06-01"
  },
  body: JSON.stringify({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    stream: true,
    messages: [{ role: "user", content: prompt }]
  })
});

// Parse SSE stream
const reader = response.body.getReader();
// ... streaming logic

For embeddings, the same pattern with OpenAI or Ollama. The embedding index stores vectors in .obsitui-index/embeddings.json and uses cosine similarity for semantic search:

function cosineSimilarity(a: number[], b: number[]): number {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (magA * magB);
}

No ML frameworks, no tensor libraries — just vanilla math.

Obsidian Compatibility

ObsiTUI reads your existing Obsidian configuration from .obsidian/:

Daily notes settings (date format, folder)
Templates folder
Periodic notes plugin config
Templater plugin config

Your vault is never modified unless you explicitly create, edit, rename, or delete something. ObsiTUI is a consumer of your Obsidian vault, not a competing format.

The Testing Philosophy

I’m a big believer in comprehensive testing, especially for tools I depend on daily. ObsiTUI has 301 tests across 20 test files covering:

Vault scanning and note loading
Markdown parsing with inline formatting
Fuzzy search scoring
Vim editor (61 tests for operators, motions, modes)
Graph building and layout
Daily notes navigation
File operations
AI configuration and embeddings
Cache invalidation
And more

When you’re building developer tools, stability matters. Tests give me confidence to ship quickly and refactor aggressively.

What’s Next

ObsiTUI is just getting started. Some areas I’m exploring:

Enhanced AI Features:

Multi-turn conversations with context from multiple notes
Automatic knowledge graph generation
Smart templates that adapt based on note type
Collaborative AI (multiple agents working together)

Better Integration:

MCP (Model Context Protocol) support for Claude Code
Sync with AI agents via Spectre.ai (our memory platform)
Real-time collaboration (though that’s tricky in a TUI)

Performance:

Incremental parsing for huge vaults
Smarter caching strategies
Parallel processing for embedding generation

Mobile/Remote:

Running over SSH with full functionality
Mobile terminal apps (Termux, iSH)

Try It Yourself

ObsiTUI is open source and available now:

git clone https://github.com/atr0t0s/obsitui.git
cd obsitui
npm install
npm run build
node dist/cli.js /path/to/your/vault

Or with Ollama for fully local AI:

{
  "ai": {
    "chatProvider": "ollama",
    "embeddingProvider": "ollama",
    "ollamaChatModel": "llama3.2",
    "ollamaEmbeddingModel": "nomic-embed-text"
  }
}

Why This Matters

We’re in an era where our tools should adapt to our workflows, not the other way around. If you live in the terminal, your notes should too. If you think in vim motions, your editor should speak that language. If you want AI assistance, you should be able to choose your provider and run it your way.

ObsiTUI is my answer to those needs. It’s not for everyone — if you love Obsidian’s GUI and don’t mind context switching, stick with it. It’s an excellent tool.

But if you’re a terminal person, if you value vim keybindings, if you want AI integration on your terms, and if you appreciate software that’s fast, tested, and built with a clear philosophy — give ObsiTUI a try.

I built it because I needed it. Maybe you do too.

Questions? Feedback? Find me on X @atr0t0s or open an issue on GitHub. Star the repo if this resonates with you — it helps others discover the project.

Want to contribute? PRs welcome. The codebase is clean TypeScript with comprehensive tests, so it’s easy to jump in.

ObsiTUI: Why I Built a Terminal UI for Obsidian was originally published in Artificial Intelligence in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

What we learned building agent memory at scale

George Violaris — Sun, 15 Feb 2026 13:17:16 GMT

We run AI agents that orchestrate complex workflows. Early on, we hit a problem that seemed simple but turned out to be fundamental: the agents couldn’t remember anything useful.

Not just “what did the user say last message” — that’s easy. But things like: what decisions did we make three conversations ago? What are this user’s preferences? What did another agent already try that didn’t work?

Every existing solution we tried had problems. Here’s what we learned building persistent memory that actually works in production.

Why agent memory matters

Our agents need to maintain context across sessions. A user might start a task on Monday, continue it Wednesday, then come back Friday with a follow-up question. The agent needs to know what happened, what was decided, and why.

Traditional approaches break down fast:

Session storage works for single conversations but forgets everything when the session ends. The agent starts fresh every time.

Conversation history hits token limits. You can’t just dump 50 previous conversations into every prompt. And even if you could, most of that context isn’t relevant.

Simple database lookups don’t solve the relevance problem. You can store everything, but how do you know what to retrieve? “Load the last 5 conversations” misses important context from older sessions.

The real problem isn’t storage — it’s knowing what to remember, when to remember it, and how to retrieve the right context quickly.

What we tried first

Vector databases alone

We started simple: embed every message, store in Qdrant, retrieve via similarity search when needed.

This broke immediately in production.

The agent would get a query like “What did we decide about the API rate limits?” and retrieve 10 conversations that mentioned “API” and “rate” but none of them had the actual decision. Or worse, it would retrieve an old conversation where we were still discussing options, not the final decision.

Semantic similarity doesn’t understand importance or recency. A message saying “we should probably increase the rate limit” scores just as high as “we increased the rate limit to 1000/hour” — but one is speculation and one is a decision.

We needed structure on top of similarity.

LangChain memory modules

We tried the built-in memory abstractions. ConversationBufferMemory, ConversationSummaryMemory, the whole set.

These are fine for demos but don’t scale. ConversationBufferMemory just grows until you hit token limits. ConversationSummaryMemory compresses old messages into summaries, but summaries lose the details you actually need later.

The bigger problem: these are designed for single conversations. They don’t handle “user returns 3 days later” or “agent needs to remember what happened in a different conversation thread.”

RAG over conversation history

We tried treating all past conversations as documents and using a RAG pipeline to retrieve relevant context.

This was better but still had problems:

Chunking conversations loses coherence. Where do you split? Mid-conversation breaks context.
You still hit token limits on retrieval. Even if you find the relevant conversations, you can’t include all of them.
The agent can’t distinguish between “we discussed X” and “we decided X and implemented it.”

The conversation format itself isn’t the right unit for memory. Sometimes you need a single fact from a long conversation. Sometimes you need the whole reasoning chain.

Session summaries with metadata

We tried storing session summaries in Postgres with metadata (user_id, topic, decisions made, date) and loading relevant sessions on new conversations.

The metadata helped, but we ran into new problems:

How do you decide what makes a session “relevant” to the current query? We tried keyword matching on topics — too brittle. We tried embedding the summaries — back to the similarity problem.

And summaries still lose nuance. A summary might say “discussed rate limit changes” but not capture that we tried 500/hour first and it wasn’t enough.

The insight that changed our approach

The breakthrough came when we realized we were solving the wrong problem.

We weren’t just building memory for individual agents helping individual users. We had orchestrator agents that needed to coordinate with each other. Multiple agents working on the same task needed to share context.

An agent helping with data analysis shouldn’t have to rediscover that the user prefers tables over charts. An agent debugging code shouldn’t have to relearn that this user’s codebase uses TypeScript strict mode.

We needed three types of memory:

User-specific memory: Preferences, context, decisions for a specific user
Session memory: What happened in this conversation thread
Shared memory: Knowledge that’s useful across users and agents

That third one was the key. Most memory systems assume memory is per-user or per-conversation. But in a multi-agent system, agents need to learn from each other.

What we built

Our architecture combines Postgres, Qdrant, and Redis:

Postgres stores structured memory entries with metadata:

Memory type (preference, decision, fact, task)
Scope (user-specific, session-specific, shared pool)
Source (which agent created it, from which conversation)
Timestamp and importance score
The actual content

Qdrant stores vector embeddings for semantic search:

Every memory entry gets embedded
We search for semantically similar memories
But we filter and rank based on Postgres metadata

Redis caches recent memory lookups:

Same query pattern? Return cached results
Invalidate on new memories in scope
Keeps retrieval fast

Memory types and scopes

When an agent stores a memory, it tags it:

memory = {
    "content": "User prefers concise responses without extra explanation",
    "type": "preference",
    "scope": "user",
    "user_id": "user_123",
    "importance": 0.8,
    "created_by": "agent_orchestrator_1",
    "session_id": "session_456"
}

The scope determines who can access it:

user scope: Only agents working with this user
session scope: Only within this conversation thread
shared scope: Any agent can access (pooled memory)

Shared memories are things like “trying to import pandas without installing it causes ModuleNotFoundError” — facts that are useful across users.

Retrieval strategy

When an agent needs context, we don’t just do vector similarity search. We combine multiple signals:

Semantic similarity (Qdrant vector search)
Recency (newer memories weighted higher)
Importance (manually or automatically scored)
Scope matching (user memories for this user, shared memories for everyone)
Type filtering (looking for preferences vs facts vs decisions)

Here’s simplified retrieval code:

def retrieve_memories(query: str, user_id: str, session_id: str, 
                     memory_types: list = None, limit: int = 10):
    # Check cache first
    cache_key = f"mem:{user_id}:{hash(query)}"
    cached = redis.get(cache_key)
    if cached:
        return cached
    
    # Vector search in Qdrant
    query_embedding = embed(query)
    similar = qdrant.search(
        query_embedding, 
        limit=limit * 3  # Get more candidates
    )
    
    # Filter and rank in Postgres
    memory_ids = [m.id for m in similar]
    memories = db.query("""
        SELECT *, 
               calculate_relevance_score(
                   importance, 
                   created_at, 
                   scope,
                   type
               ) as score
        FROM memories
        WHERE id IN %s
          AND (scope = 'shared' 
               OR (scope = 'user' AND user_id = %s)
               OR (scope = 'session' AND session_id = %s))
          AND (%s IS NULL OR type = ANY(%s))
        ORDER BY score DESC
        LIMIT %s
    """, (memory_ids, user_id, session_id, memory_types, memory_types, limit))
    
    # Cache result
    redis.setex(cache_key, 300, memories)  # 5 min TTL
    
    return memories

The scoring function combines similarity (from Qdrant rank), recency (exponential decay), importance (stored score), and type match.

Memory consolidation

We don’t store every message as a separate memory. That would be noise.

Instead, agents decide what’s worth remembering:

Preferences: Explicitly stated or inferred from behavior
Decisions: “We decided to use approach X”
Facts: Information that’s likely to be needed later
Task state: What’s been tried, what worked, what failed

After each conversation, an agent reviews the session and extracts memories:

# Simplified version
def extract_memories(conversation: list[Message]) -> list[Memory]:
    prompt = f"""
    Review this conversation and extract:
    1. User preferences (if any stated or strongly implied)
    2. Important decisions made
    3. Facts that should be remembered
    4. Task outcomes (what worked, what failed)
    
    Conversation:
    {format_conversation(conversation)}
    
    Return as JSON with type, content, importance (0-1), scope.
    """
    
    response = llm.complete(prompt)
    memories = parse_memories(response)
    
    for memory in memories:
        store_memory(memory)
    
    return memories

This runs asynchronously after the conversation, so it doesn’t slow down responses.

Shared memory pooling

This is where things get interesting.

When an agent encounters something useful — a pattern, a solution, a common error — it can flag it for the shared pool:

memory = {
    "content": "When using the trading API, rate limit is 100 req/min not 1000",
    "type": "fact",
    "scope": "shared",
    "importance": 0.9,
    "source_user": None,  # Applies to everyone
    "verified": True
}

Now any agent helping any user can access this knowledge. We’re building an organizational memory, not just per-user memory.

The tricky part is quality control. We don’t want agents polluting the shared pool with wrong information. Right now we:

Require high importance scores for shared memories
Let agents mark memories as “verified” after they’ve proven useful
Periodically review and prune shared memories that aren’t being accessed

What we learned at scale

The retrieval accuracy problem

Our biggest issue was agents retrieving wrong context.

A user would ask “What rate limit did we set?” and the agent would retrieve a conversation about rate limits from a different project, or an old discussion before we made the final decision.

We fixed this with better metadata and explicit decision marking:

When a decision is made, we create a memory type=”decision” with the final outcome
We suppress old “discussion” type memories when a decision exists
We added session_context field linking related memories together
We weight exact user+session matches much higher than similar shared memories

Now retrieval looks like:

# First: check for explicit decisions in this user's history
decisions = retrieve_memories(
    query, user_id, session_id, 
    memory_types=["decision"], 
    limit=5
)

if decisions:
    return decisions

# Second: check preferences and facts
context = retrieve_memories(
    query, user_id, session_id,
    memory_types=["preference", "fact"],
    limit=10
)

# Third: check shared pool as fallback
if len(context) < 5:
    shared = retrieve_memories(
        query, user_id=None, session_id=None,
        memory_types=["fact"],
        limit=5
    )
    context.extend(shared)

return context

This layered approach dramatically reduced wrong context retrievals.

Performance and cost

Retrieval latency was acceptable from the start — Qdrant is fast, Postgres queries with proper indexes are fast, Redis caching helps.

Typical query: 20–40ms for cache hit, 80–150ms for cache miss.

The real cost is embeddings. Every memory needs to be embedded, and embeddings aren’t free.

We optimized by:

Embedding asynchronously (don’t block the conversation)
Batching embeddings (embed 10–50 memories at once)
Only re-embedding if content changes significantly
Using smaller embedding models for less critical memories

At our current scale (thousands of memories per user, millions in shared pool), storage cost in Postgres and Qdrant is negligible. Embedding cost is the main expense.

Memory growth over time

Users generate memories faster than we expected. After a few months, active users had hundreds or thousands of memory entries.

Most of those memories were no longer relevant.

Early memories like “User asked how to get started” aren’t useful once the user is experienced. Temporary task state like “working on debugging feature X” becomes irrelevant once the task is done.

We added automatic importance decay:

# Every week, reduce importance of old memories
for memory in old_memories:
    age_weeks = (now - memory.created_at).days / 7
    decay = 0.95 ** age_weeks  # 5% decay per week
    memory.importance *= decay
    
    if memory.importance < 0.1:
        archive_memory(memory)  # Move to cold storage

We don’t delete memories, we archive them. They’re still accessible if explicitly searched, but they don’t show up in normal retrieval.

This keeps the active memory set manageable.

Debugging memory issues

When an agent gives a wrong answer based on wrong context, debugging is hard.

We built memory inspection tools:

UI showing what memories the agent retrieved for a query
Ability to see why a memory was ranked high (similarity? recency? importance?)
Manual memory editing and deletion
Memory provenance (which conversation created this?)

This revealed issues like:

Memories with ambiguous content that matched too many queries
Over-important memories that dominated every retrieval
Stale memories that hadn’t decayed properly
Duplicate memories from different phrasings

We’re still improving the tooling. Memory is invisible to users but critical to agent performance, so observability matters.

The hard parts we’re still figuring out

Memory pruning and consolidation

Our biggest unsolved problem.

We can decay importance over time and archive low-importance memories. But that’s crude. What we really need is smart consolidation.

If an agent has 20 memories about a user’s API preferences that evolved over time, we should consolidate them into one current preference memory, not keep all 20 versions.

The challenge: how do you do this automatically without losing important context?

Summarizing loses details. But keeping everything creates clutter.

We’re experimenting with:

Periodic consolidation where an LLM reviews related memories and merges them
Explicit memory superseding (“this memory replaces memories X, Y, Z”)
Memory chains that link related memories together chronologically

But we haven’t found a solution that works reliably yet.

Measuring memory quality

How do you know if your memory system is working?

We track basic metrics:

Retrieval latency (how fast)
Cache hit rate (how often we can skip retrieval)
Memory growth rate (how many memories per user per day)
Archived memory percentage (how much gets pruned)

But these don’t measure quality.

We’re considering:

Agent self-reporting: did the retrieved memories help?
User feedback: was the agent’s response contextually appropriate?
Manual spot-checking: review a sample of retrievals daily
A/B testing: different retrieval strategies on similar queries

Right now we rely on production debugging — when something goes wrong, we investigate. But that’s reactive, not proactive.

Cross-agent memory conflicts

Multiple agents accessing shared memory can create conflicts.

Agent A stores “User prefers detailed technical explanations” Agent B stores “User wants concise summaries only”

Which is correct? Both might be right in different contexts.

We added context fields to preferences:

{
    "content": "Prefers detailed explanations",
    "type": "preference",
    "context": "technical documentation",
    "user_id": "user_123"
}

But context matching adds complexity to retrieval. And sometimes preferences genuinely change — how do you detect that vs conflicting observations?

We don’t have a good answer yet.

Privacy and data retention

Users expect agents to remember things. But they also expect privacy.

Questions we’re navigating:

How long should we keep memories?
Should users be able to see all their stored memories?
What happens when a user deletes their account?
Can users selectively forget specific memories?
Are shared pool memories anonymized enough?

We built memory export and deletion, but the UX is still rough. Users don’t think in terms of “memory entries”, they think in terms of conversations and context.

Cost at scale

Right now our costs are manageable:

Postgres: cheap (text and metadata storage)
Qdrant: moderate (vector storage is larger but still reasonable)
Redis: cheap (caching)
Embeddings: main cost, scales with memory creation rate

But we’re not at millions of users yet. At scale, embedding costs could become significant.

We’re watching:

New embedding models that are cheaper or more efficient
Quantization and compression for stored embeddings
Smarter decisions about what to embed vs store as text only
Batch processing to get better embedding API rates

What works in production

Some practical advice from what we’ve learned:

Start with a simple hybrid approach. Vector search for semantic similarity, Postgres for metadata and filtering. Don’t overcomplicate early.

Tag everything with metadata from the start. You can’t add structure retroactively to thousands of memories. Type, importance, scope, timestamp — capture it upfront.

Build memory inspection tools early. You need to debug retrieval issues. Being able to see “the agent retrieved these 5 memories, here’s why” is essential.

Let users see what agents remember. We added a “memory” section in our UI showing the agent’s top memories about this user. Transparency helps trust.

Don’t store everything. Every message doesn’t need to be a memory. Be selective. Extract what’s actually useful.

Decay importance over time. Old memories rarely stay important. Reduce their weight gradually.

Separate concerns. We use different storage for different purposes — Redis for speed, Postgres for structure, Qdrant for similarity. Don’t try to make one database do everything.

Make scope explicit. User, session, or shared. Don’t leave it ambiguous.

Test with real agents and users. Memory problems only show up in production. You can’t unit test “did the agent retrieve the right context?”

Where this is going

Agent memory is still immature as a field. Most frameworks treat it as an afterthought — add a memory module and you’re done.

But memory is fundamental to agent intelligence. An agent that forgets everything is barely more useful than a stateless API call.

What’s still missing:

Better consolidation strategies. We need automatic ways to merge, summarize, and compress memories without losing critical information.

Temporal reasoning. Agents need to understand “this information was true then but not now” vs “this is a permanent fact.”

Multi-modal memory. Right now we mostly store text. But agents need to remember images, code, documents, user interactions. Relating these different modalities is unsolved.

Provenance and confidence. “I remember X” should come with “because user said X in conversation Y” and a confidence level.

Better evaluation. We need standard benchmarks for memory quality, not just retrieval speed.

The research is moving fast, but production systems are still figuring out basics. The gap between “memory works in a demo” and “memory works with 1000 users over 6 months” is large.

Conclusion

Building agent memory that works at scale is harder than it looks.

Vector databases alone aren’t enough. Memory frameworks are too simple. RAG over conversations hits limits quickly.

What worked for us: hybrid architecture (Postgres + Qdrant + Redis), explicit memory types and scopes, shared memory pools across agents, smart retrieval that combines similarity with metadata, and automatic importance decay.

What we’re still figuring out: memory consolidation, quality measurement, conflict resolution, and cost optimization at larger scale.

If you’re building agent systems, you’ll hit these problems too. Start simple, instrument everything, and expect to iterate. There’s no perfect solution yet — just tradeoffs.

What are you seeing in your agent systems? What memory strategies have worked for you? I’d be interested to hear what others are doing.