Here’s where it gets interesting. Say you want a “Greet” tool with its own UI. Each tool gets its own HTML entry point, its own Angular app, and its own resource registration.

Server: Register Both Tools

Greeting Component

Step 5: Sharing Code Between UIs

You probably noticed that both components have identical App setup and theming boilerplate. That's a great candidate for extraction — and since each HTML is a separate Vite entry point, Vite's tree-shaking ensures each bundle only includes what it actually imports.

Both components import createMcpApp and extractText from the shared module. Since they're in separate Vite builds, tree-shaking still applies — if you add more shared utilities later, each bundle only pulls in what it calls.

Sharing Models and Types

The same principle works for shared data models. If both UIs work with common types — say a user profile that comes from the server:

Both components can import { UserProfile, parseUserProfile } from "./shared/models" — the types are erased at build time (zero cost), and the parser function is only included in bundles that call it. This is a natural place to put validation logic, formatters, or any domain code that multiple UIs need.

Step 6: The Build

The Vite config uses an INPUT environment variable to select which HTML file to build:

The emptyOutDir: false is important — it lets you run Vite multiple times, once per HTML file, into the same dist/ directory.

Each HTML file produces a fully self-contained output (all JS, CSS, and Angular runtime inlined). The two bundles are completely independent.

Theming: Looking Native in Any Host

MCP Apps can look native in any host (Claude Desktop, a custom chat client, etc.) by using CSS variables that the host provides. The global.css file defines sensible fallbacks:

When the host sends style updates via onhostcontextchanged, the helper functions overwrite these variables on the document root. Your Angular component styles reference the variables (var(--color-accent), var(--spacing-md)), so they adapt automatically — no theme prop drilling needed.

Recap

Each UI is a self-contained Angular application. They share a server, they can share code, but their bundles are independent. Add a third tool? Same pattern — new HTML, new component, new registration, one more vite build in the chain.

Building MCP Apps with Angular was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Securing Your MCP Server with Firebase Auth: A Production Walkthrough

Dale Nguyen — Sat, 16 May 2026 19:04:48 GMT

Add Authentication to your MCP

Model Context Protocol (MCP) servers let AI assistants interact with real user data. That means auth isn’t optional — it’s the difference between a useful tool and a data breach. This post walks through exactly how Can Tax Pro secures its Python MCP server with Firebase Authentication, supporting both Firebase ID tokens (for direct access) and a custom OAuth 2.0 flow (for third-party clients like Claude.ai).

Architecture Overview

The system has three moving parts:

Browser / Claude.ai Client
       │
       │  Authorization: Bearer 
       ▼
  MCP Server (Python/FastMCP on Cloud Run)
       │
       │  Firebase Admin SDK
       ▼
  Firestore (data isolated by userId)

The MCP server accepts two token types:

Firebase ID tokens — issued by Firebase Authentication, verified cryptographically
Custom OAuth tokens (ctpo_*) — issued by the web app's OAuth server, stored as hashes in Firestore

The web app itself acts as the OAuth authorization server for third-party integrations.

Step 1: Initialize Firebase Admin SDK

The server initializes Firebase Admin once at startup, with environment-aware credential resolution:

# main.py
import firebase_admin
from firebase_admin import credentials, auth as firebase_auth, firestore

if not firebase_admin._apps:
    sa_json = os.environ.get("FIREBASE_SERVICE_ACCOUNT")
    project_id = os.environ.get("FIREBASE_PROJECT_ID")
    if sa_json:
        cred = credentials.Certificate(json.loads(sa_json))
        firebase_admin.initialize_app(cred, {"projectId": project_id})
    else:
        firebase_admin.initialize_app(options={"projectId": project_id})

Locally: set FIREBASE_SERVICE_ACCOUNT to your service account JSON. On Cloud Run: omit it entirely — the SDK picks up Application Default Credentials (ADC) automatically via Workload Identity.

This means no secrets in production. Your Cloud Run service account just needs the Firebase Admin SDK Administrator Service Agent IAM role.

Step 2: Token Resolution

Two resolver functions handle each token type:

OAuth Tokens (ctpo_*)

Custom tokens are never stored in plaintext. The server hashes them with SHA-256 and looks up the hash in Firestore:

def resolve_oauth_token(bearer_token: str) -> str:
    token_hash = hashlib.sha256(bearer_token.encode()).hexdigest()
    doc = db.collection("oauthTokens").document(token_hash).get()
    if not doc.exists:
        raise ValueError("Invalid or revoked OAuth token")
    data = doc.to_dict()
    expires_at = data.get("expiresAt")
    if expires_at and expires_at < datetime.now(timezone.utc):
        raise ValueError("OAuth token expired")
    return data["userId"]

The Firestore document stores userId, expiresAt, clientId, and a refreshTokenHash. Revocation is instant — delete the document and the token stops working on the next request.

Firebase ID Tokens

Firebase handles the hard part:

def resolve_id_token(id_token: str) -> str:
    decoded = firebase_auth.verify_id_token(id_token)
    return decoded["uid"]

verify_id_token checks the signature against Google's public keys and validates claims (expiry, issuer, audience). It caches the public keys locally so it doesn't make a network call on every request.

Step 3: The Auth Middleware

A Starlette BaseHTTPMiddleware wraps every request. It tries the OAuth path first, then falls back to Firebase ID tokens:

class AuthMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        # Skip auth for public endpoints
        if request.url.path in _PUBLIC_PATHS:
            return await call_next(request)
        
        auth_header = request.headers.get("Authorization", "")
        
        if not auth_header.startswith("Bearer "):
            return Response(
                status_code=401,
                headers={"WWW-Authenticate": 'Bearer realm="tax-mcp"'},
            )
        token = auth_header.removeprefix("Bearer ").strip()
        token_set = _user_id_var.set("")
        try:
            # Try OAuth token first
            if token.startswith("ctpo_"):
                user_id = resolve_oauth_token(token)
            else:
                user_id = resolve_id_token(token)
            _user_id_var.set(user_id)
            return await call_next(request)
        except Exception as e:
            return Response(
                status_code=401,
                content=str(e),
                headers={"WWW-Authenticate": 'Bearer realm="tax-mcp"'},
            )
        finally:
            _user_id_var.reset(token_set)

Public paths (/health, /.well-known/oauth-protected-resource) bypass auth — necessary for health checks and OAuth discovery.

Step 4: Thread-Safe User Context

Tools shouldn’t take a user_id parameter — that would pollute every signature and make testing awkward. Instead, use a ContextVar to propagate identity through the async call stack:

# _context.py
from contextvars import ContextVar

_user_id_var: ContextVar[str] = ContextVar("user_id")
def get_user_id() -> str:
    try:
        return _user_id_var.get()
    except LookupError:
        raise RuntimeError("User context not set - request did not pass through auth middleware.")

Every tool calls get_user_id() to scope its Firestore queries:

# tools/income.py
from .._context import get_user_id

@mcp.tool()
def list_income(tax_year_id: str) -> list[dict]:
    docs = (
        db.collection("users")
          .document(get_user_id())
          .collection("taxYears")
          .document(tax_year_id)
          .collection("incomeEntries")
          .stream()
    )
    return [{"id": d.id, **d.to_dict()} for d in docs]

ContextVar is async-safe — each concurrent request gets its own context, so there's no cross-contamination between users even under high concurrency.

The middleware resets the var in a finally block, which is critical: without cleanup, the context leaks to the next request on a reused coroutine.

Step 5: The OAuth 2.0 Server (Web App Side)

For third-party clients (Claude.ai, etc.), the web app implements an OAuth 2.0 authorization server. Here’s the flow:

1. Client → GET /oauth/authorize?client_id=...&code_challenge=...
2. User logs in (Firebase Auth)
3. Server → redirect with authorization_code
4. Client → POST /oauth/token with code + code_verifier
5. Server → { access_token: "ctpo_...", refresh_token: "ctpr_..." }

Token generation (TypeScript, server-side):

// server/routes/oauth/token.post.ts
import { createHash, randomBytes } from "crypto";

const accessToken = "ctpo_" + randomBytes(32).toString("hex");
const tokenHash = createHash("sha256").update(accessToken).digest("hex");
await db.collection("oauthTokens").doc(tokenHash).set({
  userId: session.userId,
  clientId: session.clientId,
  expiresAt: new Date(Date.now() + 60 * 60 * 1000), // 1 hour
  createdAt: new Date(),
});
return { access_token: accessToken, token_type: "bearer", expires_in: 3600 };

The ctpo_ prefix lets the MCP server route to the right resolver without trying Firebase first. Only the hash ever touches the database.

PKCE (Proof Key for Code Exchange) protects the authorization code flow. The client sends a code_challenge (SHA-256 of a random code_verifier) in step 1, then proves ownership by sending the raw code_verifier in step 4. The server hashes it and compares:

const verifierHash = createHash("sha256")
  .update(body.code_verifier)
  .digest("base64url");

if (verifierHash !== session.codeChallenge) {
  throw createError({ statusCode: 400, message: "Invalid code_verifier" });
}

Step 6: Client-Side (Browser)

The Angular/Analog web app attaches Firebase ID tokens to outbound API requests via an HTTP interceptor:

// auth.interceptor.ts
export const authInterceptor: HttpInterceptorFn = (req, next) => {
  const auth = inject(Auth);
  return from(auth.currentUser?.getIdToken() ?? Promise.resolve(null)).pipe(
    switchMap((token) => {
      if (!token) return next(req);
      return next(req.clone({
        setHeaders: { Authorization: `Bearer ${token}` },
      }));
    }),
  );
};

getIdToken() auto-refreshes the token when it's close to expiry, so you never send a stale JWT.

Step 7: Firestore Security Rules

The MCP server uses the Admin SDK, which bypasses all Firestore security rules. The get_user_id() context is your authorization layer. But rules provide defense-in-depth for direct client SDK access:

rules_version = '2';
service cloud.firestore {
  match /databases/{database}/documents {
    match /users/{userId}/{document=**} {
      allow read, write: if request.auth != null && request.auth.uid == userId;
    }
    match /oauthTokens/{tokenId} {
      allow read, write: if false; // Server-only
    }
  }
}

OAuth token documents are locked to server-only access. Users can never read or write them directly.

Step 8: OAuth Discovery Endpoint

Well-behaved OAuth clients (including Claude.ai) auto-discover server capabilities. Expose the RFC 9728 metadata endpoint:

@app.get("/.well-known/oauth-protected-resource")
async def oauth_protected_resource():
    return JSONResponse({
        "resource": "https://your-mcp-server.run.app",
        "authorization_servers": ["https://your-web-app.com"],
        "bearer_methods_supported": ["header"],
    })

This tells clients where to find the authorization server and how to present tokens. It’s a public endpoint (no auth required) — make sure it’s in _PUBLIC_PATHS.

Step 9: Local Testing

Don’t rely on a real Firebase project for unit tests. Seed a test token directly:

# test_auth.py
import hashlib
from firebase_admin import firestore
from datetime import datetime, timezone, timedelta

TEST_TOKEN = "ctpo_localtest_" + "a" * 48
TEST_TOKEN_HASH = hashlib.sha256(TEST_TOKEN.encode()).hexdigest()
def seed_test_token(db, user_id: str):
    db.collection("oauthTokens").document(TEST_TOKEN_HASH).set({
        "userId": user_id,
        "clientId": "test-client",
        "expiresAt": datetime.now(timezone.utc) + timedelta(hours=1),
    })
def cleanup(db):
    db.collection("oauthTokens").document(TEST_TOKEN_HASH).delete()

Then test the three critical paths: no token → 401, invalid token → 401, valid token → 200.

Security Properties

Summary

Securing an MCP server with Firebase Auth comes down to four things:

Middleware that validates tokens before any tool runs
ContextVar to propagate user identity without polluting tool signatures
Hash-only storage for custom OAuth tokens
Workload Identity to eliminate service account key management in production

The dual-token design (Firebase ID tokens for direct use, custom OAuth tokens for third-party clients) keeps the server flexible while maintaining a single, auditable auth path. Every request either has a valid, unexpired token mapping to a real user, or it gets a 401.

Securing Your MCP Server with Firebase Auth: A Production Walkthrough was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tokensparsamkeit for coding assistants

Nicolas Fränkel — Sat, 16 May 2026 08:31:33 GMT

You make decisions with data. Most businesses assumed that the most data, the better the decision. Then, several factors put a halt to the hoarding of always more data. GDPR and its localized counterparts, and the cost of storage. However, before it happened, the Datensparsamkeit approach already existed.

Datensparsamkeit is a German word that’s difficult to translate properly into English. It’s an attitude to how we capture and store data, saying that we should only handle data that we really need.

— Datensparsamkeit

I don’t agree with Martin Fowler’s claim that it’s difficult to translate. The translation of Sparsamkeit is frugality. In the context of coding assistants, token frugality is a good thing.

Today, critical resources aren’t CPU, RAM, or storage, but tokens. Tokens are a finite and expensive resource. My opinion is that soon, developers will be measured on their token usage: the better one will be the one using the fewest tokens to achieve similar results.

— Writing an agent skill

Imagine two engineers finishing the same job with the same quality in the same timeframe. If the organization needs to let go of one, it will be the one that costs more. In the era of AI, it means the one who consumes more tokens.

In this post, I want to show a couple of methods to keep the usage of tokens small.

Compression

One of the first steps toward Tokensparsamkeit is to compress tokens sent to the underlying LLM while keeping the same data. But what are tokens? It’s a gross oversimplification, but you for the sake of explanation, let’s consider a word is a token. Read this deep dive if you want more details.

If we consider tokens are words, we could remove articles and similar words from the payload to decrease the tokens number. “Find the distance between the Earth and the moon” becomes “Find distance between Earth and moon”. For all intents and purpose, the data received is the same, with less words.

The trick is to set a proxy between the client and the LLM backend. I’m using rtk myself:

CLI proxy that reduces LLM token consumption by 60–90% on common dev commands. Single Rust binary, zero dependencies

— rtk project on GitHub

The tool works across file commands, git, gh, test runners, build/lint commands, aws, docker, kubectl, etc. Note that it's not a magical recipe, as rtk itself mentions:

This only applies to Bash tool calls. Claude Code built-in tools such as Read, Grep, and Glob bypass the hook, so use shell commands or explicit rtk commands when you want RTK filtering there.

Context optimization

The second step toward Tokensparsamkeit is to avoid stuffing the context with irrelevant data.

Most people who start using coding assistants assume the context only consists of the system prompt and user prompts. There actually is a lot more. Anthropic’s Effective context engineering for AI agents article mentions:

System prompt
User prompt
Message history
Tool definitions
Tool results
MCP servers
RAG
Agent memory if applicable

Claude Code introduced the option to compact (or clear?) the context before each interaction. It explicitly asked with each interaction whether to do it. I liked it, but they removed it a week or so later. Perhaps too many people didn’t understand what it entailed? In any case, make good use of the /compact command that most assistants provide: it will reduce the conversation history to reduce its token usage, while trying to keep the relevant bits.

Also notice that tools and MCP servers use tokens; the more you configure, the more tokens used. Some MCP servers are so easy to set up, it’s tempting to stuff your assistant with them. Don’t. Or enable them either on a case-by-case basis or at the project level. Why enable the Vaadin MCP on a Rust project?

The same goes for tools, although I don’t think many do use them a lot in comparison to MCP servers.

Local models

Tokens usage is only important for cloud-based billing. We don’t care about it if we use a local model. There are several ways to do it, including AI gateways. In the scope of this article, I’ll keep it simple.

I want to keep Claude Code as the client, because it’s really good. At the same time, I want to use my own hardware with a local model: the cost is upfront, but then I have zero recurring cost but the power.

If you want to just do it, How to Run Local LLMs with Claude Code is where I found the solution. Continue reading the section if you want to read about the issues I faced.

I tried initially to run Qwen3 32B via Ollama in Docker. Docker containers cannot access Apple’s Metal GPU framework, so the model ran entirely on CPU. It loaded successfully but crashed during inference with a 500 error; CPU-only inference on a 32B model is simply too slow to be usable.

I have been using Ollama a bit as the default, because others did. Then I stumbled upon Friends Don’t Let Friends Use Ollama. I switched from Ollama to llama-cpp, which enabled low-level configuration.

The biggest hurdle was the context window size. Claude Code sends lots of tokens to the backend. On the OpenTelemetry tracing demo, it’s around 35k on each request.

I started with Qwen3 models. The default token size wasn’t big enough. When a model received more tokens than its maximum, llama server immediately rejects the request. I tried to increase the limit with the --ctx-size option, to no avail. Qwen3 modelsare trained with 32,768 tokens. It's a hard limit baked into the GGUF file metadata. Llama server abides by it.

Llama server is meant to serve multiple requests simultaneously. It turns out that the count of available tokens is shared equally across all possible requests. If the number of max tokens is T and the server can handle x requests in parallel, each request only has T/x tokens available. For this reason, I set the parallelism with --parallel 1.

Despite all of the above, it still didn’t work.

Mixture of Experts vs. dense models

I was using a dense model, which is what we use regularly. Dense models load all in memory at once. The alternative is to use a Mixture of Experts model.

In the context of transformer models, a MoE consists of two main elements:

Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token “Parameters” is sent to the first network. As we’ll explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs — the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

— What is a Mixture of Experts

In layman’s terms, a MoE segments its weights/parameters into separate specialized submodels called experts. A routing layer activates only the necessary experts depending on the request. Compared to regular dense models, instead of computing across the entire model of size T, only a small subset of experts is activated per request. The combined size of activated experts t is much smaller than T, even though the sum of all experts together is larger than T.

The Qwen3.5–35B-A3B model is a MoE that works perfectly on my machine.

Putting it all together

We still miss a couple of elements to reach the goal.

To better interact with Claude Code, the model should return structured content. That’s what the --jinja flag is for. For better performance, you should also use Flash Attention. It's an optimized algorithm for computing the attention mechanism in Transformer models. It's faster, more memory-efficient, and more scalable than standard attention. Activate it via --flash-attn on. The last configuration parameter is to offload as many layers as possible to the GPU with --n-gpu-layers 99.

The final server command line is:

llama-server \
  --model ~/models/Qwen3.5-35B-A3B-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --ctx-size 65536 \
  --parallel 1 \
  --flash-attn on \
  --jinja \
  --port 8080

On the Claude Code side, we need to set several environment variables:

Environment variable Meaning Example ANTHROPIC_BASE_URL URL to the llama-server instance http://127.0.0.1:8080 ANTHROPIC_API_KEY Anything dummy ANTHROPIC_AUTH_TOKEN Anything dummy CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC Self-explicit 1

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_API_KEY=dummy
export ANTHROPIC_AUTH_TOKEN=dummy
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
claude

At this point, you can use Claude Code, which will query your local model. Here’s a sample server output for a query, for information.

srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.788 (> 0.100 thold), f_keep = 0.789
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist 
slot launch_slot_: id  0 | task 1464 | processing task, is_child = 0
slot update_slots: id  0 | task 1464 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 56401
slot update_slots: id  0 | task 1464 | n_past = 44456, slot.prompt.tokens.size() = 56378, seq_id = 0, pos_min = 56377, n_swa = 0
slot update_slots: id  0 | task 1464 | Checking checkpoint with [56141, 56141] against 44456...
slot update_slots: id  0 | task 1464 | Checking checkpoint with [55629, 55629] against 44456...
slot update_slots: id  0 | task 1464 | Checking checkpoint with [49151, 49151] against 44456...
slot update_slots: id  0 | task 1464 | Checking checkpoint with [40959, 40959] against 44456...
slot update_slots: id  0 | task 1464 | restored context checkpoint (pos_min = 40959, pos_max = 40959, n_tokens = 40960, n_past = 40960, size = 62.813 MiB)
slot update_slots: id  0 | task 1464 | erased invalidated context checkpoint (pos_min = 49151, pos_max = 49151, n_tokens = 49152, n_swa = 0, pos_next = 40960, size = 62.813 MiB)
slot update_slots: id  0 | task 1464 | erased invalidated context checkpoint (pos_min = 55629, pos_max = 55629, n_tokens = 55630, n_swa = 0, pos_next = 40960, size = 62.813 MiB)
slot update_slots: id  0 | task 1464 | erased invalidated context checkpoint (pos_min = 56141, pos_max = 56141, n_tokens = 56142, n_swa = 0, pos_next = 40960, size = 62.813 MiB)
slot update_slots: id  0 | task 1464 | n_tokens = 40960, memory_seq_rm [40960, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 43008, batch.n_tokens = 2048, progress = 0.762540
slot update_slots: id  0 | task 1464 | n_tokens = 43008, memory_seq_rm [43008, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 45056, batch.n_tokens = 2048, progress = 0.798851
slot update_slots: id  0 | task 1464 | n_tokens = 45056, memory_seq_rm [45056, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 47104, batch.n_tokens = 2048, progress = 0.835163
slot update_slots: id  0 | task 1464 | n_tokens = 47104, memory_seq_rm [47104, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 49152, batch.n_tokens = 2048, progress = 0.871474
slot update_slots: id  0 | task 1464 | n_tokens = 49152, memory_seq_rm [49152, end)
slot update_slots: id  0 | task 1464 | 8192 tokens since last checkpoint at 40960, creating new checkpoint during processing at position 51200
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 51200, batch.n_tokens = 2048, progress = 0.907785
slot update_slots: id  0 | task 1464 | created context checkpoint 6 of 32 (pos_min = 49151, pos_max = 49151, n_tokens = 49152, size = 62.813 MiB)
slot update_slots: id  0 | task 1464 | n_tokens = 51200, memory_seq_rm [51200, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 53248, batch.n_tokens = 2048, progress = 0.944097
slot update_slots: id  0 | task 1464 | n_tokens = 53248, memory_seq_rm [53248, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 55296, batch.n_tokens = 2048, progress = 0.980408
slot update_slots: id  0 | task 1464 | n_tokens = 55296, memory_seq_rm [55296, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 55885, batch.n_tokens = 589, progress = 0.990851
slot update_slots: id  0 | task 1464 | n_tokens = 55885, memory_seq_rm [55885, end)
slot update_slots: id  0 | task 1464 | prompt processing progress, n_tokens = 56397, batch.n_tokens = 512, progress = 0.999929
slot update_slots: id  0 | task 1464 | created context checkpoint 7 of 32 (pos_min = 55884, pos_max = 55884, n_tokens = 55885, size = 62.813 MiB)
slot update_slots: id  0 | task 1464 | n_tokens = 56397, memory_seq_rm [56397, end)
reasoning-budget: activated, budget=2147483647 tokens
slot init_sampler: id  0 | task 1464 | init sampler, took 4.37 ms, tokens: text = 56401, total = 56401
slot update_slots: id  0 | task 1464 | prompt processing done, n_tokens = 56401, batch.n_tokens = 4
slot update_slots: id  0 | task 1464 | created context checkpoint 8 of 32 (pos_min = 56396, pos_max = 56396, n_tokens = 56397, size = 62.813 MiB)
srv  log_server_r: done request: POST /v1/messages 127.0.0.1 200
reasoning-budget: deactivated (natural end)
slot print_timing: id  0 | task 1464 | 
prompt eval time =   65949.79 ms / 15441 tokens (    4.27 ms per token,   234.13 tokens per second)
       eval time =    3639.91 ms /    87 tokens (   41.84 ms per token,    23.90 tokens per second)
      total time =   69589.71 ms / 15528 tokens
slot      release: id  0 | task 1464 | stop processing: n_tokens = 56487, truncated = 0

Discussion

While the underlying model is important, most people undervalue the client. I used both Claude Code and Copilot CLI with the same underlying model, Claude Sonnet 4.6. I found Claude Code superior by far across several sessions.

The move of most vendors toward subscriptions to benefit from recurring revenues make sense for them. For the customer, however, it’s another question: once you stop paying, you lose access to the service.

In the context of coding assistants, vendors justify it by cloud usage. Unfortunately, the per-token metering is quite opaque. If the vendor doesn’t size their service properly, users get charged more. I don’t think that’s fair.

Keeping Claude Code while hosting the model local is a great cost-savvy alternative. You only need to pay for the hardware once. Granted, it’s slower, but it’s a business model I prefer. If you have well-designed working autonomous agents, you can run them during the night anyway.

To go further:

Originally published at A Java Geek on May 10th, 2026.

Tokensparsamkeit for coding assistants was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

YOLO Is a Terrible Strategy for Validating Production Changes

Benjamin Cane — Fri, 15 May 2026 22:19:16 GMT

Photo by bijesh regmi on Unsplash

YOLO is a terrible strategy for validating production changes.

How many times have you seen it?

Your platform is running smoothly. No alerts, no issues. Then suddenly, something breaks.

After digging in, you discover the cause: another system you depend on made a change, and that change broke your platform.

They didn’t notice it broke. You did, much too late…

How many times have you been the cause of another platform breaking?

🥶 Cold Reality

I wish the above scenario were rare, but it happens constantly across the technology industry.

It happens between internal teams, third-party integrations, and shared infrastructure teams.

These scenarios make you wonder, “How was that change validated?”

Maybe they tested it, and their validation had gaps. Maybe they did little validation at all. If any.

Either way, the result is the same: they validated their change with 100% of production traffic. Bad plan.

💡 Better Ways to Validate Changes

There are many ways teams can reduce production risk when rolling out changes, and the best teams combine the following approaches.

Canary Releases 🐤

I talk about canary deployments often.

Instead of moving 100% of traffic at once, move small percentages gradually and observe behavior closely.

That observed part matters. Look at error rates, latency changes (beyond normal platform warmup), resource spikes, and unexpected retries. All of these indicate customer impact.

Canary deployments are one of the best ways to reduce the blast radius of changes, identify problems quickly, and self-correct.

Shadow Traffic 🪞

Traffic mirroring sends production traffic to a new version before routing live traffic there.

Responses are ignored, but you observe behavior and monitor the same signals you would with a canary release without sacrificing a customer request.

Synthetic Traffic 🤖

Synthetic traffic simulates user behavior continuously. It’s great for monitoring customer experience, but also a great way to validate new deployments.

Route synthetic traffic to upgraded instances first and verify behavior before moving real traffic. If it fails with synthetic traffic, it likely won’t survive real traffic.

Smoke Tests 😶‍🌫️

The classic approach. After deployment, run a small set of fast tests to confirm the platform is fundamentally working.

Smoke tests don’t need to be fancy; they can be shell scripts, API calls, read-only requests, a test file, or full end-to-end validation.

Their purpose is simple: to quickly catch obvious breakage.

🧠 Final Thoughts

Don’t think of the above methods as mutually exclusive choices. Combine them.

Some platforms I work on combine canary releases, shadow traffic, and synthetic traffic. Others use smoke tests plus canary releases.

The more layers of validation you have, the more likely you are to catch issues before your customers do. Because having your customers validate changes for you is a poor strategy.

Originally published at https://bencane.com on May 7, 2026.

YOLO Is a Terrible Strategy for Validating Production Changes was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Terminal Driven Project Scaffolding with Kikplate

Moeid Heidari — Fri, 15 May 2026 20:49:06 GMT

Kikplate CLI is built for developers who want to discover, install, and manage project templates directly from the terminal. Instead of browsing repositories manually or copying starter projects between machines, Kikplate gives you a workflow that stays entirely inside the command line.

The CLI connects to a Kikplate server, lets you search for plates, clone them into repositories, manage local templates, and authenticate with your account when needed.

The entire experience is designed around fast terminal workflows.

Getting Started

Once the CLI is installed, you can view all available commands.

kikplate

The CLI includes commands for searching, scaffolding, authentication, local management, submissions, and account operations.

Usage:
  kikplate [command]

Available Commands:
  completion
  config
  describe
  login
  logout
  my
  plates
  scaffold
  search
  submit
  verify
  whoami

Before using the CLI, you can initialize the default configuration.

kikplate config init

By default, the CLI stores its configuration here.

~/.kikplate/config.yaml

You can also provide a custom configuration path.

kikplate --config ./config.yaml

Searching Plates

One of the most useful parts of the CLI is plate discovery.

You can search by name.

kikplate search --name node

Example output:

┌─────────────────────┬─────────────────────┬──────────┬────────┬──────────┐
│ SLUG                │ NAME                │ CATEGORY │ RATING │ VERIFIED |
├─────────────────────┼─────────────────────┼──────────┼────────┼──────────|
│ test-node-js-boiler │ Test Node JS Boiler │ other    │ 4.5    │ yes      │
└─────────────────────┴─────────────────────┴──────────┴────────┴──────────┘

Search is not limited to names. You can filter using categories, tags, limits, and pagination.

kikplate search --help
Search plates on the server

Usage:
  kikplate search [flags]

Flags:
      --category string   Filter by category
      --limit int         Results per page (default 20)
      --name string       Search by name
      --page int          Page number (default 1)
      --tag string        Filter by tag

Searching by tags makes it easy to find templates built around specific architectures or technologies.

kikplate search --tag DDD

┌──────────────────────────┬──────────────────────────┬──────────┬────────┬──────────┐
│ SLUG                     │ NAME                     │ CATEGORY │ RATING │ VERIFIED │
├──────────────────────────┼──────────────────────────┼──────────┼────────┼──────────┤
│ react-clean-architecture │ React Clean Architecture │ other    │ 5.0    │ yes      │
└──────────────────────────┴──────────────────────────┴──────────┴────────┴──────────┘

Total: 1

The CLI output is intentionally compact and readable. You can quickly scan results without opening a browser.

Managing Local Plates

Kikplate also keeps track of plates available on your machine.

To view local plates:

kikplate plates list

Example output:

┌───────────────────────────────┬───────────────────────────────┬────────────────────────────────────────────────────┬──────────────────────────┐
│ SLUG                          │ NAME                          │ DESCRIPTION                                        │ SERVER                   │
├───────────────────────────────┼───────────────────────────────┼────────────────────────────────────────────────────┼──────────────────────────┤
│ go-clean-architecture-starter │ Go Clean Architecture Starter │ HTTP API starter with Chi, PostgreSQL, layered ... │ https://kikplate.dev/api │
│ rust-axum-api-starter         │ Rust Axum API Starter         │ HTTP API starter with Axum, Tokio, SQLx, Postgr... │ https://kikplate.dev/api │
└───────────────────────────────┴───────────────────────────────┴────────────────────────────────────────────────────┴──────────────────────────┘

This makes the CLI useful even when you already know which templates you use frequently.

Scaffolding Projects

The scaffolding command is where Kikplate becomes part of the actual development workflow.

You can scaffold a plate directly into a repository.

kikplate scaffold rust-axum-api-starter https://github.com/MoeidHeidari/test-kikplate-clone.git/

This command clones the target repository and applies the selected plate automatically.

Example repository created with Kikplate:

https://github.com/MoeidHeidari/test-kikplate-clone

You can also scaffold locally using the local flag.

kikplate scaffold rust-axum-api-starter --local

This workflow is especially useful when bootstrapping backend services, frontend applications, APIs, or internal tools.

Authentication

Kikplate supports SSO authentication providers directly from the terminal.

GitHub login:

kikplate login sso github
kikplate login sso google
kikplate login sso gitlab

To remove the active session:

kikplate logout

After authentication you can verify the current account.

kikplate whoami

Example output:

Username:  moeidheidari
Name:      Moeid Heidari
Email:     moeidheidari73@gmail.com
Account:   405bed9f-ab1e-4bbf-a5d1-d5bd27462592

The authentication flow feels native to the terminal and avoids unnecessary setup.

Plate Details

You can inspect a specific plate using the describe command.

kikplate describe rust-axum-api-starter

This is useful when checking metadata, tags, descriptions, repository information, and verification status before scaffolding.

Submitting Plates

Kikplate is not only for consuming templates. It also allows developers to publish and share their own plates.

To submit a repository as a plate:

kikplate submit

To verify a submitted plate:

kikplate verify

This creates a workflow where teams can maintain internal starter templates while also exposing public templates to the community.

Personal Workspace Commands

The CLI also includes commands focused on the authenticated user.

kikplate my

This command can be used to access personal plates, organizations, and bookmarks.

Why the CLI Matters

Most template systems stop at downloading repositories. Kikplate goes further by turning templates into a complete terminal workflow.

You can search templates from the command line.
You can scaffold projects directly into repositories.
You can authenticate using SSO.
You can manage local plates.
You can publish and verify templates.

All of this happens without leaving the terminal. That is what makes the CLI feel practical in daily development.

CLI Documentation

Full CLI documentation is available here: https://github.com/kikplate/kikplate/blob/main/docs/cli.md

Server Configuration

By default the Kikplate CLI connects to the official Kikplate server at https://kikplate.dev/api

This means all search, authentication, and plate operations are executed against the hosted Kikplate platform without any additional setup.

If you are running your own Kikplate instance you can point the CLI to a different API server. This allows teams and organizations to host private plate registries while still using the same CLI workflow.

The server address can be changed through the CLI configuration or by editing the configuration file directly at ~/.kikplate/config.yaml

Once configured, all CLI commands such as search, scaffold, and plates will use your custom server instead of the default one.

Terminal Driven Project Scaffolding with Kikplate was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Bypassing User Isolation on Android with a Screen Reader

Karol Wrótniak — Fri, 15 May 2026 11:08:31 GMT

A single missing check in Android lets one user’s screen reader leak another user’s private notifications. Here’s how it happened.

Multi-user & accessibility on Android

Android’s multi-user support lets several people share one device. Each user gets their own space, apps, and data. This feature is common on tablets. But not all smartphones have it. Even so, the code is there. The problem is that accessibility services run with high privileges. They need to see everything to help users. Sometimes, this power breaks the walls between users.

Screen readers & TalkBack

Screen readers turn text into speech. They allow people with low vision to use apps. The screen may even be completely off, but the user can still interact with the device. TalkBack is Google’s screen reader for Android. Normally, TalkBack only reads the currently focused UI elements. But there are ways to make it speak programmatically. One is announceForAccessibility() (now deprecated) - a method that forces the screen reader to read arbitrary text. Another is live regions - parts of the UI that update without user interaction. When something changes, the system fires an accessibility event (a system-level broadcast) that carries the updated text. A screen reader picks it up and reads the new value aloud. Status bar notifications are one example of live regions.

The bug: CVE-2022–20448

The bug was simple: NotificationManagerService didn't check if a notification belonged to the current foreground user before dispatching the accessibility event. This is what caused screen readers to read it out loud. Imagine a phone with two users: Alice (using the phone right now) and Bob (a background user).

Bob receives a text message: “Your verification code is 3291”.
The system posts the notification and fires an accessibility event containing that text.
TalkBack on Alice’s active session picks up the event and reads it aloud.
Alice hears Bob’s private 2FA code.

Screen readers weren’t the only apps that could intercept this data. Android dispatches accessibility events to all registered accessibility services — not just TalkBack. Apps like Tasker, which registers as an accessibility service for UI automation, or notification-logging apps would also receive Bob’s notification content.

The fix

The entire fix was a single added condition — checking whether the notification actually belongs to the current user — plus a unit test to prevent regression:

// frameworks/base/services/core/java/com/android/server/notification/NotificationManagerService.java

-                && !suppressedByDnd) {
+                && !suppressedByDnd
+                && isNotificationForCurrentUser(record)) {

isNotificationForCurrentUser() returns true only when the notification's owner matches the foreground user - so background users' notifications are no longer broadcast as accessibility events.

I reported this issue on June 29, 2022. Google awarded a $5,000 bounty for the finding. They marked the bug as High severity in the November 2022 Android Security Bulletin and released patches for Android 10, 11, 12, 12L, and 13. The vulnerability is tracked as CVE-2022–20448.

Takeaway

It really makes you wonder just how many security bugs are hiding behind assistive technologies.

Originally published at https://www.thedroidsonroids.com on May 11, 2026.

Bypassing User Isolation on Android with a Screen Reader was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Prototype to Production — Developer Abstractions that Accelerate (Part 7)

Santosh Pai — Fri, 15 May 2026 07:24:39 GMT

By this stage, the system has all the necessary layers in place. Requests are validated before they leave, routed intelligently, executed within defined cost boundaries, adapted across environments, and fully observable. Each of these capabilities addresses a specific aspect of running AI systems in production, but together they introduce a level of complexity that can be difficult for teams to work with directly.

Why do developers need abstractions?

The challenge shifts from building the system to making it usable. Without clear abstractions, developers are required to understand and manage multiple concerns simultaneously — policies, routing logic, cost constraints, and environment configurations. This often leads to duplication of effort, inconsistencies across services, and an increasing reliance on implicit knowledge rather than shared structure. It’s a cognitive overload.

Abstractions address this by introducing a unified interface through which developers interact with the system. Instead of handling each layer independently, requests flow through a single entry point that applies guardrails, selects models, enforces limits, and captures observability by default. This simplifies integration while preserving control, allowing teams to focus on building features rather than orchestrating infrastructure.

Over time, these abstractions extend beyond APIs into structured configurations, reusable templates, and shared workflows. Policies become explicit and versioned, routing strategies can be defined once and applied consistently, and common patterns are reused across teams. This reduces fragmentation and makes system behaviour predictable, even as complexity increases.

What emerges is a set of tools, and a platform that balances flexibility with consistency. Developers can move quickly without bypassing safeguards, and teams can scale systems without introducing instability.

The Need for an AI Control Plane

The control plane enforces rules, enabling a way of building that remains reliable as systems evolve.

Looking across the layers — guardrails, routing, cost control, environment awareness, and observability — it becomes clear that the real value lies in how these capabilities are brought together and made accessible. This is what allows AI systems to transition from isolated prototypes to structured, production-ready systems that can be trusted to operate at scale.

Shared Platform Capabilities

Over time, the control plane becomes more than a gateway. It evolves into a shared platform capability used across teams and products.

New applications inherit observability automatically. Environment-aware policies are applied consistently. Routing strategies become reusable. Audit trails exist by default. Instead of each team solving operational concerns independently, the platform provides these capabilities as standardized building blocks.

This creates a significant shift in how AI systems are developed. Teams spend less time assembling infrastructure layers and more time focusing on product behaviour, user workflows, and business logic.

Reducing Operational Fragmentation

One of the less visible challenges in production AI systems is operational fragmentation. Different teams often introduce different SDKs, different routing strategies, different observability tools, and different governance models. Over time, this creates systems that behave inconsistently despite serving similar purposes.

Developer abstractions reduce this fragmentation by creating a common operational language. Requests follow similar patterns regardless of the application. Policies are enforced consistently. Observability data becomes comparable across teams and environments.

Consistency at this layer is what allows organizations to scale AI adoption without losing operational control.

The Shift from Integrations to Systems

Early AI adoption focused on integrating models into applications. Production AI shifts the focus toward operating systems of behaviour — where governance, routing, observability, and cost management become shared infrastructure concerns rather than isolated implementation details.

Series Summary

This series explored the operational layers required to move AI systems from isolated prototypes to reliable production infrastructure.

We examined how guardrails define what is allowed, how routing determines where requests should go, how cost control keeps systems sustainable at scale, how environment-aware behaviour introduces operational context, and how observability and auditability make AI systems understandable over time.

The final layer focused on developer abstractions and the emergence of the AI Control Plane — a shared operational layer that brings governance, routing, cost management, environment policies, and observability together into a consistent system that teams can build on reliably.

Together, these layers represent a shift from simply integrating models into applications toward building structured, governable, and scalable AI systems.

From Prototype to Production — Developer Abstractions that Accelerate (Part 7) was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

Semantic Search in OutSystems Developer Cloud

Stefan Weber — Thu, 14 May 2026 14:37:32 GMT

ODC now lets you add semantic search to your entities with a few clicks. But if you stop there, your retrieval quality will likely disappoint you. Here’s why — and what to do about it.

OutSystems Developer Cloud (ODC) has a built-in semantic search mechanism that works directly on top of your entities. You select which entities and text attributes should be searchable, and ODC takes care of the rest. No third-party search service is required.

That convenience comes with trade-offs. In this article, I’ll explain how semantic search works, walk you through what ODC supports and where it falls short, and then show how you can dramatically improve retrieval quality.

Why Semantic Search Matters

Instead of matching keywords, semantic search captures the meaning behind a query and returns results based on conceptual relevance. This is essential for chatbots, recommendation engines, and any scenario where users express their intent in natural language.

How Semantic Search Works

Traditional keyword search matches exact words. If you search for “restart device”, it finds documents containing those words. Semantic search works differently — it converts text into numerical vectors (embeddings) that represent meaning, and then finds other vectors that are close in that meaning space.

This is powerful. A search for “How do I restart the device?” will match a document titled “Reset procedure” even though neither word overlaps. The embedding model understands that “restart” and “reset” express a similar intent.

But this is also where things get tricky. Let me show you some examples.

When Similarity Works Well

These pairs are semantically different in wording but close in meaning — exactly what you want semantic search to find:

Query
"How do I restart the device?"

Matches With
"Reset procedure for your equipment"

Why it Works
Different words, same intent

Query
"The screen is black"

Matches With
"Display not showing any output"

Why it Works
Symptom described differently

Query
"Cancel my subscription"

Matches With
"How to end my membership"

Why it Works
Synonyms and paraphrases

When Similarity Misleads

Here’s where it gets dangerous. These pairs have high cosine similarity — they look close in vector space — but their meaning is fundamentally different:

https://medium.com/media/9a56b3490fb3349c33509f0414ea94fd/href

These examples illustrate a fundamental limitation of dense vector embeddings. The models are trained on natural language patterns, and they’re excellent at capturing general meaning. But they compress fine-grained differences — negations, specific numbers, error codes, version identifiers — into nearly identical regions of the vector space.

Why this matters: This is one of the main reasons why production RAG systems use hybrid search (dense + sparse vectors). Sparse vectors excel at exact term matching and would easily distinguish “Error 503” from “Error 404”. ODC doesn’t support hybrid search today, which makes the optimization techniques later in this article even more important.

Cosine Similarity in a Nutshell

When the system compares two embeddings, it uses cosine similarity — a score between -1 and 1, where 1 means identical direction in vector space. In practice, similarity scores for related text pairs typically fall between 0.3 and 0.95, depending on the embedding model and content type. A difference of 0.02 in similarity score can be the difference between a relevant and an irrelevant result, yet the misleading pairs above often score within that margin of the correct result.

This is why retrieval alone isn’t enough. You need additional mechanisms — query rewriting and reranking — to catch what vector similarity misses. We’ll get to those later.

How ODC Implements Semantic Search

ODC’s semantic search is tightly integrated with its entity model. Here are the key components:

Index Ingestion — You configure one or more text attributes of an entity for indexing. ODC chunks the content, generates embeddings via an embedding model, and stores them in a PostgreSQL PgVector extension.

Search Index — The vector database holds the chunked, embedded data.
Retrieve — At query time, the user’s input is embedded and compared against the index using dense vector similarity.

Augment & Generate — Retrieved chunks can be used in RAG pipelines (e.g., via Agent Workbench or custom applications) to ground LLM responses.

A note on embedding models: The quality of your embeddings depends heavily on the underlying model. ODC does not currently disclose which embedding model is used or allow you to bring your own. This limits your ability to evaluate its strengths and weaknesses for your specific domain — particularly for specialized terminology or non-English content.

At a high level, ODC’s semantic search covers three dimensions: understanding the intent behind a query, recognizing contextual relationships between words, and grasping meaning through synonyms, paraphrases, and linguistic associations. These are the core strengths of any embedding-based retrieval system — and also where its limitations begin, as we’ve seen above.

Chunking Strategies in OutSystems

Chunking is the process of splitting large documents into smaller, meaningful sections so they can be efficiently embedded, searched, and retrieved. Without chunking, you’d embed entire records as a single vector — burying the meaning of individual sections in noise.

Good chunking ensures that retrieval is accurate and focused, irrelevant text doesn’t overwhelm the model, hallucinations are reduced, and the system can handle large, mixed-topic content.

Instead of sending a 200-page product manual to the LLM, chunking retrieves only the section about “Resetting device settings” — so the model gives a precise answer, not noise from the entire document.

Let’s look at the four chunking strategies available in ODC.

Fixed-Size Chunking

This one splits text into equally sized pieces based on a maximum character count, with a configurable overlap between chunks.

It’s very simple and extremely fast, and it works even with unstructured or messy documents. The problem is that it splits sentences and concepts mid-word or mid-thought. A sentence like “hold the On/Off button for 10 seconds until the Bosch logo appears” can be cut right in the middle. This produces meaningless or noisy chunks that result in very poor semantic retrieval quality.

Best practice: Fixed-size chunking is the weakest strategy for semantic search. It’s acceptable only as a baseline or when content is completely unstructured and no better option is feasible. For anything with natural sentence or paragraph boundaries, avoid it.

Sentence-Based Chunking

You define how many sentences a chunk may contain, along with a maximum character count and overlap. This respects natural sentence boundaries and produces more coherent chunks than fixed-size splitting.

There are some things to be aware of though. Sentence detection depends on language. ODC has dictionaries for 15 languages (English, German, French, etc.) that handle nuances like acronyms and abbreviations. For unsupported languages, only punctuation is used — leading to incorrect sentence splits. If the system doesn’t identify the language, it defaults to English.

More importantly, sentence-based chunking does not respect paragraph or section boundaries. A chunk may contain the last sentence of one topic and the first sentence of the next. Headings, lists, and tables are not treated differently from body text.

Best practice: Sentence-based chunking is a step up from fixed-size, but it only works well for homogeneous, well-punctuated prose in supported languages. Structured documents with headings, tables, or mixed content types will suffer.

Recursive Chunking

This approach defines character limits and overlaps while prioritizing a hierarchy of specific characters as delimiters (e.g., headings → paragraphs → sentences). The splitter tries the highest-level separator first and only falls back to smaller ones when chunks exceed the size limit.

It aligns chunks with natural document structure, which makes it great for manuals, structured PDFs, and HTML content. The downside is that it fails on documents with broken or inconsistent structure. If headings are missing or formatting is irregular, the chunker degrades to something close to fixed-size behavior. It also doesn’t merge semantically related content across sections. If “Causes” and “Resolution” live in different sections, they end up in different chunks — even though they belong to the same concept.

Best practice: Recursive chunking is the best general-purpose strategy available in ODC. It works well for structured content but cannot handle cross-section reasoning or poorly formatted input.

Smart Chunking (Default)

This is ODC’s default method. It combines recursive chunking with default separators, automatically adapting to the content found in searchable fields. No configuration required — it just works.

That said, it is still fundamentally recursive chunking with automated separator selection, so it inherits all the same limitations. It has no semantic understanding and cannot detect that “causes”, “symptoms”, and “reset procedure” belong to the same conceptual unit. Since the separators are chosen automatically, it can also be difficult to predict or debug how content is being split.

Best practice: Smart chunking is a sensible default, but don’t assume it’s optimal. For critical RAG applications, always evaluate whether recursive chunking with custom separators gives you better results.

Alternative Chunking Strategies

In addition to the built-in chunking strategies, several other approaches have emerged in the RAG space that are deemed more advanced for production systems. Understanding these methods can highlight the limitations of ODC’s built-in options and guide you in implementing your own custom chunking solutions.

Semantic Chunking

Instead of splitting text by characters, sentences, or structural markers, semantic chunking groups text by meaning. It uses an embedding model to measure how similar consecutive sentences or paragraphs are to each other. When the similarity drops significantly, it creates a chunk boundary.

This means a troubleshooting article where “symptoms”, “causes”, and “resolution” flow naturally into each other would stay in one chunk — because the meaning is connected. Recursive chunking would split them into separate chunks based on their headings, losing that connection.

Semantic chunking is especially valuable for poorly structured documents where headings are missing or inconsistent. It doesn’t rely on formatting at all — only on what the text actually means.

Implementation complexity: Moderate — requires an embedding model call for each sentence pair during ingestion. Can be built in ODC by calling an external embedding API in your ingestion pipeline.

Adaptive Chunking

Adaptive chunking dynamically adjusts the chunk size based on content complexity. Dense, technical paragraphs get smaller chunks so that each embedding captures a focused idea. Simple, straightforward passages get larger chunks to avoid fragmenting content unnecessarily.

Think of a product manual where one section is a simple feature overview and the next is a detailed troubleshooting flow with multiple conditions. Fixed or recursive chunking would use the same granularity for both. Adaptive chunking would produce larger chunks for the overview and smaller, more focused chunks for the troubleshooting steps.

This prevents information loss in complex sections and reduces noise in simple ones.

Implementation complexity: Moderate to high — requires heuristics or a model to assess content density and adjust chunk sizes dynamically. Can be implemented with rule-based logic or a lightweight LLM call per section.

Context-Enriched Chunking

With all the strategies above, each chunk stands on its own. It has no awareness of what came before or after it. Context-enriched chunking solves this by adding a brief summary of neighboring chunks to each chunk.

For example, if chunk 3 contains a resolution step, context-enriched chunking would prepend something like: “The previous section described error 503 occurring when the device loses network connectivity during a firmware update.” This way, the embedding of chunk 3 captures not just the resolution itself but also the problem it relates to.

This is critical for multi-step reasoning, where the answer to a user’s question spans multiple sections. Without context enrichment, the retriever might find the resolution chunk but miss the connection to the specific error that caused it.

Implementation complexity: Moderate — requires a post-processing step after initial chunking that generates summaries of neighboring chunks (typically via an LLM call) and prepends them. Straightforward to implement but adds ingestion latency and cost.

AI-Driven Chunking

AI-driven chunking uses an LLM to read the entire document and decide where the most meaningful breakpoints are. Instead of following rules (split at headings, split every N sentences), the LLM identifies conceptual units the way a human reader would.

This is the most expensive strategy — it requires an LLM call during ingestion for every document — but it produces the most intuitive, human-like chunks. It’s particularly useful for mixed-source documents where structure, formatting, and content types vary wildly and no single rule-based strategy fits.

Implementation complexity: High — requires a full LLM processing call for every document during ingestion. Significantly increases ingestion time and cost. Best reserved for high-value content where retrieval quality is critical.

Key takeaway: The absence of these strategies means that out of the box, the quality of your ODC semantic search is limited by how well the four built-in chunking methods fit your content. For heterogeneous data — a mix of FAQs, troubleshooting guides, product specs, and legal text — no single built-in method will perform well across all content types.

Custom Chunks in OutSystems

ODC does allow you to disable the built-in chunking on semantic search attributes and implement your own chunking logic.

This means you can:

Build a custom chunking pipeline in your application that preprocesses text before it is written to the entity.
Apply different chunking strategies to different content types — e.g., recursive chunking for structured manuals, sentence-based chunking for FAQs, and a custom semantic grouping for troubleshooting guides.
Implement any of the advanced strategies listed above by calling an LLM during ingestion to determine optimal chunk boundaries.
Store the pre-chunked text in your entity attributes so that ODC only handles the embedding and indexing — not the splitting.

This shifts the chunking responsibility from ODC’s built-in mechanism to your application, giving you full control over chunk quality at the cost of additional development effort and ingestion complexity.

Best practice: For production RAG applications with diverse or complex content, seriously consider disabling the default chunkers and implementing a custom ingestion pipeline. The built-in methods are convenient for prototyping and simple use cases, but a tailored chunking strategy will consistently deliver better retrieval quality.

In advanced RAG systems, the highest quality comes from a hybrid ingestion pipeline that applies different chunking methods to different content types. With custom chunking in ODC, this is achievable — it just requires you to build and maintain that pipeline yourself.

Dense-Only Vectors: A Significant Limitation

Chunking determines what goes into your vectors. But the type of vector itself also has a major impact on retrieval quality.

ODC semantic search uses only dense vector embeddings — compact numerical vectors that capture semantic meaning.

Dense vectors are great at understanding paraphrases and synonyms. “How do I restart the device?” matches “Reset procedure” — that kind of thing. They work well for natural language queries.

Where they struggle is exact term matching. Searching for error code “503” or product name “Kiox 300” may return semantically similar but factually wrong results. Domain-specific terms like “HANA”, “SAP”, or “OML” may not be well represented in the embedding model’s training data. Dense embeddings compress numbers and codes into a meaning space that doesn’t distinguish “Error 503” from “Error 510”.

In production RAG systems, the best practice is Hybrid Search — combining dense vector search (semantic similarity) with sparse vector search (keyword and exact-match relevance). A typical starting point uses a weighted formula:

Final Score = 0.7 × Dense Score + 0.3 × Sparse Score

Note that this weighting is use-case-dependent and should be tuned for your specific data and query patterns. The 0.7/0.3 split is a commonly cited baseline, not a universal rule.

This ensures that both semantic relevance and exact matching influence the ranking. A query like “How do I fix error 503 on the Kiox 300?” benefits from dense search understanding the intent (“fixing an issue”, “troubleshooting”) and sparse search matching the exact terms “503” and “Kiox 300”.

ODC does not support sparse vectors or hybrid search. This means that queries containing specific identifiers, codes, or product names may return less precise results than a hybrid system would.

What’s Available and What’s Not

Now that you’ve seen how chunking works and where dense-only vectors fall short, here’s a summary of what ODC semantic search supports today:

https://medium.com/media/ce0585a32257186e4b0b2c2ab9736592/href

Optimizing Retrieval: Query Rewriting and Reranking

Given the information above, the question becomes: how can you improve retrieval quality within ODC? The answer lies in two techniques that you can implement in your application logic.

Query Rewriting

Query rewriting is the process of transforming a user’s original query into a better, clearer, or more complete version before sending it to the retrieval system.

In RAG, the answer quality heavily depends on what you retrieve. If the query is unclear or incomplete, the retriever may miss relevant documents, retrieve irrelevant content, or show overly broad results. This is especially critical with dense-only search, where the embedding of a vague query lands in a broad, non-specific region of the vector space.

Consider a chatbot conversation:

User: “How do I reset it?”

The system cannot know what “it” refers to. Sending this raw query to semantic search will produce poor results because the embedding of “How do I reset it?” is far too generic.

An LLM rewrites the query using the conversation history to produce a self-contained, specific query:

Rewritten Query: “How do I reset the KIOX 300 display when it is stuck in loading state?”

This rewritten query produces a far more precise embedding that lands much closer to the relevant chunks in the vector space.

Query rewriting can fix ambiguity, expand missing context, add synonyms, convert conversational questions into standalone ones, and turn fragments into full queries.

How to Implement in ODC

Since ODC doesn’t provide built-in query rewriting, you implement it as a pre-processing step in your application:

Capture the conversation history.
Before calling the semantic search action, send the user’s latest message along with the conversation history to an LLM with a prompt like: “Rewrite the following user question as a standalone, self-contained search query. Use the conversation history to resolve any ambiguous references. Return only the rewritten query.”
Use the rewritten query as the input to ODC’s semantic search.

This is a lightweight LLM call (a few tokens in, a few tokens out) that can dramatically improve retrieval relevance at minimal cost.

Best practice: Query rewriting is the single highest-impact, lowest-cost optimization you can make for your RAG pipeline. Implement it always, even for simple use cases.

Reranking

Reranking is a post-retrieval optimization step where an additional model — often a cross-encoder or LLM-powered relevance scorer — evaluates and reorders the initially retrieved documents to ensure the most relevant items appear at the top.

Semantic search retrieves candidates based on embedding similarity, which is a fast but rough approximation. The initial ranking often contains results that are topically related but don’t answer the question, results that are semantically similar but factually irrelevant, and truly relevant results buried below mediocre ones.

This problem is amplified in ODC because there’s no hybrid search, the chunking strategies are limited, and dense-only retrieval can surface conceptually similar but wrong content.

How It Works (Two-Stage Retrieval)

Stage 1 — Retrieval (fast, broad): ODC semantic search retrieves a broad set of candidate chunks (e.g., top 10–20 results).

Stage 2 — Rerank (precise, slower): A more powerful model (cross-encoder or LLM) evaluates each candidate together with the user query and assigns a refined relevance score.

Here’s an example. User query: “How do I fix error 503 on the Kiox 300?”

https://medium.com/media/479c72d14dab08bf012990f8a27ccd01/href

The reranker pushes the most relevant result to the top and filters out noise — something the initial dense vector similarity alone could not accomplish.

Reranking greatly improves precision, lowers token costs (you pass fewer, better chunks to the LLM for generation), reduces hallucination, and is especially valuable for technical and repetitive domains where many chunks are topically similar but only a few are actually useful.

How to Implement in ODC

Over-retrieve: Configure your semantic search to return more results than you ultimately need (e.g., retrieve 15, use 3–5).
Call a reranking model: After retrieval, send the query and the retrieved chunks to a reranking API. Options include:

Cohere Rerank API — Purpose-built reranking model. Fast, cost-effective, and consistent. Recommended as a first choice.
Cross-encoder models (e.g., via Azure AI or a custom endpoint) — High accuracy, good for domain-specific tuning.
LLM-as-reranker — Use a prompt that asks the LLM to score each chunk’s relevance to the query on a scale of 1–10. This works but is slower, more expensive per call, and less deterministic than dedicated reranking models. Use it only when a purpose-built reranker isn’t available.

Sort and filter: Reorder the results by the new relevance score and take only the top N.
Pass to generation: Use the reranked chunks as context for the LLM response.

A simple LLM-based reranking prompt:

“Given the following user query and a list of text passages, rate each passage’s relevance to the query on a scale of 0 to 10. Return only the passage IDs and their scores.”

Putting It All Together

Here’s the recommended architecture for high-quality semantic search in ODC:

This pipeline compensates for most of the limitations:

https://medium.com/media/b3838cf9049c555663b9f5d034171479/href

Important caveat: Query rewriting and reranking serve as mitigations, not as comprehensive solutions to these limitations. For highly accurate retrieval, a more advanced approach incorporating both sparse and dense vectors, along with hybrid search, is necessary.

Summary

ODC’s built-in semantic search is a significant step forward — it can remove the need for third-party vector databases and simplifies the developer experience. For high-accuracy RAG applications though, be aware of its constraints.

Know your content. Understand the structure and diversity of the data you’re indexing. Choose the chunking strategy that fits — don’t blindly accept the default.

Use recursive or smart chunking for structured content. If your entity data contains well-structured text with natural section boundaries, recursive chunking will outperform fixed-size and sentence-based approaches.

Avoid fixed-size chunking unless you’re dealing with completely unstructured, messy text and have no better option.

Consider custom chunking. While built-in chunking strategies are convenient, custom chunking can significantly enhance accuracy and quality — especially for heterogeneous content.

Implement query rewriting. A simple LLM call before retrieval can transform a vague conversational query into a precise search input. This is your highest-impact optimization.

Implement reranking. Over-retrieve and then rerank. This compensates for the lack of hybrid search and the limitations of dense-only retrieval. It’s especially critical for technical domains with specific terminology. Prefer dedicated reranking models over LLM-as-reranker for cost, speed, and consistency.

Acknowledge the limitations you can’t change. Currently, ODC lacks support for sparse vectors, hybrid search, and custom embedding models. Navigate these constraints by employing the pre- and post-retrieval optimizations mentioned earlier. For high-accuracy requirements, an external tech stack remains necessary.

Monitor and iterate. Collect user feedback on search quality. The gap between “good enough” and “production-grade” is almost always closed through iterative refinement of chunking parameters, query rewriting prompts, and reranking thresholds.

If you’ve implemented any of these optimizations in your ODC projects, I’d love to hear about your results and experiences. Let’s connect on LinkedIn.

Semantic Search in OutSystems Developer Cloud was originally published in ITNEXT on Medium, where people are continuing the conversation by highlighting and responding to this story.

ITNEXT - Medium

Building MCP Apps with Angular

How MCP Apps Work (Quick Recap)

Project Structure

Step 1: The Server

Step 2: The HTML Entry Point

Step 3: The Angular App

Step 4: Adding a Second UI

Server: Register Both Tools

Greeting Component

Step 5: Sharing Code Between UIs

Sharing Models and Types

Step 6: The Build

Theming: Looking Native in Any Host

Recap

Securing Your MCP Server with Firebase Auth: A Production Walkthrough

Architecture Overview

Step 1: Initialize Firebase Admin SDK

Step 2: Token Resolution

OAuth Tokens (ctpo_*)

Firebase ID Tokens

Step 3: The Auth Middleware

Step 4: Thread-Safe User Context

Step 5: The OAuth 2.0 Server (Web App Side)

Step 6: Client-Side (Browser)

Step 7: Firestore Security Rules

Step 8: OAuth Discovery Endpoint

Step 9: Local Testing

Security Properties

Summary

Microsoft Security Copilot Agents: Inside the Agentic SOC

Tokensparsamkeit for coding assistants

Compression

Context optimization

Local models

Mixture of Experts vs. dense models

Putting it all together

Discussion

YOLO Is a Terrible Strategy for Validating Production Changes

🥶 Cold Reality

💡 Better Ways to Validate Changes

Canary Releases 🐤

Shadow Traffic 🪞

Synthetic Traffic 🤖

Smoke Tests 😶‍🌫️

🧠 Final Thoughts

Terminal Driven Project Scaffolding with Kikplate

Getting Started

Searching Plates

Managing Local Plates

Scaffolding Projects

Authentication

Plate Details

Submitting Plates

Personal Workspace Commands

Why the CLI Matters

CLI Documentation

Server Configuration

Bypassing User Isolation on Android with a Screen Reader

Multi-user & accessibility on Android

Screen readers & TalkBack

The bug: CVE-2022–20448

The fix

Takeaway

Is Step Functions Still Necessary? The Case for Lambda Durable Functions in 2026

From Prototype to Production — Developer Abstractions that Accelerate (Part 7)

Why do developers need abstractions?

The Need for an AI Control Plane

Shared Platform Capabilities

Reducing Operational Fragmentation

The Shift from Integrations to Systems

Series Summary

Semantic Search in OutSystems Developer Cloud

Why Semantic Search Matters

How Semantic Search Works

When Similarity Works Well

When Similarity Misleads

Cosine Similarity in a Nutshell

How ODC Implements Semantic Search

Chunking Strategies in OutSystems