DZone Spotlight

Sunday, January 18 View All Articles »

From Aspects to Advisors: Design Modular Cross-Cutting Features with Spring AI

By Horatiu Dan

CORE

In a nutshell, aspect-oriented programming (AOP) is a way of adding extra behavior to existing code without needing to change it. At its core, AOP is a programming paradigm that helps separate cross-cutting concerns (security checks, caching, transaction management, error handling, monitoring, logging, etc.) from the core logic of an application. By leveraging it, behavior that is needed in various layers or modules of an application is modularized and defined in a single place—an aspect—instead of being scattered across various components, which leads to duplicated and hard-to-maintain code or to a mix of business and infrastructure logic. With AOP, such concerns are written once and applied automatically whenever needed. Similarly to AOP, when it comes to Spring AI applications, interaction requests and responses can be intercepted, modified, or augmented on the fly by using the Advisors API. Specifically, when sending or receiving data to or from a large language model (LLM) via a ChatClient instance, existing or custom advisors may be plugged in, and well-defined actions can be performed either before or after passing the request or response further down the execution chain. The paradigm is similar — rather than defining a pointcut for weaving a before, after, or around advice at a certain join point, chains of advisors encapsulating specific logic can be injected into the components conducting the LLM interaction. As a result, the corresponding calls are advised in a specified order, based on the initial configuration. This article aims to provide a better understanding of how such AI advisors work, how the ready-to-use ones can be plugged in and used directly, and ultimately how a custom advisor can be implemented and put to work as part of the request/response execution chain during communication with an LLM. Personally, I see this as a way of applying AOP concepts to Spring AI. Proof of Concept The experimental project built in this article is named ai-aop-advisors and uses the following: Java 21Maven 3.9.9Spring Boot v3.5.8Spring AI v1.1.0 In order to keep the Spring dependencies in sync, the following spring-ai-bom is configured in the pom.xml file. XML <dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-bom</artifactId> <version>${spring-ai.version}</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> The chosen LLM for interaction is gpt-5 from OpenAI, and the communication is carried over HTTP. Therefore, the following dependencies are added. XML <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-starter-model-openai</artifactId> </dependency> Further, in order to connect to the LLM, the spring.ai.openai.api-key property is set up in the application.properties file, pointing to the actual key exported as an environment variable. Properties files spring.ai.openai.chat.options.model = gpt-5 spring.ai.openai.chat.options.temperature = 1 spring.ai.openai.api-key = ${OPEN_AI_J_API_KEY} To complete the initial implementation, a ChatClient is constructed using the automatically autowired ChatClient.Builder and injected into a @RestController so that questions can be asked directly using cURL, HTTPie, or a browser. Java @RestController @RequestMapping("/assistant") public class AssistantController { private final ChatClient chatClient; public AssistantController(ChatClient.Builder builder) { chatClient = builder .defaultSystem("You are a helpful AI assistant, provide short, focused answers.") .build(); } @RequestMapping("/ask") public String ask(@RequestParam String question) { return chatClient.prompt() .user(question) .call() .content(); } } With these in place and the application running, one may issue the following request: Plain Text GET http://localhost:8080/assistant/ask?question=what is the capital of Romania? The model will obviously provide the correct response, which is Bucharest. In just a few lines of code, with the help of the Spring Framework in general and Spring AI in particular, a simple AI assistant is sketched and developed as part of a web application running over HTTP. Nevertheless, the interaction with the LLM is stateless: the conversation has no context or message window, and therefore every time the user has a new inquiry, they must provide sufficient details for the model to understand the expectations and respond accordingly. For instance, if the next user request comes immediately after the previous one: Plain Text GET http://localhost:8080/assistant/ask?question=How many citizens does it have? the LLM will most probably respond in the following manner: Plain Text What place or entity are you asking about? Please specify the country/city (and if you mean citizens vs total population), and I’ll give you the latest estimate. This behavior confirms what was stated earlier: with the current configuration, all questions are treated independently. The answer generated for one request is not used by the LLM when generating responses to subsequent ones. Next, the experimental application is enhanced to use Spring AI advisors so that “memory” is transparently injected without requiring changes to the core logic. AI Advisors In general, Spring AI advisors are useful because they act as interception and enhancement layers around LLM interactions. They allow developers to add logic to every request or response during a conversation without polluting the core business code. This makes them powerful, at least because they: control the LLM pipeline without changing application logic — request pre-processing or response post-processingenable memory and context injection — stateful conversations, personalized responses, long-term memory across sessionsenable retrieval-augmented generation — answering with private data, staying up to date, and reducing hallucinationsenforce guardrails and safety — blocking harmful prompts before they reach the model and sanitizing responses before users see themenable tool calling and agent behavior — calling databases, triggering services, and executing workflowsenable observability and debugging — cost control, prompt-tuning optimizations, and compliance audits The experiment in this article focuses on points 2 and 6. It first demonstrates how ready-to-use advisors can be used to add memory to the current prompt and to observe requests and responses by logging them — via MessageChatMemoryAdvisor and SimpleLoggerAdvisor, respectively. Afterwards, a custom advisor—called TokenUsageAdvisor — is implemented. On the one hand, it estimates the token count for every request sent; on the other, it extracts the total token usage from the metadata of every response received and logs it as well. The core API components are designed for both synchronous scenarios — CallAdvisor and CallAdvisorChain — and reactive ones — StreamAdvisor and StreamAdvisorChain. The proof of concept in this article focuses on the former, non-streaming scenarios. The central interface to implement in such cases is CallAdvisor. Java public interface CallAdvisor extends Advisor { ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain); } In addition to the inherited getName() and getOrder() methods, the adviseCall() method may do the following: define logic that is executed before the rest of the advisor chain is called, using the chatClientRequestcall the rest of the advisor chain (ChatClientResponse chatClientResponse = callAdvisorChain.nextCall(chatClientRequest)), passing the request and obtaining the responsedefine logic that is executed after the rest of the advisor chain is called Depending on the advisor type, the before, the after, or both parts are implemented. Nevertheless, the advisor chain is always invoked and the response returned. Using the built-in advisors is straightforward. The two instances of MessageChatMemoryAdvisor and SimpleLoggerAdvisor are passed when the ChatClient is constructed. Java chatClient = builder .defaultSystem("You are a helpful AI assistant, provide short, focused answers.") .defaultAdvisors( MessageChatMemoryAdvisor.builder(chatMemory).build(), SimpleLoggerAdvisor.builder().build() ).build(); With the application running, the two previously asked questions are issued again. While the former question regarding the capital of Romania is answered in the same way—Bucharest—the response to the latter question (How many citizens does it have?) is now different. Plain Text About 1.7 million residents in the city proper (2021 census), roughly 2.3 million in the metropolitan area. The context is now richer, the conversation is stateful, and the model is able to infer meaning based on the previous questions and answers. Regarding SimpleLoggerAdvisor, if its logging level is configured accordingly in the application.properties file: Properties files logging.level.org.springframework.ai.chat.client.advisor.SimpleLoggerAdvisor = DEBUG the payload of every request and response can be seen in the console. Below is the request logged after the second question. What is interesting to observe is that, thanks to MessageChatMemoryAdvisor, both user messages are sent, giving the LLM more context. Plain Text DEBUG 35512 --- [ai-aop-advisors] [nio-8080-exec-5] o.s.a.c.c.advisor.SimpleLoggerAdvisor : request: ChatClientRequest[ prompt=Prompt{ messages=[ UserMessage{content='What is the capital of Romania?', metadata={messageType=USER}, messageType=USER}, AssistantMessage [messageType=ASSISTANT, toolCalls=[], textContent=Bucharest., metadata={role=ASSISTANT, messageType=ASSISTANT, refusal=, finishReason=STOP, annotations=[], index=0, id=chatcmpl-CkX702H22Z9iBefp5QLyQz6ebQ8oI}], SystemMessage{textContent='You are a helpful AI assistant, provide short, focused answers.', messageType=SYSTEM, metadata={messageType=SYSTEM}, UserMessage{content='How many citizens does it have?', metadata={messageType=USER}, messageType=USER}], modelOptions=OpenAiChatOptions: {"streamUsage":false,"model":"gpt-5","temperature":1.0} }, context={} ] Regarding the response received, it contains quite a lot of information. Plain Text DEBUG 35512 --- [ai-aop-advisors] [nio-8080-exec-5] o.s.a.c.c.advisor.SimpleLoggerAdvisor : response: { "result" : { "output" : { "messageType" : "ASSISTANT", "metadata" : { "role" : "ASSISTANT", "messageType" : "ASSISTANT", "refusal" : "", "finishReason" : "STOP", "annotations" : [ ], "index" : 0, "id" : "chatcmpl-CkX8C9VpplTNXOxAIuqTB9BsywHfC" }, "toolCalls" : [ ], "media" : [ ], "text" : "About 1.7 million residents in the city proper (2021 census); roughly 2.3 million in the metro area." }, "metadata" : { "finishReason" : "STOP", "contentFilters" : [ ], "empty" : true } }, "metadata" : { "id" : "chatcmpl-CkX8C9VpplTNXOxAIuqTB9BsywHfC", "model" : "gpt-5-2025-08-07", "rateLimit" : { "requestsLimit" : 500, "requestsRemaining" : 499, "requestsReset" : 0.120000000, "tokensLimit" : 500000, "tokensRemaining" : 499961, "tokensReset" : 0.004000000 }, "usage" : { "promptTokens" : 51, "completionTokens" : 868, "totalTokens" : 919, "nativeUsage" : { "completion_tokens" : 868, "prompt_tokens" : 51, "total_tokens" : 919, "prompt_tokens_details" : { "audio_tokens" : 0, "cached_tokens" : 0 }, "completion_tokens_details" : { "reasoning_tokens" : 832, "accepted_prediction_tokens" : 0, "audio_tokens" : 0, "rejected_prediction_tokens" : 0 } } }, "promptMetadata" : [ ], "empty" : false }, ... } In the last part of this article, a custom advisor is developed and plugged into the ChatClient. The goal of this experimental advisor is to behave as an “around” advisor, meaning that actions are taken both before and after the execution of the rest of the advisor chain. For the request, before sending it to the LLM, a total estimation of the tokens present in the text of all contained messages — UserMessage, AssistantMessage, and SystemMessage — is calculated. For the response, after it is received from the LLM, the message metadata is read and the actual token usage — promptTokens, completionTokens, and totalTokens — is extracted. Both values are logged to the console. To achieve this, the existing BaseAdvisor interface is used, with the focus, as mentioned earlier, on the non-streaming part. Java public interface BaseAdvisor extends CallAdvisor, StreamAdvisor { @Override default ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain) { Assert.notNull(chatClientRequest, "chatClientRequest cannot be null"); Assert.notNull(callAdvisorChain, "callAdvisorChain cannot be null"); ChatClientRequest processedChatClientRequest = before(chatClientRequest, callAdvisorChain); ChatClientResponse chatClientResponse = callAdvisorChain.nextCall(processedChatClientRequest); return after(chatClientResponse, callAdvisorChain); } @Override default String getName() { return this.getClass().getSimpleName(); } ChatClientRequest before(ChatClientRequest chatClientRequest, AdvisorChain advisorChain); ChatClientResponse after(ChatClientResponse chatClientResponse, AdvisorChain advisorChain); } A class implementing this interface and defining both the before() and after() methods clearly acts as an “around” advisor. Let’s examine them one by one. To implement the action performed before the execution of the rest of the advisor chain, a JTokkitTokenCountEstimator instance is injected. With its help, for every message that is part of the ChatClientRequest prompt, the token count of its text is estimated. All estimations are then summed to compute the total for the entire request. Java @Override public ChatClientRequest before(ChatClientRequest chatClientRequest, AdvisorChain advisorChain) { List<Message> messages = chatClientRequest.prompt().getInstructions(); int tokenCount = messages.stream() .mapToInt(msg -> { var text = switch (msg) { case UserMessage userMsg -> userMsg.getText(); case AssistantMessage assistantMsg -> assistantMsg.getText(); case SystemMessage systemMsg -> systemMsg.getText(); default -> ""; }; return tokenCountEstimator.estimate(text); }) .sum(); log.debug("Request: {} messages ~ {} tokens.", messages.size(), tokenCount); return chatClientRequest; } For the action performed after the execution of the rest of the advisor chain, the response shown above is read as JSON, and the pieces of information of interest are extracted from the metadata section using JsonPath. These include promptTokens, completionTokens, and totalTokens. Since the JSON is read directly, the required dependency must be added to the pom.xml file. XML <dependency> <groupId>com.jayway.jsonpath</groupId> <artifactId>json-path</artifactId> </dependency> Additionally, an ObjectMapper instance is constructed inside the TokenUsageAdvisor. Java private static final ObjectMapper MAPPER = new ObjectMapper() .disable(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES) .disable(SerializationFeature.FAIL_ON_EMPTY_BEANS) .registerModule(new JavaTimeModule()); With these in place, the method in question looks as follows. Java @Override public ChatClientResponse after(ChatClientResponse chatClientResponse, AdvisorChain advisorChain) { try { var json = MAPPER.writeValueAsString(chatClientResponse.chatResponse()); var promptTokens = JsonPath.read(json, "$.metadata.usage.promptTokens"); var completionTokens = JsonPath.read(json, "$.metadata.usage.completionTokens"); var totalTokens = JsonPath.read(json, "$.metadata.usage.totalTokens"); log.debug("Response: promptTokens = {}, completionTokens = {}, totalTokens = {}.", promptTokens, completionTokens, totalTokens); } catch (JsonProcessingException e) { throw new RuntimeException(e); } return chatClientResponse; } The complete implementation of TokenUsageAdvisor can be explored separately if desired. The previously configured ChatClient is now extended with the custom advisor as well: Java chatClient = builder .defaultSystem("You are a helpful AI assistant, provide short, focused answers.") .defaultAdvisors( MessageChatMemoryAdvisor.builder(chatMemory).order(0).build(), TokenUsageAdvisor.builder().order(1).build(), SimpleLoggerAdvisor.builder().order(2).build() ).build(); If the application is restarted and the two incipient questions are asked again, in addition to the pieces of information previously displayed in the logs, the token utilization for both the request and the response appears as well. Plain Text DEBUG 35512 --- [ai-aop-advisors] [nio-8080-exec-5] c.h.a.advisor.TokenUsageAdvisor : Request: 4 messages ~ 31 tokens. DEBUG 35512 --- [ai-aop-advisors] [nio-8080-exec-5] c.h.a.advisor.TokenUsageAdvisor : Response: promptTokens = 51, completionTokens = 868, totalTokens = 919. There are three advisors, and their priority is explicitly defined. In this particular case, the order of invocation is as follows: Requests: MessageChatMemoryAdvisor, TokenUsageAdvisor, SimpleLoggerAdvisorResponses: SimpleLoggerAdvisor, TokenUsageAdvisor, MessageChatMemoryAdvisor Regarding ordering, this is an important aspect, so a few additional details are worth mentioning. The CallAdvisor interface extends Advisor, which in turn extends the core framework interface Ordered. The execution order of advisors is determined by the Ordered#getOrder() method. Advisors with higher precedence (lower order values) are executed first during the before() phase and last during the after() phase. Conversely, advisors with lower precedence (higher order values) are executed later during request processing and earlier during response processing. The diagram below visually summarizes this behavior. Takeaways When it comes to Spring AI advisors, in addition to their elegant and powerful way of modularizing cross-cutting concerns during interactions with LLMs, a few best practices are worth taking into account. When building such components, follow the Single Responsibility Principle and have each advisor handle a single, specific task in order to enhance modularity. If state needs to be shared between advisors, use the advisor context. When multiple advisors are involved, carefully configure their order of execution — or at least be aware of it if the ordering is irrelevant for a particular use case. Additionally, in cases where the execution order for requests differs from that for responses, use separate advisors for each side and configure them with different order values. More

We Want to Hear From You! Take DZone's 2025 Developer Community Survey Now

By Carisse Dumaua

Another year passed right under our noses, and software development trends moved along with it. The steady rise of AI, the introduction of vibe coding — these are just among the many impactful shifts, and you've helped us understand them better. Now, as we move on to another exciting year, we would like to continue to learn more about you as software developers, your tech habits and preferences, and the topics you wish to know more about. With that comes our annual community survey — a great opportunity for you to give us more insights into your interests and priorities. We ask this because we want DZone to work for you. Click below to participate ⬇️ And as a small token, you will have a chance to win up to $300 in gift cards and exclusive DZone swag! All it will take is just 10–15 minutes of your time. Now, how cool is that? Over the years, DZone has remained an ever-growing avenue for exploring technology trends, looking for solutions to technical problems, and engaging in peer discussions — and we aim to keep it that way. We're going to need your help to create a more relevant and inclusive space for the DZone community. This year, we want to hear your thoughts on: Who you are as a developer: your experience and how you use toolsWhat you want to learn: your preferred learning formats and topics of interestYour DZone engagement: how often you visit DZone, which content areas pique your interest, and how you interact with the DZone community You are what drives DZone, so we want you to get the most out of every click and scroll. Every opinion is valuable to us, and we use it to equip you with the right resources to support your software development journey. And that will only be possible with your help — so thank you in advance! — Your DZone Content and Community team and our little friend, Cardy More

Trend Report

Database Systems

Every organization is now in the business of data, but they must keep up as database capabilities and the purposes they serve continue to evolve. Systems once defined by rows and tables now span regions and clouds, requiring a balance between transactional speed and analytical depth, as well as integration of relational, document, and vector models into a single, multi-model design. At the same time, AI has become both a consumer and a partner that embeds meaning into queries while optimizing the very systems that execute them. These transformations blur the lines between transactional and analytical, centralized and distributed, human driven and machine assisted. Amidst all this change, databases must still meet what are now considered baseline expectations: scalability, flexibility, security and compliance, observability, and automation. With the stakes higher than ever, it is clear that for organizations to adapt and grow successfully, databases must be hardened for resilience, performance, and intelligence. In the 2025 Database Systems Trend Report, DZone takes a pulse check on database adoption and innovation, ecosystem trends, tool usage, strategies, and more — all with the goal for practitioners and leaders alike to reorient our collective understanding of how old models and new paradigms are converging to define what’s next for data management and storage.

Refcard #397

Secrets Management Core Practices

By Apostolos Giannakidis

CORE

Refcard #375

Cloud-Native Application Security Patterns and Anti-Patterns

By Samir Behara

RAG on Android Done Right: Local Vector Cache Plus Cloud Retrieval Architecture

Why “Classic RAG” Breaks on Android On paper, retrieval-augmented generation is straightforward: embed the query, retrieve the top chunks, stuff them into a prompt, and generate an answer with citations. On Android, that “classic” flow runs into real constraints: Latency budgets are tight. Users feel delays instantly, especially inside chat-like UIs.Networks are unreliable. RAG becomes brittle when your retrieval depends on a perfect connection.Privacy expectations are higher. Users assume mobile experiences are local-first, especially for enterprise or personal data.Resources are limited. Battery, memory, and storage don’t tolerate “just cache everything.”Cold start is unforgiving. If the first answer is slow or wrong, you lose trust quickly. So the goal isn’t “RAG everywhere.” The goal is first to find a helpful answer quickly, then to upgrade the grounding when the cloud is available. That’s exactly what a two-tier system provides. The Reference Architecture The most reliable mobile RAG setup uses two retrieval tiers and treats the cloud like an upgrade, not a dependency. Client (Android) Query Orchestrator: Runs local + cloud retrieval concurrently and merges results.Local Vector Cache (Room/SQLite): A small “hot set” of chunk embeddings and text.Lightweight Similarity Search: Exact cosine similarity over a small N (cheap and good enough).Prompt Builder: Strict schema, citations, and token budgeting.Gateway Client: Calls your server gateway (avoid direct model calls from the app).Background Sync (WorkManager): Keeps the cache warm using pinned/popular content. Server AI Gateway: Auth, redaction, rate limits, model routing, and trace logging.Vector Search + Chunk Store: Canonical chunks with versions, enforced tenant isolation.Optional Reranker: Improves quality for top candidates (cross-encoder or LLM rerank). This architecture gives you speed and resilience locally, plus freshness and recall from the cloud. The Hybrid Retrieval Flow (Local-First, Cloud Upgrade) A practical request flow looks like this: Normalize the query (trim, de-noise, remove obvious UI fluff).Start local retrieval immediately to get the top 3–5 chunks fast.In parallel, attempt cloud retrieval if the device is online.Merge + de-duplicate results using a stable chunkId.Rerank using cheap heuristics; optionally, rerank top N with a stronger model.Build a strict prompt with short excerpts and forced citations.Generate via the gateway and stream to the UI.Warm the local cache with winning chunks for the next similar query. The “secret” is concurrency: local retrieval gives you speed; cloud retrieval improves accuracy when available. Your UI can show a grounded answer quickly, then refine it if the cloud finds better sources. Local Vector Cache: Keep It Small and Versioned The local cache is not your full knowledge base. It’s a curated hot set. What to Cache Locally Essential FAQs/product guides/help center snippetsRecently used or recently retrieved “winning” chunks (semantic warming)Pinned documents per user/org (enterprise-friendly) Size Guidance Most apps do great with 500–2,000 chunks locally. With that scale, exact cosine similarity is cheap enough and avoids pulling in heavyweight on-device vector databases. Fields That Matter Store enough metadata to prevent silent staleness: chunkId, docId, title, chunkTextembedding, embeddingDimnamespace (tenant/org/user scope)docVersionembeddingModelVersionexpiresAt (TTL) Invalidation Rules (Non-Negotiable) Invalidate cached chunks if: docVersion changesembeddingModelVersion changesTTL expiresnamespace/tenant scope changes Stale caches are the fastest path to “confidently wrong” outputs. Kotlin Skeleton (Room + Hybrid Retriever) Below is a minimal pattern you can ship. It’s intentionally simple: keep the cache small, do exact similarity, and merge with cloud results. Kotlin @Entity(tableName = "rag_chunks") data class RagChunkEntity( @PrimaryKey val chunkId: String, val namespace: String, val docId: String, val title: String, val chunkText: String, val embeddingBlob: ByteArray, val embeddingDim: Int, val docVersion: Long, val embeddingModelVersion: String, val expiresAtEpochMs: Long ) class RagRetriever( private val dao: RagChunkDao, private val cloud: CloudRagApi, private val embedder: QueryEmbedder, private val network: NetworkChecker ) { suspend fun retrieve(ns: String, query: String, modelVer: String): List<RagChunk> = coroutineScope { val now = System.currentTimeMillis() val q = QueryNorm.normalize(query) val embDeferred = async { embedder.embed(q) } val localDeferred = async { val emb = embDeferred.await() dao.loadValid(ns, modelVer, now) .asSequence() .map { it.toDomain(score = Similarity.cosine(emb, it), source = "local") } .sortedByDescending { it.score } .take(5) .toList() } val cloudDeferred = async { if (!network.isOnline()) emptyList() else cloud.search(ns, q, embDeferred.await(), topK = 8) } (localDeferred.await() + cloudDeferred.await()) .groupBy { it.chunkId } .map { (_, items) -> items.maxBy { it.score } } .sortedByDescending { it.score } .take(8) } } This gives you: fast local hits under typical conditionscloud “upgrade” when onlinedeterministic merge behavior via stable IDs Prompting Rules That Keep RAG Honest RAG fails less because of “bad models” and more because of loose prompting. A few rules make a huge difference: Cap retrieved chunks (usually 6–10 max).Include short excerpts (not full pages).Always include chunkId and title and require the model to cite them (e.g., [chunkId]).Add explicit refusal behavior: if the sources don’t contain the answer, say you can’t confirm. This prevents the model from “filling gaps” when retrieval is weak. Production Guardrails Security/Privacy Enforce namespace isolation in cloud vector search (tenant-safe by design).Allowlist from which a given feature can be retrieved.Redact common sensitive fields before cloud calls (emails, phones, IDs).Log chunk IDs and versions, not raw chunk text. Observability Track: latency breakdown (local retrieval, cloud retrieval, generation)local hit rate vs. cloud upgrade rateempty retrieval ratecitation coverage ratedocVersion mismatch/staleness incidents If you can’t answer “which chunk led to this output?”, debugging becomes guesswork. Takeaways RAG feels “native” on Android when you stop treating retrieval as a single cloud dependency and instead build a two-tier system: Local vector cache for speed, resilience, and privacyCloud retrieval for freshness and recallVersioned caching + TTL to prevent stale answersStrict citations + refusal behavior to keep outputs groundedBasic observability to iterate confidently That’s the architecture that turns RAG from a demo into a feature users trust.

By Mohan Sankaran

From Chatbot to Agent: Implementing the ReAct Pattern in Python

The Problem: The Limits of a Static Chatbot Most developers have mastered the basic LLM API call: send a prompt, get a completion. This works perfectly for summarization, sentiment analysis, or creative writing. However, this architecture fails in real-world engineering scenarios where the application needs accurate, real-time information or needs to perform actions. If you ask a standard GPT-4 implementation: "What is the current stock price of Datadog multiplied by 1.5?", it will fail. It fails because: Knowledge cutoffs: The model doesn't know today's stock price.Lack of math reliability: LLMs are probabilistic text generators, not calculators. They often hallucinate math. To solve this, we need to move from a chatbot architecture to an agentic architecture. The Solution: The ReAct Paradigm ReAct (Reason and Act) is a prompting engineering technique that guides LLMs to generate both verbal reasoning traces and specific actions. Instead of immediately trying to answer, the model is instructed to "think out loud" about what it needs to do, execute a tool, observe the output, and repeat the process. We are essentially moving the application flow from a linear request-response to an iterative loop. Visualizing the Architecture Shift Unlike a standard retrieval-augmented generation (RAG) pipeline, where data is fetched before the LLM call, an agent decides during execution what data it needs. Standard LLM call vs. ReAct agent Step 1: The Baseline Failure Let's define the problem. We want an AI to answer questions that require real-time data and math. Here is standard Python code using OpenAI's client. It will fail to provide an accurate answer. Python import os from openai import OpenAI # Setup OpenAI client (ensure OPENAI_API_KEY is set in environment) client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY")) def query_llm(prompt): response = client.chat.completions.create( model="gpt-4o", # Or gpt-3.5-turbo messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt} ], temperature=0 ) return response.choices[0].message.content # The difficult query user_query = "What is the current stock price of Datadog (DDOG) multiplied by 1.5?" print(f"User Query: {user_query}") print("-" * 20) print(query_llm(user_query)) # Output varies, but usually something like: # "I cannot provide real-time stock prices as my data is not current..." # Or worse, it hallucinates a number. Step 2: Building the Agent Under the Hood To turn this chatbot into an agent, we need three components: Tools: Python functions the LLM can "call".The ReAct prompt: Instructions that force the specific Thought/Action/Observation format.The execution loop: The engine that parses the LLM's response and executes the actions. While frameworks like LangChain or Semantic Kernel abstract this, building it raw in Python is crucial for understanding the mechanics of AI engineering. 1. Define the Tools We will define two simple tools: one to simulate fetching stock data, and one to perform accurate math. Python # In a real app, this would hit an external API (e.g., Alpha Vantage or Yahoo Finance) def get_stock_price(ticker: str) -> str: """Useful for when you need to find the current price of a stock.""" print(f"[TOOL LOG] Fetching price for {ticker}...") # Simulating mock data for stability mock_data = { "DDOG": "120.50", "GOOGL": "175.20" } return mock_data.get(ticker.upper(), "Error: Ticker not found") def calculator(expression: str) -> str: """Useful for performing math calculations. Input must be a valid python expression.""" print(f"[TOOL LOG] Calculating: {expression}...") try: # WARNING: eval() is dangerous in production without strict sandboxing. # Using here for simplicity of tutorial. return str(eval(expression)) except Exception as e: return f"Error calculating: {e}" # Registry of available tools tools_registry = { "get_stock_price": get_stock_price, "calculator": calculator 2. The ReAct Prompt This is the most critical part. We must explicitly tell the LLM the available tools and the exact format it must follow. Python REACT_SYSTEM_PROMPT = """ You are an agent designed to answer questions requiring external information and math. You have access to the following tools: 1. get_stock_price(ticker: str): Useful for when you need to find the current price of a stock. 2. calculator(expression: str): Useful for performing math calculations. Use the following format rigorously: Question: the input question you must answer Thought: you should always think about what to do next. Action: the action to take, should be one of [get_stock_price, calculator] Action Input: the input to the action, e.g., DDOG or 100 * 1.5 Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin! """ 3. The Execution Loop (The Engine) This loop manages the state. It sends the conversation history to the LLM, uses regex to find the Action: the LLM wants to take, executes the corresponding Python function, appends the output as an Observation:, and runs the loop again. Python import re def agent_execution_loop(user_query): # Initialize conversation history with the instructions history = [ {"role": "system", "content": REACT_SYSTEM_PROMPT}, {"role": "user", "content": f"Question: {user_query}"} ] max_steps = 5 # Safety mechanism to prevent infinite loops step_count = 0 while step_count < max_steps: step_count += 1 # 1. Call LLM with current history response = client.chat.completions.create( model="gpt-4o", messages=history, temperature=0, # Keep temp low for deterministic actions stop=["Observation:"] # Stop generating before it tries to fake an observation ) llm_output = response.choices[0].message.content print(f"--- Step {step_count} LLM Output ---") print(llm_output) # Append LLM's thought process to history history.append({"role": "assistant", "content": llm_output}) # 2. Check for Final Answer if "Final Answer:" in llm_output: return llm_output.split("Final Answer:")[1].strip() # 3. Parse Action and Input using Regex # Looking for patterns like: Action: get_stock_price\nAction Input: DDOG action_match = re.search(r"Action: (\w+)", llm_output) input_match = re.search(r"Action Input: (.+)", llm_output) if action_match and input_match: tool_name = action_match.group(1) tool_input = input_match.group(1).strip() # 4. Execute Tool if tool_name in tools_registry: tool_function = tools_registry[tool_name] observation = tool_function(tool_input) observation_message = f"Observation: {observation}" print(f"--- Step {step_count} Output ---") print(observation_message) # Append observation to history so the LLM sees it next turn history.append({"role": "user", "content": observation_message}) else: history.append({"role": "user", "content": f"Observation: Error: Tool {tool_name} not found."}) return "Error: Maximum steps reached without final answer." Running the Agent Let's run the same query that failed earlier through our new agent loop. Python # The difficult query again user_query = "What is the current stock price of Datadog (DDOG) multiplied by 1.5?" print(f"Starting ReAct Agent for: {user_query}\n") final_result = agent_execution_loop(user_query) print("\n" + "="*30) print(f"FINAL RESULT: {final_result}") The Output Trace When you run this, you will see the agent "thinking" and executing step by step: Python Starting ReAct Agent for: What is the current stock price of Datadog (DDOG) multiplied by 1.5? --- Step 1 LLM Output --- Thought: The user wants to know the current stock price of Datadog (DDOG) multiplied by 1.5. First, I need to find the current stock price of DDOG. Action: get_stock_price Action Input: DDOG --- Step 1 Output --- [TOOL LOG] Fetching price for DDOG... Observation: 120.50 --- Step 2 LLM Output --- Thought: I have the current stock price of DDOG, which is 120.50. Now I need to multiply this value by 1.5 to get the final answer. Action: calculator Action Input: 120.50 * 1.5 --- Step 2 Output --- [TOOL LOG] Calculating: 120.50 * 1.5... Observation: 180.75 --- Step 3 LLM Output --- Thought: I have performed the calculation. The current stock price of DDOG (120.50) multiplied by 1.5 is 180.75. Final Answer: The current stock price of Datadog (DDOG) multiplied by 1.5 is 180.75. ============================== FINAL RESULT: The current stock price of Datadog (DDOG) multiplied by 1.5 is 180.75. ============================== Key Engineering Considerations Moving from this tutorial code to production requires addressing several challenges: Robust parsing: The regex used here (r"Action: (\w+)") is brittle. In production, use more robust parsing or, better yet, OpenAI's native "Function Calling" (Tools API), which returns structured JSON instead of unstructured text that needs parsing.Safety and sandboxing: Never use eval() in production. Math tools should use libraries like numexpr. Other tools intended to perform actions (like database writes or API posts) require strict permission layers.Context window management: The history list grows with every step. For long-running tasks, you need strategies to summarize previous steps or eject older observations to stay within token limits.Loop prevention: Always include a max_steps counter to prevent a confused agent from burning through your API credits in an infinite loop. Conclusion The ReAct pattern is a fundamental building block of agentic AI systems. By forcing the model to verbalize its reasoning and grounding its answers in external tool outputs, developers can overcome the inherent limitations of static LLMs. While modern frameworks handle the heavy lifting, understanding the raw "prompt-and-loop" architecture is essential for debugging and optimizing complex AI behaviors.

By Nikita Kothari

Resilient API Consumption in Unreliable Enterprise Networks (TypeScript/React)

Enterprise networks are often noisy. VPNs, WAFs, proxies, mobile hotspots, and transient gateway hiccups can cause timeouts, packet loss, throttling, and abrupt connection resets. Designing resilient clients minimizes checkout/MACD friction, prevents duplicate actions, and keeps the UI responsive even when backends or the network are unstable. We have a strong toolkit for making API calls, but how do we make them safe for users and painless for developers? Which stack should we choose? How do we cut duplication and keep code maintainable at enterprise scale? These questions matter when you have hundreds of endpoints: some triggered by CTAs, some on page load, others quietly prefetching data in the background, and a few that need streaming. There’s no one-size-fits-all — each job has a best-fit approach. In this article, we’ll compare three common strategies — Fetch, Axios, and RTK Query — then help you choose the right toolset for your context. We’ll also share practical examples that save time and solve everyday problems developers face. Goals Resilience: Tolerate flaky networks (WAFs, proxies, mobile, VPN).Correctness: Avoid duplicate writes, handle concurrency, and enforce contracts.Simplicity: Centralize configuration, reduce boilerplate.Observability: Trace requests, measure latency/errors/retries.Security: Protect tokens, validate inputs/outputs. What to Cover for Every API Call Contract and Types Generate types from OpenAPI/Swagger or validate with schemas at runtime.Tools: openapi-typescript, openapi-generator, zod, or valibot. Authentication and Headers Attach tokens/cookies securely; rotate/refresh as needed.Include correlation headers (e.g., X-Request-Id). Timeouts Explicit per-endpoint timeouts; keep below gateway/WAF limits.Keep defaults tight (e.g., 8s); use shorter timeouts for read endpoints and slightly longer for write paths that touch several services. TypeScript // api.ts import axios from 'axios'; export const api = axios.create({ baseURL: '/api', timeout: 8000, // sensible default; override per request where needed withCredentials: true, headers: { 'Content-Type': 'application/json' } }); // Attach a requestId and simple timing metadata api.interceptors.request.use((config) => { (config as any).metadata = { start: Date.now() }; config.headers = { ...config.headers, 'X-Request-Id': crypto.randomUUID() }; return config; }); api.interceptors.response.use( (res) => { const meta = (res.config as any).metadata; if (meta) res.headers['x-client-latency-ms'] = String(Date.now() - meta.start); return res; }, (error) => Promise.reject(error) ); Configure per-request timeouts and provide a friendly timeout error message. TypeScript // timeouts.ts import { api } from './api'; export async function fetchCatalog(signal?: AbortSignal) { // Read path: lower timeout return api.get('/catalog', { timeout: 5000, signal, timeoutErrorMessage: 'Catalog timed out' }); } export async function submitOrder(payload: unknown, signal?: AbortSignal) { // Write path: a bit higher, but still below gateway/WAF timeouts return api.post('/orders', payload, { timeout: 12000, signal, timeoutErrorMessage: 'Order submission timed out' }); } Retries and Backoff Only retry idempotent GET/PUT/DELETE or POST with an idempotency key.Back off with jitter to avoid thundering herds; cap max attempts. TypeScript // retry.ts import axios, { AxiosError } from 'axios'; import { api } from './api'; const MAX_RETRIES = 3; const BASE_DELAY_MS = 300; //could you debounce lib instead function sleep(ms: number) { return new Promise((r) => setTimeout(r, ms)); } function isIdempotent(method?: string) { return ['get', 'head', 'options', 'put', 'delete'].includes((method || '').toLowerCase()); } function hasIdempotencyKey(headers?: any) { const key = headers?.['Idempotency-Key'] || headers?.['idempotency-key']; return Boolean(key); } function shouldRetry(error: AxiosError) { if (!error.config) return false; const status = error.response?.status; const transient = !status || [408, 425, 429, 500, 502, 503, 504].includes(status) || (error.code === 'ECONNABORTED'); const safeMethod = isIdempotent(error.config.method || ''); const postIsSafe = (error.config.method || '').toLowerCase() === 'post' && hasIdempotencyKey(error.config.headers); return transient && (safeMethod || postIsSafe); } api.interceptors.response.use(undefined, async (error: AxiosError) => { const config: any = error.config || {}; if (!shouldRetry(error)) return Promise.reject(error); config._retryCount = (config._retryCount ?? 0) + 1; if (config._retryCount > MAX_RETRIES) return Promise.reject(error); const jitter = Math.random() * 100; const delay = BASE_DELAY_MS * 2 ** (config._retryCount - 1) + jitter; // 300, 600, 1200 (+ jitter) await sleep(delay); return api.request(config); }); Circuit Breaker Prevents hammering a failing downstream service. Trips open after N consecutive failures, cool down, then half-open to probe recovery. TypeScript // circuitBreaker.ts import type { AxiosInstance, AxiosRequestConfig, AxiosResponse } from 'axios'; import axios from 'axios'; type State = 'CLOSED' | 'OPEN' | 'HALF_OPEN'; export class AxiosCircuitBreaker { private state: State = 'CLOSED'; private failures = 0; private nextAttemptAt = 0; constructor( private readonly client: AxiosInstance, private readonly opts = { failureThreshold: 5, cooldownMs: 15000 } ) {} async request<T = any>(config: AxiosRequestConfig): Promise<AxiosResponse<T>> { const now = Date.now(); if (this.state === 'OPEN') { if (now < this.nextAttemptAt) { const err = new axios.AxiosError('Circuit open', 'ECIRCUITOPEN', config); return Promise.reject(err); } this.state = 'HALF_OPEN'; } try { const res = await this.client.request<T>(config); this.onSuccess(); return res; } catch (err) { this.onFailure(); throw err; } } private onSuccess() { this.failures = 0; this.state = 'CLOSED'; } private onFailure() { if (this.state === 'HALF_OPEN') { this.trip(); return; } this.failures += 1; if (this.failures >= this.opts.failureThreshold) this.trip(); } private trip() { this.state = 'OPEN'; this.nextAttemptAt = Date.now() + this.opts.cooldownMs; } } // usage import { api } from './api'; export const breaker = new AxiosCircuitBreaker(api, { failureThreshold: 4, cooldownMs: 10000 }); // Example call // await breaker.request({ method: 'get', url: '/inventory' }); Cancellation Use AbortController (Axios supports the standard AbortSignal).With Redux Toolkit createAsyncThunk, the abort signal is provided for you. TypeScript // products.slice.ts import { createAsyncThunk, createSlice } from '@reduxjs/toolkit'; import { api } from '../api'; export const fetchProducts = createAsyncThunk('products/fetch', async (_, { signal }) => { const res = await api.get('/products', { signal, timeout: 6000 }); return res.data as { id: string; name: string }[]; }); const productsSlice = createSlice({ name: 'products', initialState: { items: [] as { id: string; name: string }[], status: 'idle' as 'idle'|'loading'|'succeeded'|'failed' }, reducers: {}, extraReducers: (b) => { b.addCase(fetchProducts.pending, (s) => { s.status = 'loading'; }) .addCase(fetchProducts.fulfilled, (s, a) => { s.items = a.payload; s.status = 'succeeded'; }) .addCase(fetchProducts.rejected, (s) => { s.status = 'failed'; }); } }); export default productsSlice.reducer; TypeScript-JSX // ProductsList.tsx import { useEffect } from 'react'; import { useAppDispatch, useAppSelector } from './store'; import { fetchProducts } from './products.slice'; export function ProductsList() { const dispatch = useAppDispatch(); const items = useAppSelector((s) => s.products.items); useEffect(() => { const promise = dispatch(fetchProducts()); return () => { // Abort when the component unmounts or route changes promise.abort(); }; }, [dispatch]); return <ul>{items.map((p) => <li key={p.id}>{p.name}</li>)}</ul>; } Idempotency and Concurrency Choose PUT/PATCH for idempotent updates when possible. For POSTs that create or change server state, include an idempotency key header to ensure the server applies the operation at most once, even under retries.Use optimistic UI to immediately reflect intended changes, then reconcile or rollback on failure.For conflicting updates, consider ETags + If-Match to avoid lost updates. Optimistic update with idempotency key: TypeScript // checkout.slice.ts import { createSlice, createAsyncThunk, nanoid } from '@reduxjs/toolkit'; import { api } from '../api'; type CartItem = { id: string; qty: number }; type State = { items: CartItem[]; pendingOps: Record<string, { prev?: CartItem }> }; const initialState: State = { items: [], pendingOps: {} }; export const updateCartItem = createAsyncThunk( 'checkout/updateCartItem', async ({ id, qty, idempotencyKey }: { id: string; qty: number; idempotencyKey: string }, { signal, rejectWithValue }) => { try { const res = await api.put(`/cart/items/${id}`, { qty }, { signal, timeout: 10000, headers: { 'Idempotency-Key': idempotencyKey } }); return res.data as CartItem; } catch (e: any) { return rejectWithValue({ id, qty, message: e?.message ?? 'update failed' }); } } ); const slice = createSlice({ name: 'checkout', initialState, reducers: { startOptimisticUpdate(s, a: { payload: { id: string; qty: number; opId: string } }) { const { id, qty, opId } = a.payload; const idx = s.items.findIndex((it) => it.id === id); const prev = idx >= 0 ? { ...s.items[idx] } : undefined; if (idx >= 0) s.items[idx].qty = qty; s.pendingOps[opId] = { prev }; }, rollback(s, a: { payload: { opId: string } }) { const { prev } = s.pendingOps[a.payload.opId] || {}; if (prev) { const idx = s.items.findIndex((it) => it.id === prev.id); if (idx >= 0) s.items[idx] = prev; } delete s.pendingOps[a.payload.opId]; } }, extraReducers: (b) => { b.addCase(updateCartItem.fulfilled, (s, a) => { // Confirm final state from server const updated = a.payload; const idx = s.items.findIndex((it) => it.id === updated.id); if (idx >= 0) s.items[idx] = updated; // Clean up any matching pending op if you track by id }) .addCase(updateCartItem.rejected, (s, a) => { // Rollback is performed from UI where opId is known }); } }); export const { startOptimisticUpdate, rollback } = slice.actions; export default slice.reducer; TypeScript-JSX // CartItem.tsx import { useAppDispatch } from './store'; import { startOptimisticUpdate, updateCartItem, rollback } from './checkout.slice'; export function CartItem({ id, qty }: { id: string; qty: number }) { const dispatch = useAppDispatch(); const onChangeQty = (nextQty: number) => { const opId = crypto.randomUUID(); const idempotencyKey = opId; // reuse opId for Idempotency-Key // 1) optimistic update dispatch(startOptimisticUpdate({ id, qty: nextQty, opId })); // 2) server update with retry/cancellation baked via api config const thunk = dispatch(updateCartItem({ id, qty: nextQty, idempotencyKey })); // 3) rollback on failure thunk.unwrap().catch(() => dispatch(rollback({ opId }))); }; return ( <div> <span>Qty: {qty}</span> <button onClick={() => onChangeQty(qty + 1)}>+</button> <button onClick={() => onChangeQty(qty - 1)}>-</button> </div> ); } Notes for checkout/MACD: Use the same idempotency key when retrying POST/PUT requests to prevent duplicate order lines or duplicated MACD changes.For MACD, consider representing each change as a deterministic resource (PUT /services/{id}/config), so retries remain safe.If the server supports ETag, add If-Match headers to ensure you aren’t overwriting concurrent updates. Error Handling Normalize errors to a single shape for UI; map known status codes to friendly messages. One error shape – consistent fields your UI can rely on (message, status, retryable, requestId, details).Deterministic mapping – known status codes and transport errors become clear, human-readable messages.Separation of concerns – keep low-level errors in the transport; normalize at the boundary (thunks/RTK Query baseQuery), so retries/circuit breakers still work. TypeScript example: normalizer and usage (Axios and RTK Query): TypeScript // error.ts import type { AxiosError } from 'axios'; export type AppError = { code: string; status?: number; message: string; requestId?: string; retryable: boolean; details?: unknown; }; const FRIENDLY: Record<number, { code: string; message: string; retryable: boolean }> = { 400: { code: 'BAD_REQUEST', message: 'Request is invalid. Please check inputs.', retryable: false }, 401: { code: 'UNAUTHENTICATED', message: 'You’re signed out. Please sign in and try again.', retryable: false }, 403: { code: 'FORBIDDEN', message: 'You don’t have permission to do this.', retryable: false }, 404: { code: 'NOT_FOUND', message: 'The resource was not found.', retryable: false }, 409: { code: 'CONFLICT', message: 'Your changes conflict with another update.', retryable: true }, 412: { code: 'PRECONDITION', message: 'Version mismatch. Refresh and try again.', retryable: true }, 422: { code: 'VALIDATION', message: 'Some fields need attention.', retryable: false }, 425: { code: 'TOO_EARLY', message: 'Service not ready. Please try again shortly.', retryable: true }, 429: { code: 'RATE_LIMITED', message: 'Too many requests. Please wait and retry.', retryable: true }, 500: { code: 'SERVER_ERROR', message: 'We hit a snag. Please try again.', retryable: true }, 502: { code: 'BAD_GATEWAY', message: 'Upstream gateway error. Try again.', retryable: true }, 503: { code: 'UNAVAILABLE', message: 'Service is temporarily unavailable.', retryable: true }, 504: { code: 'GATEWAY_TIMEOUT', message: 'Service timed out. Please try again.', retryable: true } }; export function normalizeAxiosError(e: unknown): AppError { const ax = e as AxiosError<any>; const status = ax?.response?.status; const requestId = (ax?.response?.headers?.['x-request-id'] as string) || (ax?.response?.headers?.['x-correlation-id'] as string); // Transport-level signals if ((ax as any)?.code === 'ECONNABORTED') { return { code: 'TIMEOUT', message: 'Request timed out. Please try again.', retryable: true, status, requestId, details: ax?.response?.data }; } if ((ax as any)?.code === 'ERR_NETWORK' || !status) { return { code: 'NETWORK', message: 'Network issue detected. Check connection and retry.', retryable: true, status, requestId, details: ax?.message }; } if ((ax as any)?.code === 'ECIRCUITOPEN') { return { code: 'CIRCUIT_OPEN', message: 'Service is recovering. Please try again shortly.', retryable: true, status, requestId }; } // HTTP mapping const preset = status ? (FRIENDLY[status] ?? defaultFor(status)) : defaultFor(undefined); const serverMessage = ax?.response?.data?.message || ax?.response?.data?.error || ax?.message; return { code: preset.code, status, message: serverMessage || preset.message, retryable: preset.retryable, requestId, details: ax?.response?.data }; } function defaultFor(status?: number) { if (!status) return { code: 'UNKNOWN', message: 'Unexpected error. Please try again.', retryable: true }; if (status >= 500) return { code: 'SERVER_ERROR', message: 'We hit a snag. Please try again.', retryable: true }; return { code: 'CLIENT_ERROR', message: 'Something went wrong with the request.', retryable: false }; } Use with Axios thunks (keep retries/circuit breakers in transport; normalize at boundary): TypeScript // orders.thunk.ts import { createAsyncThunk } from '@reduxjs/toolkit'; import { api } from './api'; import { normalizeAxiosError, type AppError } from './error'; export const fetchOrder = createAsyncThunk<any, string, { rejectValue: AppError }>( 'orders/fetch', async (id, { signal, rejectWithValue }) => { try { const res = await api.get(`/orders/${id}`, { signal, timeout: 8000 }); return res.data; } catch (err) { return rejectWithValue(normalizeAxiosError(err)); } } ); Use with RTK Query baseQuery (all errors become AppError): TypeScript // rtkAxiosBaseQuery.ts import type { BaseQueryFn } from '@reduxjs/toolkit/query'; import type { AxiosRequestConfig } from 'axios'; import { api } from './api'; import { normalizeAxiosError, type AppError } from './error'; export const axiosBaseQuery = (): BaseQueryFn<AxiosRequestConfig, unknown, AppError> => async (config) => { try { const res = await api.request(config); return { data: res.data }; } catch (e) { return { error: normalizeAxiosError(e) }; } }; Simple UI usage example: TypeScript-JSX // ErrorBanner.tsx import type { AppError } from './error'; export function ErrorBanner({ error }: { error: AppError }) { return ( <div role="alert"> <strong>{error.message}</strong> {error.requestId && <small> · Ref: {error.requestId}</small>} {error.retryable && <button onClick={() => window.location.reload()}>Try again</button>} </div> ); } Observability Log requestId, latency, retry count, and breaker state; integrate with your APM. Performance and Caching Deduplicate in-flight requestsCache read data (with TTL)Invalidate with tags (RTK Query) Deduplicate in-flight requests (Axios). Avoid duplicate GETs fired concurrently from multiple components by reusing the same Promise. TypeScript // inflight.ts import { api } from './api'; const inflight = new Map<string, Promise<any>>(); function stable(params: Record<string, any> = {}) { return JSON.stringify(Object.keys(params).sort().reduce((a, k) => (a[k] = params[k], a), {} as any)); } export function getWithDedupe<T>(url: string, params: Record<string, any> = {}, timeout = 6000) { const key = `GET ${url}?${stable(params)}`; const existing = inflight.get(key); if (existing) return existing as Promise<{ data: T }>; const p = api.get<T>(url, { params, timeout }) .finally(() => inflight.delete(key)); inflight.set(key, p as Promise<any>); return p; } // usage // const { data } = await getWithDedupe<Product[]>('/products', { q: 'router' }); Simple read cache with TTL (Axios). Cache successful GET responses for a time window to reduce network load. TypeScript // ttlCache.ts import { api } from './api'; type Entry<T> = { expires: number; data: T }; const cache = new Map<string, Entry<any>>(); function now() { return Date.now(); } function key(url: string, params?: Record<string, any>) { return `GET ${url}:${JSON.stringify(params ?? {})}`; } export async function getCached<T>(url: string, params?: Record<string, any>, ttlMs = 30_000) { const k = key(url, params); const hit = cache.get(k); if (hit && hit.expires > now()) { return { data: hit.data as T, fromCache: true as const }; } const res = await api.get<T>(url, { params, timeout: 6000 }); cache.set(k, { data: res.data, expires: now() + ttlMs }); return { data: res.data, fromCache: false as const }; } // usage // const res = await getCached<Product[]>('/products', { q: '5g' }, 60000); RTK Query: caching, de-duplication, and tag-based invalidation. RTK Query caches responses, dedupes in-flight requests, and lets you invalidate specific data via tags. TypeScript // services/products.api.ts import { createApi } from '@reduxjs/toolkit/query/react'; import { axiosBaseQuery } from './rtkAxiosBaseQuery'; // wraps your Axios instance type Product = { id: string; name: string; price: number }; export const productsApi = createApi({ reducerPath: 'productsApi', baseQuery: axiosBaseQuery(), tagTypes: ['Product'], keepUnusedDataFor: 60, // seconds endpoints: (build) => ({ getProducts: build.query<Product[], void>({ query: () => ({ url: '/products', method: 'GET', timeout: 5000 }), providesTags: (result) => result ? [ ...result.map((p) => ({ type: 'Product' as const, id: p.id })), { type: 'Product' as const, id: 'LIST' } ] : [{ type: 'Product', id: 'LIST' }] }), updateProduct: build.mutation<Product, Partial<Product> & Pick<Product, 'id'>>({ query: ({ id, ...patch }) => ({ url: `/products/${id}`, method: 'PATCH', data: patch, headers: { 'Idempotency-Key': crypto.randomUUID() } }), // Invalidate the specific product and the list invalidatesTags: (result, error, { id }) => [{ type: 'Product', id }, { type: 'Product', id: 'LIST' }] }), createProduct: build.mutation<Product, Omit<Product, 'id'>>({ query: (body) => ({ url: '/products', method: 'POST', data: body }), invalidatesTags: [{ type: 'Product', id: 'LIST' }] }) }) }); export const { useGetProductsQuery, useUpdateProductMutation, useCreateProductMutation } = productsApi; // usage in a component // const { data, isFetching } = useGetProductsQuery(); // const [updateProduct] = useUpdateProductMutation(); Security and Compliance Don’t log PII.Redact sensitive fields. Follow CSP/CORS.Protect tokens in memory only. Quick Checklist for Enterprise Networks Timeouts are defined per request, always below gateway/WAF limits.Retries limited, exponential backoff, and jitter.Circuit breaker enabled per service domain.All requests are cancellable; navigation aborts in-flight calls.Optimistic UI with rollback and server-side idempotency keys.Observability: track retry counts, breaker state, and timeout rates. Simple end-to-end example in one place: TypeScript // endToEnd.ts import { breaker } from './circuitBreaker'; export async function safeGet<T>(url: string, signal?: AbortSignal, timeout = 6000) { // GET with timeout, retries (via interceptor), and circuit breaker return breaker.request<T>({ method: 'GET', url, signal, timeout }).then((r) => r.data); } export async function safeIdempotentPost<T>(url: string, body: unknown, signal?: AbortSignal, timeout = 10000) { const idempotencyKey = crypto.randomUUID(); return breaker .request<T>({ method: 'POST', url, data: body, signal, timeout, headers: { 'Idempotency-Key': idempotencyKey } }) .then((r) => r.data); } Decision Guide (Quick Comparison) Variants: Fetch vs. Axios vs. RTK Query (where each fits) Fetch (Native) Pros: built-in, standards-based, streaming support, Request/AbortController.Cons: no interceptors, manual timeouts (via AbortController), manual JSON/error parsing, no built-in retries.Use when: you want minimal dependencies or need streaming; prepared to build wrappers.Doc: https://developer.mozilla.org/docs/Web/API/Fetch_API Axios (Library) Pros: interceptors, concise API, per-request timeout, JSON by default, good error objects, and upload/download progress.Cons: no built-in caching/dedupe, you manage retries and invalidation yourself.Use when: you need centralized control (headers, auth, tracing) and enterprise behaviors (retries, circuit breaker).Doc: https://axios-http.com/docs/intro RTK Query (Redux Toolkit Query) Pros: cache/dedupe, polling, refetch on focus/reconnect, optimistic updates, auto-cancel, generated hooks, integration with Redux DevTools.Cons: learning curve for tags/invalidation; you’ll still choose fetch or axios as the underlying transport.Use when: you want to standardize data fetching with first-class caching and request lifecycle management.Doc: https://redux-toolkit.js.org/rtk-query/overview Decision Guide Choose Axios if you need: You need fine-grained control of interceptors, custom circuit breakers, corporate proxy/WAF nuances, and non-standard auth headers.You already have significant middleware, logging, and observability built around Axios.You prefer explicit ownership of caching, dedupe, and invalidation logic. Choose RTK Query (with Axios under the hood) if you need: You want built-in caching, de-duplication, polling, refetch on focus/reconnect, and request lifecycle management.You prefer generated hooks and minimal boilerplate for data fetching.You want first-class optimistic updates and cancellation via onQueryStarted and patchQueryData.You can standardize on a baseQuery (fetch or axios) and tags for cache invalidation. Choose plain Fetch if you: Want zero dependency and are comfortable building wrappers for timeouts, retries, and error handling. Good compromise: Keep Axios as the transport and use RTK Query’s axios-compatible baseQuery so you retain interceptors, timeouts, retries, and circuit breaker logic while gaining RTK Query’s caching and lifecycle. Using RTK Query with Axios: TypeScript // rtkAxiosBaseQuery.ts import type { BaseQueryFn } from '@reduxjs/toolkit/query'; import type { AxiosRequestConfig, AxiosError } from 'axios'; import { api } from './api'; // your axios instance with interceptors export const axiosBaseQuery = (): BaseQueryFn<AxiosRequestConfig, unknown, unknown> => async (config) => { try { const result = await api.request(config); return { data: result.data }; } catch (axiosError) { const err = axiosError as AxiosError; return { error: { status: err.response?.status, data: err.response?.data || err.message } }; } }; FeatureFetch (native)Axios (library)RTK Query (Redux Toolkit Query)Built into the platform✓——Streaming response (browser)✓——Interceptors—✓◐ (via baseQuery/transport)Per-request timeout—✓◐ (via transport)Abort/cancel requests✓ (AbortController)✓ (AbortSignal)✓ (auto-cancel)Built-in retries——◐ (retry wrapper)Caching and de-duplication——✓Refetch on focus/reconnect——✓Polling——✓Optimistic updates——✓Generated React hooks——✓Redux DevTools integration——✓Upload/download progress (browser)—✓—Rich error objects—✓◐ (standardized shape)Automatic JSON parsing— (res.json)✓✓Cache invalidation by tags——✓Request de-duplication——✓Works with custom auth headers✓ (manual)✓✓ (via baseQuery) Key documents: Axios: https://axios-http.com/docs/introAxios timeouts: https://axios-http.com/docs/req_configAxios cancellation (AbortController): https://axios-http.com/docs/cancellationRedux Toolkit (RTK): https://redux-toolkit.js.org/Redux Toolkit createAsyncThunk: https://redux-toolkit.js.org/api/createAsyncThunkRTK Query: https://redux-toolkit.js.org/rtk-query/overviewMDN AbortController: https://developer.mozilla.org/docs/Web/API/AbortController

By Hanna Labushkina

DevSecOps for MLOps: Securing the Full Machine Learning Lifecycle

I still remember the Slack message that arrived at 2:47 AM last March. A machine learning engineer at a healthcare AI startup, someone I'd interviewed six months prior about their ambitious diagnostic model, was having what could only be described as an existential crisis. "Our fraud detection model just started flagging every transaction from zip codes beginning with '9' as high-risk," he wrote. "We can't figure out why. It wasn't doing this yesterday. We've rolled back twice. Same behavior. We think someone poisoned our training pipeline but we have no audit trail. No signatures. Nothing. We don't even know when the data changed." Their model processed 40,000 transactions daily. It had been making bad decisions for eleven hours before anyone noticed. By the time they took it offline, they'd falsely blocked $1.3 million in legitimate purchases and let through at least $80,000 in confirmed fraud. The post-mortem revealed that an attacker had compromised a data preprocessing script three weeks earlier, subtly biasing the training data. Not dramatically — just enough that the model learned a spurious correlation that didn't trigger their basic validation checks. No one caught it because no one was looking. Their DevSecOps pipeline, rigorous for traditional software, had a blind spot the size of their entire ML stack. That conversation taught me something I should have understood years earlier: we've been building AI systems with the security mindset of 2015, and reality is coming to collect. The Threat Model Nobody Built Let's establish the stakes clearly. Traditional software has vulnerabilities — buffer overflows, SQL injection, privilege escalation. ML systems inherit all of those, then add an entirely new attack surface that most security teams don't understand and can't defend against. Data poisoning. Model inversion. Membership inference. Backdoor attacks. Extraction attacks. These aren't theoretical academic concerns. They're documented in the wild, and they're escalating. In July 2024, the Cloud Security Alliance published their updated ML Top 10 threats list, and the language was uncharacteristically blunt: "Traditional cybersecurity approaches fall woefully short when applied to machine learning systems." They weren't being dramatic. They were warning us. Consider what happened at a financial services company I consulted for last October. They'd deployed a credit risk model trained on five years of historical data. An attacker gained brief access to their feature engineering pipeline — not production, just the preprocessing stage — and injected carefully crafted synthetic records. Not random noise. Mathematically precise data points designed to shift the decision boundary in specific ways. Three months later, the model started approving loans for a particular demographic profile that correlated with higher default rates. Not because the model was biased in a traditional sense, but because it had been deliberately taught to be. The poisoned data represented less than 0.3 percent of the training set. That was enough. By the time they detected the issue — and detection only happened because a skeptical analyst manually reviewed approval patterns — they'd issued 127 loans totaling $4.2 million that their original, unpoisoned model would have rejected. Projected losses: $890,000 assuming industry-average default rates for that risk tier. Here's the part that haunts me: their DevSecOps pipeline was impressive. They had SAST, DAST, dependency scanning, container image verification, the works. Their code deployments were locked down tighter than most banks I've audited. But none of that protected them against an attack that targeted their training data, not their code. The OWASP Machine Learning Security Top 10, published in updated form in early 2024, makes the threat taxonomy explicit. Model inversion attacks that reconstruct training data from model outputs. Membership inference that reveals whether specific individuals' data was used for training. Transfer learning attacks that smuggle backdoors through pre-trained models downloaded from public repositories. I've watched data scientists download pre-trained models from Hugging Face, fine-tune them on proprietary data, and deploy them to production without once asking: "Who trained the base model? What's in those weights? Could there be a backdoor trigger we'd never detect?" The answer is almost always: we don't know, we can't tell, and yes, there could be. MLSecOps: Because "Secure by Default" Doesn't Apply to Gradients The term "MLSecOps" sounds like consultant-speak. I was skeptical too. But after auditing ML pipelines at eleven companies over the past eighteen months, I've concluded we need the term because we need the concept — extending DevSecOps practices across the full machine learning lifecycle in ways that account for ML-specific threats. The Cloud Security Alliance's framework is useful here. Securing ML systems means protecting "the confidentiality, integrity, availability, and traceability of data, software, and models." That last word — traceability — is where most teams fail catastrophically. In traditional software, you can trace a deployed binary back to source code, commit hash, build pipeline, and even the engineer who approved the merge. In ML, can you trace a deployed model back to the exact dataset version, preprocessing parameters, hyperparameter choices, random seed, and framework version that produced it? Can you cryptographically verify none of those inputs were tampered with? At most companies, the honest answer is no. I spoke with a senior ML engineer at a logistics company in November. They were deploying models trained on terabytes of delivery route data. When I asked about their model provenance tracking, he pulled up their MLflow instance. Lots of metadata. Lots of logged metrics. But when I asked, "If a regulator asked you to prove this exact model was trained on only authorized data with no tampering, what would you show them?" he went quiet. "We'd show them our access logs and... hope that was convincing?" That's not an isolated case. That's the industry standard. The shift from DevSecOps to MLSecOps requires thinking about three distinct attack surfaces simultaneously: the code (training scripts, deployment infrastructure), the data (datasets, feature stores, preprocessing pipelines), and the models themselves (weights, architectures, exported artifacts). Miss any one of those and you're compromised. Secure all three and you're... better positioned than 95 percent of ML teams currently operating. Data Pipeline Hardening: The Unsexy Foundation Data is the new oil, they kept saying, right up until someone poisoned the oil. Securing ML data pipelines requires adopting practices that feel tedious until the day they save you. I'm talking about data validation frameworks, dataset versioning, anomaly detection at ingestion, and schema enforcement like your business depends on it — because it does. Last September, I worked with an e-commerce company deploying a recommendation model. Their data pipeline pulled from fifteen different sources — user behavior logs, inventory databases, third-party demographic data. Zero validation beyond basic type checking. We implemented Great Expectations — an open-source data validation framework — as a mandatory CI check. Every new batch of training data had to pass a suite of expectation tests before it could be used. Expected value ranges. Expected distributions. Expected correlations between features. First week: twelve failed jobs. The data science team was annoyed. "This is slowing us down." Second week: we caught a data integrity issue where a vendor API had started returning null values for a key feature, but defaulting them to zero instead of missing. The model would have learned that zero meant "premium user" when it actually meant "data unavailable." That bug would have cost them, conservatively, $200,000 in misallocated ad spend over the next quarter. After that, nobody complained about the validation gates. Dataset versioning is the other non-negotiable. Tools like DVC (Data Version Control) let you treat datasets like code — versioned, immutable, traceable. When you train a model, you should be able to point to the exact commit hash of the data that produced it. One insurance company I advised had been retraining their actuarial models monthly using "the latest data dump from the warehouse." No versioning. No audit trail. When their regulators asked them to reproduce a model from eight months prior, they couldn't. The data had been overwritten. The compliance fine was $750,000. DVC would have cost them maybe forty hours of engineering time to implement. The integration point is CI/CD. Your data validation tests should run automatically whenever new training data is staged. Failed validation should block model training the same way failed unit tests block code deployment. This isn't revolutionary — it's just applying basic software engineering discipline to the most critical component of your ML system. Model Integrity: Cryptographic Trust for Statistical Artifacts Here's a question that should keep ML teams awake: how do you know the model you're deploying to production is the model your training pipeline actually produced? Most teams can't answer that. Their deployment process is something like: training finishes, model gets saved to S3 or GCS, deployment script pulls it down, serves it. At no point is there cryptographic verification that the model artifact is authentic, unmodified, and traceable to a known-good training run. Sigstore — a project from the Open Source Security Foundation — solves this. It provides cryptographic signing for arbitrary artifacts, including ML models and container images. The value proposition is simple: sign your models when they're produced, verify signatures before deployment. I saw this implemented elegantly at a medical imaging startup in August 2024. Their training pipeline, after producing a new diagnostic model, automatically signed the model file using Sigstore's keyless signing (which uses OIDC identity, not manually managed keys — one less secret to leak). Their deployment pipeline, before serving any model, verified the signature against a list of approved signing identities. The workflow was: train → sign → version → deploy → verify. What this prevented: an attacker who compromised their model storage couldn't simply swap in a backdoored model. The deployment pipeline would reject it because the signature wouldn't match. An insider who wanted to deploy an unapproved model would need to compromise both the storage and the signing identity. Layered defenses. Not perfect, but dramatically better than trusting that whatever's in the bucket is legitimate. SLSA — Supply Chain Levels for Software Artifacts — extends this concept to the entire build provenance. SLSA Level 3, for example, requires that you can cryptographically prove the entire chain from source to artifact: which code was used, which data, which build environment, which engineer triggered it. For ML, this means being able to attest: "This model was trained using dataset version X, code commit Y, on infrastructure Z, by pipeline W, and here's the cryptographic proof." The OpenSSF documentation is explicit: "Sigstore enables cryptographic signing of ML models, protecting against model-related supply chain attacks." It's not theoretical. It's operational, today, if you bother to implement it. Most teams don't bother. They'll spend weeks optimizing model accuracy by 0.3 percent, then deploy that model through a pipeline with zero integrity verification. Priorities. Code and Dependencies: The Familiar Threat in Unfamiliar Territory ML codebases inherit all the traditional software vulnerabilities, but they're often maintained by data scientists who weren't trained in secure coding practices and don't think of themselves as building production systems. I've reviewed ML repositories where training scripts executed arbitrary code via pickle.load() on untrusted model files. Where data processing pipelines used eval() on user-provided formulas. Where container images pulled base layers from random Docker Hub accounts with no verification. The solution is to apply the same tools you'd use for any other codebase: SAST to catch code-level vulnerabilities, SCA to flag known CVEs in dependencies, and container image scanning to verify runtime environments. But there's an ML-specific twist. ML projects have dependency hell on steroids. TensorFlow, PyTorch, NumPy, SciPy, scikit-learn, and a dozen specialized libraries, all with complex version interdependencies. And most ML teams pin versions loosely if at all, because they're optimizing for "it works on my laptop" not "it's defensible in production." OpenSSF Scorecard is useful here. It's an automated tool that analyzes repository health across multiple dimensions: Are dependencies pinned? Are there branch protection rules? Is there evidence of code review? Have there been recent security updates? I ran Scorecard against twenty ML repositories from mid-sized companies last November. Average score: 3.2 out of 10. For comparison, well-maintained open-source infrastructure projects typically score 7-9. The lowest-scoring repos had unlocked main branches (anyone could push directly), no required reviews, dependencies specified as package>=1.0 (meaning "whatever the latest version is, I guess"), and hadn't been updated in over a year despite multiple CVEs in their transitive dependencies. Those are production ML systems. Processing real data. Making real decisions. With security posture that would embarrass a college hackathon project. Container image scanning catches some of this. Tools like Trivy or Grype scan your runtime images for known vulnerabilities. But they only help if you actually fail the build when they find critical CVEs, and if you're rebuilding images regularly enough to pick up patches. One financial services company I worked with in December had a model running in production for fourteen months on a container image that hadn't been rebuilt in over a year. Their base image had thirty-seven known vulnerabilities, including four critical remote code execution bugs. Why hadn't they rebuilt? "The model's working fine, we didn't want to risk breaking it." That's the organizational culture problem. When data scientists see security updates as risks rather than necessities, your MLSecOps tooling won't save you. Runtime Monitoring: Detecting the Attack You Missed Preventing Even perfect pipeline security won't catch everything. That's why runtime monitoring for ML systems is critical — and fundamentally different from traditional application monitoring. Model drift detection is the obvious starting point. Your model was trained on data from one distribution. Production data will drift — sometimes naturally, sometimes because an attacker is deliberately feeding adversarial inputs to degrade performance or trigger specific behaviors. I advised a fraud detection team last June that had deployed sophisticated model performance monitoring. They tracked prediction latency, throughput, error rates — standard stuff. But they weren't tracking data drift or prediction distribution shift. Three weeks into deployment, their model's precision dropped from 94 percent to 78 percent. They noticed because customer complaints spiked. In retrospect, the input feature distributions had shifted significantly starting five days prior — visible in their logs, but no one was watching that metric. Had they been monitoring for drift using something like Evidently AI or Fiddler, they would have caught it immediately. Instead, they caught it when the business impact became undeniable. Anomalous output detection is the other critical component. Your model should have a statistical profile of normal behavior — typical prediction distributions, typical confidence scores, typical feature importance. Deviations from that profile might indicate adversarial inputs, corrupted data, or a model that's been tampered with. One e-commerce recommendation system I audited had no output monitoring at all. When I tested it with deliberately adversarial inputs — edge cases designed to trigger unusual behavior — it started recommending products that made no semantic sense. Not obviously broken, just subtly wrong in ways that would erode user trust over time. Their response: "We have an A/B testing framework, we'd catch that in a test before full rollout." Except they wouldn't. A/B tests measure aggregate metrics like click-through rate. They don't catch that your model is vulnerable to targeted adversarial inputs that could be exploited by a malicious vendor gaming your recommendation system. Logging and traceability complete the picture. Every inference request should be logged with enough context to reproduce it — input features, model version, prediction, timestamp. Not just for debugging, but for security forensics. If you discover your model was compromised six weeks ago, can you identify every prediction it made during that window? Can you notify affected users? Can you quantify the business impact? Most teams can't. Their inference logs are either too sparse (just predictions, no inputs) or non-existent (inference is stateless, nothing is saved). That's not an ML system — it's a black box that makes decisions you can't audit or defend. Governance: The Boring Part That Actually Matters Technical controls are necessary but insufficient. Without organizational governance — policies, processes, and culture that prioritize ML security — your Sigstore implementation and data validation gates will gradually erode. NIST's AI Risk Management Framework, released in its current form in early 2023 and updated through 2024, provides a useful structure. It's not prescriptive tooling — it's a set of principles for identifying, assessing, and mitigating AI risks across the lifecycle. But here's the disconnect: I've talked to dozens of ML teams over the past year who've read the NIST framework, nodded appreciatively, and then done absolutely nothing to implement it. Because frameworks are abstract, and tickets in Jira are concrete, and nobody's OKRs reward "implemented governance." The teams that succeed do a few things consistently. First, they enforce repository access controls as code. Tools like Allstar — another OpenSSF project — let you programmatically enforce rules across all your repos. Require branch protection. Require signed commits. Require code review for any PR that touches model training or data pipelines. Make these policies mandatory, not suggestions. Second, they break down silos between data scientists and security teams. At one company I advised, the security team had no visibility into ML deployments because they happened through a separate pipeline that the data science team managed. The security team didn't understand ML, the data scientists didn't prioritize security, and nobody talked to each other. We forced collaboration by making the security team part of the ML deployment approval process. Not as gatekeepers who could arbitrarily block things, but as consultants who reviewed threat models and verified that appropriate controls were in place. The data scientists hated it initially. Six months later, after we caught three serious issues in pre-production that would have caused incidents in prod, they became the biggest advocates for the process. Third, they treat models and datasets as production artifacts deserving the same rigor as code. That means: versioned, tested, signed, deployed through controlled pipelines, monitored in production, and decommissioned deliberately when they're no longer needed. I spoke with an ML platform lead in December who'd implemented this philosophy across her organization. Every model had an owner, a risk assessment, a deployment checklist, and a monitoring plan. Models that didn't meet minimum security standards didn't get deployed, full stop. Even if the business wanted them. Her team initially pushed back. "This bureaucracy is slowing down innovation." Her response: "We'll innovate slower and ship things that work, or we'll move fast and break prod in ways that get us sued. Choose." They chose the former. Model deployment velocity dropped 30 percent in the first quarter. Incidents dropped 70 percent. Business impact of ML-related bugs dropped by over 90 percent. After two quarters, deployment velocity recovered as teams internalized the new standards. Incidents stayed low. That's what mature MLSecOps looks like. Not fast and reckless. Fast and controlled. The Reckoning We're Walking Into I'll make another prediction I desperately hope I'm wrong about. In the next eighteen months, a major company — publicly traded, household name — will suffer a catastrophic ML security incident. Not a model accuracy issue. Not an embarrassing bias case that goes viral on Twitter. A deliberate attack that exploits ML-specific vulnerabilities to cause material harm at scale. Maybe it'll be a poisoned training dataset that causes a lending model to systematically discriminate. Maybe it'll be a model extraction attack that steals a proprietary model worth millions in R&D investment. Maybe it'll be a backdoored pre-trained model that gets fine-tuned and deployed across hundreds of products before anyone realizes it's compromised. The technical details don't matter. What matters is that when it happens, we won't be able to claim ignorance. The threats are documented. The mitigations exist. We just haven't bothered to implement them because ML security is hard, unglamorous, and doesn't improve your leaderboard metrics. I've spent fifteen years covering cybersecurity. I've watched industries ignore obvious risks until the inevitable disaster forces change. Equifax. SolarWinds. Log4Shell. Every single time, the post-incident analysis reveals that the vulnerabilities were known, the fixes were available, and organizations chose not to act because security is expensive and breaches are merely probable. ML security is following the same trajectory. We know the risks. We have the tools. We're choosing not to use them. The companies that will survive the coming reckoning are the ones implementing MLSecOps now — boring, tedious, foundational work that doesn't generate hype or conference talks but does make your ML systems defensible. Sign your models. Version your data. Validate your inputs. Monitor your outputs. Treat your ML pipeline as hostile infrastructure that requires defense in depth. Or don't. And explain to your board, your customers, and your regulators why you deployed unverifiable models trained on unaudited data through ungoverned pipelines. I know which conversation I'd rather have.

By Igboanugo David Ugochukwu

CORE

Assist, Automate, Avoid: How Agile Practitioners Stay Irreplaceable

TL;DR: The A3 Framework by AI4Agile Without a decision system, every task you delegate to AI is a gamble on your credibility and your place in your organization’s product model. AI4Agile’s A3 Framework addresses this with three categories: what to delegate, what to supervise, and what to keep human. The Future of Agile in the Era of AI It's January 2026. The AI hype phase is over. We've all seen the party tricks: ChatGPT writing limericks about Scrum, Claude drafting generic Retrospective agendas. Nobody's impressed anymore. Yet in many agile teams, there's a strange silence. While we see tools being used, often quietly, sometimes secretly, we rarely discuss what this means for our roles, for our work, for the principles that make Agile viable. There is a tension between two extremes: the enthusiastic "automate everything with agents" crowd, and the quiet, gnawing fear of obsolescence. For twenty years, I've watched organizations struggle with agile transformations. The patterns of failure are consistent: they treat Agile as a process to be installed rather than a culture to be cultivated. They value tools over individuals and interactions. Today, I see the exact same pattern repeating with AI. Organizations go shopping for tokens and expect magic, while practitioners wonder whether their expertise is about to be automated away. We need a different conversation. The Work That Made You Visible Is Now Commodity Work Let us name some uncomfortable things: Drafting user stories, synthesizing stakeholder notes, summarizing workshops, turning a messy Retro into themes, organizing super-sticky post-its, because procurement refused to buy them — these were never the point of your job. But they were visible proof that you were doing something. AI changes that visibility. If you are a Scrum Master or Agile Coach who spends 20 hours a week chasing status updates and drafting emails, you are in danger. Not because AI will take your job, but because those tasks are commodity work. When drafting and summarizing became cheap—10 years ago, transcribing a minute of recording cost about $1—the only thing of value remaining is judgment, trust-building, and accountability. Let's also name what many practitioners fear: you are worried AI will replace you. Not because you think you are unskilled, but because you have seen organizations reduce roles to checklists before, demanding verifiable proof that your contribution is moving the ROI needle in the right direction. If your company once replaced "agile coaching" with a rollout plan and a set of events, why wouldn't it replace an agile practitioner with a customized AI that generates agendas and action items by simply prompting it? It's a rational fear. It's also incomplete. Harvard Business School researchers ran a field experiment with 776 professionals. They found that people working with AI produced work comparable to two-person teams. The researchers called AI a "cybernetic teammate." Unsurprisingly, people actually felt better working with AI than working alone: more positive emotions, fewer negative ones. This effect wasn't just about getting more done. It was also about how AI changes the work experience. Which brings us to an important insight I have pointed to for a long time in my writing: If you have deep knowledge of Agile, AI lets you apply it faster and more broadly. AI is the most critical lever you will likely encounter in your professional career.If you do not know about Agile, AI simply amplifies your incompetence. A fool with an LLM is still a fool, but now they are spreading their nonsense more confidently. (Dunning-Kruger as a service, so to speak.) The tool is neutral. Your expertise is not. The AI4Agile Educational Path: Building Judgment, Not Dependency Over the past 12 months, I have been developing what I call the AI4Agile Educational Path: a structured learning concept for practitioners who want to work with AI, not be replaced by it. The philosophy is simple: never outsource your thinking. AI should amplify your expertise, not substitute for your judgment. The goal is not to teach you how to prompt a chatbot to do your work. The goal is to build career resilience by mastering the reality of the cybernetic teammate. If you have been following my work, you may recognize some of these concepts. What is new is how they connect to structured learning paths grounded in research, role-specific guidance for Scrum Masters, Product Owners, and Coaches, and measurable outcomes that go beyond "I used ChatGPT today." And here is what that research implies: you don't "roll out" teammates. You introduce them with norms, boundaries, and feedback loops. You decide what the teammate is allowed to do, what must be reviewed, and what stays human. Accountability doesn't disappear when work becomes faster and supported by a machine that we do not fully understand. The A3 Framework: A Decision System for AI Delegation The primary struggle I see among practitioners isn't access to tools. It is a judgment about when to use them. We see Product Owners and Managers pasting sensitive customer data into public models. Scrum Masters using AI to write delicate feedback emails that sound robotic and insincere. Coaches delegating analysis that they should have done themselves. Ad-hoc delegation produces ad-hoc results and often unnecessary harm to people, careers, and organizations. This is why I built the Educational Path around what I call the A3 Framework: Assist, Automate, Avoid. Before you type a single prompt, you categorize the task. Each category has distinct rules for AI involvement, human responsibilities, and failure modes. Once you know the category, the prompting decisions become obvious, not to mention automating tasks with agents: Assist is where AI drafts, and you decide.Automation is the execution under constraints, with checkpoints and audits.Avoid is where mature practitioners earn their keep: tasks too risky, too sensitive, or too context-dependent for AI at any level. I will unpack the full A3 Framework in a dedicated article, complete with role-specific examples for Scrum Masters, Product Owners, and Coaches, as well as a downloadable Decision Card you can keep at your desk. For now, the core principle is that the framework makes AI delegation discussable. Instead of suspicious questions — "Who used AI on this? Did you actually think about it?" — your team asks productive questions: "Which category is this work in? What guardrails do we need?" That shift, from secrecy to shared vocabulary, is how you prevent AI use from becoming clandestine and keep thinking visible across your team. What This Path Will Not Do This path won't do your job for you. It won't teach you to automate everything. Some things should stay human precisely because they're slow, contextual, and relational. It won't promise productivity gains without addressing governance, adoption, and human factors. AI transformation will fail for the same reasons Agile transformation did: governance theater, proportionality failures, and treating workers as optimization targets rather than co-designers. "AI theater" looks exactly like "agile theater": impressive demos, vanity metrics, yet no actual change in how decisions get made. And it won't replace the Agile Manifesto values with tool worship. Individuals and interactions still matter more than processes and tools. AI is the ultimate tool. Our challenge is to use it to enhance our individuals and improve our interactions, not let it become a process that manages us. Conclusion: The Road Ahead Over the coming weeks, I will publish detailed explorations into this new reality: the full A3 Framework with practical examples, how to position yourself as an AI thought leader, why AI transformation fails for the same reasons Agile transformation did, how to address "Shadow AI" before it becomes a governance crisis, and practical multi-model workflows. Still, there remains an interesting question: when AI makes the artifacts cheap, will your judgment become more visible, or will it turn out you were hiding behind the artifacts? The elephant is in the room. It's time to say "hello."

By Stefan Wolpers

CORE

9 Tips for Building Apps to Withstand AI-Driven Bot Attacks

DDoS and other bot-driven cyberattacks don’t seem to be going away. If anything, the rise of AI is making them harder to thwart, turning bot protection into a new challenge for security-minded software development teams. Recent industry studies indicate that AI bot traffic surged over 300% last year, and 37% of all internet traffic was attributed to malicious bots. Stopping AI-powered bot attacks is hardly a straightforward undertaking. The simplest move is just to block all AI-driven requests, but that’s not an option, as many have legitimate use cases. AI-driven bot attacks are soaring across the digital ecosystem, with APIs and app flows as a prime target. Even benign AI bot requests drain server resources and expose data because of their scale, but malicious bot attacks are obviously far more dangerous. App developers need to change the dev rulebook in order to build apps that can stop malicious AI-driven bot attacks. Here are nine tips for app dev teams to strengthen app protections and frustrate AI-driven bot attacks, without harming user experience. 1. Build a Bot-Aware Client Foundation First and foremost, bake bot defenses into your app security instead of bolting them on as an afterthought. That means creating a shared SDK owned by the app team to centralize app identity, device/app attestation, request signing, replay protection, and telemetry. Among other benefits, SDKs for internal use serve to extend consistent protection to every feature and endpoint, preventing attackers from finding weaker client paths. 2. Avoid Binary Pass/Fail Decisions Rather than treating trust as a binary issue, treat it as a graded set of decisions, because AI bots quickly switch devices to bypass hard controls. Instead, tier trust by throttling, slowing, or limiting sensitive actions for lower-trust clients, so that bots have to pay increasing costs but aren’t motivated to rotate devices, without impacting UX for real users. 3. Scan for Human Behavior in the App Bots can perfectly mimic headers, TLS fingerprints, and even device models, enabling them to successfully trick defenses that only rely on network signals. But they still struggle to emulate human micro-behavior inside the app, including gesture entropy, typing and edit rhythms, navigation timing, focus/blur patterns, and realistic task completion speeds. These signals should be collected via lightweight in-app telemetry and used as risk indicators. 4. Bind High-Value Actions to Short-Lived Sessions Any action that creates value or risk, such as login, signup, one-time password (OTP) generation, payments, inventory, and messaging, should be cryptographically bound to a short-lived app session. Ephemeral session keys, canonical request signing, timestamps, and nonces prevent requests from being replayed or automated at scale. Rotate keys frequently and invalidate sessions when conditions drift, such as device changes, ASN/IP changes, or lock skew. This turns API abuse into a stateful problem for attackers and breaks the “stateless replay” model that AI bots rely on. 5. Focus Defenses on Abuse Goals AI-powered bot attacks aim to achieve outcomes like creating accounts, scraping data, abusing promos, testing cards, or sending spam. They quickly adapt to endpoint rules, so to outsmart them, you need to match their strategies. Map their goals and design flow-level defenses, instead of scattering rules across individual APIs. When every abuse goal has its own friction, limits, and validation logic, attackers are forced to retool for each objective instead of reusing one automation pipeline. 6. Replace Blanket CAPTCHA With Adaptive Friction CAPTCHA has evolved into the ultimate lose-lose defense: it doesn’t stop AI bots, while genuine human users hate it. Replace these defenses with adaptive friction such as short interaction sequences, lightweight proof-of-work, or flow-specific delays, triggered by spikes in user risk signals. This keeps UX fast for real users while raising the costs for AI-powered bots. 7. Rate-Limit by Behavior Similarity AI bot attacks rarely come from one IP, so targeting single IPs isn’t usually effective. Instead, program your protections to spot large, coordinated AI bot fleets by looking for user, device, session, ASN, and behavioral similarity clustering, which are signs of AI-powered bot swarms rather than a mass of human users. Apply rate limits across these clusters to slow or isolate entire bot swarms without punishing legitimate users. 8. Build an App that Bots Hate (But Humans Like) AI bots thrive on stateless, repeatable API calls. They don’t like encountering benign statefulness in critical flows, like invalidating state on backgrounding, long pauses, or abnormal jumps; subtly varying required parameters or ordering per session; or tying progress to UI events like screen views or transitions. Meanwhile, human users naturally move through app state linearly because they follow the easiest path. 9. Design for Change AI bots learn fast and remember everything. If your signals, headers, telemetry fields, or flow order stay static, they will eventually be modeled and bypassed. Use server-driven flags to change thresholds, reorder flows, rotate telemetry schemas, and deprecate old app versions aggressively and constantly. Don’t wait for the next release, or you’ll give AI bots too long to break in. Force attackers into constant rework while your app keeps moving. Build Apps that Keep AI Bot Attacks at Bay AI-powered bot attacks are sneaky, persistent, and forceful, but smart tactics at the app dev level can stop them from disrupting your app or dragging down real user experience. Dynamic friction, flows, and barriers that stymie AI but are natural for humans, and constant iteration, help keep AI-powered bot attacks out of your apps.

By Philip Piletic

CORE

Integrating Retrieval-Augmented Generation (RAG) With Agentic AI: Harnessing Elasticsearch Vector Databases for Enterprise AI Systems

Large language models (LLMs) have changed how we think about automation and managing knowledge. They show strong skills in synthesis tasks. However, using them in crucial business areas like FinTech and healthcare reveals their underlying limitations. It is clear that while LLMs can generate language well, they lack the structural strength needed to serve as reliable knowledge systems or to act as independent, responsible decision-makers in real-world situations. Enterprises don’t just want chatbots; they want intelligent agents that can: Interpret domain-specific dataMake decisions aligned with business rulesMaintain context across multi-step workflowsProduce accurate, traceable, and compliant outputs Plain LLMs cannot meet these expectations. They hallucinate. They don’t “know” your enterprise. And they lack long-term memory. Agentic AI — LLM-powered agents that plan, reason, and act — depend heavily on trustworthy knowledge and persistent state. This is exactly where retrieval-augmented generation (RAG) and Elasticsearch-based vector databases intersect. RAG grounds model responses in real enterprise data. Elasticsearch provides scalable, low-latency vector search and hybrid retrieval. Agentic AI orchestrates everything into autonomous behavior. This article presents a clear, practical blueprint for integrating RAG with agentic AI using Elasticsearch vector databases, complete with architectural patterns, a Python implementation, and actionable design guidance for real-world enterprise environments. The Enterprise AI Gap: Problem Statement Hallucination Is a First-Class Risk LLMs generate text by predicting the next token rather than verifying facts. This leads to hallucinations, outputs that appear plausible but are objectively incorrect. In a consumer Q&A setting, such errors may be merely inconvenient. In an enterprise environment, however, they can be harmful: Incorrect regulatory or compliance guidanceMisinterpretation of policies or proceduresInaccurate financial or healthcare recommendationsMisleading analysis for internal stakeholders It is not feasible to build reliable, production-grade AI systems on a model that confidently produces information without underlying verification. No Native Access to Enterprise Knowledge Out of the box, an LLM: Doesn’t know your products or servicesCan’t see your internal documentation, playbooks, or policiesCan’t query your databases, APIs, or knowledge basesCan’t automatically incorporate daily changes in the business Fine-tuning helps only partially and is expensive, slow, and brittle. Enterprises need a way for LLMs to retrieve the latest truth from their own systems. No Long-Term Memory for Multi-Step Tasks Agentic workflows, like onboarding, troubleshooting, or case resolution, require: Remembering prior steps and decisionsReusing context across multiple interactionsBuilding a “picture” of the user or case over time LLMs have a context window, not true memory. Once the token limit is reached or the session ends, the model “forgets” everything. Lack of Explainability and Traceability In regulated and high-stakes environments, leaders ask: Where did this answer come from?Which policy or document supports this recommendation? Plain LLMs cannot show their work. Without retrieval, there are no citations, no links to documents, no audit-friendly trails. Scaling Retrieval Across Millions of Documents Even if you attach a search layer, traditional keyword search (BM25, full-text) is not enough. Enterprises need: Semantic search to understand meaning, not just keywordsLow-latency vector search at scaleHybrid retrieval that combines dense and sparse signalsRobust indexing pipelines that can ingest varied content This is where vector databases and Elasticsearch’s modern vector capabilities become essential. What is Retrieval-Augmented Generation (RAG) and Why Does It Matter? RAG addresses the main weaknesses of LLMs by injecting fresh, relevant, and authoritative context into every response. RAG operates as an intermediary layer between organizational data and a language model. The process typically involves: Encode documents as vector embeddings.At query time, embed the user question.Retrieve the most relevant chunks from a vector store (e.g., Elasticsearch).Pass the retrieved context + question into the LLM.The LLM becomes a reasoning engine over your data, instead of a hallucinating storyteller. RAG enables: Hallucination reduction through fact-groundingImmediate updates, no model retraining neededExplainable answers with citations and traceabilityDomain-specific accuracy using internal knowledgeEnterprise safety and compliance controlsLong-term memory when prior decisions are stored as embeddings RAG is the backbone of trustworthy, production-ready enterprise AI. Why Elasticsearch as a Vector Database for Agentic AI? Elasticsearch has evolved from a search engine into a powerful vector search and hybrid retrieval platform. For enterprise RAG and agents, it offers many advantages. Vector Search at Scale Elasticsearch supports: Dense vector fieldsApproximate Nearest Neighbor (ANN) algorithmsSimilarity metrics like cosine and dot product This enables fast, scalable semantic retrieval across millions of documents. Hybrid Retrieval (Dense + Sparse) Best-in-class RAG often uses hybrid search: BM25 / keyword signals → precision for explicit terms (IDs, codes, field names)Vector similarity → semantic understanding of meaning This enables quick, scalable semantic retrieval across millions of documents. Enterprise Security and Governance For real-world deployments, Elasticsearch offers: Role-based access controlEncryption and TLSAudit loggingMulti-tenant clusters This is critical for FinTech, healthcare, and other regulated domains. Operational Maturity Elasticsearch is already in use by many enterprises for log analytics, observability, or search. Extending that investment to RAG and Agentic AI is a natural and cost-effective path. Architecture Design: RAG + Agentic AI + Elasticsearch High-Level Architecture Components User Input Layer: Receives commands or queries.Embedding Generation: Converts input into semantic vectors using LLM embeddings.Vector Retrieval Layer (Elasticsearch): Searches for relevant embeddings from knowledge or memory.Agent Reasoning Layer: LLM uses retrieved context to generate responses or actions.Action Execution Layer: Executes tasks via APIs, microservices, or internal logic.Memory Update Layer: Stores embeddings of new interactions for future retrieval. Key Roles of Integrated Technologies Technology RoleCore Function in ArchitectureElasticsearch Vector StoreServes as the knowledge base and long-term agent memory, storing embeddings and enabling high-speed vector similarity search.RAG LayerOrchestrates the retrieval process: fetching vectors, reconstructing text chunks, and assembling the final context sent to the LLM.LLMThe core computational engine that interprets the question and synthesizes the answer only from the provided context.Agentic LayerThe control plane that plans the multi-step workflow, determines when to invoke tools (including RAG), and manages memory updates. Design Best Practices Chunk your documents wisely (by sections, headings, or semantic units).Index rich metadata (source, department, tags, data sensitivity).Use hybrid search to combine keyword and vector retrieval.Add guardrails: if context is weak, the agent should abstain or escalate.Evaluate regularly with synthetic and real test cases (hallucinations, relevance, latency).Start narrow and expand: begin with one domain (e.g., onboarding) and scale out. Implementation Walkthrough in Python Below is a simplified but realistic implementation to help you go from concept to code. Install Dependencies Python pip install elasticsearch sentence-transformers openai numpy You can swap OpenAI with any LLM provider; the RAG pattern stays the same. Connect to Elasticsearch Python from elasticsearch import Elasticsearch es = Elasticsearch( "http://localhost:9200", basic_auth=("elastic", "your_password") ) Create a Vector-Enabled Index Python index_name = "rag_docs" index_body = { "mappings": { "properties": { "content": {"type": "text"}, "embedding": { "type": "dense_vector", "dims": 768, "similarity": "cosine" }, "source": {"type": "keyword"} } } } if not es.indices.exists(index=index_name): es.indices.create(index=index_name, body=index_body) Generate Embeddings and Index Documents Python from sentence_transformers import SentenceTransformer import uuid model = SentenceTransformer("all-MiniLM-L6-v2") documents = [ { "content": "RAG reduces hallucinations by grounding LLM responses in retrieved enterprise knowledge.", "source": "architecture-notes" }, { "content": "Agentic AI enables multi-step reasoning and tool usage, turning LLMs into autonomous agents.", "source": "design-doc" }, { "content": "Elasticsearch provides scalable vector search and hybrid retrieval for enterprise AI workloads.", "source": "platform-doc" } ] for doc in documents: embedding = model.encode(doc["content"]).tolist() es.index( index=index_name, id=str(uuid.uuid4()), document={ "content": doc["content"], "embedding": embedding, "source": doc["source"] } ) Build a Retrieval Function Python def retrieve_context(question: str, k: int = 3): query_vec = model.encode(question).tolist() search_body = { "size": k, "query": { "knn": { "embedding": { "vector": query_vec, "k": k } } } } results = es.search(index=index_name, body=search_body) chunks = [] for hit in results["hits"]["hits"]: source = hit["_source"] chunks.append(source["content"]) return "\n".join(chunks) Construct a RAG Prompt Python def build_rag_prompt(question: str) -> str: context = retrieve_context(question) return f""" You are an enterprise AI assistant. Use ONLY the context below to answer the question accurately. If the context is insufficient, say you do not have enough information. Context: {context} Question: {question} """ Call the LLM Python from openai import OpenAI client = OpenAI(api_key="YOUR_OPENAI_API_KEY") def ask_rag(question: str) -> str: prompt = build_rag_prompt(question) response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You are a precise, compliant enterprise assistant."}, {"role": "user", "content": prompt} ] ) return response.choices[0].message["content"] print(ask_rag("How does RAG help reduce hallucinations in enterprise AI?")) From RAG to Agentic AI To evolve from “assistant” to agent, you add: Planning The agent decides what to do next: Retrieve more contextCall an external APIWrite new data back into ElasticsearchAsk the user for clarification Tool Use You expose tools to the agent: search_docs (RAG retrieval)call_api (microservices, SaaS, internal APIs)write_memory (store embeddings, notes, decisions) Memory You can treat Elasticsearch itself as a memory layer: Store decisions and summaries as embeddingsStore user preferences or case state as documentsRetrieve them later as part of context Simple Agent Loop (Conceptual) Python def agent(query: str): # Step 1: Retrieve context via RAG context = retrieve_context(query) # Step 2: Ask the LLM to propose a plan plan_prompt = f""" You are an enterprise AI agent. Given the user query and the context below, decide the next step. Context: {context} User query: {query} Decide whether to: - answer_directly - refine_and_search - ask_clarifying_question Explain your reasoning briefly. """ plan = ask_llm(plan_prompt) # wrapper around LLM call # Step 3: Act based on plan (simplified) if "refine_and_search" in plan: refined_query = extract_refined_query(plan) # parse from LLM output return agent(refined_query) elif "ask_clarifying_question" in plan: question_to_user = extract_question(plan) return f"CLARIFY: {question_to_user}" else: # answer directly using current context return ask_rag(query) Real-World Use Cases and Design Tips Use Cases FinTech & Wealth Management Advisor onboarding assistantsProduct and services recommendationsCompliance-checking agentsPolicy and product knowledge assistants Healthcare Clinical guidelines retrievalSummarizing patient history from notes (with proper governance) Cybersecurity Incident triage agents retrieving logs and playbooksGuided response workflows based on runbooks Internal Enterprise AI Developer knowledge assistantsArchitecture and design documentation copilotsSupport agents for internal tools and platforms Real-World FinTech Example Scenario: An AI agent advising clients on retirement portfolios. User input: “Recommend a moderate-risk strategy for 2025.”Embedding generation: Convert the query into a vector.Vector search: Retrieve client history, recent market analysis, and regulatory guidelines.RAG-based reasoning: LLM combines context to provide an informed recommendation.Action: Suggest portfolio allocation via dashboard or notification.Memory update: Store embeddings for future personalized recommendations. Benefits Dynamic, accurate, and personalized adviceReduced hallucinationsScalable knowledge retrieval Conclusion Enterprises today demand AI systems that go beyond generating text; they must interpret complex domain data, make informed decisions, retain long-term context, and deliver accurate outputs traceable to authoritative sources. Traditional LLMs alone cannot meet these expectations due to hallucinations, a lack of enterprise grounding, and limited reasoning over extended tasks. Integration of RAG and Agentic AI, powered by Elasticsearch vector databases, enables organizations to gain a scalable and reliable foundation for autonomous enterprise intelligence. This unified architecture provides factual, domain-grounded answers, transparent reasoning, high-performance semantic retrieval, and persistent memory that supports complex multi-step agent workflows. As enterprises move toward autonomous and self-improving systems, the combined RAG + Agentic AI + Elasticsearch architecture offers a clear blueprint for modern AI design. It enables agents to reliably retrieve, reason, remember, and act — elevating enterprise AI from basic assistance to true autonomy.

By Devdas Gupta

6 Software Development and DevOps Trends Shaping 2026

In 2025, many teams tried new things in software development and DevOps — AI copilots, new platforms, more automation, and more security checks. Some of it worked great, some of it created new mess (tool sprawl, unclear ownership, higher cloud bills, and “we ship faster but break more”). Heading into 2026, the focus is shifting from experimentation to ensuring reliability and repeatability. Leaders and practitioners are asking the same questions: How do we move fast without losing quality? How do we keep systems secure without slowing teams down? How do we reduce toil, control costs, and still deliver features that matter? This article breaks down six trends shaping the next year: agentic AI across the SDLC, semantic layers/ontologies that give AI real business context, platform engineering with internal developer platforms, software supply chain security, observability built on standard telemetry, and FinOps becoming part of daily engineering decisions. Together, these trends solve one big problem: they help teams scale delivery — with less chaos, fewer surprises, and more confidence. Trend 1: Agentic AI Across the SDLC The SDLC is the software development life cycle — the end-to-end process of planning, building, testing, deploying, and operating software. It matters because most delays don’t occur only in coding but also in handoffs and “glue work” between steps. Agentic AI is AI that can work toward a goal with limited supervision, by planning steps and using tools (not just generating text). Think: “take this issue, make a change, run checks, and prepare a pull request for review.” Why Does It Matter in 2026? Teams are overloaded by repetitive tasks around delivery — triage, updating configs, chasing flaky tests, fixing CI, writing PR summaries, and digging through logs. Agents can reduce that toil and shorten feedback loops, so engineers spend more time on decisions and design (and less time on copy-and-paste work). For example, GitHub documents workflows in which Copilot can be asked to create a pull request, with the developer approving it before it proceeds. But there’s a catch: AI tends to amplify what already exists in your engineering system. If your foundations are strong (good tests, clear standards, solid CI), you get faster. If things are messy, you might ship faster… into more problems. That’s why 2026 is about agents plus guardrails, not agents alone. If GitHub Copilot feels weak for our use case, there are solid open-source alternatives: Continue (open-source assistant for VS Code/JetBrains; we can connect different models and contexts, and it supports agent-style workflows)Tabby (open-source, self-hosted coding assistant, often positioned as an on-prem alternative to Copilot) And if we want “more agent, less IDE autocomplete,” these are worth a look: OpenHands (agentic developer assistant project)Aider (terminal-first coding agent that works via git changes) Trend 2: Ontologies/Semantic Layer for AI context (Semantic grounding for real business meaning) A semantic layer is a part of data architecture that translates complex data into business-friendly terms, so “revenue,” “active customer,” or “incident severity” mean the same thing everywhere. An ontology is a more formal version of that idea: a shared domain model with clear definitions and relationships (for example: Customer owns Contract, Contract relates to Product, Product has Region rules). OWL is a common standard for representing ontologies. Under the hood, many ontology/knowledge-graph approaches build on RDF, which represents facts as simple graph statements. What problem does this solve? Data quality issues are real (missing values, inconsistent records, stale data). But even when the data is “good enough,” teams still hit a second issue: meaning and consistency. The same metric name can mean different things across teams, dashboards, and services. When AI systems learn from conflicting definitions, they can sound confident but still be wrong, and it’s hard to explain why. A semantic layer and ontology provide AI with a reliable map of the domain, so answers are grounded in shared definitions and relationships rather than guesswork. We can see it in Figure 1. Figure 1. Ontology Flow Why Does It Matter in 2026? As we use more AI assistants and agents in engineering and ops, they need trusted context to make safe decisions. Graph-based RAG approaches are gaining attention because they can combine text with relationships, not only similarity search. GraphRAG is one example of this direction. And to keep this domain model clean over time, we can validate graph data with constraint rules like SHACL, so “domain truth” doesn’t drift into chaos. Trend 3: Platform Engineering 2.0/AI-ready Internal Developer Platforms Platform engineering is about building internal developer platforms (IDPs) — shared, self-service infrastructure and tools that help teams build, test, deploy, and operate software more consistently. Rather than each team reinventing its own pipeline, platform teams create golden paths (pre-approved, repeatable ways to get things done). As we move into 2026, these platforms are evolving beyond CI/CD automation into AI-ready platforms that embed intelligence, security, and observability into the developer experience. Why Does It Matter in 2026? Many teams experimented with DIY automation in 2024–2025 and now face “integration tax”: dozens of custom scripts, inconsistent standards, unclear ownership, and slow onboarding for new developers. AI-ready IDPs aim to solve these issues by providing patterns, guardrails, and intelligent defaults that scale across teams. They can offer context-aware recommendations (e.g., which tests to run and which security rules apply), enforce policy-as-code, generate environment previews, and integrate AI assistants directly into workflows. This reduces cognitive load on developers and accelerates delivery without compromising quality or governance. What problem it solves: Traditional DevOps pipelines often lack standardization and visibility at scale. Platform engineering creates a shared foundation so teams don’t spend time on plumbing, maintain consistency across services, and adopt new practices (such as AI-augmented workflows) more safely. In 2026, these platforms will also help balance productivity with compliance, cost, and reliability by baking in best practices rather than leaving them as optional add-ons. Links and trend signals: Gartner highlights the strategic shift toward platform engineering and embedded intelligence as a key trend for software teams.Industry discussions increasingly position IDPs as the backbone of scalable DevOps practices.Patterns like policy-as-code and standardized pipelines are growing as larger organizations prioritize compliance and auditability. Trend 4: Supply-Chain Security as the New DevSecOps Baseline What it is: Traditionally, DevSecOps has focused on finding and fixing vulnerabilities in code or containers. In 2026, the emphasis is expanding to software supply-chain security — meaning we protect not just our code, but every piece that goes into building, packaging, and delivering software: dependencies, build systems, artifacts, and deployment pipelines. Practices like Software Bill of Materials (SBOMs), artifact signing, provenance tracking, and attestation frameworks (e.g., SLSA) are becoming baseline expectations, not optional add-ons. [source: https://www.cisa.gov/resources-tools/resources/2025-minimum-elements-software-bill-materials-sbom] Why Does It Matter in 2026? High-profile incidents in recent years have shown that attackers often exploit vulnerabilities outside the app’s codebase — for example, compromised open-source libraries or malicious updates in CI/CD pipelines. As teams move faster with AI-augmented workflows, it’s even easier for risky components to creep into releases. Strengthening the supply chain means verifying the origin of each artifact, who signed it, and which policies it complies with before deployment. This reduces unexpected properties and limits blast radius. [source: https://www.itpro.com/software/enterprises-need-to-sharpen-up-on-software-supply-chain-security] What problem it solves: It tackles two significant issues at once: preventing untrusted code from entering production, and making compliance and auditability part of everyday workflows. In 2026, supply-chain security won’t be something done “if we have time” — it will be part of the delivery pipeline itself, giving teams confidence they can ship fast and safe. Links and trend signals: CISA guidance on baseline SBOM elements for software supply chains. [source: https://www.cisa.gov/resources-tools/resources/2025-minimum-elements-software-bill-materials-sbom]Enterprise pressure to mature supply-chain practices. [source: https://www.itpro.com/software/enterprises-need-to-sharpen-up-on-software-supply-chain-security] Trend 5: Observability and Telemetry Engineering What it is: Observability is the practice of understanding how systems behave in production by collecting signals like logs, metrics, and traces. In 2026, this is evolving into telemetry engineering — a more intentional, standardized approach to how we define, collect, store, and use observability data across services and teams. Rather than ad-hoc dashboards and random logs sprinkled everywhere, telemetry engineering treats signals as first-class artifacts that are designed, reviewed, and governed much like code or APIs. Why does it matter in 2026? As architectures become more distributed and AI-driven automation touches more parts of the stack, blind spots can quickly turn into outages or degraded user experience. Teams can’t afford to guess what’s happening anymore; they need reliable, consistent signals that can drive automated insights and even feed AI assistants for problem diagnosis. Standardization efforts (like OpenTelemetry) are unifying how data is collected and transmitted, making it easier to correlate traces with metrics and logs, and to automate alerting, root cause analysis, and cost optimization. [source: https://opentelemetry.io/docs/] What problem it solves: Traditional logging or monitoring often results in signal silos — each tool with its own format and blind spots. Telemetry engineering breaks these silos by aligning on shared schemas, sampling strategies, tagging conventions, retention policies, and cost controls. This gives engineering teams a consistent lens into their systems, reduces noise, and supports AI-assisted debugging and predictive analysis. Links and trend signals: OpenTelemetry growing in adoption as the de facto standard for traces, metrics, and logsIndustry focus on treating observability as a platform concern, not a team hack Trend 6: FinOps Meets DevOps (Cost as a First-Class Engineering Signal) What it is: FinOps is the practice of managing and optimizing cloud spend through shared ownership between engineering, finance, and product teams. When FinOps meets DevOps, cost stops being something reviewed after deployment and becomes part of everyday engineering decisions — alongside performance, reliability, and security. In practice, this means teams see cost impact early and often, not only in monthly reports. Why it matters in 2026: Cloud and AI costs are no longer predictable or linear. Ephemeral environments, GPU workloads, managed services, and AI inference can change spend dramatically in days, not months. In 2026, teams that treat cost as “someone else’s problem” will struggle. Instead, DevOps pipelines increasingly include cost guardrails: budget alerts, environment TTLs, right-sizing checks, and cost regression detection before changes hit production. What problem it solves: It closes the gap between speed and sustainability. By integrating cost visibility directly into DevOps workflows, teams can move fast without accidentally burning budget, and leaders can make trade-offs explicit instead of reactive. Links and rend signal: The FinOps Foundation reports growing adoption of engineering-led cost ownership as cloud maturity increases. Conclusion Looking ahead to 2026, all these trends point to the same idea: teams need to scale software delivery with more structure, not more tools. AI, platforms, security, observability, and cost control only help when they are built into the way we work, not added at the end. Teams that connect these areas will move faster, with less stress and fewer surprises. Simple next steps to start now: Pilot one AI workflow, like helping with issue handling or pull requests, with clear rules and human review.Invest in IDP golden paths, so security, observability, and AI tools are the default, not optional.Set a basic supply chain security baseline, including SBOMs and artifact signing.Create a small semantic “thin slice” for one business domain to give AI shared context.Standardize telemetry and cost guardrails, so teams see reliability and cost early, not too late. These steps don’t require a big redesign on day one. But together, they help teams build software that is faster, safer, and more sustainable in 2026.

By Boris Zaikin

CORE

UX Research in Agile Product Development: Making AI Workflows Work for People

During my eight years working in agile product development, I have watched sprints move quickly while real understanding of user problems lagged. Backlogs fill with paraphrased feedback. Interview notes sit in shared folders collecting dust. Teams make decisions based on partial memories of what users actually said. Even when the code is clean, those habits slow delivery and make it harder to build software that genuinely helps people. AI is becoming part of the everyday toolkit for developers and UX researchers alike. As stated in an analysis by McKinsey, UX research with AI can improve both speed (by 57%) and quality (by 79%) when teams redesign their product development lifecycles around it, unlocking more user value. In this article, I describe how to can turn user studies into clearer user stories, better agile AI product development cycles, and more trustworthy agentic AI workflows. Why UX Research Matters for AI Products and Experiences For AI products, especially LLM-powered agents, a single-sentence user story is rarely enough. Software Developers and product managers need insight into intent, context, edge cases, and what "good" looks like in real conversations. When UX research is integrated into agile rhythms rather than treated as a separate track, it gives engineering teams richer input without freezing the sprint. In most projects, I find three useful touchpoints: Discovery is where I observe how people work todayTranslation is where those observations become scenario-based stories with clear acceptance criteriaRefinement is where telemetry from live agents flows back into research and shapes the next set of experiments A Practical UX Research Framework for Agile AI Teams To keep this integration lightweight, I rely on a framework that fits within normal sprint cadences. I begin by framing one concrete workflow rather than a broad feature; for example "appointment reminder calls nurses make at the start of each shift." I then run focused research that can be completed in one or two sprints, combining contextual interviews, sample call listening, and a review of existing scripts. The goal is to understand decisions, pain points, and workarounds. Next, I synthesize findings into design constraints that developers can implement directly. Examples include "Never leave sensitive information in voicemail" or "Escalate to a human when callers sound confused." Working with software developers, product managers, and UX designers, I map each constraint to tests and telemetry so the team can see when the AI agent behaves as intended and when it drifts. Also Read: The Benefits of AI Micromanagement UX Research Framework for Agile AI Product Development Technical Implementation: From Research to Rapid Prototyping One advantage of modern AI development is how quickly engineering can move from research findings to working prototypes. The gap between understanding the problem and having something testable has shrunk dramatically. Gartner projects that by 2028, 33% of enterprise software will embed agentic AI capabilities driving automation and more productivity. When building AI agents, I have worked with teams using LLMs or LLM SDKs to stand up functional prototypes within a single sprint. The pattern typically looks like this: UX research identifies a workflow and its constraints, then developers configure the agent using the SDK's conversation flow tools, prompt templates, and webhook integrations. Within days, I have a working prototype that real users can evaluate. This is where UX research adds the most value to rapid prototyping. SDKs handle the technical heavy lifting, such as speech recognition, text-to-speech, and turn-taking logic. But without solid research, developers and PMs end up guessing business rules and conversation flows. When I bring real user language, observed pain points, and documented edge cases into sprint planning, the engineering team can focus on what matters: building an agent that fits how people work. The same holds true for text-based agents. LLM SDKs let developers wire up conversational agents quickly, but prompt engineering goes faster when you have actual user phrases to work from. Guardrails become obvious when you have already seen where conversations go sideways. Also Read: Bounded Rationality: Why Time-Boxed Decisions Keep Agile Teams Moving How UX Research Changes Agile AI Development Incorporating UX research into agile AI work changes how teams plan and ship software. Deloitte's 2025 State of Generative AI in the Enterprise series notes that organizations moving from proofs of concept into integrated agentic systems are already seeing promising ROI. In my experience, the shift happens in two key areas. The first change is in how I discuss the backlog with engineering and product teams. Instead of starting from a list of features, I start from observed workflows and pain points. Software developers and PMs begin to ask better questions: How often does this workflow occur? What happens when it fails? Where would automation genuinely help rather than just look impressive in a demo? The second change is in how I judge success. Rather than looking only at LLM performance metrics or deployment counts, I pay attention to human-centric signals. Did the AI agent reduce manual calls for nurses that week? Did fewer financial operations staff report errors in their end-of-day checks? Those questions anchor agile AI decisions in users' lived experience. Use Case: Voice AI Agent for Routine Calls I built a voice AI agent to support routine inbound and outbound calls in healthcare and financial services. In my user research, I found that clinical staff and operations analysts spent large parts of their shifts making scripted reminder and confirmation calls. Staff jumped between systems, copied standard phrases, and often skipped documentation when queues spiked. I ran contextual interviews with nurses and operations staff over two sprints. I sat with them during actual call sessions, noted where they hesitated, and asked why certain calls took longer than others. One nurse told me she dreaded callbacks for no-shows because patients often got defensive. That single comment shaped how we designed the escalation logic. Based on these observations, I scoped an AI agent with clear boundaries. It would dial numbers, read approved scripts, capture simple responses like "confirm" or "reschedule," log outcomes in the primary system, and escalate to a human when callers sounded confused or emotional. Each constraint came directly from something I observed or heard in research. The "escalate when confused" rule, for example, came from watching a staff member spend four minutes trying to calm a patient who misunderstood an automated message. We treated the research findings as acceptance criteria in the backlog. Developers could point to a specific user quote or observed behavior behind every rule. When questions came up during sprint reviews, I could pull up the interview notes rather than guess. The AI agent cut manual call time, reduced documentation errors by more than 50%, and made collaboration between teams and end users more consistent. Because I started from real workflow observations and built in human escalation paths, adoption was smoother than previous automation attempts and increased by 35% in one quarter. Voice AI Agent Case Study Why This Approach Works UX research gives agile AI development a focused user perspective that directly supports developer cycles. When teams work from real workflows and constraints, they write less speculative code, reduce rework, and catch potential failures earlier. McKinsey's work on AI-enabled product development points out that teams redesigning their Agile AI product development and with UX research expertise tend to see more user-centric decision-making leading to better product experiences. Knowing this, and in my opinion, you do not have to trade one for the other. Agile AI teams that work this way stay closer to their users without slowing down. Key Takeaways If you are beginning to build or refine LLM-powered agents, here is a realistic next step. Pick one narrow workflow. Study how work happens today. Run a small research-driven experiment. Use telemetry and follow-up conversations to refine each iteration. AI delivers lasting value only when it is integrated thoughtfully into how people and teams already operate. By treating UX research as a first-class part of agile AI development, you bring the user's perspective into every sprint and make your development lifecycle more responsive to real needs. UX research helps agile AI teams start from real workflows instead of abstract features, leading to more focused and effective agentic workflowsIntegrating Research into each agile AI product development sprint gives teams clearer constraints, reduces rework, and supports higher quality releasesModern LLMs accelerate prototyping, but the quality of your agentic AI workflows depends on how well you understand the AI workflows before you define requirements and write code

By Priyanka Kuvalekar

How to Secure a Spring AI MCP Server with an API Key via Spring Security

Instead of building custom integrations for a variety of AI assistants or Large language models (LLMs) you interact with — e.g., ChatGPT, Claude, or any custom LLM — you can now, thanks to the Model Context Protocol (MCP), develop a server once and use it everywhere. This is exactly as we used to say about Java applications; that thanks to the Java Virtual Machine (JVM), they're WORA (Write Once Run Anywhere). They're built on one system and expected to run on any other Java-enabled system without further adjustments. In How to Build an MCP Server and Client With Spring AI MCP, I described in detail how to leverage MCP to enrich the context of these LLMs, making their responses more precise. Moreover, it exemplified how to implement an end-to-end use case that integrates an MCP server and a peer MCP client into an AI assistant. Nevertheless, no aspects around securing such integrations were provided, which raised legitimate concerns around deploying into production. Therefore, this article will focus on the MCP Server part. I'll describe how to implement a simple server using Spring AI, and how to test that its working correctly. I'll use a very useful tool called MCP Inspector. Yet, the emphasis here is on how its security can be configured. According to the MCP Specification, MCP servers that use HTTP as their transport layer shall be secured with OAuth 2.0 access tokens. There are situations, though, in which an infrastructure supporting all the needed entities (authorization servers, etc.) is not available, and thus, the communication still needs to be secured. The experiment in this article assumes such a case and demonstrates how an MCP server can be configured to authorize only requests from MCP clients that include the proper API key in a designated HTTP header. Developing the MCP Server As the purpose here is to secure it with an API key, the example MCP server is designed to expose a single tool that enriches the context with additional information about a specific Ninja character, whose name is passed as a parameter. Normally, such data could be read from a designated private data source (database, file etc.) and delivered back to the peer MCP Client and consequently to the LLM to use in its endeavors. The server project set-up is the following: Java 21Maven 3.9.9Spring Boot – 3.5.7Spring AI – v. 1.1.0 The project is named mcp-server-api-key . In order to be sure of the recommended spring dependencies used, the spring-ai-bom is configured in the pom.xml file. XML <dependencyManagement> <dependencies> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-bom</artifactId> <version>${spring-ai.version}</version> <type>pom</type> <scope>import</scope> </dependency> </dependencies> </dependencyManagement> The main dependency is the Spring AI MCP Server Boot Starter, which provides the well-known, convenient capability of automatic component configuration, making it easy to set up an MCP server in Spring Boot applications. XML <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-starter-mcp-server-webmvc</artifactId> </dependency> As the communication is over HTTP, the WebMVC server transport is used. The starter activates McpWebMvcServerAutoConfiguration and provides HTTP-based transport using Spring MVC and automatically configured streamable endpoints. Additionally, the spring-boot-starter-web dependency is included. To configure the MCP Server, a few properties prefixed by spring.ai.mcp.server are added to the application.properties file. Let’s take them in order. Properties files spring.ai.mcp.server.name = mcp-server-api-key spring.ai.mcp.server.version = 1.0.0 spring.ai.mcp.server.instructions = Instructions - endpoint: /mcp, type: sync, protocol: streamable spring.ai.mcp.server.type = sync spring.ai.mcp.server.protocol = streamable spring.ai.mcp.server.streamable-http.mcp-endpoint = /mcp spring.ai.mcp.server.capabilities.tool = true spring.ai.mcp.server.capabilities.completion = false spring.ai.mcp.server.capabilities.prompt = false spring.ai.mcp.server.capabilities.resource = false In addition to the server’s name and type, which are obvious, the properties that designate the version and the instructions are pretty important. The instance's version is sent to clients and used for compatibility checks, while the instructions property provides guidance upon initialization and allows clients to get hints on how to utilize the server. spring.ai.mcp.server.streamable-http.mcp-endpoint is the endpoint path that an MCP client will use when communicating with the server. As already mentioned, the server exposes only a single tool; thus, the last four properties in the snippet above are set accordingly. Once these are configured, the application itself is straight-forward. In the previous implementations of Spring AI, one would have needed to register a ToolCallbackProvider first, then proceed with the actual tools. Starting with version 1.1.0, however, this isn’t needed any longer, and the tool configuration can be done directly, as shown in the component below. Java @Component public class NinjaTools { private final NinjaService ninjaService; public NinjaTools(NinjaService ninjaService) { this.ninjaService = ninjaService; } @McpTool(name = "get-ninja-character-strengths", description = "Provides the strength of the Ninja character with the indicated name") public NinjaStrengths ninjaStrengths(@McpToolParam(description = "The Ninja character name") String name) { return ninjaService.strengthsByName(name); } } Via the @McpTool annotation, we specify the name and description of the tool — by @McpToolParam, its name parameter. The tool exposed by this MCP server is very simple — it returns the strengths of a Ninja character, based on its provided name. The purpose is to ultimately provide such a list to the underlying LLM so that it has a better view of the concrete context. Related to this, it’s worth mentioning that the role of the description parameter in the @McpTool An annotation is significant because it allows the MCP client to have a hint that if this MCP server tool is invoked, additional related details might be obtained. My preference when it comes to implementing MCP servers is to clearly separate the MCP-specific part, (which is pretty similar regardless of the particular tools’ details) from the one containing the actual functionality, which is, by all means, unrelated to MCP. This leads to a less coupled, more cohesive solution. In the simple case here, the separation looks exaggerated, yet it’s kept as it clearly outlines the idea. Now, to complete the implementation, the next simple NinjaService is constructed (the focus in this article is on securing the MCP server). Java @Service public class NinjaService { public NinjaStrengths strengthsByName(String name) { return switch (name) { case "lloyd" -> new NinjaStrengths("Lloyd Garmadon – Green Ninja (Life)", List.of("Leadership", "Adaptability", "Courage")); case "kai" -> new NinjaStrengths("Kai – Fire Ninja (Fire)", List.of("Determination", "Fearlessness", "Loyalty")); case "jay" -> new NinjaStrengths("Jay Walker – Lightning Ninja (Lightning)", List.of("Creativity", "Agility", "Humor")); case "cole" -> new NinjaStrengths("Cole – Earth Ninja (Earth)", List.of("Balance", "Stability", "Resilience")); case "zane" -> new NinjaStrengths("Zane – Ice Ninja (Ice)", List.of("Intelligence", "Compassion", "Generosity")); case "nya" -> new NinjaStrengths("Nya – Water Ninja (Water)", List.of("Independence", "Adaptability", "Curiosity")); default -> new NinjaStrengths(); }; } public record NinjaStrengths(String name, List<String> strengths) { public NinjaStrengths() { this("Unknown", Collections.emptyList()); } } } The results represent the character’s complete name and a list of three strengths, which are packed as NinjaStrengths instances. Normally, once this point is reached, MCP server development is complete – it can be tested and then used. Nevertheless, this step is postponed for now; its security is configured and applied, and then finally checked with the MCP Inspector tool. Securing the MCP Server Regarding the security concern, the mcp-security project is used. At the time of this writing, its maintainers have specified that the module is still under development and compatible with Spiring AI 1.1.0 and up. So for the experiment here, we are good. The mcp-server-security dependency provides the two specified possibilities – OAuth 2.0 and API key-based for MCP, out of which the latter will be used. The dependencies below are added into the pom.xml file. XML <dependency> <groupId>org.springaicommunity</groupId> <artifactId>mcp-server-security</artifactId> <version>0.0.3</version> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-security</artifactId> </dependency> To secure the MCP server with an API key, in addition to the key itself, a key secret is required. For convenience, they are declared as environmental variables and used in the application.properties file. Properties files api.key.id = ${API_KEY_ID} api.key.secret = ${API_KEY_SECRET} In the case of the MCP server, all requests will be intercepted, and only those that are compliant (possess the authorization header denoting the expected API key) are further authorized. To keep things simple and intuitive, a single ApiKeyEntity instance is constructed and stored as part of a simple InMemoryApiKeyEntityRepository. Then, when the SecurityFilterChain is built, a SecurityConfigurerAdapter is applied and an McpApiKeyConfigurer is used, which allows two concerns to be addressed: Set the expected security header name – ninja-x-api-keySet the repository that stores the server API key(s) The security configuration class is indicated below. Java @EnableWebSecurity @Configuration public class SecurityConfig { @Value("${api.key.id}") private String apiKeyId; @Value("${api.key.secret}") private String apiKeySecret; @Bean ApiKeyEntity apiKey() { return ApiKeyEntityImpl.builder() .name("API key") .id(apiKeyId) .secret(apiKeySecret) .build(); } @Bean ApiKeyEntityRepository<ApiKeyEntity> apiKeyRepository() { return new InMemoryApiKeyEntityRepository<>(List.of(apiKey())); } @Bean SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { return http.authorizeHttpRequests(auth -> auth.anyRequest().authenticated()) .with(McpApiKeyConfigurer.mcpServerApiKey(), apiKeyConfig -> apiKeyConfig.apiKeyRepository(apiKeyRepository()) .headerName("ninja-x-api-key")) .build(); } } Taking a more general approach, if the API key storage is insufficient, one can implement the interface below and provide a specific mechanism that suits the particular needs. Java public interface ApiKeyEntityRepository<T extends ApiKeyEntity> { @Nullable T findByKeyId(String keyId); } At this point, the MCP server is secure. In order to be able to successfully communicate, an MCP client shall send HTTP requests that contain the required header: Plain Text "ninja-x-api-key": [api-key-id].[api-key-secret] where api-key-id and api-key-secret are replaced with the values configured above. Testing the MCP Server For testing and debugging MCP servers in general (and the one in this article in particular), the MCP Inspector tool is used. Its documentation clearly describes the prerequisites for running it and provides details on the available configurations. It can be started with the following command. PowerShell C:\Users\horatiu.dan>npx @modelcontextprotocol/inspector Starting MCP inspector... Proxy server listening on localhost:6277 Session token: 6ae623f67a00875097263bb02ecbd7c3949fdb41e741e9d6650765b282673733 Use this token to authenticate requests or set DANGEROUSLY_OMIT_AUTH=true to disable auth MCP Inspector is up and running at: http://localhost:6274/?MCP_PROXY_AUTH_TOKEN=6ae623f67a00875097263bb02ecbd7c3949fdb41e741e9d6650765b282673733 Opening browser... Once it’s up and running, MCP Inspector can be accessed via the link above. Prior to connecting to the developed MCP server, though, there are some prerequisites: Transport Type: Streamable HTTPURL: http://localhost:8080/mcpSet the needed Authentication header – ninja-x-api-key – with the value id.secret Once successfully connected, one can observe the following in the mcp-server-api-key logs. Plain Text [mcp-server-api-key] [nio-8080-exec-1] i.m.server.McpAsyncServer : Client initialize request - Protocol: 2025-06-18, Capabilities: ClientCapabilities[experimental=null, roots=RootCapabilities[listChanged=true], sampling=Sampling[], elicitation=Elicitation[]], Info: Implementation[name=inspector-client, title=null, version=0.17.2] On the other hand, the response of the MCP Inspector initialize request is: JSON { "capabilities": { "logging": {}, "tools": { "listChanged": true } }, "serverInfo": { "name": "mcp-server-api-key", "version": "1.0.0" }, "instructions": "Instructions - endpoint: /mcp, type: sync, protocol: streamable" } Communication being settled, the exposed tool(s) can now be listed and invoked. The picture exemplifies the execution of get-ninja-character-strengths, having as input 'jay' and returning the available character strengths. Conclusion Although straightforward, the MCP server implementation and its security configuration with an API key are implemented in only a few lines of code, thanks to Spring AI and Spring Security. Nevertheless, conceptually there is quite a lot to cover and configure and thus, a thorough understanding of the concepts is needed once the enthusiasm passes, so that the developed applications are robust enough and ready for production.

By Horatiu Dan

CORE