Red Buffer - Medium

Private, Scalable LLM Inference for Data Compliance: vLLM & SGLang

Ahmad Faraz Sheikh — Mon, 27 Apr 2026 06:31:31 GMT

Open source inference engines give you what commercial APIs cannot: complete data sovereignty, production scale throughput, and zero vendor lock-in for enterprise LLM workloads.

A developer on your team pastes a customer contract into an LLM prompt. The model returns a clean summary in two seconds. The problem is that the contract just left your network, crossed an international border, landed on a third-party server, and may have been logged for up to 30 days. For companies in regulated industries, that single API call can trigger a compliance violation under GDPR, HIPAA, or sector-specific data residency laws.

This is not a hypothetical edge case. According to Kong’s 2025 Enterprise AI report, 44 percent of organizations identify data privacy and security as the top barrier to LLM adoption. Gartner predicts that by 2027, more than 40 percent of AI-related data breaches will stem from improper cross-border use of generative AI.

Enterprise API tiers from OpenAI, Anthropic, and Google offer contractual protections: data processing agreements, zero data retention policies, SOC 2 certifications. These protections are real. But for a growing number of organizations, they are not enough. The European Data Protection Board’s Opinion 28/2024 establishes that AI models trained on personal data are subject to GDPR in most cases, specifically because of memorization capabilities. The EU AI Act becomes fully applicable on August 2, 2026, adding obligations for high-risk AI systems. India’s Digital Personal Data Protection Act, in its enforcement phase since late 2025, mandates encryption, masking, and one-year activity log retention.

For organizations in finance, healthcare, defense, or legal services, the fundamental architectural problem remains: sensitive data leaves the enterprise perimeter on every API call. Data residency laws in the EU, India, Saudi Arabia, and Brazil require certain data categories to stay within national boundaries. Every prompt sent to a third party LLM is a processing event that may require a Data Protection Impact Assessment, a legal basis under GDPR, and a transfer mechanism like Standard Contractual Clauses. At enterprise volume, thousands or millions of prompts daily, this creates a governance surface most organizations cannot manage.

This is where open source inference engines change the equation like vLLM and SGLang let you run large language models entirely on your own infrastructure, at production throughput, with zero data leaving your network, and zero ambiguity about who controls it.

How Self-Hosted Inference Solves the Compliance Equation

Self-hosting an LLM with vLLM or SGLang changes the compliance architecture in four ways:

Data never leaves your network. Prompts, completions, and intermediate computations stay on infrastructure you control. No third-party processor, no cross-border transfer, no retention ambiguity.
No training on your data. Open weight models like Llama, Mistral, Qwen, and DeepSeek are downloaded once and served locally. The question of whether your inputs improve someone else’s model simply does not arise.
Audit trails you own. You decide what to log, where to store it, and how long to retain it, which is essential for GDPR data subject requests and sector-specific audits.
Jurisdictional flexibility. Deploy inference instances wherever your infrastructure lives, keeping data processing inside the boundaries your regulators require. Gartner expects 35 percent of countries to be locked into region-specific AI platforms by 2027, making this increasingly non-optional.vLLM vs SGLang: Performance, Architecture, and When to Use each.

Enter vLLM and SGLang: What They Are and Why They Exist

vLLM and SGLang are open source inference engines purpose-built for serving large language models at production scale. Both originated from research at UC Berkeley. vLLM was introduced in a 2023 paper centered on the PagedAttention algorithm, which applies the concept of virtual memory paging from operating systems to manage GPU memory for LLM inference. SGLang followed with the paper “Efficient Execution of Structured Language Model Programs,” introducing RadixAttention, a mechanism that stores and reuses cached computation across requests sharing common prefixes.

These are not research prototypes. vLLM powers production inference at Meta, Mistral AI, Cohere, IBM, and Red Hat. SGLang is deployed across over 400,000 GPUs at organizations including xAI, AMD, NVIDIA, Intel, LinkedIn, Cursor, and major cloud providers, including Oracle Cloud, Google Cloud, Microsoft Azure, and AWS. Both expose OpenAI-compatible API endpoints, which means existing applications built against the OpenAI SDK can migrate to self-hosted infrastructure by changing a single URL, with no code modifications required.

The practical result is that any organization with access to GPU infrastructure, whether on premises, in a private cloud, or through a dedicated cloud instance, can run a state-of-the-art language model without sending a single byte of data to a third party.

With the compliance case established, the practical question becomes how actually to run these engines. Both vLLM and SGLang are designed to get you from zero to a working inference server in minutes, not days, though each takes a different architectural path to production performance.

vLLM :

vLLM’s core innovation is PagedAttention. Traditional LLM inference allocates contiguous GPU memory for each request’s key-value cache, which wastes 60 to 80 percent of memory through fragmentation. PagedAttention breaks the KV cache into fixed-size non-contiguous blocks and uses an indirection layer, similar to virtual memory in operating systems. Combined with continuous batching, which lets new requests join an in-progress batch as soon as a slot opens, early benchmarks showed throughput 14 to 24 times higher than HuggingFace Transformers on the same hardware.

vLLM Setup

Getting a vLLM server running takes three commands:

# Install
pip install vllm

# Launch an OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.9 \
  --port 8000

For multi-GPU deployment, add --tensor-parallel-size 2 (or 4, 8) to distribute the model across GPUs. For Docker-based production, vLLM publishes an official image:

docker run --gpus all -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct

Once running, any OpenAI SDK client works by pointing to your endpoint:

from openai import OpenAI
client = OpenAI(base_url="http://your-server:8000/v1", api_key="your-key")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Summarize this contract..."}]
)

That is it. No cloud account, no per-token billing, no data leaving your infrastructure.

SGLang :

SGLang’s core innovation is RadixAttention. Where vLLM manages memory efficiently, SGLang manages content efficiently. Traditional engines discard the KV cache after each request completes. SGLang stores cached prefixes in a radix tree, a data structure that represents shared token sequences. When a new request arrives with a matching prefix, such as the same system prompt, conversation history, or few-shot examples, SGLang reuses the cached computation instead of recalculating it. This produces cache hit rates of 85 to 95 percent for few-shot learning and 75 to 90 percent for multi-turn chat.

On NVIDIA H100 GPUs running Llama 3.1 8B with bfloat16 precision, SGLang delivers approximately 16,200 tokens per second versus vLLM’s 12,500 with FlashInfer, a 29 percent throughput gap. For multi-turn workloads, SGLang gains another 10 to 20 percent from cache hits. Its compressed finite state machine for constrained decoding makes JSON and XML generation up to 3 times faster than unconstrained generation with post-processing.

SGLang Setup :

Setup is comparably simple:

# Install
pip install sglang[all]

# Launch an OpenAI-compatible server
python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --port 30000 \
  --mem-fraction-static 0.9

For multi-GPU, add --tp 2 (tensor parallel size). Docker deployment uses the official image:

docker run --gpus all -p 30000:30000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0

The server exposes the same OpenAI-compatible API, so the same client code works by changing the base URL to port 30000.

Choosing Between Them

With both engines running the same models behind the same API surface, the decision comes down to workload pattern:

PagedAttention manages memory in blocks. RadixAttention manages content in trees.

Start with vLLM if you need the broadest model compatibility, the largest community for troubleshooting, or if your workload is primarily single-turn requests across heterogeneous GPU hardware. Choose SGLang if your workload involves multi-turn conversations, agent-based reasoning with tool calling, heavy use of structured outputs, or if you run dedicated GPU clusters where throughput per dollar is the primary metric.

Scaling From Proof of Concept to Production

Both engines ship with Docker images and support Kubernetes deployment behind standard GPU operators. Horizontal scaling uses request queue depth or GPU utilization as the autoscaling signal, the same pattern any containerized service uses.

The cost case strengthens as volume grows. Enterprise LLM API spending doubled to $8.4 billion in 2025, and per-token pricing scales linearly with usage. Self-hosted infrastructure has higher fixed costs but sublinear scaling. For multi-turn RAG applications, SGLang’s RadixAttention cache reuse can cut GPU footprint by nearly 30 percent versus vLLM, which translates to roughly $15,000 in monthly savings at one million requests per day.

Both engines handle multi-GPU deployment natively. vLLM has been deployed at Stripe-scale (50 million daily requests on a third of their original GPU fleet), and SGLang has been documented running DeepSeek across 96 H100s.

What Self-Hosting Does Not Give You

Self-hosting is not free of tradeoffs. Running inference infrastructure requires GPU procurement or cloud GPU allocation, driver and CUDA management, model version updates, performance monitoring, and security hardening. Stripe reportedly achieved 73 percent inference cost reduction by migrating to vLLM, but they also had a dedicated ML platform team to manage the transition.

Model quality is another consideration. Open weight models such as Llama 3.3 70B, Qwen 3.5, DeepSeek V3, and Mistral Large 3 are competitive with proprietary models on many tasks, but frontier capabilities from GPT-4o or Claude Opus may still exceed what is available in the open weight ecosystem for certain specialized benchmarks. The gap has narrowed substantially, and for the vast majority of enterprise use cases, including summarization, classification, extraction, code generation, and conversational AI, open weight models deliver production-quality results.

Organizations without existing GPU infrastructure or ML engineering capacity should consider managed self-hosting platforms that deploy on your cloud instances while handling operational complexity, or start with a proof of concept on a single GPU before committing to full-scale deployment.

Key Takeaways

Commercial LLM APIs create an inherent compliance exposure for regulated industries. Self-hosted inference eliminates data leaving your network entirely.
vLLM (PagedAttention) and SGLang (RadixAttention) are production-grade engines used by Meta, xAI, NVIDIA, LinkedIn, and hundreds of other organizations at scale.
SGLang leads on raw throughput (~16,200 tok/s on H100) and excels at multi-turn, agentic, and structured output workloads. vLLM offers the broadest model and hardware compatibility.
Both engines expose OpenAI-compatible APIs, making migration from commercial APIs a configuration change, not a rewrite.
Start with vLLM for breadth and stability; move to SGLang when multi-turn performance and prefix caching drive measurable cost savings.

Where to Go Next

If this article moved you from curious to committed, these are the resources worth bookmarking.

vLLM

Official documentation: docs.vllm.ai
GitHub repository: github.com/vllm-project/vllm
Supported models list: docs.vllm.ai/en/latest/models/supported_models.html

SGLang

Official documentation: docs.sglang.ai
GitHub repository: github.com/sgl-project/sglang
LMSYS blog (release notes and benchmarks): lmsys.org/blog

Compliance references cited in this article

EDPB Opinion 28/2024 on AI models and personal data: edpb.europa.eu
EU AI Act full text: artificialintelligenceact.eu

Private, Scalable LLM Inference for Data Compliance: vLLM & SGLang was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

Harder Than It Sounds: An Introduction to Voice Agents

Aamina Binte Khurram — Tue, 21 Apr 2026 06:21:32 GMT

Voice agents are a kind of conversational AI that use generative models to power human-like conversations. In this article, we’ll unpack what voice agents are, and take a look under the hood to see the systems that make this technology possible.

Before diving in, it helps to understand the two building blocks behind voice agents: agentic AI and conversational AI.

Agentic AI

An agent can be loosely defined as an entity (human or otherwise) that pursues goals and makes decisions to achieve them. In doing so, it operates autonomously, is capable of understanding its environment, reasoning and using the right tools to act effectively. (Think less chatbot, more undercover operative — but without the sunglasses and trench coat of course.)

An AI agent is simply a software system that applies these principles: it uses artificial intelligence to reason, take actions and pursue goals.

Conversational AI

If agentic AI is about doing, conversational AI is about talking. It refers to technologies that enable machines to understand and respond to human language in a natural, dialogue-based way. At the core of conversational AI are techniques from Natural Language Processing (NLP) and Natural Language Understanding (NLU), which help systems interpret intent, context, and sentiment in human speech or text.

Voice Agents

Voice agents are the best of both worlds. They are the voice-modality branch of conversational AI, but they also have agentic capabilities.

For example, imagine calling a restaurant to make a reservation. Instead of waiting for a staff member or navigating multiple menus, you speak naturally: “I’d like to make a reservation for tomorrow evening.” The voice agent at the other end understands your request, checks availability, asks follow-up questions, and books the slot, all within a fluid conversation.

In other words, voice agents don’t just talk; they can take actions, make decisions, and help complete tasks, all through natural spoken interaction.

Architectural Overview

To understand the architecture of a voice agent, we can follow along with the conceptual model of an actual human conversation. Imagine two people trying to decide on a movie to watch. In the interest of brevity (a virtue we’ll immediately abandon), let’s call them Person A and Person B.

Person A is a sci-fi fan, Person B is not. Person A makes several passionate arguments about why watching a sci-fi film will be an incredible experience, Person B makes several equally passionate arguments about why watching a sci-fi film will be an incredibly boring experience, and eventually manages to convince Person A that a new bank robbery movie is the better option.

During this exchange, Person A and B both thought of things to say, verbalised them, listened to what the other person said, understood the other person’s speech, adjusted their responses based on the content and style of the other person’s speech, and so on. Because choosing the right movie mattered to them both very much, Person A and Person B got a little impassioned at times, even cutting each other off at moments. But despite all the pauses and interruptions, they were able to continue conversing.

This cycle repeats every time we speak with others, and it all happens so fast we don’t even realise it. Asking a machine to perform all these tasks is certainly not a short order, and as we’ll discover shortly, there’s a lot going on behind the scenes to make a machine “conversant”.

When a user speaks to a voice agent, the first challenge is that LLMs (the AI programs at the heart of voice agents, capable of understanding and generating text) don’t actually understand speech. So before anything else, we need to convert the user’s speech into text. This is the job of a speech-to-text, or STT, model. Think of it as a transcriber, listening and converting spoken words into text the LLM can work with.

Once the LLM receives that text, it acts as the “brain” of the operation: understanding what the user said, deciding what to respond, and calling any relevant external tools where needed. It then produces a textual response. But of course, a voice agent doesn’t hand you a written note, it speaks. For this step, we use a text-to-speech, or TTS, model. Think of it as a narrator bringing the LLM’s response to life.

One shortcut worth noting is real-time LLMs. Unlike standard LLMs, they are multi-modal, capable of taking speech as input and producing speech as output. In some cases, this reduces the need for a strict STT-LLM-TTS pipeline, and can lower overall latency by removing intermediate steps. However, in most production voice systems today, the traditional pipeline is still widely used. It offers more control, modularity, and flexibility, especially when different components need to be swapped, tuned, or optimized independently.

From https://livekit.com/voice-agents

That’s a lot of steps. To feel human-like, all this has to happen very, very fast. But speed alone doesn’t solve everything. Pauses and interruptions are central to natural conversation, and handling them gracefully is one of the trickier engineering challenges in voice AI. This brings us to perhaps the most underappreciated challenge in voice AI, and why getting conversation to feel natural is harder than it sounds.

Why Voice Is Hard

Human conversation is characterised by rapid turn-taking, with silent gaps between speakers typically around 200 milliseconds. Although producing language takes longer than this, we maintain the pace by predicting when the other person is about to finish speaking and preparing a response in advance.

For a voice agent to feel natural, it must do something similar: keep the delay between a user finishing their sentence and the agent beginning its response as short as possible, while still ensuring that the response is accurate and relevant.

This is harder than it sounds. Every interaction has to pass through multiple steps (speech-to-text, language understanding, and text-to-speech) and each of these introduces latency. Even small delays compound into noticeable conversational lag.

From: https://livekit.com/blog/voice-agent-architecture-stt-llm-tts-pipelines-explained

Latency, however, is only part of the problem. The more subtle challenge is turn-taking.

Let’s return to our earlier example. Imagine Person A enthusiastically explaining the plot of Star Trek in an attempt to convert Person B into a sci-fi fan (an effort we already know doesn’t succeed). This explanation won’t be a continuous monologue; Person A will pause frequently: to breathe, to emphasise a point, or to check that Person B is following along. Person B, in turn, will respond during these pauses, asking questions, adding comments, or occasionally interrupting altogether.

Now consider what happens if the timing is off:

If Person B asks a question and Person A takes too long to respond, the delay feels awkward.
If Person A pauses briefly mid-sentence and Person B immediately jumps in, it feels like an interruption.

In natural conversation, we handle this seamlessly. We instinctively know when to speak, when to wait, and when to stop if interrupted. Even when overlaps happen, they are brief and one person quickly yields the floor to the other.

For a machine however, this is much harder to get right.

Voice agents rely on Voice Activity Detection (VAD) to determine when a user has started or stopped speaking. These models are trained to detect patterns in audio that indicate speech boundaries. But real conversations are messy. A short pause to catch a breath can be mistaken for the end of a sentence, causing the agent to respond too early and cut the user off. On the other hand, waiting too long to be sure the user has finished speaking increases latency and makes the system feel sluggish.

On top of this, well-designed voice agents support barge-in, allowing users to interrupt the agent while it is speaking. This requires the system to immediately stop its output, process the new input, and shift context, all in real time.

Balancing these factors — latency, turn-taking, interruptions, and continuous audio processing — is what makes voice systems fundamentally more complex than their text-based counterparts.

A Deeper Dive: What’s Under the Hood

STT

Speech-to-text (STT) models, also known as Automatic Speech Recognition (ASR) models, convert a user’s spoken input into text, which can then be processed by the next stage of the voice agent pipeline: the LLM.

When you speak to a voice agent, the system doesn’t wait for you to finish your sentence before it starts processing. Instead, it continuously ingests audio and incrementally converts a stream of sound into text that the rest of the system can work with.

How it works

At a high level, an STT model takes a raw audio signal as input and transforms it into text through a series of steps. The audio waveform is first processed into acoustic features, which capture patterns in speech such as frequency and timing. These features are then passed through a neural model that predicts the most likely sequence of words corresponding to the input audio.

Modern STT systems combine acoustic modeling (understanding how speech sounds) with language modeling (understanding which word sequences are likely), allowing them to remain robust to noise and ambiguous speech.

Streaming

In voice agents, STT systems typically operate in streaming mode, meaning they transcribe audio in real time as it is being spoken. Rather than waiting for a complete utterance, the model produces interim (partial) transcripts that are continuously updated as more audio arrives.

This streaming behaviour is critical for reducing latency. By producing partial transcripts early, the system can begin processing a user’s intent before they have finished speaking. However, it also introduces new challenges. Interim transcripts can change as more audio arrives, and deciding when a transcription is “final” becomes a non-trivial problem closely tied to turn-taking.

LLM

Once a user’s speech has been converted into text, the next step is to understand what was said and come up with an appropriate response. This is where the large language model (LLM) comes in.

The LLM acts as the core reasoning engine of a voice agent. It takes in the transcribed user input, along with any relevant conversation history, and generates an appropriate response. It can also decide to call external tools or APIs to perform actions such as retrieving information or completing tasks.

How it works

At a high level, an LLM is trained to predict the next token in a sequence of text, given the tokens that came before it. By repeatedly applying this process, it generates coherent responses one token at a time.

In a conversational setting, the input to the model typically includes:

the current user utterance (from the STT)
prior conversation turns
system-level instructions (e.g. how the agent should behave)

The model processes this context and produces a response incrementally, token by token, which can then be passed downstream for speech generation.

What matters for voice agents

While LLMs are often used in text-based systems, deploying them in a real-time voice setting introduces additional constraints.

1. Latency and streaming

Voice agents cannot wait for a full response before speaking. Instead, LLMs stream tokens as they are generated, allowing downstream components like the TTS model to begin producing audio early. Even small delays in token generation can introduce noticeable pauses, making latency a critical factor.

2. Response style

Voice interactions favour concise, natural responses. Outputs that are too long or overly structured may be acceptable in text, but can feel unnatural when spoken aloud. Prompts and system design in voice agents therefore aim to keep responses short and conversational.

3. Decision-making and context

The LLM is not just generating text, it is deciding what to do next. This may involve calling external tools or APIs and incorporating the results into its response. At the same time, it must maintain conversational context across turns, balancing relevance with the latency and cost of passing large conversational histories.

TTS

A text-to-speech (TTS) model is used to convert the LLM’s textual output into audio, allowing the agent to respond in speech. If the STT model is how the system listens, the TTS model is how it speaks, and it is often the most user-visible part of a voice agent’s pipeline.

How it works

At a high level, a TTS model takes text as input and generates a corresponding audio waveform. Modern neural TTS models first convert text into intermediate representations (such as phonemes or spectrograms), and then generate speech that captures not just the words, but also aspects like timing, pitch, and intonation.

What matters for voice agents

1. Time to first audio and streaming

Voice agents must begin speaking quickly. Rather than waiting for the full response, TTS systems generate audio in chunks, allowing playback to start while the rest of the response is still being synthesized. This significantly reduces perceived latency.

2. Naturalness and prosody

The quality of the generated voice — including its tone, pacing, and expressiveness — has a direct impact on user experience. Even correct responses can feel unnatural if the speech sounds robotic or poorly timed.

3. Interruptibility

In real conversations, speakers can be interrupted. Voice agents must support this as well, meaning TTS output needs to be stoppable mid-generation so the system can respond to new user input without delay.

Conclusion

Building a voice agent quickly reveals that most of the hard problems are not about generating correct outputs, but about doing so under strict latency constraints while maintaining a natural conversational flow. Small issues in transcription delay, response length, or speech timing can significantly affect how “human” the system feels, even when each individual component is functioning correctly. In practice, the difference between a working system and a genuinely good voice experience is not just model quality, but how well the pieces are orchestrated under real-time pressure.

The stakes of getting this right become especially clear in high-sensitivity contexts. Consider a voice agent collecting medical information before a patient consultation. In such a use case, every component’s limitations carry real weight. A transcription error means the clinician receives inaccurate information. A hallucinated LLM response could mislead a patient about their condition. A robotic or poorly timed TTS voice can feel cold in a moment that calls for warmth and reassurance.

These issues are not unique to healthcare; they appear in almost any real-world voice interface. Unlike traditional GUIs, where the designer controls the interaction pathways, voice agents operate in open-ended territory. You cannot predict every way a user will phrase a request, and you cannot fully constrain what the model will say in response. Building well means accepting that uncertainty and designing for it. In practice, this requires careful prompting, continuous iteration, and a clear understanding of what each component in the pipeline can and cannot do reliably.

Have anything to add or improve? Please feel free to reach out on LinkedIn!

References:

Harder Than It Sounds: An Introduction to Voice Agents was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

LangChain vs LangGraph: Choosing the Right AI Workflow Framework

Tauseefahmed Tam — Tue, 10 Feb 2026 06:40:00 GMT

Large language model (LLM) frameworks now offer multiple ways to structure your pipelines. LangChain and LangGraph come from the same ecosystem but serve different needs.

LangChain is a linear, chain-of-operations framework ideal for straight line workflows you already understand.

LangGraph is a low level, graph based orchestrator for complex, stateful agent systems. The choice isn’t automatic: for clear sequential tasks, LangChain’s simplicity is often faster and more efficient.

Architecture & Primitives

LangChain primarily optimized for linear workflows; LangGraph routes state through a branching graph of nodes.

LangChain is designed around linear pipelines, where prompts, models, retrievers, and tools are executed in a fixed sequence. This makes it well-suited for simple, well-defined workflows like retrieval followed by summarization and answering. LangGraph, on the other hand, represents workflows as stateful graphs, allowing nodes to branch, loop, and revisit earlier steps. By passing a shared state through the graph, LangGraph natively supports conditional logic, retries, checkpoints, and human-in-the-loop interactions capabilities that are difficult to model cleanly with linear chains.

Figure: Simplified architectural patterns. (Left) A LangChain pipeline executes each component in order. (Right) A LangGraph workflow is cyclic/graph-based, with an LLM agent node, conditional “Need tool?” logic, and the possibility of looping.

Under the hood, LangChain provides building blocks like prompt templates, memory modules, vector retrievers and tool interfaces, all composed into chains or agents.

The mental model is like Lego blocks: small operations fit together in a fixed sequence.

LangGraph, by contrast, is very low-level and stateful. You define state schemas and graph nodes in code. The framework guarantees durable execution: on failure or pause, you can checkpoint and resume from the last state. It even offers streaming outputs, long-term memory(enables persistence and checkpointing, but memory storage still needs to be designed explicitly), and visual debugging tools for production agents.

In short, LangChain’s in DAG looping or revisiting steps is possible but awkward compared to LangGraph, whereas LangGraph’s directed graph explicitly allows revisiting prior steps or branching off; a powerful but heavier abstraction.

When to Stick with LangChain

Linear Pipelines: LangChain is ideal for straightforward, sequential workflows like text transformation or RAG, where each step feeds cleanly into the next and no mid-pipeline decision-making is required.
Simple RAG and Q&A Systems: Static knowledge-base applications such as FAQ bots or document summarizers fit naturally into LangChain’s model, requiring only retrieval, prompting, and optional short-term memory.
Fast Prototyping: LangChain minimizes boilerplate and cognitive overhead, making it easier to iterate quickly compared to heavier agent or graph-based frameworks.
Lower Operational Overhead: For short-lived or single-shot requests, LangChain avoids the additional orchestration complexity introduced by stateful graph execution.
Latency-Sensitive Use Cases: Simpler execution paths generally result in lower end-to-end latency, making LangChain better suited for real-time or “quick hit” queries.
Stateless or Lightly Stateful Systems: When requests are independent and long-term state or branching logic isn’t required, LangChain is typically the more efficient and pragmatic choice.

When LangGraph Shines

Stateful, Long-Running Workflows: LangGraph is well-suited for systems that require persistent state across steps or sessions, such as virtual assistants or multi-turn reasoning workflows, with built-in checkpointing and recovery.
Conditional Logic and Loops: Native support for branching, retries, and looping allows workflows to adapt dynamically based on outcomes, without embedding complex control logic throughout the codebase.
Multi-Agent Coordination: Graph-based execution enables multiple specialized agents or tools to operate in parallel or sequence, making complex task orchestration natural and maintainable.
Human-in-the-Loop Workflows: LangGraph supports pausing execution for review, approval, or intervention, with full state visibility and resumability built in.
Production-Grade Resilience: Designed for durable, mission-critical systems, LangGraph offers crash recovery, detailed tracing, and scalability features that reduce operational risk in complex AI applications.

Performance and Complexity Trade-offs

LangChain has a lighter execution and cognitive footprint, making it well-suited for short-lived, low-latency, and retrieval-heavy workloads where additional orchestration would add unnecessary overhead. For simple queries or linear pipelines, LangChain — or even direct LLM calls — often delivers better performance and faster iteration. LangGraph introduces extra per-step overhead through state management and scheduling, but this cost pays off when workflows require branching, retries, persistence, or re-planning. In those cases, LangGraph outperforms ad-hoc chain-based solutions both in clarity and execution. From a productivity standpoint, LangChain is easier to learn and reason about, while LangGraph’s graph and state-machine model carries higher upfront complexity, making it a poor fit for simple tasks where it would slow teams down rather than accelerate them. As one experienced engineer puts it:

“Avoid implementing a full graph when no loops or persistent state are needed. Over-engineering reduces clarity and increases maintenance cost.”

On the other hand, when an application truly needs a multi-turn agent or fine-grained control, trying to hack it with LangChain often leads to fragile code and hidden bugs.

Decision Checklist

A quick heuristic often helps teams decide:

Single-pass vs Multi-pass: If your workflow is a one-shot pipeline with known steps (e.g. RAG answering or document Q&A), choose LangChain. If you foresee multiple decision points or a loop-until-done pattern, lean LangGraph.
Statefulness: Do you need to remember information across user interactions or iterations? If yes (long chat memory, task lists, session state), LangGraph is preferable. If each call is independent or only uses short-term memory, LangChain (with a simple memory module) is sufficient.
Error Handling: For pipelines that must retry or continue on errors, LangGraph has built‑in support (checkpoints, retry edges). In LangChain you’d have to manually catch exceptions and re-call chains.
Complexity vs Speed: If rapid prototyping or iteration speed is critical, start with LangChain; it gets you un-stuck faster. Only graduate to LangGraph once your requirements demand it (branching logic, long sessions, human steps).
Operational Concerns: Consider deployment. LangChain’s ease to be deployed in stateless enironment design fits well in simple serverless or containerized setups. LangGraph (with its persistent threads) may need additional infrastructure (like a managed agent platform). If your environment is constrained, factor that in.
Trends: The community feedback is clear: there’s no one-size-fits-all standard anymore. Use LangChain when its abstraction cuts your work; use LangGraph when you’re tackling long-lived, agentic workflows that really need it.

Conclusion

Avoid over-engineering. If a plain chain does the job, run with it. Only graphify your architecture once your requirements grow beyond the capabilities of a simple chain. By aligning the framework to the problem complexity, teams can move faster and stay focused on building value, not wrestling with unnecessary abstraction.

LangChain vs LangGraph: Choosing the Right AI Workflow Framework was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

Agentic AI Systems with Integrated XAI: Building Smarter, Transparent, and Trustworthy AI

Tauseefahmed Tam — Fri, 06 Feb 2026 07:35:34 GMT

Artificial Intelligence (AI) has evolved from simple predictive models into agentic systems. Agentic systems are AI entities that can reason, plan, and take autonomous actions. But as AI becomes more capable, one critical question arises:

How do we trust its decisions?

This is where Explainable AI (XAI) enters the stage. By integrating XAI into agentic AI systems, we can build agents that not only act but also explain why they act the way they do.

🧠 What are Agentic AI Systems?

An agentic AI system is more than just a machine learning model. Unlike traditional ML models that passively output predictions (e.g., “spam” vs. “not spam”), an agentic system can:

Perceive its environment (through inputs like text, images, or sensor data)

Reason about goals and constraints

Plan a sequence of actions

Act autonomously to achieve objectives

Learn and Adapt from new data and outcomes

Think of ChatGPT with internet access, or a trading bot that adjusts strategies dynamically. These are agentic because they operate like intelligent decision-makers.

🔎 Why Do We Need XAI (Explainable AI)?

The more autonomy we give to AI, the more important trust and accountability become.

Imagine:

A healthcare agent suggests a treatment plan.
A financial agent executes trades on your behalf.
A legal agent reviews contracts.

Would you follow their advice blindly? Probably not.

This is why Explainable AI (XAI) is critical: it makes decisions transparent by showing reasoning, evidence, and trade-offs.

Benefits of integrating XAI into agents:

Trust → Users can see why a decision was made
Debugging → Engineers can identify biases or flaws
Compliance → Regulations (like GDPR) require explanation of automated decisions
Adoption → People are more likely to use AI they understand

⚙️ Architecture of Agentic AI with XAI

An agentic AI system with XAI typically has these 5 layers:

Perception Layer → Collects input (text, data, sensors)
Cognition Layer (Reasoning Engine) → Uses ML + symbolic reasoning
Decision Layer → Plans and selects best action
Action Layer → Executes the decision
XAI Layer (Transparency Module) → Generates human-readable explanations

[Input Data] → [Perception] → [Reasoning/Planning] → [Decision] → [Action]
                                                    ↘ [XAI Explanation]

💻 Coding a Simple Agent with XAI

Let’s build a toy AI agent that:

Takes a goal (e.g., recommend a movie)
Makes a decision using ML
Provides a human-friendly explanation with SHAP (SHapley Additive exPlanations)

Step 1: Build a simple recommendation agent

import shap
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# Step 1: Environment perception (load dataset)
data = load_iris()
X, y = data.data, data.target

# Step 2: Cognition (train ML model)
model = RandomForestClassifier(n_estimators=50)
model.fit(X, y)

# Step 3: Agent makes a decision
sample = X[42].reshape(1, -1)
prediction = model.predict(sample)[0]
decision = data.target_names[prediction]

print(f"Agent's decision: Recommend '{decision}'")

Step 2: Add XAI explanation with SHAP

# Explain WHY the decision was made
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(sample)
shap_values_for_class = shap_values[prediction][0]

# Handle case where number of SHAP values doesn't match number of features
if len(shap_values_for_class) != n_features:
    # Pad SHAP values with zeros to match number of features
    padded_shap = np.zeros(n_features)
    padded_shap[:len(shap_values_for_class)] = shap_values_for_class
    shap_values_for_class = padded_shap

print("\nExplanation (Top features influencing decision):")
for i, feature_name in enumerate(feature_names):
    if i < len(shap_values_for_class):
        shap_value = shap_values_for_class[i]
    else:
        shap_value = 0.0  # Default if SHAP value is missing
    print(f"{feature_name}: {shap_value:.3f}")

Output Example:

Agent's decision: Recommend 'setosa'

Explanation (Top features influencing decision):
sepal length (cm): 0.066
sepal width (cm): -0.039
petal length (cm): -0.028
petal width (cm): 0.000

This means the agent didn’t just pick a flower but it also explained that petal length and petal width were the most important features.

🌍 Real-World Applications

Healthcare Agents → AI diagnosis systems explaining risk factors for diseases

Financial Agents → Trading bots that justify investment moves

Legal Agents → Contract analysis with traceable reasoning

Customer Service Agents → Transparent recommendations, not “black box” replies

Autonomous Vehicles → Explaining navigation decisions in case of accidents

🚀 Future of Agentic AI with XAI

The future is moving towards self-reflective agents systems that:

Explain their decisions in natural language

Audit themselves for bias

Provide confidence levels for their outputs

Learn user preferences to give personalized explanations

Imagine a world where your AI assistant doesn’t just tell you what to do but also explains why it’s the best option for you specifically.

With more AI comes more explanation — ML Engineer

Conclusion

Agentic AI is shaping the next generation of intelligent systems that are autonomous, adaptive, and action-driven. But without explainability, they risk becoming black boxes we can’t trust. The combination of Agentic AI + XAI paves the way for responsible, ethical, and human-aligned AI.

Code notebook: https://colab.research.google.com/drive/1gV-Ca5Lzt7mWPm_zMkdgPU58ZS9Xqm3_?usp=sharing

Agentic AI Systems with Integrated XAI: Building Smarter, Transparent, and Trustworthy AI was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

Deploying Python Apps on Google Cloud Run using Gitlab pipelines(CI/CD)

Abdul Rehman Raja — Tue, 03 Feb 2026 09:34:18 GMT

If you are a software developer, you must have heard about the CI/CD buzzwords. Today let’s break this down and understand how to implement CI/CD pipeline for deploying your python applications on Google Cloud Run’s serverless architecture using Gitlab pipelines.

CI/CD

CI /CD stands for continuous integration and continuous Delivery / Deployment. Continuous integration (CI) refers to the practice of automatically and frequently integrating code changes into a shared source code repository. Continuous delivery and/or deployment (CD) is a 2 part process that refers to the integration, testing, and delivery of code changes. Continuous delivery stops short of automatic production deployment, while continuous deployment automatically releases the updates into the production environment.

Google Cloud Run

Google Cloud Run is a fully managed serverless platform that allows developers to deploy and scale containerized applications easily. As GCR is serverless so you do not have to worry about managing and scaling the compute resources, Google handles it for you.

Gitlab

Gitlab is a web-based platform developed on git (a distributed version control system) that helps development / Engineering teams manage their code. It provides following features:

Source Code Management
Continuous integration and delivery (CI/CD)
Code Review
Security
AI-powered workflows
Data Analytics

Prerequisites

Before getting started with our CI/CD journey, here are the prerequisites for our exercise:
1. Gitlab Repository
2. Google Cloud Platform (GCP) Account with billing setup
3. Docker

Now lets dive into implementing our first CI/CD pipeline

1. Install and Authenticate Google Cloud SDK

Run these commands:

snap install google-cloud-cli

This command will download google cloud CLI on your ubuntu system. It might be different for Mac and windows. Now authenticate your GCP account using:

 gcloud auth login

This should redirect you to google’s login page where you can choose or add your account for logging into google cloud.

2. Setup Artifact Registry and Cloud Build API

On your GCP Console, click on API & Services

Search Artifact Registry API and Enable it if not already

Now Search Artifact Registry in the search tab and create a new repository as shown:

Similarly Search Cloud Build API and Enable it from the API & Services section

3. Setup Sevice Account

Go to IAM & Admin and click Service Accounts

Click on + Create Service Account and then follow these steps:

Now let’s generate a security key to access the service account.

Click on the Service account that you have created and then click on “KEYS” tab

Create a new JSON key (It would automatically be downloaded)and save it securely.

4. Setup project locally

Run this command locally to setup your GCP project locally

gcloud config set project

5. Setup CI/CD Variables on Gitlab

Open your Gitlab Repository, Go to Settings -> CI/CD

Add your variables as shown

Here’s a short description of these variables:

.env file is copied as it is and pasted into the STAGING_ENV variable.
GCP_PROJECT_ID can be copied from google cloud console home page once you select the project that you are working on. Remember this is not the service account Id.
GCP_CLOUD_BUILD_SERVICE_KEY is simply the json key that was downloaded at end of step 3. You have to copy the json from that file and paste it into this variable.

6. Writing the Dockerfile:

We can only deploy our applications in the form docker containers on Google Cloud Run. Containers come from images that are built using a Docerfile. So let’s write one now:

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
RUN apt-get update -y && apt-get install -y default-jre && apt-get clean && rm -rf /var/lib/apt/lists/*

# Copy code to Working directory
COPY . .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Expose application port
EXPOSE 8080

# Run the application
CMD ["chainlit", "run", "chainlit_app.py", "--port", "8080", "-h"]

Remember to make our lives easy, we run our application on 8080 PORT otherwise, you can take a look into how to bind other ports with GCR as well.

7. Writing the deployment action:

Now that we have set our variables, we can proceed to writing the gitlab action for deployment. Here’s the .gitlab-ci.yml file:

image: google/cloud-sdk:alpine

deploy_staging:
  stage: deploy
  environment: staging
  only:
    - staging
  script:
    - echo $ARTIFACT_REPO
    - echo $GCP_PROJECT_ID
    - cp $GCP_CLOUD_BUILD_SERVICE_KEY /tmp/gcloud-service-key.json
    - cp $STAGING_ENV  /tmp/.env
    - cp /tmp/.env .env
    - gcloud auth activate-service-account --key-file /tmp/gcloud-service-key.json
    - gcloud config set project $GCP_PROJECT_ID
    - gcloud builds submit  --config=cloudbuild.yaml --substitutions=_PROJECT_ID=$GCP_PROJECT_ID,_ARTIFACT_REPO=$ARTIFACT_REPO,_ENV_IMG_NAME=$ENV_IMG_NAME,_ENV=$ENV,_SERVICE_NAME=$SERVICE_NAME .
  variables:
    ENV_IMG_NAME: 
    ENV: "staging"
    ARTIFACT_REPO: 
    SERVICE_NAME:

Your cloudbuild.yaml file should look something like this:

steps:
  # build the container image
  - name: "gcr.io/cloud-builders/docker"
    args:
      [
        "build",
        "-t",
        "us-central1-docker.pkg.dev/$_PROJECT_ID/$_ARTIFACT_REPO/$_ENV_IMG_NAME:$_ENV",
        ".",
      ]
    # push the container image
  - name: "gcr.io/cloud-builders/docker"
    args:
      [
        "push",
        "us-central1-docker.pkg.dev/$_PROJECT_ID/$_ARTIFACT_REPO/$_ENV_IMG_NAME:$_ENV",
      ]
    # deploy to Cloud Run
  - name: "gcr.io/cloud-builders/gcloud"
    args:
      [
        "run",
        "deploy",
        "$_SERVICE_NAME",
        "--image",
        "us-central1-docker.pkg.dev/$_PROJECT_ID/$_ARTIFACT_REPO/$_ENV_IMG_NAME:$_ENV",
        "--region",
        "us-central1",
        "--platform",
        "managed",
        "--allow-unauthenticated",
        "--memory",
        "2Gi",
      ]
options:
  logging: CLOUD_LOGGING_ONLY

8. Push the Code

Before pushing the code to your gitlab repository in staging branch (can be different for you if you are following the tutorial on any other branch) verify this directory structure:

📂 Project Root
│── 📄 chainlit_app.py        # Main application file
│── 📂 src/                   # Source files directory
│   ├── .py        # Python source files
│── 📄 cloudbuild.yaml        # Google Cloud Build configuration
│── 📄 .gitlab-ci.yml         # GitLab CI/CD pipeline configuration
│── 📄 Dockerfile             # Docker container configuration
│── 📄 requirements.txt       # Dependencies list

Finally we have completed our CI/CD pipeline and the way to test it is by pushing the code to staging branch (change .gitlab-ci.yml for other branchs). After pushing the code you can login to gitlab > Your Repository > Branch and see the pipeline running in action. You can also view the logs for deployment as well.

After the pipeline has run sucessfully, in your google cloud console > GCR, you can see your cloud function in action and copy the the URI and Public URL of your cloud function.

Deploying Python Apps on Google Cloud Run using Gitlab pipelines(CI/CD) was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

️ Create a Personal VPN on AWS EC2 Using OpenVPN — Easy & Complete Walkthrough

Mirzaafshal — Mon, 02 Feb 2026 06:19:56 GMT

🛡️ Create a Personal VPN on AWS EC2 Using OpenVPN — Easy & Complete Walkthrough

Virtual Private Network (VPN) is a service that allows users to access websites privately through another network’s servers. VPN is one of the most effective solutions for protecting internet privacy and security.

In this article, you’ll learn how to spin up your own VPN server in under 30 minutes using an EC2 instance and OpenVPN.

🔍 Why Create Your Own VPN?

Before we dive in, let’s look at the benefits of hosting your own VPN:

Total control: No data logging unless you configure it.
Bypass geo-restrictions: Access region-specific content.
Enhanced security: Safely browse on public Wi-Fi.
Save money: One-time setup, no monthly fee.
Learning experience: Great way to improve your cloud and Linux skills.

🛠️ What You’ll Need

An AWS account (free tier eligible)
Basic command-line knowledge
A terminal or SSH client
~30 minutes of your time

☁️ Step 1: Launch an EC2 Instance

Go to AWS EC2 Console.
Click Launch Instance.
Choose an AMI:
👉 Ubuntu Server 22.04 LTS (recommended)
Select t2.micro (Free tier eligible).
Configure storage (default is fine).
Add a new key pair (save the .pem file securely).
Set up Security Group with the following rules:

Launch the instance and note the Public IPv4 address.

🔐 Step 2: Connect to the EC2 Instance

Use your terminal or SSH client to connect:

chmod 400 your-key.pem
ssh -i your-key.pem ubuntu@your-ec2-public-ip

📦 Step 3: Install OpenVPN and Easy-RSA

Once connected, install the necessary packages:

sudo apt update && sudo apt install openvpn easy-rsa -y

📁 Step 4: Set Up the CA and Server Certificates

Create the PKI directory:

make-cadir ~/openvpn-ca
cd ~/openvpn-ca

Build the CA:

./easyrsa init-pki
./easyrsa build-ca

Generate the server certificate and key:

./easyrsa gen-req server nopass
./easyrsa sign-req server server

Create Diffie-Hellman parameters and HMAC Key:

./easyrsa gen-dh
openvpn --genkey --secret ta.key

🔄 Step 5: Move Files to OpenVPN Directory

sudo cp pki/ca.crt pki/private/server.key pki/issued/server.crt ta.key pki/dh.pem /etc/openvpn/

⚙️ Step 6: Configure the OpenVPN Server

Get a default config file:

gunzip -c /usr/share/doc/openvpn/examples/sample-config-files/server.conf.gz | sudo tee /etc/openvpn/server.conf

Open it for editing:

sudo nano /etc/openvpn/server.conf

Make sure these lines are present/uncommented:

port 1194
proto udp
dev tun
ca ca.crt
cert server.crt
key server.key
dh dh.pem
tls-auth ta.key 0
cipher AES-256-CBC
auth SHA256
user nobody
group nogroup
persist-key
persist-tun
status openvpn-status.log
verb 3

🔧 Step 7: Enable IP Forwarding

Edit the sysctl config:

sudo nano /etc/sysctl.conf

Uncomment or add:

net.ipv4.ip_forward=1

Apply it:

sudo sysctl -p

🔥 Step 8: Configure the UFW Firewall

Allow SSH and VPN traffic:

sudo ufw allow ssh
sudo ufw allow 1194/udp

Edit UFW before rules:

sudo nano /etc/ufw/before.rules

Add this section at the top, before *filter:

*nat
:POSTROUTING ACCEPT [0:0]
-A POSTROUTING -s 10.8.0.0/8 -o eth0 -j MASQUERADE
COMMIT

Then set default forwarding:

sudo nano /etc/default/ufw

Change:

DEFAULT_FORWARD_POLICY="ACCEPT"

Enable the firewall:

sudo ufw disable && sudo ufw enable

▶️ Step 9: Start and Enable OpenVPN

Start the server:

sudo systemctl start openvpn@server

Enable on boot:

sudo systemctl enable openvpn@server

Check status:

sudo systemctl status openvpn@server

📲 Step 10: Create a VPN Client Configuration

Generate client certificate and key:

cd ~/openvpn-ca
./easyrsa gen-req client1 nopass
./easyrsa sign-req client client1

Copy client files to your local machine:

scp -i your-key.pem ubuntu@your-ec2-ip:/home/ubuntu/openvpn-ca/pki/issued/client1.crt .
scp -i your-key.pem ubuntu@your-ec2-ip:/home/ubuntu/openvpn-ca/pki/private/client1.key .
scp -i your-key.pem ubuntu@your-ec2-ip:/etc/openvpn/ca.crt .
scp -i your-key.pem ubuntu@your-ec2-ip:/etc/openvpn/ta.key .

📝 Step 11: Create a .ovpn Profile

Create a client1.ovpn file locally:

client
dev tun
proto udp
remote your-ec2-public-ip 1194
resolv-retry infinite
nobind
persist-key
persist-tun
remote-cert-tls server
cipher AES-256-CBC
auth SHA256
key-direction 1
verb 3


# Paste content of ca.crt


# Paste content of client1.crt


# Paste content of client1.key


# Paste content of ta.key

Now import the .ovpn file in your favorite VPN client (OpenVPN Connect, Tunnelblick, etc.).

💻 How to Use .ovpn on Different Platforms

Windows and Mac

Download OpenVPN from here and install.
Click OpenVPN client and import .ovpn file.
Click Connect.

Ubuntu/Linux

Install OpenVPN:

sudo apt update && sudo apt install openvpn -y

Run:

sudo openvpn --config client1.ovpn

Android

Open the Google Play Store on your Android device.
Search for OpenVPN Connect.
Tap Install to download and install the app.
Once the installation is complete, open the OpenVPN Connect app.
Import .ovpn file.
After importing the profile, tap Connect to establish a connection to the VPN server

iOS (iPhone/iPad)

Open the App Store on your iOS device.
Search for OpenVPN Connect.
Tap the Get button to download and install the app.
Once the installation is complete, open the OpenVPN Connect app.
Import .ovpn file.
After importing the profile, you can now connect to the VPN server by selecting the profile and tapping Connect.

🎉 You’re Done!

You’ve successfully deployed a private VPN server on AWS EC2 using OpenVPN! You can now:

✅ Browse securely on public networks

✅ Access region-locked content

✅ Enjoy peace of mind with total data privacy

🛡️ Create a Personal VPN on AWS EC2 Using OpenVPN — Easy & Complete Walkthrough was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

Zero-Shot Object Detection with CLIP Models

Ahmad Faraz Sheikh — Wed, 28 Jan 2026 08:10:38 GMT

Imagine you’re at a party, and your friend shows you a photo asking, ‘Can you spot the lychee in this picture?’ You might have never seen a lychee before, but if someone describes it as ‘a small, red fruit with a rough texture,’ you can use that information to make an educated guess. This is exactly what we want computers to do: recognize objects they’ve never been specifically trained to find by using prior knowledge from textual descriptions or related concepts. This is where Zero-Shot Object Detection comes in!

Traditional object detection systems work like rote learners. They need to see countless labeled examples: “cat,” “car,” “bottle”, before they can accurately identify objects. Imagine trying to teach a child what a spoon is by showing them hundreds of spoon photos instead of simply describing it. Not the most efficient way to learn, is it?

But what if we could create a system that learns more like humans? A system that doesn’t need to see every single example but can reason about objects from descriptions alone. That’s where CLIP (Contrastive Language-Image Pre-training), an AI model developed by OpenAI, comes into the picture. CLIP is like giving a computer both eyes to see and a brain that understands language. Let’s dive deeper into how it works, what makes it special, and how you can use it for your projects!

Understanding the Basics

What Makes CLIP Special?

Traditional computer vision models are like robots that can only recognize items from a predefined catalog. CLIP, on the other hand, is like having a smart assistant who understands both pictures and words and can connect them intelligently.

Here’s an analogy:

Traditional Model: “I can only recognize cats if you’ve shown me thousands of cat pictures during training.”
CLIP: “I understand what cats look like AND I understand what the word ‘cat’ means, so I can identify cats even in situations I haven’t seen before!”

How CLIP Connects Images and Text

Think of CLIP as a translator between two “languages”, the language of images and the language of text. Here’s how it works:

When CLIP sees an image, it encodes it into a unique numeric representation, or embedding, that captures its essence.
Similarly, when CLIP reads a text description, it converts that text into the same type of embedding.

These embeddings exist in a shared “space” where similar visual and textual concepts end up close to each other. For instance, a photo of a cat and the text “a cat” would be neighbors in this embedding space, enabling CLIP to make meaningful connections.

The Concept of “Zero-Shot” Learning

“Zero-shot learning” may sound technical, but it’s a simple idea. Let’s say someone tells you: “A peacock is a large bird with vibrant, colorful feathers and a fan-shaped tail.” Even if you’ve never seen one in person, you could identify a peacock in a picture using this description alone. That’s zero-shot learning, the ability to recognize something new without prior specific training.

This skill is incredibly valuable in the real world, where it’s impossible to train AI on every possible object or scenario. CLIP’s zero-shot capabilities allow it to scale effortlessly to new tasks. Let me know if you’d like me to adjust further!

Real-World Applications and Benefits

CLIP’s abilities open up exciting possibilities in the real world:

Flexible Search Systems: Want to find “a person wearing a red hat sitting on a beach” in your photo library? CLIP can help!
Accessibility Tools: Helping visually impaired people understand images through natural descriptions
Creative Tools: Helping artists and designers find specific visual references based on text descriptions

The beauty of CLIP is that it’s incredibly versatile; you don’t need to retrain it for each new type of object you want to detect. It’s like having a universal translator between the visual world and human language!

How CLIP Works

Architecture Overview

Encoder Component of Clip

Classifier Component of CLIP

Imagine CLIP as two parallel pathways that eventually meet, like two rivers joining together:

Image Encoder (Vision Pathway):

This is like a smart camera that takes images
Uses a modified Vision Transformer (ViT) or ResNet
Converts images into a standardized 512-dimensional vector (think of it as a special code)

Text Encoder (Language Pathway)

This is like a language expert that processes text
Uses a Transformer architecture similar to GPT
Converts text descriptions into the same type of 512-dimensional vector

These two pathways are trained together using contrastive learning, imagine teaching a child to match pictures with their descriptions by showing them both correct and incorrect pairs.

The Contrastive Learning Approach

CLIP’s Training Pipeline : Converting images and text into a shared space for matching, with parallel encoders processing inputs and optimizing for correct image-text pairs through contrastive learning.

CLIP (Contrastive Language-Image Pre-training) works like a sophisticated matching game where it learns to connect images with their corresponding text descriptions. The model has two main components: a vision encoder for processing images and a text encoder for handling language, both converting their inputs into comparable mathematical representations. During training, CLIP looks at multiple image-text pairs and learns to recognize correct matches while distinguishing incorrect ones. This simple but powerful approach allows CLIP to learn from millions of image-text pairs found on the internet, helping it develop a robust understanding of how visual content relates to natural language descriptions.

Hands-on Implementation

Let’s walk through a complete example of using CLIP for object detection:

Loading and Preparing CLIP

import clip
import torch
from PIL import Image

# Load the model
model, preprocess = clip.load("ViT-B/32", device=device)

# Prepare your image
image = Image.open("path_to_your_image.jpg")
image_input = preprocess(image).unsqueeze(0).to(device)

Writing Text Prompts

# Define what you're looking for
text_descriptions = ["a dog", "a cat", "a person", "a car", "a tree"
]
# Encode text descriptions
text_tokens = clip.tokenize(text_descriptions).to(device)

Processing Images and Text

with torch.no_grad():
    # Get image features
    image_features = model.encode_image(image_input)
    
    # Get text features
    text_features = model.encode_text(text_tokens)
    
    # Calculate similarity
    similarity = torch.nn.functional.cosine_similarity(
        image_features, text_features
    )

# Get the most likely match
most_likely = text_descriptions[similarity.argmax().item()]
print(f"This image most likely contains {most_likely}")

Testing CLIP with a Real-World Example

To demonstrate CLIP’s capabilities, let’s consider the following test case. We provide the model with a list of textual prompts to compare with an image of cookies. Our prompts include:

“chocolate chip cookies”
“cookies”
“a cat”,
“a dog”,
“a cow”,
“a person standing”

After running the code on the following image:

Image of cookies

through CLIP with these textual descriptions, the model outputs the following results:

Detected chocolate chip cookies with 31.745% confidence

sAs we can see, the object “chocolate chip cookies” with the highest confidence score (31.745%) is returned, which is the most specific and accurate match for the given image. This highlights how CLIP effectively aligns visual content with textual descriptions, even when given highly contextual or detailed prompts.

From Images to Videos: Detecting Objects in Motion

Imagine this: You’re watching a wildlife documentary, and you spot a rare bird. You snap a photo, wondering, “Can I find every moment this bird appears in the video?” But here’s the catch: you don’t know the bird’s name, and you can’t find any information about it online. You might think you need a name or description to search for it in the video based on the previous details in this article. But with CLIP, you don’t need any of that.

Using just the image of the bird, CLIP can scan the entire video and pinpoint every frame where the bird appears. No labels, no descriptions, just the power of AI connecting images and context.

In this section, we’ll take CLIP’s zero-shot object detection capabilities to the next level. Not only can CLIP identify objects in static images, but it can also track them across videos. Let’s dive into how you can detect an object using its image and retrieve all the frames where it appears.

Generate Embeddings for the Target Object:

import cv2

# Load and preprocess the target object image
target_image = Image.open("path_to_your_image")
target_image_preprocessed = preprocess(target_image).unsqueeze(0).to(device)

# Generate embedding for the target object
with torch.no_grad():
    target_embedding = model.encode_image(target_image_preprocessed)
    target_embedding /= target_embedding.norm(dim=-1, keepdim=True)

Scan Video Frames for the desired Object

# Open video file
video_path = "/content/test (1).mp4"
cap = cv2.VideoCapture(video_path)

frame_rate = int(cap.get(cv2.CAP_PROP_FPS))  # Get frame rate
detected_frames = []

while cap.isOpened():
    ret, frame = cap.read()
    if not ret:
        break

    # Convert frame to PIL Image and preprocess
    frame_pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    frame_preprocessed = preprocess(frame_pil).unsqueeze(0).to(device)

    # Compute embedding for current frame
    with torch.no_grad():
        frame_embedding = model.encode_image(frame_preprocessed)
        frame_embedding /= frame_embedding.norm(dim=-1, keepdim=True)

    # Compare embeddings (cosine similarity)
    similarity = (frame_embedding @ target_embedding.T).item()

    # If similarity exceeds a threshold, save the frame
    if similarity > 0.2:  # Adjust threshold as needed
        detected_frames.append(frame)  # Store detected frame

cap.release()

Conclusion

Imagine a world where AI can recognize any object, in any image or video, without ever having seen it before. This is no longer just a futuristic dream, it’s becoming a reality with CLIP’s zero-shot object detection.

With traditional object detection models, we were stuck in a cycle of collecting massive labeled datasets, training models for hours, and constantly updating them with new objects. But CLIP changes the game. Instead of memorizing countless images, it understands objects through language, much like how we humans do. This breakthrough opens up endless possibilities:

Smarter search systems that let you find photos using natural descriptions.
Enhanced accessibility tools that help visually impaired users understand their surroundings.
Creative AI applications that let artists and designers retrieve inspiration through text-based queries.
Video tracking that identifies objects across footage, even if their names are unknown.

The best part? No specialized retraining is required. Whether you’re a developer, a researcher, or an AI enthusiast, CLIP offers a powerful way to explore the connection between vision and language.

If you’re excited to try it out, check out my GitHub repository, where I’ve implemented CLIP for zero-shot object detection. Let’s push the boundaries of AI together! 🚀

GitHub Repository

📌 Explore my CLIP-based object detection implementation:
🔗 My GitHub Repository

Olostep: Web Data API for AI and Research Automation

Abdul Rehman Raja — Wed, 21 Jan 2026 08:23:03 GMT

Are you building agents that access Stack Overflow like your friend in a hoodie sitting in a dark room, staring at a glowing screen, wearing fancy headphones, and somehow always knowing the right answer?

Except… instead of a human, it’s your AI.

If yes, then you already know the problem: AI is only as good as the data it can access. And the web is messy, dynamic, JavaScript-heavy, and bot-protected, which is not exactly AI-friendly.

That’s where Olostep comes in.

What is Olostep (in plain English)?

Olostep is a Web Data API that enables your AI to effectively utilize the internet.

Not “trained-on-the-web-in-2023” internet but live, real, structured, up-to-date web data.

Instead of fighting with:

Headless browsers
Proxy rotation
CAPTCHAs
JavaScript rendering
Brittle scrapers that break every two weeks

You send Olostep a URL (or a task), and it gives you back clean, usable data ready for:

AI agents
RAG pipelines
Research automation
Dashboards
Lead enrichment
Competitor tracking

Think of Olostep as:

“The data intern your AI deserves but one that never sleeps.”

What can Olostep do?

At a high level, Olostep offers APIs for:

Scraping individual pages
Crawling entire websites
Mapping all URLs on a domain
Batch processing thousands of URLs
AI-powered web answers with sources
Parsing unstructured content into JSON
Agent-based automation using natural language

Basically:

If the data exists on the public web, Olostep can probably get it.

Core Concepts (Quick Tour)

Scrapes (“Give me this page”):

You pass a URL.
Olostep returns the content in HTML, Markdown, or text format.

Perfect for:

Blog posts
Documentation
Product pages
Landing pages

Crawls (“Give me this whole site”)

You give a starting URL.
Olostep recursively follows internal links and collects pages.

Great for:

Docs ingestion
Knowledge bases
RAG pipelines
Internal search engines

Batches (“Do this at scale”)

Have 1,000 to 10,000 URLs?
Send them in one job and let Olostep handle concurrency.

Used for:

Lead enrichment
SEO audits
Price monitoring
Market research

Answers (“Search the web and explain it to me.”)

Instead of scraping first and prompting later, Olostep can:

Search the web
Read multiple sources
Generate an AI answer
Attach references

Perfect for:

Research agents
Analyst copilots
Internal Q&A tools

Hands-On Activity (Python): Scrape a Web Page

import requests
API_KEY = ""
API_URL = "https://api.olostep.com/v1/scrapes"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}
ppayload = {
    "url_to_scrape": "https://example.com"
}
response = requests.post(API_URL, headers=headers, json=payload)
data = response.json()
print(data["markdown_content"])

What’s happening here?

Olostep loads the page (JS included)
Extracts the content
Returns it in a clean, AI-friendly format

Pros:

No retries
No IP blocked issues (Scalable)
No Selenium

Hands-On Activity (Node.js): Ask the Web a Question (AI-Powered)

const API_KEY = "YOUR_API_KEY";
fetch("https://api.olostep.com/v1/answers", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${API_KEY}`,
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    task: "What are the biggest AI trends in 2026?",
    json: {
      trend: "",
      explanation: ""
    }
  })
})
  .then(res => res.json())
  .then(data => console.log(data))
  .catch(err => console.error(err));

Python SDK (Cleaner, Less Boilerplate)

If you don’t want to deal with raw HTTP calls, Olostep’s Python SDK makes life easier.

Installation

pip install olostep

Example: Simple Scrape

from olostep import Olostep
client = Olostep(api_key="YOUR_API_KEY")
result = client.scrapes.create(
    url_to_scrape="https://docs.olostep.com"
)
print(result.markdown_content)

Example: Crawl a Website

crawl = client.crawls.create(
    start_url="https://docs.olostep.com"
)
for page in crawl.pages():
    print(page.url)

When to use the SDK

You’re building pipelines
You want pagination handled automatically

Node SDK (Agent-Friendly & Async)

The Node SDK is ideal if you’re building:

AI agents
Backend services
Serverless workflows

Installation

npm install olostep

Example: Scrape a Page

import { Olostep } from "olostep";
const client = new Olostep({
  apiKey: "YOUR_API_KEY"
});
const result = await client.scrapes.create({
  url_to_scrape: "https://example.com"
});
console.log(result.markdown_content);

Example: Batch URLs

const batch = await client.batches.create({
  items: [
    { custom_id: "1", url: "https://site1.com" },
    { custom_id: "2", url: "https://site2.com" }
  ]
});
console.log(batch);

Why SDKs matter

Less error-prone
Easier retries
Cleaner agent integration
Faster prototyping

Supported Platforms

Olostep doesn’t care where your code lives: local machine, cloud, CI pipeline, or some mysterious server you SSH into once and never touch again.

If it can make HTTP requests, Olostep works there.

Programming Languages

Out of the box, Olostep supports:

Python (For data pipelines, ML workflows, and RAG systems)
Node.js / JavaScript (For backend services, agents, and serverless functions)

And if you’re using something else?
No problem, Olostep is a plain HTTP API, so you can integrate it with:

Go
Java
C#
PHP
Ruby
Bash (yes, really)

Deployment Environments

Olostep works seamlessly across:

Local development (Mac, Linux, Windows)
Cloud servers (AWS, GCP, Azure, DigitalOcean)
Serverless platforms (AWS Lambda, Vercel, Cloudflare Workers*)
Docker & Kubernetes workloads
CI/CD pipelines

If your app can reach the internet, it can reach Olostep.

AI & Agent Frameworks

Olostep fits naturally into modern AI stacks and agentic workflows, including:

LangChain
LlamaIndex
Custom RAG pipelines
Agent-based architectures
Internal research copilots

It acts as the “web access layer”, the part that actually fetches reality before your LLM starts hallucinating.

Data Formats

Olostep speaks the formats your systems already understand:

HTML (raw page content)
Markdown (perfect for RAG ingestion)
Plain text
Structured JSON (via parsers or AI extraction)

Conclusion

Most AI systems today don’t fail because the models are bad; they fail because they’re blind to the real, live web.

They hallucinate.
They rely on stale knowledge.
They guess instead of verifying.

Olostep fixes that by giving your AI what it’s been missing all along: reliable, structured, up-to-date access to the internet.

Whether you’re building:

Agentic RAG systems,
Research automation,
Internal copilots,
Lead enrichment pipelines,
or large-scale web intelligence tools,

Olostep removes the painful parts of web data extraction, letting you focus on building intelligence instead of infrastructure.

No brittle scrapers.
No proxy chaos.
No JavaScript nightmares.

Just clean data, delivered at scale exactly when your AI needs it. So if you want your AI to stop pretending it knows the web and actually use it, Olostep might just be the hoodie-wearing genius sitting quietly behind the scenes, only faster, scalable, and always online.

Olostep: Web Data API for AI and Research Automation was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

Agentic MCP: Dynamic MCP Integration for AI Agents

Ahmad Faraz Sheikh — Mon, 17 Nov 2025 06:32:43 GMT

Integrating the Model Context Protocol (MCP) into AI agents still feels like heavy lifting. Each time we want an agent to access a data source or a tool, we dive into config files, register endpoints, reload services, and keep track of what’s connected. It might work, but it’s slow, brittle, and doesn’t scale well. Additionally, when we load every possible MCP into the agent’s configuration ahead of time, we can encounter the tool overload issue. Too many available options clutter the agent’s context, increase token usage, slow reasoning, and reduce accuracy.

Now, imagine an agent that already knows which MCPs are available to it. When a user submits a query, the agent can determine on its own whether a specific MCP is needed, automatically load it into its context, and continue the conversation without interruption. There’s no need for a manual agent restart after updating the configuration. This approach maintains a seamless and efficient workflow while preserving context and minimizing computational overhead. Since the agent only activates the MCPs required for a given query, it also minimizes token usage and operational cost, avoiding the inefficiency of loading every MCP by default.

Architecture Overview

The architecture for dynamic MCP integration is built around two main layers inside the agent: Initial Processing and Dynamic MCP Handling. Together, they enable the agent to determine whether a user query can be resolved internally or if it requires an external MCP, and then seamlessly bring that MCP into use.

1. Initial Processing

When a user submits a query, the agent first evaluates whether it can handle the request directly. If the logic or data needed already exists within the agent’s internal context, it generates an immediate response. This keeps lightweight tasks fast and efficient.

2. Dynamic MCP Handling

For queries that require external information or specialized processing, the agent shifts to its dynamic integration layer:

MCP Discovery: The agent fetches a list of available MCPs from its directory or registry.
MCP Selection: It evaluates which MCP can help cater to the user’s query based on domain relevance, available tools, and description.
MCP Integration: The selected MCP is dynamically attached to the agent’s runtime environment.
Execution: The agent, now equipped with the right MCP, processes the query, retrieves results, and delivers the response all within the same conversational flow.

This design eliminates the need for manual configuration updates, prevents tool overload, and optimizes both performance and cost. The agent remains context-efficient, loading only the MCPs needed for each interaction rather than keeping all integrations active at once.

You can visualize the complete flow below:

Implementation Approach

To bring this design into practice, I implemented a two-stage mechanism that enables the agent to dynamically discover, select, and integrate MCPs while maintaining the session context intact.

1. MCP Discovery Tool

The process begins with a dedicated tool that queries the database for all registered MCP configurations. Each configuration includes the MCP’s name, description, and other configurations. The tool compiles this information into a structured list (with only the names and description) and passes it to the agent as part of the runtime context. This provides the agent with real-time awareness of what MCPs are currently available, without hard-coding them into configuration files.

2. Agent-Driven Selection

When a user submits a query, the agent evaluates it against the list of available MCPs. Using the descriptions as semantic hints, the agent determines which MCP can best fulfill the user’s request. This decision is made autonomously at runtime, without requiring any manual input.

3. Dynamic Registration and Context Preservation

Once the agent selects the most relevant MCP, it calls a second tool that handles the registration process. This tool spawns a new agent instance that inherits all configurations, states, and previous MCPs from the existing session, and then integrates the newly required MCP. The result is a refreshed agent that can immediately proceed with the same conversation, now equipped with the right capabilities to address the query.

4. Continuous Flow and Efficiency

Because the transition occurs seamlessly within the same conversation, the user never experiences a context break. This method avoids preloading every MCP, saving both token usage and compute cost, while maintaining continuity in reasoning and dialogue.

Design Considerations

Building a dynamic MCP integration system requires more than just connecting agents and protocols. It also requires reliable coordination, state management, and careful control of how agents and tools interact. The following considerations form the backbone of a scalable and resilient setup.

1. Centralized MCP Manager

A centralized MCP Manager acts as the source of truth for all available MCPs. It stores configurations, maintains metadata (names, endpoints, descriptions, status), and exposes a discovery API for agents to query. This setup ensures that updates to MCPs are reflected instantly across all agents, avoiding configuration drift. The MCP Manager can also implement health checks to track the availability and responsiveness of each MCP.

2. Centralized Agent Manager

A Centralized Agent Manager complements the MCP Manager by overseeing the lifecycle of active agents.

Synchronization with the MCP Manager: It remains in sync with the MCP registry, ensuring that each agent always has access to the latest MCP configurations.
Smooth Restarts: It handles agent restarts gracefully, automatically reinitializing previous contexts and reloading relevant MCPs without losing session history.
Seamless Query Transfer: It manages transitions when a query needs to move from one agent instance to another, maintaining the user conversation flow without interruptions or context resets.

3. Prompt and Tool Usage Guidelines

Agents should be guided with well-structured prompt templates that instruct how to decide between getting or adding an MCP. Clear prompting reduces ambiguity in tool invocation and ensures the agent behaves deterministically when selecting or attaching new MCPs.

MCP Discovery Tool: Used when the agent needs to query the MCP Manager for the list of available MCPs.
Add MCP Tool: Triggered when the agent identifies a new MCP required for the current query and needs to integrate it into the runtime context.
Both tools should include lightweight validation logic to confirm successful MCP addition and maintain traceability in logs.

Possible Enhancements and Alternate Approaches

Depending on system needs and use-case complexity, several refinements can be introduced within the existing architecture:

Preload MCP List at Session Start
Instead of relying on a dedicated tool to fetch available MCPs each time, the agent can load the MCP registry directly into its context when a new session begins. To avoid a tool call.
Temporary MCP Handling via Sub-Agent
For setups that do not want to retain MCPs permanently, the logic can be adjusted so that instead of restarting the entire agent, a lightweight sub-agent is spawned. This sub-agent carries only the required MCP, processes the specific user query, and passes the response back to the main agent before being terminated.
Controlled MCP Removal
A dedicated tool can be introduced to remove a specific MCP from the agent’s context, freeing up memory and reducing token usage. This action would only trigger on explicit user request and should include a confirmation prompt before deletion to prevent accidental removals.

Conclusion

Dynamic MCP integration represents a practical step toward truly adaptive I agents. By allowing agents to discover, attach, and manage MCPs on demand, we remove the rigidity of manual configurations and unlock smoother, context-preserving workflows. The architecture not only simplifies development but also optimizes cost, resource use, and scalability.

As agent ecosystems grow, such mechanisms will become essential for balancing autonomy with control. Whether through centralized managers, sub-agents, or contextual reloading, the goal remains the same: to enable agents who can understand what they need, find it, and use it responsibly within the workflow. This shift moves AI agents from being pre-wired systems to becoming dynamic collaborators that configure themselves intelligently. If you’ve been working on similar adaptive agent architectures or have thoughts on improving MCP discovery and management, feel free to share your perspective in the comments.

Agentic MCP: Dynamic MCP Integration for AI Agents was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

Kolmogorov-Arnold Networks: A Game Changer or Just Another AI Experiment?

Mubashir Iqbal — Fri, 11 Jul 2025 11:40:32 GMT

Source

Kolmogorov-Arnold Networks (KANs) is an exciting new development in deep learning, pushing the boundaries of traditional models like Multi-Layer Perceptrons (MLPs). Inspired by the principles of the Kolmogorov-Arnold Representation Theorem, KANs introduce a novel approach to building neural networks. Unlike MLPs, which use fixed activation functions at each node (commonly referred to as “neurons”), KANs feature learnable activation functions on the connections between nodes (known as “edges”). Instead, each weight parameter is replaced by a univariate function, specifically a spline, which can easily switch between broad and detailed adjustments. This shift not only simplifies the handling of complex functions but also enhances the flexibility and interpretability of neural network architectures, marking a significant leap forward in AI technology.

Historical Background

Let’s delve into history to understand the KAN architecture. At the end of the 19th century, there was a competition between mathematicians and physicists to break down some higher-degree equations into simple equations that were easy to deal with and understand as those equations were mostly required for different industries to use. Some equations of specific degrees were solved by mathematicians, but some were not. This led to a question that was mentioned in 23 Hilbert’s Problem as the thirteenth question: whether it is possible to break 7th-degree polynomials into simpler functions of 2 variables.

This question was there for a reason because if there could be a way to do this then working with and estimating complex functions could be made easier. Or in simple words we can describe multiple dimensions by breaking them into simple dimensions which would be easy to understand individually.

To solve this problem, two main theories were discovered:

Universal Approximation Theorem
Kolmogorov-Arnold Representation theorem

Universal Approximation Theorem:

It is the foundation of the Multi-Layer Perceptron (MLP), also known as the traditional neural network architecture. It aims to approximate complex functions with the desired accuracy using a feed-forward network.

“It states that the complex relationship between input and output (a complex function) can be estimated using a hidden layer with n neurons (n may vary depending on the problem), where a sigmoid or any other nonlinear function should be used as the activation function for nonlinear mapping.”

Hidden layer is approximating the function

Kolmogorov–Arnold representation theorem:

Consider that your colleague has written a complex piece of code that performs multiple tasks. You have been assigned to review and understand it. However, the problem is that all these tasks are implemented in a non-standard way — without modules, classes, or functions. Would it be easier for you to analyze and understand this code, or would a well-structured version that follows the separation of concerns (i.e., one function per task) be more manageable?

You guessed it right — the code that is broken down into functions and classes will be much easier to understand and modify. This is exactly what the Kolmogorov-Arnold Theorem does.

Any complex function, no matter how many different inputs it has, can be broken down into a combination of simpler, one-variable functions.

Mathematically, the theorem can be stated as:

where Φq and ϕqp are continuous functions of one variable.

Kolmogorov Arnold Network

Now that we have a basic overview of the Kolmogorov-Arnold Theorem, we can understand why there was a great quest among the research community to develop a way to utilize the Kolmogorov–Arnold Representation Theorem for solving AI problems. This is because it could help train models more efficiently using significantly less data. Kolmogorov-Arnold Networks (KANs) address this challenge by modifying the theorem to enhance computational efficiency.

Now, let’s dive into how KANs have adapted the Kolmogorov–Arnold Theorem in their architecture.

Mathematical Form of KAN

Before going in more depth lets understand the Mathematical form of KAN in a better way.

Lets consider Housing pricing dataset

The relation between price and features can be represented using a multivariate function, say F.

Price = f(φ_b(bedrooms) + φ_sq(area) + φ_a(age))

Price = f₁(φ_b(bedrooms) + φ_sq(area) + φ_a(age)) + … + fₙ(φ_b(bedrooms) + φ_sq(area) + φ_a(age))

Instead of looking at price as a single formula, we break it down into multiple layers or steps, making it easier to interpret how each small function contributes to the final prediction. Each of these smaller functions captures the effect of bedrooms, area, and age separately, allowing us to understand their individual impact before combining them into the overall price estimation.

Now, look back at the formula of the Kolmogorov–Arnold Theorem — hopefully, it now makes sense. Take your time to grasp this concept.

This approach not only simplifies the representation of complex functions but also allows neural networks like KAN to achieve better computational efficiency and interpretability by breaking down high-dimensional problems into manageable components.

Difference between KAN and Multi Layer Perceptron (MLPs):

MLPs have linear weights and they use activation functions on nodes for learning non linearity.
KAN uses learnable activation functions on edges(weights) due to which they don’t have any linear matrix.
KAN nodes only add incoming signals unlike MLPs nodes, which applies activation functions too.

B-splines:

B-Spline (Basis Spline) is a piecewise-defined polynomial function that provides a smooth curve while maintaining local control over shape adjustments. B-Splines are widely used in computer graphics, numerical analysis, and curve fitting.

To understand why B-spline is used in KANs, Lets consider a case where there is only one composition, i.e., we need to focus only on the inner dot product of the main equation given above.

Remember, these ϕ functions are not weights but univariate polynomial functions/equations, whose coefficients are trained using back propagation during training. However, instead of using simple equations, researchers have used B-splines due to their adaptive nature and efficiency.

Since data points represent positions in space, to find a curve passing through n points we need need polynomial of n-1, which could be solved by using using system of equation on (n x n) matrix, which is an expensive operation or if with back propagation, and with increasing number of points the resultant curve , although do pass from those points but extreme points starts going crazy as can be seen in this graph that just to pass through these data points the curve found through solving system of equation has moved more then what was required by data points pattern.

B-splines solve this issue by using part-wise functions( function defined from one point to other point in such a way that flow of graph stays smooth not pointy) give us more control over points to draw best smooth curve, while keeping the function differentiable which is core of learning through back propagation.
One of many other plus points of using B splines is that it gives control over the curve between points without affecting the other part of the curve. This is useful when we need to fine-tune a model for some specific scenario, unlike neural networks, which start to forget previously learned graphs on fine-tuning, these will only improve fine-tuning the relevant area instead of the whole knowledge.

Fine-tuning the model with B splines will only adjust the graph between points which is required for that specific knowledge area instead of adjusting whole graph. On the other hand MLPs will adjust the complete poly curve only for changing the curve for one point which result in loss of previous learning related to other parts of graph.

Conclusion

Kolmogorov-Arnold Networks (KANs), while still in their early stages compared to MLPs, which matured with time using different techniques, introduce a revolutionary approach to deep learning by replacing traditional fixed activation functions with learnable functions on edges. Rooted in the Kolmogorov-Arnold Representation Theorem, KANs aim to break down complex functions into simpler, more interpretable components — similar to how modular programming improves code readability.

However, the true potential of KANs remains to be seen. Will they match the versatility of MLPs in deep learning? Can they efficiently learn from any kind of data? Only time will tell if this innovation aligns with industry needs.

What are your thoughts? Do you see KANs shaping the future of AI, or will they remain a niche concept? Comment below!

Kolmogorov-Arnold Networks: A Game Changer or Just Another AI Experiment? was originally published in Red Buffer on Medium, where people are continuing the conversation by highlighting and responding to this story.

Red Buffer - Medium

Private, Scalable LLM Inference for Data Compliance: vLLM & SGLang

How Self-Hosted Inference Solves the Compliance Equation

Enter vLLM and SGLang: What They Are and Why They Exist

vLLM :

vLLM Setup

SGLang :

SGLang Setup :

Choosing Between Them

Scaling From Proof of Concept to Production

What Self-Hosting Does Not Give You

Key Takeaways

Where to Go Next

Harder Than It Sounds: An Introduction to Voice Agents

Agentic AI

Conversational AI

Voice Agents

Architectural Overview

Why Voice Is Hard

A Deeper Dive: What’s Under the Hood

STT

LLM

TTS

Conclusion

LangChain vs LangGraph: Choosing the Right AI Workflow Framework

Architecture & Primitives

When to Stick with LangChain

When LangGraph Shines

Performance and Complexity Trade-offs

Decision Checklist

Conclusion

Agentic AI Systems with Integrated XAI: Building Smarter, Transparent, and Trustworthy AI

🧠 What are Agentic AI Systems?

🔎 Why Do We Need XAI (Explainable AI)?

⚙️ Architecture of Agentic AI with XAI

💻 Coding a Simple Agent with XAI

Step 2: Add XAI explanation with SHAP

🌍 Real-World Applications

🚀 Future of Agentic AI with XAI

Conclusion

Deploying Python Apps on Google Cloud Run using Gitlab pipelines(CI/CD)

CI/CD

Google Cloud Run

Gitlab

Prerequisites

1. Install and Authenticate Google Cloud SDK

2. Setup Artifact Registry and Cloud Build API

3. Setup Sevice Account

4. Setup project locally

5. Setup CI/CD Variables on Gitlab

6. Writing the Dockerfile:

7. Writing the deployment action:

8. Push the Code

️ Create a Personal VPN on AWS EC2 Using OpenVPN — Easy & Complete Walkthrough

🛡️ Create a Personal VPN on AWS EC2 Using OpenVPN — Easy & Complete Walkthrough

🛠️ What You’ll Need

💻 How to Use .ovpn on Different Platforms

Windows and Mac

Ubuntu/Linux

Android

iOS (iPhone/iPad)

Zero-Shot Object Detection with CLIP Models

Understanding the Basics

What Makes CLIP Special?

How CLIP Connects Images and Text

The Concept of “Zero-Shot” Learning

Real-World Applications and Benefits

How CLIP Works

Architecture Overview

The Contrastive Learning Approach

Hands-on Implementation

Testing CLIP with a Real-World Example

From Images to Videos: Detecting Objects in Motion

Conclusion

GitHub Repository

Further Reading & Resources

Olostep: Web Data API for AI and Research Automation

What is Olostep (in plain English)?

What can Olostep do?

Core Concepts (Quick Tour)