Stories by Vishnu Sivan on Medium

A Practical Guide to Training AI Agents with Microsoft Agent Lightning

Vishnu Sivan — Sun, 22 Mar 2026 13:43:01 GMT

Artificial intelligence agents are rapidly transforming the way we interact with software. Powered by large language models (LLMs), these agents can answer questions, automate workflows, and integrate with external tools and data sources. However, training and improving AI agents — especially for complex, multi-step tasks — has traditionally been difficult. Developers often need to modify large portions of code, design custom training loops, and manage complex reinforcement learning pipelines.

Microsoft’s Agent Lightning framework aims to simplify this process. It introduces a new approach that separates how an agent operates from how it learns, allowing existing AI agents to be trained with reinforcement learning with minimal or even zero code changes. Instead of redesigning the agent architecture, developers can plug Agent Lightning into their current systems and allow the agent to improve through real-world interactions.

In this article, we explore how to build and train an AI agent using the Agent Lightning framework.

Getting Started

What is Agent Lightning
Why Agent Lightning Matters
Three-Component Architecture of Agent Lightning
How Agent Lightning Works
Hands-on 1: Manual Prompt Search with AgentLightning and OpenAI
Hands-on 2: Building a Trainable LLM Agent with AgentLightning
Hands-on 3: Sentiment Analysis Agent with AgentLightning
Hands-on 4: LangGraph SQL Agent with AgentLightning

What is Agent Lightning?

Agent Lightning is an open-source framework developed by Microsoft that enables AI agents to be trained and optimized using reinforcement learning (RL) based on their real-world execution behavior.

Traditionally, improving an AI agent requires modifying the agent’s internal logic or redesigning parts of its architecture. Agent Lightning solves this problem by introducing an external training layer that observes how an agent behaves during execution. It captures the agent’s actions, states, and outcomes, and uses this data to improve the agent’s performance over time. This approach allows existing agents to become self-learning systems without rewriting their core logic, which is especially valuable in production environments.

At a conceptual level, Agent Lightning treats an agent’s execution as a Markov Decision Process (MDP). During each step of a task, the agent is in a specific state, generates an action (typically an LLM output), and receives a reward depending on whether the action helps achieve the task goal. These rewards become learning signals that guide the reinforcement learning process.

One of the biggest advantages of Agent Lightning is its framework-agnostic design. It can be integrated with agents built using popular frameworks such as LangChain, OpenAI Agents SDK, AutoGen, CrewAI, LangGraph, or even custom Python-based agents. In most cases, integration requires minimal or near-zero code changes.

The framework typically consists of a Python SDK and a training server. Developers wrap their existing agent logic in a lightweight interface (such as a LitAgent class), define how to evaluate the agent’s output using a reward function, and start the training process. Agent Lightning then collects execution traces, processes them through its hierarchical reinforcement learning algorithm (LightningRL), and updates the model or prompt configuration to improve the agent’s performance.

https://medium.com/media/d71655f466e7dbf7f9fda5cc1eaee323/href

Key Features

Framework Agnostic — Compatible with major agent frameworks such as LangChain, OpenAI Agents SDK, AutoGen, CrewAI, and custom Python-based agents.
Reinforcement Learning Optimization — Improves agent performance over time by applying reinforcement learning to learn from successes, failures, and feedback signals.
Multiple Training Methods — Supports reinforcement learning, automatic prompt optimization, and supervised fine-tuning.
Multi-Agent Coordination — Enables optimization and collaboration across multiple agents within complex multi-agent systems.
Execution Monitoring & Error Tracking — Uses the Lightning Server to monitor agent execution, detect errors, and track performance.
Selective Optimization — Allows developers to optimize specific agents or components within a larger system.
Flexible and Extensible Architecture — Provides open interfaces for customizing algorithms, reward strategies, and training workflows.

Core Components

LightningStore — Central system that stores tasks, resources, and execution traces.
Tracer — Captures structured data such as prompts, tool calls, and reward signals.
Algorithm Engine — Processes traces and learns improved strategies or prompts.
Trainer — Orchestrates the training workflow and updates the inference system.

Why Agent Lightning Matters

Many popular agent frameworks such as LangChain, LangGraph, CrewAI, and AutoGen enable developers to build powerful AI agents capable of reasoning step-by-step and interacting with tools. However, most of these systems operate using static prompts, fixed workflows, and unchanged model parameters. As a result, the agents do not learn from their past interactions or improve automatically over time.

This limitation becomes a major challenge in real-world applications where tasks are complex and environments constantly change. Developers often need to manually refine prompts, adjust logic, or redesign workflows, which becomes difficult to maintain as systems scale.

Agent Lightning addresses this gap by introducing an automated learning pipeline powered by reinforcement learning. By enabling agents to learn continuously from real-world usage, Agent Lightning transforms traditional static agents into adaptive systems. This capability is particularly valuable for enterprise workflows, long-running automation systems, and multi-step processes, where reliability and accuracy improve only through repeated execution and feedback.

Three-Component Architecture of Agent Lightning

Agent Lightning consists of two main components: the Lightning Server and the Lightning Client. Together, they act as a lightweight intermediate layer that connects agent frameworks with LLM training systems. The framework exposes an OpenAI-compatible LLM API within the training infrastructure, enabling existing agents to integrate with the training system without modifying their original code.

Agent Lightning — Microsoft Research

The architecture of Agent Lightning is built around three core components that work together to create a continuous training loop for AI agents. These components manage learning, execution, and coordination, allowing agents to improve automatically through real-world interactions.

1. Algorithm — The Learning Engine
The Algorithm component is responsible for training and optimizing the agent. It analyzes how agents perform during task execution and uses that information to improve their behavior over time.

Its main responsibilities include:

Assigning tasks (rollouts) for agents to execute
Analyzing execution traces collected from completed tasks
Learning from agent actions and outcomes
Updating resources such as model parameters or prompt templates

Agent Lightning supports pluggable algorithms, allowing developers to choose or implement different optimization strategies. Built-in options include:

APO (Automatic Prompt Optimization) — Improves prompt templates using textual gradients and search strategies
VERL-based Reinforcement Learning — Uses RL techniques such as PPO or GRPO to train agent policies
Custom Algorithms — Developers can implement their own optimization logic through the Algorithm interface.

2. Runner — The Execution Worker
The Runner is responsible for executing agent tasks and collecting telemetry data. It retrieves tasks from the system, runs the agent using the latest resources, and records detailed execution traces.

Key responsibilities include:

Fetching tasks from the queue
Loading the latest models or prompt configurations
Running the agent to complete tasks
Capturing execution traces through a tracer system
Sending collected data back for training

The default LitAgentRunner works with agents wrapped as LitAgent instances, but developers can create custom runners for specialized execution environments.

3. LightningStore — The Coordination Hub
The LightningStore acts as the central system that coordinates all components. It manages tasks, stores execution traces, and keeps track of resource versions used during training.

Its main functions include:

Managing the task queue for agent rollouts
Storing detailed execution traces (spans)
Tracking versions of models, prompts, and resources
Handling retries and failure management during training

Agent Lightning provides multiple storage implementations such as InMemoryLightningStore, SqliteLightningStore, and MongoLightningStore, enabling the framework to scale from local experiments to distributed production environments.

How Agent Lightning Works

Agent Lightning acts as middleware between reinforcement learning (RL) algorithms and agent environments, providing standardized interfaces that allow scalable training and coordination across different system components.

The Agent Runner manages agents as they execute tasks. It distributes work, monitors progress, and collects execution results and traces.
The Algorithm component handles the training process and hosts the LLMs used for both inference and optimization.
The LightningStore serves as the system’s central data repository.

Agent Lightning: Adding reinforcement learning to AI agents without code rewrites

Execution flow

Task Execution
The Lightning Server retrieves tasks from a task pool and sends them to the agent. The agent then attempts to complete the task using its native workflow, which may include tool usage, multi-turn interactions, or coordination with other agents.

2. Trace Collection
Agent Lightning uses a sidecar-based monitoring approach to capture execution data without interfering with the agent’s internal logic. During each run, the system collects structured telemetry such as execution traces, agent actions, errors and reward signals. These traces are converted into state–action–reward–next state transitions, which represent the agent’s decision-making steps during the task.

3. Training and Optimization Loop
The collected traces are organized into training data and processed by a reinforcement learning framework (such as VERL). RL algorithms like GRPO analyze the agent’s behavior and update resources such as prompt templates or model weights.

The updated configuration is then used in the next task cycle, creating a continuous feedback loop where the agent improves its performance through repeated execution and learning. Agent Lightning also supports intermediate reward signals, allowing smaller rewards for successful steps within a task to accelerate training.

Basic Integration

Integrating Agent Lightning into an existing agent is simple and requires minimal changes to the original code. The core idea is to add lightweight hooks that capture the agent’s actions and observations during execution, without modifying its internal logic.

By using Agent Lightning’s helper functions, the agent can report key events such as inputs and outputs, enabling the training system to collect data for optimization.

import agentlightning as agl

# Existing agent logic remains unchanged
def your_existing_agent_function(query):
    # Capture the input action
    agl.emit_action("user_query", query)
    
    # Execute agent logic
    response = your_agent.process(query)
    
    # Capture the output observation
    agl.emit_observation("agent_response", response)
    
    return response

Hands-on 1: Manual Prompt Search with AgentLightning and OpenAI

In this hands-on tutorial, we will build and train a simple AI agent using Agent Lightning in Google Colab. The goal is to demonstrate how reinforcement learning can improve an agent’s performance over time through real interactions.

We will set up both the Lightning Server and Client, create a basic QA agent, and connect it to the training pipeline. As the agent executes tasks, Agent Lightning will collect feedback and continuously refine its behavior, making it progressively smarter.

Agent Lightning supports multiple learning algorithms, including:

APO (Automatic Prompt Optimization) — requires additional libraries such as POML
VERL (Reinforcement Learning) — integrates with frameworks like PyTorch and vLLM, and requires GPU support

Note that VERL setup may take 20–40 minutes, depending on dependencies and environment configuration.

Setting Up the Environment

For this tutorial, we will use Google Colab with GPU support to ensure efficient training.

Open Google Colab and sign in with your Google account
Create a new notebook
Go to Runtime → Change runtime type

Set Hardware Accelerator to GPU
Select T4 GPU (recommended)

4. Click Save

Installing dependencies

Install the required libraries using the following command.

!pip install agentlightning

Build a simple qa agent and enable training using Agent Lightning.

import os, asyncio, nest_asyncio, logging
from getpass import getpass

os.environ["AGENTOPS_DISABLE_AUTO_INSTRUMENTATION"] = "true"
os.environ["OPENAI_API_KEY"] = "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"

logging.getLogger("agentlightning.tracer.otel").setLevel(logging.ERROR)
logging.getLogger("opentelemetry.trace").setLevel(logging.ERROR)

from agentlightning import (
    LitAgent, LitAgentRunner, OtelTracer,
    emit_reward, NamedResources, Rollout,
    InMemoryLightningStore,
)
from agentlightning.types import PromptTemplate
from typing import Any, Dict, Optional
import openai

nest_asyncio.apply()
MODEL = os.getenv("MODEL", "gpt-4o-mini")

_reward_log: list[float] = []

class QAAgent(LitAgent):
    def rollout(self, task: Dict[str, Any], resources: NamedResources, rollout: Rollout) -> Optional[float]:
        sys_prompt = resources["system_prompt"].template
        user = task["prompt"]
        gold = task.get("answer", "").strip().lower()

        try:
            r = openai.chat.completions.create(
                model=MODEL,
                messages=[
                    {"role": "system", "content": sys_prompt},
                    {"role": "user",   "content": user},
                ],
                temperature=0.2,
            )
            pred = r.choices[0].message.content.strip()
        except Exception as e:
            pred = f"[error] {e}"

        def score(pred: str, gold: str) -> float:
            P       = pred.lower()
            base    = 1.0 if gold and gold in P else 0.0
            gt      = set(gold.split()); pr = set(P.split())
            inter   = len(gt & pr); denom = (len(gt) + len(pr)) or 1
            overlap = 2 * inter / denom
            brevity = 0.2 if base == 1.0 and len(P.split()) <= 8 else 0.0
            return max(0.0, min(1.0, 0.7 * base + 0.25 * overlap + brevity))

        reward = float(score(pred, gold))
        emit_reward(reward)
        _reward_log.append(reward)
        print(f"  Q: {user!r:45s} | Pred: {pred!r:20s} | Gold: {gold!r:12s} | R: {reward:.3f}")
        return reward

TASKS = [
    {"prompt": "Capital of France?",             "answer": "Paris"},
    {"prompt": "Who wrote Pride and Prejudice?", "answer": "Jane Austen"},
    {"prompt": "2+2 = ?",                        "answer": "4"},
]

PROMPTS = [
    "You are a terse expert. Answer with only the final fact, no sentences.",
    "You are a helpful, knowledgeable AI. Prefer concise, correct answers.",
    "Answer as a rigorous evaluator; return only the canonical fact.",
    "Be a friendly tutor. Give the one-word answer if obvious.",
]

async def run_prompt_search():
    store  = InMemoryLightningStore()
    agent  = QAAgent()
    tracer = OtelTracer()
    runner = LitAgentRunner(tracer=tracer)

    results = []

    for sp in PROMPTS:
        print(f"\n{'='*60}")
        print(f"Prompt: {sp}")
        print('='*60)

        await store.update_resources(
            resources_id="default",
            resources={"system_prompt": PromptTemplate(template=sp, engine="f-string")}
        )
        _reward_log.clear()

        with runner.run_context(agent=agent, store=store):
            for t in TASKS:
                await runner.step(t)

        avg = sum(_reward_log) / len(_reward_log) if _reward_log else 0.0
        print(f"\n  → Prompt avg: {avg:.3f}")
        results.append((sp, avg))

    best = max(results, key=lambda x: x[1])
    print(f"\n{'='*60}")
    print(f"BEST PROMPT : {best[0]}")
    print(f"BEST SCORE  : {best[1]:.3f}")

asyncio.run(run_prompt_search())

Here’s a step-by-step walkthrough of what the code does:

1. Setup & Imports Configures API keys, suppresses noisy logs, applies nest_asyncio (needed to run asyncio inside Jupyter), and imports AgentLightning components.

2. QAAgent — the core agent Subclasses LitAgent and implements rollout(), which is called once per task:

Pulls the current system prompt from resources["system_prompt"]
Calls OpenAI with that prompt + the task’s question
Scores the prediction against the gold answer using a custom score() function that combines exact match, token overlap, and a brevity bonus
Emits the reward back to AgentLightning via emit_reward() and logs it to _reward_log

3. TASKS and PROMPTS TASKS is a small 3-question QA dataset. PROMPTS is a list of 4 candidate system prompts with different styles (terse, helpful, rigorous, friendly) — these are what we're trying to compare.

4. run_prompt_search() — the search loop This is the main async function that runs a manual prompt search:

Creates an InMemoryLightningStore, a QAAgent, and a LitAgentRunner
Loops over each candidate prompt, updates it in the store, then runs all 3 tasks through the agent
Collects the average reward for each prompt
At the end, picks and prints the best-scoring prompt

5. Entry point asyncio.run(run_prompt_search()) kicks everything off. The result is a ranked comparison of which system prompt style gets the highest average reward across the 3 QA tasks.

6. Run the cell to display the output.

Hands-on 2: Building a Trainable LLM Agent with AgentLightning

The hands-on walks you through building a minimal but complete trainable LLM agent using the AgentLightning framework. You’ll learn how agents receive tasks, call an LLM, score their own outputs via a reward function, and report results back to a training loop — the core cycle behind reinforcement learning from human/environment feedback (RLHF-style training).

Part 1 — compute_reward()

def compute_reward(output: str, expected: str) -> float:
    """
    Simple reward logic. +1.0 for correct answer, -1.0 for wrong answer.
    """
    return 1.0 if output.strip().lower() == expected.strip().lower() else -1.0

The simplest possible reward function. It does an exact string match between the agent’s output and the expected answer, returning +1.0 for a correct answer and -1.0 for a wrong one. In real-world agents this could be a more nuanced scorer (e.g. F1, BLEU, SQL execution match).

Part 2 — SimpleAgent

import agentlightning as agl
from agentlightning import LitAgent, emit_reward, NamedResources, Rollout
from openai import OpenAI
from typing import Any, Dict, Optional

class SimpleAgent(LitAgent):
    """
    A LitAgent that uses an LLM resource (like LitSQLAgent) instead of
    hardcoded rule-based logic.
    """

    def __init__(self, system_prompt: str = "You are a helpful assistant."):
        super().__init__()
        self.system_prompt = system_prompt

    def rollout(self, task: Dict[str, Any], resources: NamedResources, rollout: Rollout) -> Optional[float]:
        # 1. Extract the LLM resource (same as LitSQLAgent)
        llm: agl.LLM = resources["main_llm"]

        # 2. Build the OpenAI client using the LLM resource endpoint
        client = OpenAI(
            base_url=llm.endpoint,
            api_key=llm.api_key,
        )

        # 3. Call the LLM (instead of hardcoded if/elif logic)
        question = task.get("input", "")
        response = client.chat.completions.create(
            model=llm.model,
            messages=[
                {"role": "system", "content": self.system_prompt},
                {"role": "user",   "content": question},
            ],
            **(llm.sampling_parameters or {}),
        )
        output = response.choices[0].message.content.strip()

        # 4. Compute reward and emit it (same as LitSQLAgent)
        expected = task.get("expected_output", "")
        reward_value = compute_reward(output, expected)
        emit_reward(reward_value)

        print(f"[Agent] Input:    {question}")
        print(f"[Agent] Output:   {output}")
        print(f"[Agent] Expected: {expected}")
        print(f"[Agent] Reward:   {reward_value}")
        print("-" * 40)

        return float(reward_value)

The agent subclasses LitAgent and implements one required method: rollout(). This is called once per task during training or evaluation. Inside it:

Extracts the LLM resource from resources["main_llm"] — this is injected by the runner, not hardcoded
Builds an OpenAI client using the endpoint and API key from that resource
Calls the LLM with a system prompt + the task’s question
Scores the response via compute_reward() and calls emit_reward() to report it back to AgentLightning's training loop
Returns the reward as a float for immediate use

Part 3 — Training Data and LLM Resource

import os
os.environ["AGENTOPS_DISABLE_AUTO_INSTRUMENTATION"] = "true"

import asyncio
import nest_asyncio
nest_asyncio.apply()

import agentlightning as agl
from agentlightning import (
    LitAgentRunner,
    InMemoryLightningStore,
    OtelTracer,
    LLM,
    Trainer,
)

# Training data: list of task inputs your agent will be tested on
TRAINING_DATA = [
    {"input": "What is 2+2?",   "expected_output": "2 + 2 equals 4."},
    {"input": "Who is Newton?", "expected_output": "Newton is the father of classical mechanics."},
    {"input": "Calculate 2+2",  "expected_output": "2 + 2 equals 4."},
]

# Dummy LLM resource — required by the runner even for rule-based agents.
# Replace endpoint/model/api_key with real values if you use an actual LLM.

def get_llm_resource(temperature: float = 0.7) -> agl.LLM:
    import os
    os.environ["OPENAI_API_KEY"] = "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"

    return agl.LLM(
        endpoint="https://api.openai.com/v1",
        model="gpt-4o-mini",
        api_key=os.environ.get("OPENAI_API_KEY", ""),
        sampling_parameters={"temperature": temperature},
    )

TRAINING_DATA is a list of task dictionaries — each has an input (what the agent sees) and expected_output (what it should produce, used by the reward function).

get_llm_resource() creates an agl.LLM object that wraps the model config (endpoint, model name, API key, sampling parameters). This is passed into the store so the agent can access it at runtime without hardcoding credentials inside the agent class.

Part 4 — run_step_test() and run_trainer()

def run_step_test():
    """
    Directly tests the agent by calling runner.step() for each training sample.
    This bypasses the full Trainer loop and is the easiest way to verify
    that agent.py and reward.py are connected and working correctly.
    """
    print("=" * 40)
    print("Running step-by-step test...")
    print("=" * 40)

    tracer = OtelTracer()
    store  = InMemoryLightningStore()
    agent  = SimpleAgent()
    runner = LitAgentRunner(tracer=tracer)

    async def _run():
        # Seed the store with resources before running - the runner requires this
        await store.update_resources(
            resources_id="default", 
            resources={"main_llm": get_llm_resource(temperature=0.7)}
        )

        with runner.run_context(agent=agent, store=store):
            for task in TRAINING_DATA:
                rollout = await runner.step(task)
                print(f"[Rollout status] {rollout.status}\n")

    asyncio.run(_run())
    print("Step test completed successfully!")

def run_trainer():
    """
    Full Trainer-based training loop.
    Uses dev=True for a dry-run. Switch to dev=False and trainer.fit()
    for real training with an algorithm.
    """
    print("=" * 40)
    print("Running Trainer (dev mode)...")
    print("=" * 40)

    trainer = Trainer(
        runner=LitAgentRunner,
        dev=True,
        initial_resources={"main_llm": get_llm_resource(temperature=0.7)},
    )
    # Instantiate the agent
    agent = SimpleAgent()
    trainer.dev(agent=agent, train_dataset=TRAINING_DATA, val_dataset=TRAINING_DATA)
    print("Trainer dev run completed successfully!")

if __name__ == "__main__":
    # Step 1: Test agent + reward wiring directly (recommended first)
    run_step_test()

    # Step 2: Uncomment below to test via full Trainer loop
    run_trainer()

run_step_test is the recommended first test — it bypasses the full Trainer and directly calls runner.step() for each task. This lets you verify that the agent, LLM, and reward function are all wired correctly before introducing the complexity of a training algorithm. The flow is:

Seed the store with the LLM resource
Open a run_context that binds the agent and store to the runner
Step through each task one by one, printing the rollout status

run_trainer is the full Trainer-based loop. dev=True runs a dry-run — it exercises the full pipeline without running a real optimization algorithm, which is useful for integration testing. Switching to dev=False and calling trainer.fit() with a real algorithm (like APO) would enable actual prompt or model optimization.

Part 5 — Execute the code

Run the cells sequentially to display the output.

Hands-on 3: Sentiment Analysis Agent with AgentLightning

This hands-on demonstrates how to build a prompt-optimizing sentiment analysis agent using AgentLightning. The agent classifies text as “positive” or “negative” using an LLM, then automatically tests multiple prompt styles to find which one performs best — all without writing a training loop from scratch.

Step 1 — Install & Setup

!pip install poml

Installs poml, a dependency required by AgentLightning's APO (Automatic Prompt Optimization) module.

Step 2 — Imports & Environment

Sets the OpenAI key, disables AgentOps auto-instrumentation (avoids TracerProvider conflicts), and silences noisy but harmless OpenTelemetry warnings.

import os
import agentlightning as agl
from openai import OpenAI, AsyncOpenAI
from agentlightning import LitAgent, LitAgentRunner, OtelTracer, emit_reward, NamedResources, Rollout, InMemoryLightningStore
from typing import TypedDict
import logging
import asyncio
import nest_asyncio
nest_asyncio.apply()

os.environ["OPENAI_API_KEY"] = "..."
os.environ["AGENTOPS_DISABLE_AUTO_INSTRUMENTATION"] = "true"
logging.getLogger("agentlightning.tracer.otel").setLevel(logging.ERROR)

Step 3 — Create agents

The agent subclasses LitAgent and implements rollout(), which runs once per task. It pulls the LLM and prompt template from resources, calls OpenAI, compares the prediction to the gold label, and reports a binary reward (1.0 correct, 0.0 wrong).

# ── Agent ─────────────
class SentimentAgent(LitAgent):
    def rollout(self, task: Dict[str, Any], resources: NamedResources, rollout: Rollout) -> Optional[float]:
        # Pull resources exactly like QAAgent pulls "system_prompt"
        llm         = resources["main_llm"]
        prompt_tmpl = resources["prompt_template"].template

        # Format the prompt with the task text
        prompt = prompt_tmpl.format(**task)

        # Build client from resource
        client = OpenAI(
            base_url=llm.endpoint,
            api_key=llm.api_key,
        )

        response = client.chat.completions.create(
            model=llm.model,
            messages=[{"role": "user", "content": prompt}],
            **(llm.sampling_parameters or {}),
        )

        pred = response.choices[0].message.content.strip().lower().rstrip(".")
        gold = task["expected_label"].lower()
        reward = 1.0 if pred == gold else 0.0

        emit_reward(reward)
        print(f"  Text: {task['text']!r:40s} | Pred: {pred!r:12s} | Gold: {gold!r:12s} | R: {reward:.1f}")
        return reward

# ── Data ──────────────────────────
TRAIN_TASKS = [
    {"text": "I love this product!",                "expected_label": "positive"},
    {"text": "This is terrible and disappointing.", "expected_label": "negative"},
    {"text": "An excellent experience overall.",    "expected_label": "positive"},
    {"text": "I will never buy this again.",        "expected_label": "negative"},
]

PROMPTS = [
    'Classify the sentiment. Reply with exactly one word: positive or negative.\nText: "{text}"\nSentiment:',
    'Is the following text positive or negative? Reply with one word only.\nText: "{text}"',
    'Sentiment analysis: respond only with "positive" or "negative".\nInput: "{text}"',
]

# ── Resources ────────────────────────────────
def make_resources(prompt_template: str) -> dict:
    return {
        "main_llm": agl.LLM(
            model="gpt-4o-mini",
            endpoint="https://api.openai.com/v1",
            api_key=os.environ["OPENAI_API_KEY"],
            sampling_parameters={"temperature": 0.2},
        ),
        "prompt_template": PromptTemplate(template=prompt_template, engine="f-string"),
    }

# ── Prompt search loop ──────────────────
async def run_sentiment_search():
    store  = InMemoryLightningStore()
    agent  = SentimentAgent()
    tracer = OtelTracer()
    runner = LitAgentRunner(tracer=tracer)

    results = []

    for prompt_tmpl in PROMPTS:
        print(f"\n{'='*60}")
        print(f"Prompt: {prompt_tmpl[:60]}...")
        print('='*60)

        await store.update_resources(
            resources_id="default",
            resources=make_resources(prompt_tmpl)
        )

        reward_log = []
        with runner.run_context(agent=agent, store=store):
            for task in TRAIN_TASKS:
                result = await runner.step(task)
                if result and hasattr(result, "reward"):
                    reward_log.append(result.reward)

        avg = sum(reward_log) / len(reward_log) if reward_log else 0.0
        print(f"\n  → Prompt avg: {avg:.3f}")
        results.append((prompt_tmpl, avg))

    best = max(results, key=lambda x: x[1])
    print(f"\n{'='*60}")
    print(f"BEST PROMPT : {best[0]}")
    print(f"BEST SCORE  : {best[1]:.3f}")

asyncio.run(run_sentiment_search())

Step 4 — Execute the code

Run the cells sequentially to display the output.

Hands-on 4: LangGraph SQL Agent with AgentLightning

A notable example from Microsoft’s research highlights the use of Agent Lightning to train AI agents that can generate and iteratively refine SQL queries.

This hands-on builds a Text-to-SQL agent that takes natural language questions, generates SQLite queries using an LLM, executes them, and automatically retries on errors. It uses LangGraph for the query-generate-execute-fix loop and AgentLightning for training orchestration.

Step 1 — Downloading the Dataset

Download the Spider dataset from Google Drive and extract it into the working directory to prepare it for use in the tutorial.

!gdown --fuzzy https://drive.google.com/file/d/1oi9J1jZP9TyM35L85CL3qeGWl2jqlnL6/view
!unzip -q spider-data.zip -d data && rm spider-data.zip

Step 2 — Install dependencies

Install all the dependencies such as agentlightning[verl] ,torch ,flash-attn , langchain, langchain-openai, langchain-community, langchain-text-splitters, and faiss-cpu .

!pip install agentlightning[verl]

# For Colab/Kaggle (CUDA 12.1):
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CPU-only environments:
# !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

!pip install flash-attn --no-build-isolation
!pip install pyairports --break-system-packages

!pip install langchain
!pip install langchain-openai
!pip install langchain-community
!pip install langchain-text-splitters
!pip install faiss-cpu

Step 3— Import libraries

import os
import sqlite3
from typing import TypedDict
from langgraph.graph import StateGraph
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
import pandas as pd
import agentlightning as agl
import nest_asyncio
nest_asyncio.apply()

os.environ["AGENTOPS_DISABLE_AUTO_INSTRUMENTATION"] = "true"
os.environ["OPENAI_API_BASE"] = "https://api.openai.com/v1"
os.environ["OPENAI_API_KEY"] = "..."

# Silence noisy but harmless warnings
import logging
logging.getLogger("agentlightning.tracer.otel").setLevel(logging.ERROR)
logging.getLogger("agentlightning.tracer.agentops").setLevel(logging.ERROR)
logging.getLogger("opentelemetry.trace").setLevel(logging.ERROR)

Step 4— Define state schema

Defines the data that flows between nodes in the LangGraph graph. Every node reads from and writes to this shared state dict. get_schema extracts the table definitions so the LLM knows what columns exist. run_sql executes a query and returns results or an error string — the error string is what triggers the retry loop.

# ── State schema (required by StateGraph) ──────────────────────
class SQLState(TypedDict):
    question:    str
    schema:      str
    query:       str
    result:      str
    error:       str
    attempts:    int

# ── SQL helpers ───────────────────
def get_schema(database_path: str) -> str:
    conn = sqlite3.connect(database_path.replace("sqlite:///", ""))
    cursor = conn.cursor()
    cursor.execute("SELECT sql FROM sqlite_master WHERE type='table'")
    schema = "\n".join(row[0] for row in cursor.fetchall() if row[0])
    conn.close()
    return schema

def run_sql(query: str, db_path: str):
    try:
        conn = sqlite3.connect(db_path.replace("sqlite:///", ""))
        cursor = conn.cursor()
        cursor.execute(query)
        result = cursor.fetchall()
        conn.close()
        return result
    except Exception as e:
        return f"ERROR: {e}"

Step 5— LangGraph Agent

Builds a 4-node graph: write_query → execute_query → check_query → conditionally rewrite_query or end. The key routing logic is in should_rewrite — if there's an error and attempts remain, it loops back to fix the query; otherwise it exits.

# ── LangGraph agent builder ───────────────
    database_path: str,
    openai_base_url: str,
    model: str,
    sampling_parameters: dict,
    max_turns: int,
    truncate_length: int,
):
    llm = ChatOpenAI(
        base_url=openai_base_url,
        model=model,
        **sampling_parameters,
    )
    schema = get_schema(database_path)

    def write_query(state: SQLState) -> SQLState:
        response = llm.invoke([
            SystemMessage("You are a SQL expert. Write a SQLite query to answer the question. Return ONLY the SQL query."),
            HumanMessage(f"Schema:\n{schema}\n\nQuestion: {state['question']}"),
        ])
        return {**state, "query": response.content.strip(), "attempts": 0}

    def execute_query(state: SQLState) -> SQLState:
        result = run_sql(state["query"], database_path)
        if isinstance(result, str) and result.startswith("ERROR"):
            return {**state, "error": result, "result": ""}
        return {**state, "result": str(result)[:truncate_length], "error": ""}

    def check_query(state: SQLState) -> SQLState:
        # Pass through; routing logic handled by conditional edge below
        return state

    def rewrite_query(state: SQLState) -> SQLState:
        response = llm.invoke([
            SystemMessage("You are a SQL expert. Fix the SQL query based on the error. Return ONLY the corrected SQL query."),
            HumanMessage(
                f"Schema:\n{schema}\n\nOriginal query: {state['query']}\n"
                f"Error: {state['error']}\nQuestion: {state['question']}"
            ),
        ])
        return {**state, "query": response.content.strip(), "attempts": state["attempts"] + 1}

    def should_rewrite(state: SQLState) -> str:
        if state.get("error") and state["attempts"] < max_turns:
            return "rewrite_query"
        return "__end__"

    builder = StateGraph(SQLState)

    builder.add_node("write_query",   write_query)
    builder.add_node("execute_query", execute_query)
    builder.add_node("check_query",   check_query)
    builder.add_node("rewrite_query", rewrite_query)

    builder.add_edge("__start__",    "write_query")
    builder.add_edge("write_query",  "execute_query")
    builder.add_edge("execute_query","check_query")

    builder.add_conditional_edges("check_query", should_rewrite)
    builder.add_edge("rewrite_query", "execute_query")

    return builder.compile()

Step 6— Reward function

Rewards execution equivalence, not exact string match — two different SQL queries that return the same rows both get 1.0. This is more robust than comparing SQL strings directly.

# ── Reward function ────────────────────────
def evaluate_query(predicted_query: str, ground_truth_query: str, db_path: str, raise_on_error: bool = False) -> float:
    try:
        result_pred = run_sql(predicted_query, db_path)
        result_true = run_sql(ground_truth_query, db_path)
        return 1.0 if result_pred == result_true else 0.0
    except Exception as e:
        if raise_on_error:
            raise
        return 0.0

Step 7— The Agent

Subclasses LitAgent and wires everything together in rollout(). Builds the LangGraph agent from the injected LLM resource, runs it on the task, computes the reward, and emits it back to AgentLightning.

# ── Agent ─────────────────────────────────────────────────────────────────────
class LitSQLAgent(agl.LitAgent):
    def __init__(self, max_turns: int, truncate_length: int):
        super().__init__()
        self.max_turns = max_turns
        self.truncate_length = truncate_length

    def rollout(
        self,
        task: dict,
        resources: agl.NamedResources,
        rollout: agl.Rollout,
    ) -> float:
        llm: agl.LLM = resources["main_llm"]

        agent = build_langgraph_sql_agent(
            database_path="sqlite:///" + task["db_id"],
            openai_base_url=llm.endpoint,    # use endpoint directly
            model=llm.model,
            sampling_parameters=llm.sampling_parameters or {},
            max_turns=self.max_turns,
            truncate_length=self.truncate_length,
        )

        # Safely get langchain handler only if available
        callbacks = []
        tracer = self.get_tracer()
        if hasattr(tracer, "get_langchain_handler"):
            callbacks.append(tracer.get_langchain_handler())

        result = agent.invoke(
            {"question": task["question"], "schema": "", "query": "",
             "result": "", "error": "", "attempts": 0},
            {"callbacks": callbacks, "recursion_limit": 100},
        )

        reward = evaluate_query(
            result.get("query", ""),
            task.get("query", task.get("sql", "")),   # ground truth SQL
            task.get("db_path", f"data/database/{task['db_id']}/{task['db_id']}.sqlite"),
            raise_on_error=False,
        )
        print(f"[Rollout] Q: {task['question'][:50]} | Reward: {reward:.3f}")
        agl.emit_reward(reward)
        return reward

Step 8— VERL Config

Configures the VERL reinforcement learning algorithm — uses GRPO, the hf (HuggingFace) rollout engine, and fine-tunes Qwen2.5-Coder-1.5B. Requires a GPU node to run.

# ── VERL config ───────────────────────────────────────────────────────────────
verl_config = {
    "algorithm": {"adv_estimator": "grpo", "use_kl_in_reward": False},
    "data": {
        "train_batch_size": 8,
        "max_prompt_length": 4096,
        "max_response_length": 2048,
    },
    "actor_rollout_ref": {
        "rollout": {
            "name": "hf",
            "n": 4,
            "tensor_model_parallel_size": 1,
            "multi_turn": {"format": "hermes"},
        },
        "actor": {"ppo_mini_batch_size": 8, "optim": {"lr": 1e-6}},
        "model": {"path": "Qwen/Qwen2.5-Coder-1.5B-Instruct"},
    },
    "trainer": {
        "n_gpus_per_node": 1,
        "val_before_train": True,
        "test_freq": 32,
        "save_freq": 64,
        "total_epochs": 1,
    },
}

Step 9— Run Modes

run_dev() is for local testing — it uses the OpenAI API and a small 10-sample subset. run_training() is for full RL fine-tuning on the Spider dataset and requires a Ray cluster with GPUs. Always test with run_dev() first.

# ── Full training ───────────────────────
def run_training():
    agent     = LitSQLAgent(max_turns=3, truncate_length=1024)
    algorithm = agl.VERL(verl_config)

    trainer = agl.Trainer(
        n_runners=10,
        algorithm=algorithm,
        adapter=agl.TracerTraceToTriplet(agent_match="write|rewrite"),
    )

    train_data = pd.read_parquet("data/train_spider.parquet").to_dict("records")
    val_data   = pd.read_parquet("data/test_dev_500.parquet").to_dict("records")

    trainer.fit(agent, train_data, val_dataset=val_data)


# ── Dev / dry-run ─────────────────────────
def run_dev():
    agent = LitSQLAgent(max_turns=3, truncate_length=1024)

    trainer = agl.Trainer(
        n_workers=1,
        initial_resources={
            "main_llm": agl.LLM(
                endpoint=os.environ["OPENAI_API_BASE"],
                model="gpt-4o-mini",
                api_key=os.environ["OPENAI_API_KEY"],
                sampling_parameters={"temperature": 0.7},
            )
        },
    )

    # Load a small subset — make sure the parquet file path is correct
    dev_data = pd.read_parquet("data/test_dev_500.parquet").to_dict("records")[:10]

    trainer.dev(agent, dev_data)

if __name__ == "__main__":
    run_dev()
    # run_training()

Step 10 — Execute the code

Execute the run_dev() to display the output.

Thanks for reading this article !!

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

Resources

Zvec: Reimagining Vector Databases with SQLite-Style Simplicity

Vishnu Sivan — Sat, 28 Feb 2026 13:28:08 GMT

The research team at Alibaba Tongyi Lab has introduced Zvec, an open-source in-process vector database purpose-built for edge and on-device retrieval workloads. It described as “the SQLite of vector databases,” runs as a lightweight embedded library inside your application. It requires no external services, no background daemons, and no network communication layer.

As Retrieval-Augmented Generation (RAG), semantic search, and AI agents increasingly move toward local-first and privacy-preserving deployments, Zvec addresses a growing infrastructure gap: how to deliver high-performance vector search with full persistence and metadata support — without the operational complexity of a server-based system.

This article explores why Zvec matters, how it is architected, where it fits in the vector database ecosystem, and why embedded vector search may become a core building block for the next generation of AI applications.

Getting Started

What is Zvec
How Zvec Works
Why Zvec Matters
Performance Benchmarks
RAG-Focused Features
Hands-On 1: Creating Your First Embedded Vector Database with Zvec
Setting up the environment
Creating and querying a Zvec collection
Hands-On 2: Building a Zvec-based FAQ Agent for customer queries
Preparing sample FAQ data
Creating Zvec collection
Building retrieval function
Connecting to LLM for answer generation
Testing FAQ agent

What is Zvec

Zvec is an open-source, in-process vector database developed by Alibaba that makes high-performance similarity search easy to embed directly into applications without running a separate server or service. It is designed to be lightweight, production-ready, and useful for AI workflows like semantic search, retrieval-augmented generation (RAG), recommendation systems, and other similarity-based tasks — all within the same process as your application.

At its core, Zvec runs as a simple library use just like you would a relational embedded database such as SQLite—but for vectors instead of tables. Because it runs “in-process,” there is no network layer, no external daemon to manage, and minimal configuration required. This zero-ops design makes it especially well-suited for local development, edge devices, desktop tools, command-line utilities, and other environments where deploying a separate vector database service would be impractical.

How Zvec Works

Zvec is built on Proxima, Alibaba Group’s high-performance vector search engine that has been battle-tested in large-scale production environments. It exposes a simple API that lets developers define collections, insert documents with vectors, and run similarity queries — all in a few lines of code.

Zvec supports:

Dense and sparse vector types
Multi-vector queries
Hybrid search with scalar filters
Fast approximate nearest neighbor (ANN) search
Hybrid search optimizations
Resource control for CPU and memory, making it stable in constrained environments like mobile or edge devices

Because Zvec runs inside your application process, it can seamlessly integrate into Python projects, notebooks, CLI tools, or edge applications without additional infrastructure.

Why Zvec Matters

RAG and semantic search systems require more than just a similarity index — they need vector storage, metadata (scalar fields), full CRUD operations, and reliable persistence as local knowledge bases constantly evolve.

Zvec: The SQLite of Vector Databases | Zvec

Libraries like Faiss offer fast nearest neighbor search but lack built-in storage, crash recovery, and hybrid query capabilities, forcing developers to build additional infrastructure. Extensions such as DuckDB with VSS support add vector search but provide limited indexing flexibility and weaker resource control for edge environments. Service-based platforms like Milvus or managed vector cloud solutions require separate deployment and network communication, which can be excessive for on-device applications.

Zvec is designed specifically for these local use cases, offering a vector-native engine with built-in persistence, resource governance, and RAG-focused features — all packaged as a lightweight embedded library.

Zvec’s embedded approach enables:

Local sematic search and RAG without external services
Fast prototyping and development
Edge and IoT usage with resource governance
Simplified integration comparable to SQLite’s impact on relational storage

This positions Zvec as a key infrastructure component for applications that need high-quality vector retrieval but require a lightweight, zero-ops operational model.

Performance Benchmarks

Zvec is designed for CPU-bound, high-throughput similarity search workloads. The engine leverages multithreading, cache-efficient memory layouts, SIMD optimizations, and prefetching techniques to maximize performance on modern processors.

According to reported results from VectorDBBench (Cohere 10M dataset), Zvec achieves over 8,000 queries per second (QPS) while maintaining matched recall. In the same benchmark configuration, it reportedly delivered more than twice the throughput of the previous top-performing system, while also reducing index build time.

Zvec: The SQLite of Vector Databases | Zvec

RAG-Focused Features

Zvec is specifically optimized for retrieval-augmented generation (RAG) and AI agent workflows. It provides capabilities that go beyond basic vector indexing, making it suitable for dynamic, production-ready knowledge systems.

Full CRUD support for managing mutable local knowledge bases.
Schema evolution, allowing developers to modify indexing strategies and fields as application requirements evolve.
Multi-vector retrieval, enabling the combination of multiple embedding channels for richer semantic matching.
Built-in reranking mechanisms, including weighted fusion and Reciprocal Rank Fusion (RRF), to improve retrieval relevance.
Scalar-vector hybrid search, with scalar filters pushed down into the index path for efficient execution, along with optional inverted indexes for attribute-based filtering.

These capabilities make Zvec well-suited for applications that require flexible indexing, evolving schemas, and high-quality retrieval — all within an embedded, zero-ops deployment model.

Hands-On 1: Creating Your First Embedded Vector Database with Zvec

In this section, you will create your first Zvec collection, insert vector documents, and perform a similarity search — all within a few lines of Python code.

For this hands-on guide, we will be using Google Colab with a T4 GPU.

Setting up the environment

Step 1: Create and Configure Your Colab Notebook

Open Google Colaboratory and sign in with your Google account.
Create a new notebook by clicking on + New Notebook.
Navigate to Runtime → Change runtime type.

Set Hardware Accelerator to GPU.
Choose T4 GPU (recommended for this tutorial). It is recommended to use a GPU, as the CPU runtime may crash while creating the Zvec collection due to resource constraints.
Click Save.

Step 2: Add Hugging Face Access Token (Optional but Recommended)

If you’re pulling models from Hugging Face, you’ll need an access token:

In the left sidebar, select the 🔑 Secrets tab.
Add a new secret:

Key: HF_TOKEN
Value: Your Hugging Face access token

3. Generate a write token from your Hugging Face profile settings → Create new token → Select write → Provide token name → Click on Create token.

Press enter or click to view image in full size

Step 3: Install Dependencies

Use the following script to install all necessary packages:

!pip install zvec

Creating and querying a Zvec collection

import zvec
 
# Define collection schema
schema = zvec.CollectionSchema(
    name="example",
    vectors=zvec.VectorSchema("embedding", zvec.DataType.VECTOR_FP32, 4),
)
 
# Create collection
collection = zvec.create_and_open(path="./zvec_example", schema=schema,)
 
# Insert documents
collection.insert([
    zvec.Doc(id="doc_1", vectors={"embedding": [0.1, 0.2, 0.3, 0.4]}),
    zvec.Doc(id="doc_2", vectors={"embedding": [0.2, 0.3, 0.4, 0.1]}),
])
 
# Search by vector similarity
results = collection.query(
    zvec.VectorQuery("embedding", vector=[0.4, 0.3, 0.3, 0.1]),
    topk=10
)
 
# Results: list of {'id': str, 'score': float, ...}, sorted by relevance 
print(results)

Define a CollectionSchema: Specify vector fields (and optional scalar fields) that describe how your data will be stored.
Create or open a collection: Use create_and_open() to initialize a new collection or load an existing one from disk.
Insert documents: Add Doc objects containing unique IDs, embedding vectors, and optional metadata attributes.
Build indexes and query: Run VectorQuery operations to retrieve the nearest neighbors based on vector similarity.
Consume results: Results are returned as dictionaries containing document IDs and similarity scores, sorted by relevance. These can directly power a local semantic search engine or serve as the retrieval layer for a RAG pipeline.

With this simple workflow, you now have a fully functional embedded vector database running inside your application — ready to scale from toy examples to production-ready local AI systems.

Hands-On 2: Building a Zvec-based FAQ Agent for customer queries

This section provides a comprehensive, step-by-step guide to building a Zvec-based FAQ agent capable of handling large volumes of customer queries — similar to the support systems used by major e-commerce platforms.

Installing dependencies

In this hands-on tutorial, we will use Sentence Transformers to generate high-quality semantic embeddings and OpenAI’s language models to generate natural, context-aware responses for users.

!pip install zvec sentence-transformers openai

Preparing sample FAQ data

Create a structured FAQ dataset that will serve as the knowledge base for your Zvec-powered retrieval system, containing common customer questions and their corresponding answers for semantic search.

faq_data = [
    {
        "id": "faq_1",
        "question": "How do I return a product?",
        "answer": "You can return a product within 7 days by going to Orders → Select item → Click Return."
    },
    {
        "id": "faq_2",
        "question": "How long does delivery take?",
        "answer": "Standard delivery takes 3-5 business days."
    },
    {
        "id": "faq_3",
        "question": "How do I track my order?",
        "answer": "Go to My Orders and click on Track to see real-time tracking updates."
    },
    {
        "id": "faq_4",
        "question": "How do I cancel my order?",
        "answer": "Orders can be cancelled before they are shipped from the Orders page."
    }
]

Creating Zvec collection

Initialize and configure a persistent Zvec collection by defining the schema and preparing it to store and index embedded FAQ documents.

import zvec
from sentence_transformers import SentenceTransformer
import shutil
import os
import time # Import time module

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Define schema
schema = zvec.CollectionSchema(
    name="faq_collection",
    vectors=zvec.VectorSchema(
        "embedding",
        zvec.DataType.VECTOR_FP32,
        384   # dimension of MiniLM model
    )
    # fields=[
    #     zvec.FieldSchema("question", zvec.DataType.STRING),
    #     zvec.FieldSchema("answer", zvec.DataType.STRING)
    # ]
)

# Remove existing collection if it exists
collection_path = "./faq_zvec" # Changed path to ensure a fresh start
if os.path.exists(collection_path):
    shutil.rmtree(collection_path)
    time.sleep(1) # Add a small delay

# Create or open collection
collection = zvec.create_and_open(
    path=collection_path,
    schema=schema,
)

docs = []

for item in faq_data:
    text = item["question"] + " " + item["answer"]
    embedding = model.encode(text).tolist()

    docs.append(
        zvec.Doc(
            id=item["id"],
            vectors={"embedding": embedding}
            # fields={"question": item["question"], "answer": item["answer"]}
        )
    )

collection.insert(docs)

print("FAQ data indexed successfully!")

Building retrieval function

Implement the retrieval layer that converts user queries into embeddings and fetches the most relevant FAQ entries from Zvec using vector similarity search.

def retrieve(query, topk=3):
    embedding = model.encode(query).tolist()
    results = collection.query(
        zvec.VectorQuery("embedding", vector=embedding),
        topk=topk
    )
    return results

Connecting to LLM for answer generation

Integrate the retrieval layer with an LLM to generate clear, natural responses by grounding answers in the relevant FAQs retrieved from Zvec.

# from retriever import retrieve
from openai import OpenAI

client = OpenAI(api_key="your-openai-key")  # set OPENAI_API_KEY

# Build a lookup dict from faq_data to retrieve answers by ID
faq_lookup = {item["id"]: item for item in faq_data}

def answer_query(user_query):
    retrieved_docs = retrieve(user_query, topk=3)

    # Look up answers by doc ID (since fields aren't stored in zvec)
    context = "\n".join(
        [faq_lookup[doc.id]["answer"] for doc in retrieved_docs]
    )

    prompt = f"""
You are a customer support agent.

Use only the information below to answer.

Context:
{context}

Customer Question:
{user_query}

Answer:
"""

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )

    return response.choices[0].message.content

Testing FAQ agent

Run and interact with the FAQ agent to validate the end-to-end flow, from query embedding and retrieval in Zvec to natural response generation by the LLM.

test_questions = [
    "I received the wrong item, how do I return it?",
    "When will my order arrive?",
    "How do I cancel my order",
]

for question in test_questions:
    print(f"\n{'='*60}")
    print(f"❓ Question: {question}")
    print(f"{'='*60}")
    answer = answer_query(question)
    print(f"\n🤖 Answer:\n{answer}")

Execute the script to observe the output as below.

If you are interested in more hands-on examples, check out how Zvec can be used to build a car manual RAG agent with OpenAI for efficient and accurate data retrieval.

Google Colab

Thanks for reading this article !!

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/zvec-experiments: (Retrieval-Augmented Generation) using PDF data and simple text data.

Resources

OmniDaemon: The Universal Event-Driven Runtime for Production Ready AI Agents

Vishnu Sivan — Tue, 30 Dec 2025 16:07:38 GMT

Moving beyond the “Chatbot” era to autonomous, scalable, and resilient AI infrastructure.

The AI landscape is shifting. We have spent the last two years perfecting the “Chatbot” — a synchronous, request-response interface where an LLM waits for a user to type. But the future of AI isn’t a text box; it’s an autonomous service that listens to events, reasons, and acts in the background.

Enter OmniDaemon, a universal event-driven runtime built by OmniRexFlora Labs. It is designed to turn AI agents into robust, production-ready infrastructure.

Getting Started

The “Monolithic Trap” in Agentic AI
What is OmniDaemon?
Key Pillars
The Architecture: How it Works
The “Omni Stack” Ecosystem
The Core Problem OmniDaemon Solves
Why Event-Driven Architecture for AI Agents?
Why Traditional Architectures Fail for Agents
Getting Started: A Technical Quickstart
Step 1: Set Up the Environment
Step 2: Create your first agent
Step 3: Create the Event Producer (The Trigger)
Hands-On Project: Intelligent Log Insights Generator
Step 1: Set Up the Environment
Step 2: Create Sample Log Generator
Step 3: Build the Log Ingestion Agent
Step 4: Build the LLM-Powered Analysis Agent
Step 5: Build the Reporting Agent
Step 6: (Optional) Advanced Agent Chaining for Specialized Analysis
Advanced Features and Patterns

The “Monolithic Trap” in Agentic AI

Most developers start by building multi-agent systems as single, monolithic applications. While this works for a local demo, it fails in production for three reasons:

The Blast Radius: If your “Data Analyst” agent crashes on a corrupted CSV, your entire system — Research Agent, Writer Agent, and API — goes down with it.
The Scaling Wall: You cannot scale individual agents. If your “Research Agent” is I/O intensive, you’re forced to scale the whole monolith, wasting resources.
Synchronous Bottlenecks: AI is slow. Making a user wait for a 30-second chain of LLM calls over an HTTP request leads to timeouts and poor UX.

What is OmniDaemon?

OmniDaemon is a Universal Event-Driven Runtime for AI agents. It acts as the “Central Nervous System” for your AI stack. Think of it as “Kubernetes for AI Agents” — it provides the orchestration, scalability, and reliability needed to run AI agents in production. It allows agents to operate asynchronously, reacting to events (like a new file upload, a CRM update, or a message from another agent) rather than waiting for direct API calls.

Key Pillars:

Framework Agnostic: It doesn’t care if your agent is built with OpenAI’s SDK, LangGraph, or CrewAI. If it’s Python code, OmniDaemon can daemonize it.
Production Ready: It handles the “boring but hard” stuff: retries, Dead Letter Queues (DLQ), message persistence, and horizontal scaling.
Event-Driven (EDA): Built on high-performance backends like Redis Streams, it ensures that agents collaborate across distributed systems without being tightly coupled.
Asynchronous Execution: They listen to a “Topic” (like a Redis stream). When a task arrives, the daemon triggers a callback, processes it, and emits the result.
Pluggable Backends: Swap event buses (Redis Streams, Kafka, RabbitMQ) and storage (Redis, PostgreSQL, MongoDB) via environment variables
Horizontal Scaling: Deploy multiple agent instances for load balancing without code changes

The Architecture: How it Works

OmniDaemon sits between your event source (like Redis or a JSON stream) and your AI logic.

The Listener: OmniDaemon monitors an event stream.
The Callback: When a message arrives, it triggers your agent logic via a simple callback function.
The Manager: It manages the lifecycle. If your agent fails, OmniDaemon handles the retry logic. If it succeeds, it persists the result and can even trigger the next agent in the chain.
Storage Backends: It currently supports JSON (for local development) and Redis (for production). S3 and PostgreSQL support are on the roadmap.
The Callback Pattern: You wrap your agent logic in a single function. OmniDaemon handles the “plumbing” — fetching the message, tracking metadata, and persisting the output.
Metadata & Context: Every event carries a rich metadata object including message_id, correlation_id (to track a task across 10 different agents), and tenant_id for multi-tenant SaaS apps.

The “Omni Stack” Ecosystem

OmniDaemon is part of a larger vision by OmniRexFlora Labs to provide a complete “Linux-like” ecosystem for AI:

OmniCoreAgent: The framework for building the “brains.”
OmniMemory: Persistent, semantic memory for agents.
OmniDaemon: The runtime for distributed execution.
OmniCloud: (Upcoming) The deployment and orchestration layer.

The Core Problem OmniDaemon Solves

Traditional AI systems are request-driven: a user asks a question, the AI responds, and that’s it. But modern enterprises need AI that operates continuously in the background, reacting to events, coordinating with other agents, and integrating seamlessly with existing infrastructure.

Why Event-Driven Architecture for AI Agents?

The future of AI lies in autonomous agents that operate as distributed microservices. Just like microservices revolutionized application architecture, event-driven AI agents are transforming how we build intelligent systems.

Wave 1: Predictive Models (Traditional ML)

Domain-specific, rigid systems
Required ML expertise for each use case
Difficult to repurpose or scale

Wave 2: Generative Models (LLMs)

Revolutionary generalization capabilities
But: fixed in time, expensive to fine-tune, no access to private data
RAG (Retrieval-Augmented Generation) helped, but workflows remained rigid

Wave 3: Agentic AI (Current)

Dynamic workflows that adapt on the fly
Autonomous decision-making with tool use
Context-driven processing with memory
Collaborative multi-agent systems

Industry leaders agree: “Agents are the new apps” (Dharmesh Shah, HubSpot CTO). But building scalable agent systems requires proper infrastructure.

Why Traditional Architectures Fail for Agents

Connecting agents via REST APIs creates tightly coupled systems:

Event-driven architecture solves this through loose coupling:

Getting Started: A Technical Quickstart

OmniDaemon is designed for simplicity. You can move an existing agent into a daemonized state in minutes.

Prerequisites

Python 3.9+
Redis (Installed and running)
OpenAI API Key (or any LLM provider)

Step 1: Set Up the Environment

To use OmniDaemon effectively in a production or development environment, running Redis via Docker is the industry standard. This ensures your event stream is isolated, portable, and easy to manage.

Initializing Redis

Before running Redis, you must have the Docker Engine installed.

For Windows & Mac:
Download Docker Desktop from the official website.

For Linux (Ubuntu):
Open your terminal and run these commands to install the Docker Engine:

sudo apt update
sudo apt install ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Once Docker is ready, you can pull and start Redis with a single command. For OmniDaemon, we want persistence so that your AI tasks aren’t lost if the container restarts.

docker run -d --name omni-redis -p 6379:6379 -v redis-data:/data redis:latest redis-server --appendonly yes

Installing Dependencies

First, install the necessary libraries using pip.

pip install omnidaemon

Step 2: Create your first agent

Below is the implementation of a distributed “Greeter” agent. The agent_runner.py acts as our persistent service, while producer.py demonstrates how external systems interact with the agent through structured message passing.

Create file named agent_runner.py and add the following code it.

import asyncio
from omnidaemon import OmniDaemonSDK
from omnidaemon import AgentConfig

sdk = OmniDaemonSDK()

async def greeter(message: dict):
    """Your AI agent runs here!"""
    name = message.get("content", {}).get("name", "stranger")
    return {"reply": f"Hello, {name}! 👋"}

async def main():
    await sdk.register_agent(
        agent_config=AgentConfig(
            topic="greet.user",
            callback=greeter,
        )
    )
    await sdk.start()
    print("🎧 Agent running. Press Ctrl+C to stop.")
    
    try:
        while True:
            await asyncio.sleep(1)
    except KeyboardInterrupt:
        pass
    finally:
        await sdk.shutdown()

if __name__ == "__main__":
    asyncio.run(main())

Step 3: Create the Event Producer (The Trigger)

In an event-driven system, the producer doesn’t wait for the AI to finish. It just “drops” the job into the queue and moves on.

Create file named producer.py and add the following code it.

import asyncio
import json

async def test_messages():
    """Send test messages to the agent."""
    
    # Test cases
    tests = [
        ("greet.user", {"content": {"name": "World"}}),
        ("greet.user", {"content": {"name": "Alice"}}),
        ("greet.user", {"content": {}}),
        ("health.check", {"timestamp": "now"}),
    ]
    
    print("Testing agent messages...\n")
    
    for topic, message in tests:
        print(f"Topic: {topic}")
        print(f"Message: {json.dumps(message)}")
        
        # Simulate sending
        await asyncio.sleep(0.3)
        
        # Simulate response
        if topic == "greet.user":
            name = message.get("content", {}).get("name", "stranger")
            print(f"Response: Hello, {name}!\n")
        else:
            print("Response: {'status': 'healthy'}\n")
        
        await asyncio.sleep(0.5)
    
    print("Done!")

if __name__ == "__main__":
    asyncio.run(test_messages())

Executing the code

Open your first terminal and execute the agent runner. This process will initialize the OmniDaemonSDK, register the greeter callback to the greet.user topic, and enter an idle listening state.

python agent_runner.py

You should see the output: 🎧 Agent running. Press Ctrl+C to stop.

Open a second terminal to run the publisher script. This script acts as the “Client” or “Producer,” injecting JSON-formatted payloads into the system to be picked up by the waiting agent.

python producer.py

Hands-On Project: Intelligent Log Insights Generator

Now let’s build a real-world application: an AI-powered log analysis system that monitors authentication and database transaction logs, detects anomalies, identifies vulnerabilities, and provides actionable recommendations.

Project Architecture

Our system will have three main components:

Log Ingestion Agent: Watches log files and publishes events
Analysis Agent: Uses LLM to analyze logs for security issues, anomalies, and performance concerns
Reporting Agent: Aggregates insights and generates actionable reports

Step 1: Set Up the Environment

First, install the necessary libraries. We will use omnidaemon for the runtime and langchain (optional) to demonstrate how it wraps external frameworks.

pip install omnidaemon openai python-dotenv

Ensure your .env file has your credentials:

OPENAI_API_KEY=your_key_here
STORAGE_BACKEND=redis
REDIS_URL=redis://localhost:6379
EVENT_BUS_TYPE=redis_stream
OMNIDAEMON_API_ENABLED=true
OMNIDAEMON_API_PORT=8765
LOG_LEVEL=INFO

Step 2: Create Sample Log Generator

First, let’s create realistic sample logs for testing:

# generate_logs.py
import json
import random
import time
from datetime import datetime, timedelta
from pathlib import Path

# Create logs directory
log_dir = Path("./sample_logs")
log_dir.mkdir(exist_ok=True)

# Sample data for realistic logs
users = ["alice", "bob", "charlie", "david", "eve", "mallory"]
ips = ["192.168.1.10", "192.168.1.20", "192.168.1.30", "10.0.0.5", "172.16.0.100"]
suspicious_ips = ["45.133.1.12", "89.248.172.45"]  # Potential attack IPs
db_operations = ["SELECT", "INSERT", "UPDATE", "DELETE"]
tables = ["users", "transactions", "orders", "products", "audit_logs"]

def generate_auth_log():
    """Generate authentication log entry"""
    is_suspicious = random.random() < 0.1  # 10% suspicious activity
    
    if is_suspicious:
        # Generate suspicious patterns
        event_type = random.choice(["failed_login", "brute_force_attempt", "impossible_travel"])
        ip = random.choice(suspicious_ips)
        success = False
    else:
        event_type = random.choice(["login", "logout", "password_change", "session_refresh"])
        ip = random.choice(ips)
        success = random.random() < 0.95  # 95% success rate normally
    
    return {
        "timestamp": datetime.now().isoformat(),
        "log_type": "authentication",
        "event": event_type,
        "user": random.choice(users),
        "ip_address": ip,
        "success": success,
        "user_agent": "Mozilla/5.0" if random.random() < 0.8 else "curl/7.68.0",
        "session_id": f"sess_{random.randint(10000, 99999)}"
    }

def generate_db_log():
    """Generate database transaction log entry"""
    is_anomaly = random.random() < 0.15  # 15% anomalies
    
    operation = random.choice(db_operations)
    execution_time = random.uniform(0.01, 0.5)
    
    if is_anomaly:
        # Generate anomalous patterns
        if random.random() < 0.5:
            execution_time = random.uniform(5.0, 30.0)  # Slow query
        
        if operation == "DELETE" and random.random() < 0.3:
            rows_affected = random.randint(1000, 10000)  # Mass deletion
        else:
            rows_affected = random.randint(1, 100)
    else:
        rows_affected = random.randint(1, 10)
    
    return {
        "timestamp": datetime.now().isoformat(),
        "log_type": "database",
        "operation": operation,
        "table": random.choice(tables),
        "user": random.choice(users),
        "execution_time_ms": round(execution_time * 1000, 2),
        "rows_affected": rows_affected,
        "query_hash": f"qh_{random.randint(100000, 999999)}",
        "connection_id": random.randint(1, 50)
    }

def generate_logs(num_logs=100, filename="application.log"):
    """Generate mixed authentication and database logs"""
    log_file = log_dir / filename
    
    with open(log_file, "w") as f:
        for _ in range(num_logs):
            # Mix of auth and db logs (60% auth, 40% db)
            if random.random() < 0.6:
                log_entry = generate_auth_log()
            else:
                log_entry = generate_db_log()
            
            f.write(json.dumps(log_entry) + "\n")
            time.sleep(0.01)  # Simulate real-time generation
    
    print(f"Generated {num_logs} log entries in {log_file}")

if __name__ == "__main__":
    # Generate initial logs
    generate_logs(num_logs=100, filename="application.log")
    
    # Optionally: Continuous generation
    print("\nGenerating continuous logs... (Ctrl+C to stop)")
    try:
        while True:
            generate_logs(num_logs=10, filename="application.log")
            time.sleep(5)  # Generate new batch every 5 seconds
    except KeyboardInterrupt:
        print("\nLog generation stopped.")

Step 3: Build the Log Ingestion Agent

The log ingestion agent is the entry point of our system. It acts as a bridge between your log files and the OmniDaemon event-driven architecture. Instead of having your analysis agents directly read files (which creates tight coupling and scaling issues), the ingestion agent watches for new log entries and publishes them as events to the event bus.

# log_ingestion_agent.py
import asyncio
import json
import time
from pathlib import Path
from omnidaemon import OmniDaemonSDK, EventEnvelope, PayloadBase

sdk = OmniDaemonSDK()

class LogIngestionAgent:
    def __init__(self, log_dir="./sample_logs"):
        self.log_dir = Path(log_dir)
        self.processed_lines = {}  # Track processed lines per file
        
    async def watch_logs(self):
        """Watch log files and publish new entries"""
        print(f"🔍 Watching logs in {self.log_dir}")
        
        while True:
            try:
                # Find all log files
                log_files = list(self.log_dir.glob("*.log"))
                
                for log_file in log_files:
                    await self.process_log_file(log_file)
                
                await asyncio.sleep(2)  # Check every 2 seconds
                
            except Exception as e:
                print(f"❌ Error watching logs: {e}")
                await asyncio.sleep(5)
    
    async def process_log_file(self, log_file: Path):
        """Process new lines from a log file"""
        try:
            with open(log_file, "r") as f:
                lines = f.readlines()
            
            # Get last processed line number
            last_line = self.processed_lines.get(str(log_file), 0)
            
            # Process only new lines
            new_lines = lines[last_line:]
            
            for line in new_lines:
                if line.strip():
                    await self.publish_log_entry(line, str(log_file))
            
            # Update processed line count
            self.processed_lines[str(log_file)] = len(lines)
            
        except Exception as e:
            print(f"❌ Error processing {log_file}: {e}")
    
    async def publish_log_entry(self, log_line: str, source_file: str):
        """Publish a log entry to the event bus"""
        try:
            log_entry = json.loads(log_line)
            
            # Create event envelope
            event = EventEnvelope(
                topic="logs.raw",
                payload=PayloadBase(
                    content={
                        "log_entry": log_entry,
                        "source_file": source_file
                    },
                    reply_to="logs.analyzed"  # Results go to analyzed topic
                ),
                source="log_ingestion_agent"
            )
            
            # Publish to event bus
            task_id = await sdk.publish_task(event_envelope=event)
            print(f"📤 Published log: {log_entry.get('log_type')} - {task_id}")
            
        except json.JSONDecodeError:
            print(f"⚠️  Invalid JSON in log line")
        except Exception as e:
            print(f"❌ Error publishing log: {e}")

async def main():
    try:
        # Create ingestion agent
        ingestion_agent = LogIngestionAgent()
        
        # Start watching logs
        print("🚀 Log Ingestion Agent started")
        await ingestion_agent.watch_logs()
        
    except KeyboardInterrupt:
        print("\n👋 Shutting down...")
    finally:
        await sdk.shutdown()

if __name__ == "__main__":
    asyncio.run(main())

Step 4: Build the LLM-Powered Analysis Agent

This is the intelligent core of our system where the analysis agent subscribes to the logs.raw topic and uses OpenAI's GPT-4 to perform deep, context-aware analysis of each log entry. This LLM-powered agent understands context, identifies novel threats, and provides human-like reasoning about security vulnerabilities, performance anomalies, and operational risks. It automatically classifies severity levels, generates actionable recommendations, and publishes results to the logs.analyzed topic for downstream processing—all while handling 3 parallel consumers for high-throughput analysis.

# log_analysis_agent.py
import asyncio
import json
from typing import Dict, Any
from openai import AsyncOpenAI
from omnidaemon import OmniDaemonSDK, AgentConfig, SubscriptionConfig
from decouple import config

sdk = OmniDaemonSDK()
client = AsyncOpenAI(api_key=config("OPENAI_API_KEY"))

ANALYSIS_SYSTEM_PROMPT = """You are an expert security and database analyst. 
Analyze log entries and provide detailed insights on:

1. **Security Vulnerabilities**: Identify potential security threats, unusual access patterns, 
   brute force attempts, suspicious IP addresses, or unauthorized access attempts.

2. **Anomalies**: Detect unusual patterns such as:
   - Failed login attempts from the same IP
   - Impossible travel (logins from distant locations in short time)
   - Unusual database operations (mass deletions, slow queries)
   - Off-hours access patterns
   - Spike in certain operations

3. **Critical Actions**: Flag actions that require immediate attention:
   - Successful logins after multiple failures
   - Large-scale data modifications
   - Administrative privilege escalation
   - Database performance degradation

4. **Recommendations**: Provide specific, actionable recommendations:
   - Implement IP blocking for suspicious addresses
   - Add monitoring for specific patterns
   - Optimize slow queries
   - Review user permissions
   - Enable additional security measures

Return your analysis as a JSON object with this structure:
{
    "severity": "critical|high|medium|low|info",
    "category": "security|performance|anomaly|normal",
    "vulnerabilities": ["list of identified vulnerabilities"],
    "anomalies": ["list of detected anomalies"],
    "critical_actions": ["list of critical actions taken"],
    "recommendations": ["list of specific recommendations"],
    "summary": "brief summary of findings"
}

Be concise but thorough. Focus on actionable insights."""

async def analyze_log_with_llm(log_entry: Dict[str, Any]) -> Dict[str, Any]:
    """Analyze a log entry using LLM"""
    try:
        # Prepare log context for LLM
        log_context = json.dumps(log_entry, indent=2)
        
        # Call OpenAI API
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": ANALYSIS_SYSTEM_PROMPT},
                {"role": "user", "content": f"Analyze this log entry:\n\n{log_context}"}
            ],
            temperature=0.3,
            max_tokens=1000
        )
        
        # Parse LLM response
        analysis_text = response.choices[0].message.content
        
        # Extract JSON from response (handle markdown code blocks)
        if "```json" in analysis_text:
            analysis_text = analysis_text.split("```json")[1].split("```")[0].strip()
        elif "```" in analysis_text:
            analysis_text = analysis_text.split("```")[1].split("```")[0].strip()
        
        analysis = json.loads(analysis_text)
        
        return {
            "status": "success",
            "analysis": analysis,
            "log_entry": log_entry
        }
        
    except Exception as e:
        print(f"❌ LLM Analysis Error: {e}")
        return {
            "status": "error",
            "error": str(e),
            "log_entry": log_entry
        }

async def log_analysis_callback(message: dict):
    """OmniDaemon callback for log analysis"""
    content = message.get("content", {})
    log_entry = content.get("log_entry", {})
    source_file = content.get("source_file", "unknown")
    
    print(f"\n🔍 Analyzing {log_entry.get('log_type')} log from {source_file}")
    
    # Analyze log with LLM
    result = await analyze_log_with_llm(log_entry)
    
    if result["status"] == "success":
        analysis = result["analysis"]
        severity = analysis.get("severity", "info")
        category = analysis.get("category", "normal")
        
        # Print severity-based alerts
        if severity == "critical":
            print(f"🚨 CRITICAL: {analysis.get('summary')}")
        elif severity == "high":
            print(f"⚠️  HIGH: {analysis.get('summary')}")
        elif severity == "medium":
            print(f"⚡ MEDIUM: {analysis.get('summary')}")
        else:
            print(f"ℹ️  {analysis.get('summary')}")
        
        # Print recommendations if any
        if analysis.get("recommendations"):
            print(f"💡 Recommendations:")
            for rec in analysis["recommendations"][:2]:  # Show top 2
                print(f"   - {rec}")
    
    return result

async def main():
    try:
        print("🤖 Starting Log Analysis Agent...")
        
        # Register analysis agent
        await sdk.register_agent(
            agent_config=AgentConfig(
                name="LOG_ANALYSIS_AGENT",
                topic="logs.raw",
                callback=log_analysis_callback,
                description="LLM-powered log analysis agent",
                tools=["openai", "security_analysis"],
                config=SubscriptionConfig(
                    reclaim_idle_ms=30000,
                    dlq_retry_limit=2,
                    consumer_count=3  # Parallel processing
                )
            )
        )
        
        # Start agent runner
        await sdk.start()
        print("✅ Log Analysis Agent is now running")
        print("🎧 Listening for logs on 'logs.raw' topic...")
        
        # Keep running
        try:
            while True:
                await asyncio.sleep(1)
        except KeyboardInterrupt:
            print("\n👋 Received shutdown signal...")
    
    except Exception as e:
        print(f"❌ Error: {e}")
        raise
    
    finally:
        print("Shutting down...")
        await sdk.shutdown()
        print("✅ Shutdown complete")

if __name__ == "__main__":
    asyncio.run(main())

Step 5: Build the Reporting Agent

The reporting agent is the final stage of our pipeline that transforms individual log analyses into actionable intelligence. It subscribes to the logs.analyzed topic, aggregates insights by severity level (critical, high, medium, low), and automatically generates comprehensive reports every 20 insights. This agent produces JSON reports that can feed into dashboards, alerting systems, or business intelligence tools.

# reporting_agent.py
import asyncio
import json
from datetime import datetime
from collections import defaultdict
from pathlib import Path
from omnidaemon import OmniDaemonSDK, AgentConfig, SubscriptionConfig

sdk = OmniDaemonSDK()

class ReportingAgent:
    def __init__(self):
        self.insights = defaultdict(list)
        self.reports_dir = Path("./reports")
        self.reports_dir.mkdir(exist_ok=True)
        
    async def process_analysis(self, message: dict):
        """Process analyzed log and aggregate insights"""
        content = message.get("content", {})
        
        if content.get("status") != "success":
            return {"status": "skipped", "reason": "analysis failed"}
        
        analysis = content.get("analysis", {})
        log_entry = content.get("log_entry", {})
        
        # Aggregate insights by severity
        severity = analysis.get("severity", "info")
        self.insights[severity].append({
            "timestamp": log_entry.get("timestamp"),
            "log_type": log_entry.get("log_type"),
            "category": analysis.get("category"),
            "summary": analysis.get("summary"),
            "vulnerabilities": analysis.get("vulnerabilities", []),
            "anomalies": analysis.get("anomalies", []),
            "critical_actions": analysis.get("critical_actions", []),
            "recommendations": analysis.get("recommendations", [])
        })
        
        # Generate report every 20 insights
        total_insights = sum(len(v) for v in self.insights.values())
        if total_insights % 20 == 0:
            await self.generate_report()
        
        return {
            "status": "processed",
            "severity": severity,
            "total_insights": total_insights
        }
    
    async def generate_report(self):
        """Generate comprehensive security and performance report"""
        try:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            report_file = self.reports_dir / f"log_insights_report_{timestamp}.json"
            
            # Compile statistics
            stats = {
                "generated_at": datetime.now().isoformat(),
                "total_insights": sum(len(v) for v in self.insights.values()),
                "by_severity": {k: len(v) for k, v in self.insights.items()},
                "insights": dict(self.insights)
            }
            
            # Extract top recommendations
            all_recommendations = []
            for severity_insights in self.insights.values():
                for insight in severity_insights:
                    all_recommendations.extend(insight.get("recommendations", []))
            
            # Count recommendation frequency
            rec_counts = defaultdict(int)
            for rec in all_recommendations:
                rec_counts[rec] += 1
            
            top_recommendations = sorted(
                rec_counts.items(), 
                key=lambda x: x[1], 
                reverse=True
            )[:10]
            
            stats["top_recommendations"] = [
                {"recommendation": rec, "frequency": count}
                for rec, count in top_recommendations
            ]
            
            # Save report
            with open(report_file, "w") as f:
                json.dump(stats, f, indent=2)
            
            print(f"\n📊 Report generated: {report_file}")
            print(f"   Total Insights: {stats['total_insights']}")
            print(f"   Critical: {stats['by_severity'].get('critical', 0)}")
            print(f"   High: {stats['by_severity'].get('high', 0)}")
            print(f"   Medium: {stats['by_severity'].get('medium', 0)}")
            
            if top_recommendations:
                print(f"\n🎯 Top Recommendations:")
                for rec, count in top_recommendations[:3]:
                    print(f"   - {rec} (mentioned {count} times)")
            
        except Exception as e:
            print(f"❌ Error generating report: {e}")

async def main():
    try:
        print("📊 Starting Reporting Agent...")
        
        reporting_agent = ReportingAgent()
        
        # Register reporting agent
        await sdk.register_agent(
            agent_config=AgentConfig(
                name="REPORTING_AGENT",
                topic="logs.analyzed",
                callback=reporting_agent.process_analysis,
                description="Aggregates analysis results and generates reports",
                config=SubscriptionConfig(
                    consumer_count=1  # Single consumer for aggregation
                )
            )
        )
        
        # Start agent runner
        await sdk.start()
        print("✅ Reporting Agent is now running")
        print("🎧 Listening for analyzed logs on 'logs.analyzed' topic...")
        
        # Keep running
        try:
            while True:
                await asyncio.sleep(1)
        except KeyboardInterrupt:
            print("\n👋 Generating final report...")
            await reporting_agent.generate_report()
    
    except Exception as e:
        print(f"❌ Error: {e}")
        raise
    
    finally:
        print("Shutting down...")
        await sdk.shutdown()
        print("✅ Shutdown complete")

if __name__ == "__main__":
    asyncio.run(main())

Executing the app

You can now run the agents in different terminals to see how they work together as one system.

python generate_logs.py
python log_ingestion_agent.py
python log_analysis_agent.py
python reporting_agent.py

Step 6: (Optional) Advanced Agent Chaining for Specialized Analysis

For more sophisticated log analysis, we can create a multi-stage pipeline with specialized agents. This demonstrates OmniDaemon’s powerful agent chaining capabilities.

Create a new file for the chained agent system:

# chained_analysis_agents.py
import asyncio
import json
from typing import Dict, Any
from openai import AsyncOpenAI
from omnidaemon import OmniDaemonSDK, AgentConfig, SubscriptionConfig, EventEnvelope, PayloadBase
from decouple import config

sdk = OmniDaemonSDK()
client = AsyncOpenAI(api_key=config("OPENAI_API_KEY"))

# ============================================================================
# AGENT 1: Parse and Classify Logs
# ============================================================================

async def parse_and_classify(message: dict):
    """
    First agent in the chain: Parse log entries and classify them
    Routes logs to specialized agents based on type and content
    """
    content = message.get("content", {})
    log_entry = content.get("log_entry", {})
    source_file = content.get("source_file", "unknown")
    
    print(f"\n🔍 [CLASSIFIER] Processing log from {source_file}")
    
    # Classify log type and determine routing
    log_type = log_entry.get("log_type", "unknown")
    classification = {
        "log_entry": log_entry,
        "source_file": source_file,
        "classification": None,
        "route_to": None,
        "priority": "normal"
    }
    
    if log_type == "authentication":
        # Check for security concerns
        success = log_entry.get("success", True)
        event = log_entry.get("event", "")
        ip = log_entry.get("ip_address", "")
        
        # Detect security-related patterns
        is_security_concern = (
            not success or
            event in ["failed_login", "brute_force_attempt", "impossible_travel"] or
            ip.startswith(("45.", "89."))  # Suspicious IP ranges
        )
        
        if is_security_concern:
            classification["classification"] = "security_threat"
            classification["route_to"] = "logs.security"
            classification["priority"] = "high"
            print(f"   ⚠️  Classified as: SECURITY THREAT → Routing to security agent")
        else:
            classification["classification"] = "normal_auth"
            classification["route_to"] = "logs.security"
            classification["priority"] = "low"
            print(f"   ✅ Classified as: Normal Authentication → Security agent")
    
    elif log_type == "database":
        # Check for performance concerns
        execution_time = log_entry.get("execution_time_ms", 0)
        rows_affected = log_entry.get("rows_affected", 0)
        operation = log_entry.get("operation", "")
        
        # Detect performance issues
        is_performance_issue = (
            execution_time > 1000 or  # Slow query (> 1 second)
            (operation == "DELETE" and rows_affected > 500) or
            rows_affected > 5000
        )
        
        if is_performance_issue:
            classification["classification"] = "performance_issue"
            classification["route_to"] = "logs.performance"
            classification["priority"] = "high"
            print(f"   🐌 Classified as: PERFORMANCE ISSUE → Routing to performance agent")
        else:
            classification["classification"] = "normal_db"
            classification["route_to"] = "logs.performance"
            classification["priority"] = "low"
            print(f"   ✅ Classified as: Normal Database Op → Performance agent")
    
    else:
        classification["classification"] = "unknown"
        classification["route_to"] = "logs.general"
        print(f"   ❓ Unknown log type → General analysis")
    
    # Publish to appropriate specialized agent with reply_to for final reporting
    await sdk.publish_task(
        event_envelope=EventEnvelope(
            topic=classification["route_to"],
            payload=PayloadBase(
                content=classification,
                reply_to="logs.final_report"  # All results go to final reporting
            ),
            source="classifier_agent",
            correlation_id=message.get("correlation_id", f"corr_{log_entry.get('timestamp')}")
        )
    )
    
    return {
        "status": "classified",
        "classification": classification["classification"],
        "routed_to": classification["route_to"],
        "priority": classification["priority"]
    }

# ============================================================================
# AGENT 2: Security-Specific Analysis
# ============================================================================

SECURITY_PROMPT = """You are a cybersecurity expert specializing in authentication and access control.
Analyze this log entry for security threats:

1. Identify specific attack patterns (brute force, credential stuffing, account takeover)
2. Assess threat severity (critical, high, medium, low)
3. Check for IOCs (Indicators of Compromise): suspicious IPs, user agents, access patterns
4. Recommend immediate security actions

Return JSON:
{
    "threat_level": "critical|high|medium|low",
    "attack_type": "description of attack pattern",
    "iocs": ["list of indicators of compromise"],
    "immediate_actions": ["list of urgent actions needed"],
    "recommendations": ["security recommendations"]
}"""

async def security_analysis(message: dict):
    """
    Specialized agent for security analysis
    Deep dive into authentication logs and security threats
    """
    content = message.get("content", {})
    log_entry = content.get("log_entry", {})
    classification = content.get("classification", "unknown")
    priority = content.get("priority", "normal")
    
    print(f"\n🛡️  [SECURITY] Analyzing {classification} (Priority: {priority})")
    
    try:
        # Prepare security context
        log_context = json.dumps(log_entry, indent=2)
        
        # Call LLM for security analysis
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": SECURITY_PROMPT},
                {"role": "user", "content": f"Analyze this authentication log:\n\n{log_context}"}
            ],
            temperature=0.2,  # Lower temperature for security analysis
            max_tokens=800
        )
        
        analysis_text = response.choices[0].message.content
        
        # Parse JSON from response
        if "```json" in analysis_text:
            analysis_text = analysis_text.split("```json")[1].split("```")[0].strip()
        elif "```" in analysis_text:
            analysis_text = analysis_text.split("```")[1].split("```")[0].strip()
        
        security_analysis = json.loads(analysis_text)
        
        # Log findings
        threat_level = security_analysis.get("threat_level", "low")
        if threat_level in ["critical", "high"]:
            print(f"   🚨 THREAT DETECTED: {security_analysis.get('attack_type')}")
            print(f"   📍 IOCs: {', '.join(security_analysis.get('iocs', [])[:2])}")
        else:
            print(f"   ✅ No significant threats detected")
        
        return {
            "status": "analyzed",
            "agent_type": "security",
            "log_entry": log_entry,
            "classification": classification,
            "analysis": security_analysis,
            "correlation_id": message.get("correlation_id")
        }
        
    except Exception as e:
        print(f"   ❌ Security analysis error: {e}")
        return {
            "status": "error",
            "agent_type": "security",
            "error": str(e),
            "log_entry": log_entry
        }

# ============================================================================
# AGENT 3: Performance Analysis
# ============================================================================

PERFORMANCE_PROMPT = """You are a database performance expert.
Analyze this database transaction log for performance issues:

1. Identify slow queries and resource bottlenecks
2. Detect unusual data access patterns
3. Flag potential database design issues
4. Recommend query optimizations and indexing strategies

Return JSON:
{
    "performance_score": "excellent|good|degraded|critical",
    "issues_found": ["list of performance issues"],
    "bottlenecks": ["identified bottlenecks"],
    "optimization_recommendations": ["specific optimization steps"],
    "estimated_impact": "description of performance impact"
}"""

async def performance_analysis(message: dict):
    """
    Specialized agent for database performance analysis
    Focuses on query optimization and resource utilization
    """
    content = message.get("content", {})
    log_entry = content.get("log_entry", {})
    classification = content.get("classification", "unknown")
    priority = content.get("priority", "normal")
    
    print(f"\n⚡ [PERFORMANCE] Analyzing {classification} (Priority: {priority})")
    
    try:
        # Prepare performance context
        log_context = json.dumps(log_entry, indent=2)
        
        # Call LLM for performance analysis
        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": PERFORMANCE_PROMPT},
                {"role": "user", "content": f"Analyze this database log:\n\n{log_context}"}
            ],
            temperature=0.2,
            max_tokens=800
        )
        
        analysis_text = response.choices[0].message.content
        
        # Parse JSON from response
        if "```json" in analysis_text:
            analysis_text = analysis_text.split("```json")[1].split("```")[0].strip()
        elif "```" in analysis_text:
            analysis_text = analysis_text.split("```")[1].split("```")[0].strip()
        
        perf_analysis = json.loads(analysis_text)
        
        # Log findings
        perf_score = perf_analysis.get("performance_score", "good")
        if perf_score in ["degraded", "critical"]:
            print(f"   🐌 PERFORMANCE ISSUE: {perf_score.upper()}")
            issues = perf_analysis.get("issues_found", [])
            if issues:
                print(f"   📊 Issues: {', '.join(issues[:2])}")
        else:
            print(f"   ✅ Performance acceptable: {perf_score}")
        
        return {
            "status": "analyzed",
            "agent_type": "performance",
            "log_entry": log_entry,
            "classification": classification,
            "analysis": perf_analysis,
            "correlation_id": message.get("correlation_id")
        }
        
    except Exception as e:
        print(f"   ❌ Performance analysis error: {e}")
        return {
            "status": "error",
            "agent_type": "performance",
            "error": str(e),
            "log_entry": log_entry
        }

# ============================================================================
# AGENT 4: Final Report Aggregator
# ============================================================================

async def final_report_aggregator(message: dict):
    """
    Final agent in the chain: Aggregates specialized analyses
    Combines security and performance insights into unified report
    """
    content = message.get("content", {})
    agent_type = content.get("agent_type", "unknown")
    analysis = content.get("analysis", {})
    log_entry = content.get("log_entry", {})
    correlation_id = content.get("correlation_id", "unknown")
    
    print(f"\n📋 [FINAL REPORT] Aggregating {agent_type} analysis (ID: {correlation_id})")
    
    # Create unified report
    report = {
        "correlation_id": correlation_id,
        "timestamp": log_entry.get("timestamp"),
        "log_type": log_entry.get("log_type"),
        "agent_type": agent_type,
        "analysis": analysis
    }
    
    # Determine overall severity
    if agent_type == "security":
        threat_level = analysis.get("threat_level", "low")
        if threat_level in ["critical", "high"]:
            print(f"   🚨 SECURITY ALERT: {threat_level.upper()} threat detected")
            print(f"   🎯 Actions: {', '.join(analysis.get('immediate_actions', [])[:2])}")
    
    elif agent_type == "performance":
        perf_score = analysis.get("performance_score", "good")
        if perf_score in ["degraded", "critical"]:
            print(f"   ⚠️  PERFORMANCE ALERT: {perf_score.upper()} detected")
            print(f"   💡 Recommendations: {', '.join(analysis.get('optimization_recommendations', [])[:2])}")
    
    return {
        "status": "reported",
        "report": report
    }

# ============================================================================
# Main: Register All Agents in the Chain
# ============================================================================

async def main():
    try:
        print("🚀 Starting Chained Analysis System...")
        print("=" * 70)
        print("Pipeline: Classifier → Security/Performance → Final Report")
        print("=" * 70)
        
        # Register Agent 1: Classifier (entry point)
        await sdk.register_agent(
            agent_config=AgentConfig(
                name="CLASSIFIER_AGENT",
                topic="logs.raw",
                callback=parse_and_classify,
                description="Parses and classifies logs for routing",
                config=SubscriptionConfig(
                    consumer_count=2  # Parallel classification
                )
            )
        )
        print("✅ Registered: Classifier Agent (logs.raw)")
        
        # Register Agent 2: Security Analyzer
        await sdk.register_agent(
            agent_config=AgentConfig(
                name="SECURITY_ANALYSIS_AGENT",
                topic="logs.security",
                callback=security_analysis,
                description="Specialized security threat analysis",
                tools=["openai", "security_analysis"],
                config=SubscriptionConfig(
                    consumer_count=3  # More consumers for security
                )
            )
        )
        print("✅ Registered: Security Analysis Agent (logs.security)")
        
        # Register Agent 3: Performance Analyzer
        await sdk.register_agent(
            agent_config=AgentConfig(
                name="PERFORMANCE_ANALYSIS_AGENT",
                topic="logs.performance",
                callback=performance_analysis,
                description="Specialized database performance analysis",
                tools=["openai", "performance_analysis"],
                config=SubscriptionConfig(
                    consumer_count=3
                )
            )
        )
        print("✅ Registered: Performance Analysis Agent (logs.performance)")
        
        # Register Agent 4: Final Report Aggregator
        await sdk.register_agent(
            agent_config=AgentConfig(
                name="FINAL_REPORT_AGENT",
                topic="logs.final_report",
                callback=final_report_aggregator,
                description="Aggregates specialized analysis into final reports",
                config=SubscriptionConfig(
                    consumer_count=1  # Single consumer for aggregation
                )
            )
        )
        print("✅ Registered: Final Report Agent (logs.final_report)")
        
        # Start all agents
        await sdk.start()
        print("\n" + "=" * 70)
        print("🎧 Chained Analysis System is LIVE!")
        print("=" * 70)
        print("\nAgent Chain Flow:")
        print("  1️⃣  logs.raw → Classifier")
        print("  2️⃣  Classifier → logs.security OR logs.performance")
        print("  3️⃣  Specialized Agent → logs.final_report")
        print("  4️⃣  Final Report → Aggregated Insights")
        print("\nPress Ctrl+C to stop...")
        
        # Keep running
        try:
            while True:
                await asyncio.sleep(1)
        except KeyboardInterrupt:
            print("\n\n👋 Received shutdown signal...")
    
    except Exception as e:
        print(f"❌ Error: {e}")
        raise
    
    finally:
        print("Shutting down all agents...")
        await sdk.shutdown()
        print("✅ Shutdown complete")

if __name__ == "__main__":
    asyncio.run(main())

Run the chained the agent to see the combined results.

python chained_analysis_agents.py

Advanced Features and Patterns

1. Agent Chaining for Complex Workflows

The chained system we built demonstrates powerful agent orchestration patterns:

Pattern 1: Intelligent Classification and Routing

# Classifier examines log and routes to specialist
async def parse_and_classify(message: dict):
    log_entry = message.get("content", {}).get("log_entry", {})
    
    # Determine routing based on content analysis
    if is_security_concern(log_entry):
        route_to = "logs.security"
        priority = "high"
    elif is_performance_issue(log_entry):
        route_to = "logs.performance"
        priority = "high"
    else:
        route_to = "logs.general"
        priority = "normal"
    
    # Publish to specialized agent with correlation tracking
    await sdk.publish_task(
        event_envelope=EventEnvelope(
            topic=route_to,
            payload=PayloadBase(
                content=classification_data,
                reply_to="logs.final_report"  # All results aggregated here
            ),
            correlation_id=generate_correlation_id(log_entry)
        )
    )

Pattern 2: Parallel Specialized Processing

# Multiple specialized agents process different aspects simultaneously
# Security agent handles authentication threats
await sdk.register_agent(
    agent_config=AgentConfig(
        topic="logs.security",
        callback=security_analysis,
        config=SubscriptionConfig(consumer_count=3)  # 3 parallel security analysts
    )
)

# Performance agent handles database optimization
await sdk.register_agent(
    agent_config=AgentConfig(
        topic="logs.performance",
        callback=performance_analysis,
        config=SubscriptionConfig(consumer_count=3)  # 3 parallel performance analysts
    )
)

Pattern 3: Correlation-Based Aggregation

# Final agent aggregates related analyses using correlation_id
async def final_report_aggregator(message: dict):
    correlation_id = message.get("correlation_id")
    
    # Track related events across pipeline
    # All events with same correlation_id are part of same analysis flow
    report = {
        "correlation_id": correlation_id,
        "security_analysis": ...,
        "performance_analysis": ...,
        "unified_recommendations": ...
    }

2. Multi-Tenant Log Analysis

Extend the system to handle multiple tenants:

async def multi_tenant_analysis_callback(message: dict):
    tenant_id = message.get("tenant_id")
    content = message.get("content", {})
    
    # Load tenant-specific security policies
    policies = await load_tenant_policies(tenant_id)
    
    # Analyze with tenant context
    result = await analyze_log_with_llm(content, policies)
    
    return {
        "status": "success",
        "tenant_id": tenant_id,
        "analysis": result
    }

2. Multi-Tenant Log Analysis

Send immediate alerts for critical issues:

event = EventEnvelope(
    topic="logs.raw",
    payload=PayloadBase(
        content=log_data,
        webhook="https://your-api.com/alerts"  # Receive instant notifications
    )
)

3. Real-Time Alerting with Webhooks

Handle high log volumes by scaling agents:

await sdk.register_agent(
    agent_config=AgentConfig(
        name="LOG_ANALYSIS_AGENT",
        topic="logs.raw",
        callback=analyze_log,
        config=SubscriptionConfig(
            consumer_count=10  # Run 10 parallel consumers
        )
    )
)

4. Horizontal Scaling

Monitoring and Observability

# Check system health
health = await sdk.health()
print(f"Status: {health['status']}")
print(f"Active Agents: {health['registered_agents_count']}")

# Check dead letter queue
omnidaemon bus dlq --topic logs.raw
# View metrics
omnidaemon metrics list

Error Handling and Retry Logic

async def robust_analysis_callback(message: dict):
    try:
        result = await analyze_log_with_llm(message.get("content"))
        return {"status": "success", "data": result}
    except RateLimitError as e:
        # Retriable error - OmniDaemon will retry
        raise
    except InvalidLogFormat as e:
        # Non-retriable error - return error status
        return {"status": "error", "error": str(e)}

Performance Optimization

# Configuration for high-throughput scenarios
STORAGE_BACKEND=redis  # Use Redis for distributed systems
EVENT_BUS_TYPE=redis_stream  # Or Kafka for massive scale
REDIS_URL=redis://prod-redis.example.com:6379

# Increase consumer count for parallel processing
consumer_count=10
# Optimize LLM calls with batching
async def batch_analyze_logs(log_batch: list):
    # Analyze multiple logs in single LLM call
    pass

Thanks for reading this article !!

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

Resources

Exploring TimesFM: The Foundation Model That Understands the Language of Time

Vishnu Sivan — Thu, 23 Oct 2025 17:44:09 GMT

Forecasting the future has always been a cornerstone of human progress — from predicting sales for the next quarter to anticipating global health trends. Traditionally, time-series forecasting has relied on statistical models like ARIMA and Exponential Smoothing, and later, deep learning-based methods. While these models have proven effective, they often demand significant domain expertise and extensive fine-tuning for each new dataset.

In today’s data-driven world, accurate forecasting powers industries such as retail, finance, energy, and healthcare, enabling smarter, evidence-based decision-making. With recent advances in deep learning, the paradigm is shifting — traditional models are being surpassed by large-scale neural architectures capable of understanding complex temporal patterns.

Enter TimesFM (Time-series Foundation Model) — a groundbreaking decoder-only foundation model from Google Research, purpose-built for time-series forecasting. Unlike conventional models that require dataset-specific training, TimesFM achieves remarkable zero-shot forecasting performance, making it a true leap forward in how we approach predictive modeling.

This article explores what makes TimesFM unique — how it understands the “language of time,” captures trends, rhythms, and seasonality across diverse domains, its core architecture, highlight its innovative features, and walk through practical examples of how to use it effectively in real-world applications.

Getting Started

The Evolution of Time-Series Forecasting
What is a Decoder?
What is TimesFM?
Architecture Overview
Experimenting with TimesFM model
Handson 1: Zero-Shot Forecasting
Handson 2: Zero-Shot Forecasting with stats
Handson 3: Finetuning the model
Handson 4: Model Comparison
Building NextGen Forecasting app using TimesFM

The Evolution of Time-Series Forecasting

Time-series forecasting has evolved significantly over the years. Traditional models like ARIMA and Exponential Smoothing performed well for simple, univariate data but struggled with the complexities of modern, multivariate, and high-dimensional datasets.

The rise of deep learning models such as DeepAR and N-BEATS addressed these challenges by using neural networks capable of capturing long-term dependencies and intricate temporal patterns.

Now, a new generation of models is redefining the field. TimesFM, a foundation model for time-series forecasting, is pre-trained on diverse datasets and achieves high-accuracy, zero-shot predictions — eliminating the need for task-specific fine-tuning and marking a major leap forward in forecasting technology.

What is a Decoder?

At the core of TimesFM lies the decoder architecture, a key component also used in modern Large Language Models (LLMs) like GPT. A decoder is a neural network designed for generative tasks, where it produces new sequences based on a given context.

In language models, the decoder takes a sequence of words and predicts the next one in an autoregressive manner — generating text step by step. TimesFM applies this same concept to time-series data. Instead of predicting the next word, it predicts the next value or segment in a numerical sequence. Essentially, the model learns the “grammar of time” — understanding patterns, rhythms, and trends — to generate accurate forecasts.

What is TimesFM?

TimesFM is a decoder-only foundation model for time-series forecasting, inspired by the architecture of large language models like GPT-3. Unlike traditional models, it predicts future values directly from past data without requiring an encoder.

The model processes time-series data in patches — segments of sequential values — a technique that improves efficiency and helps capture long-term temporal patterns.

Designed as a general-purpose, zero-shot forecaster, TimesFM can deliver accurate predictions for entirely new datasets without any fine-tuning, marking a major shift from traditional, task-specific forecasting approaches.

Architecture Overview

TimesFM 1.0 is a decoder-only transformer model consisting of approximately 200 million parameters, trained on a massive pretraining dataset containing over 100 billion real-world time points. This large-scale training enables the model to deliver highly accurate forecasts on unseen datasets without any additional fine-tuning.

The model is designed for univariate time-series forecasting, where it predicts the future values of a single variable using only its past observations. TimesFM handles context lengths up to 512 time points and supports any forecast horizon, with an optional frequency indicator input to incorporate time granularity information.

TimesFM’s architecture combines a scalable transformer backbone, efficient patch-based processing, and robust masking techniques to achieve zero-shot generalization across diverse time-series datasets — setting a new standard for foundation models in forecasting.

Image source: A decoder-only foundation model for time-series forecasting

In TimesFM, the input time-series is first divided into fixed-length input patches. Each patch is then transformed into a vector using a residual block, aligning it with the model’s transformer dimensions. Positional encodings are added to these vectors to preserve the temporal order before they are passed through the stacked transformer layers.

Within the transformer, SA (self-attention) represents multi-head causal attention, which ensures the model only attends to past and present data, while FFN refers to the feed-forward (fully connected) layers that refine the learned representations.

Finally, the output tokens are processed through another residual block to produce an output of length output_patch_len, representing the model’s forecast for the next time window beyond the last observed patch.

Below is a concise overview of its key architectural elements:

Decoder-Only Design:
TimesFM utilizes a decoder-only setup, making it inherently capable of generating future sequences from past context. This structure enables it to manage input sequences of varying lengths with flexibility and efficiency.
Input Patching:
The time-series data is divided into contiguous, non-overlapping patches (tokens), which pass through residual blocks that transform each patch into a vector representation. These vectors are then processed by the transformer layers for deeper temporal understanding.
Stacked Transformer Layers:
Multiple transformer layers equipped with multi-head self-attention allow each token to reference others in the sequence. TimesFM employs causal attention, ensuring that predictions are influenced only by past and present information — never by future data.
Extended Output Patches:
A key innovation in TimesFM is its ability to generate longer output patches than the input patches. Unlike traditional LLMs that produce outputs token by token, TimesFM predicts entire segments of future data at once, enabling faster inference and more accurate long-term forecasts.
Patch Masking:
To prevent overfitting and enhance generalization, TimesFM applies a patch masking strategy during training. This allows the model to adapt seamlessly to different context lengths during inference, ensuring strong and consistent performance across diverse time-series datasets.

Model Parameters (Hyperparameters)
These are the key tunable settings that define the model’s behavior and impact its forecasting performance:

model_dim: Dimensionality of the input and output vectors.
input_patch_len (p): Length of each input patch.
output_patch_len (h): Number of future time steps predicted per generation step.
num_heads: Number of attention heads in the multi-head attention mechanism.
num_layers (nl): Number of stacked transformer layers.
context length (L): Length of historical data used for forecasting.
horizon length (H): Length of the forecast horizon.
Number of input tokens (N): Representing the number of patches derived from the input sequence. Each token is then processed through transformer layers for contextual learning.

Model Components
The core architecture of TimesFM is composed of the following fundamental components:

Residual Blocks: Used to preprocess input and output patches, ensuring stable gradient flow and efficient feature extraction.
Stacked Transformer Layers: The central component of the model, enabling rich temporal representation learning through self-attention.
Input Tokens (tj): Derived from processed patches, these tokens are generated using residual blocks and positional encodings.
Output Tokens (oj): Generated by the stacked transformer layers, these tokens are used to predict the corresponding output patches.
Patch Mask (m1:L): Applied to ignore certain portions of the input sequence during processing, helping the model generalize to varying context lengths.
Loss Function: During training, TimesFM minimizes the Mean Squared Error (MSE) between predicted and actual future values.

Experimenting with TimesFM model

This section guides the setup of the TimesFM model for time-series forecasting, starting with zero-shot forecasting to demonstrate predictions on unseen datasets without prior training. It also covers fine-tuning for specific datasets to enhance performance.

TimesFM’s performance is compared with other approaches, including statistical models (e.g., AutoETS), machine learning models (e.g., Random Forest, XGBoost, LGBM), and other foundational models such as TimeGPT, highlighting its unique capabilities and advantages.

We will also use uv, a modern and fast Python package manager (instead of pip), to set up our environment and handle dependencies.

Installing uv

uv simplifies dependency management, virtual environments, and running scripts.

For Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
set Path=C:\Users\Codem\.local\bin;%Path%

For Linux / Mac:

curl -LsSf https://astral.sh/uv/install.sh | sh

Refer to the official website for detailed installation instructions.

Installation | uv

Installing the dependencies

Initialize a uv project by executing the following command.

uv init timesfm_demo
cd timesfm_demo

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

The official PyPI package for TimesFM provides older versions (e.g., timesfm-2.0–500m and timesfm-1.0–200m). To access the latest version (timesfm-2.5–200m-pytorch), it is recommended to install the library directly from GitHub. This ensures access to the most recent features, improvements, and bug fixes.

Clone the github repository and move to the folder using the following command.

git clone https://github.com/google-research/timesfm.git
cd timesfm

Install timesfm by executing the following command.

uv pip install -e .[torch]

Handson 1: Zero-Shot Forecasting

Zero-shot forecasting allows TimesFM to generate accurate predictions on unseen datasets without any prior training or fine-tuning. Leveraging its pre-trained knowledge on diverse time-series data, the model can identify trends, seasonality, and temporal patterns directly from historical inputs.

In this demo, using a synthetic bike rental dataset with trend, seasonality, and noise, TimesFM was able to accurately forecast the next 90 days. Despite having no prior exposure to this dataset, the model effectively captured the underlying temporal patterns, demonstrating the power of zero-shot forecasting.

Install matplotlib library for graph based visualization using uv.

uv add matplotlib

Create a file named bike_rental_forecast.py and add the following code to it.

import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import timesfm

torch.set_float32_matmul_precision("high")

# --- 1. Generate Synthetic Bike Rental Data ---
def create_bike_rental_data():
    dates = pd.date_range(start="2020-01-01", end="2023-12-31", freq="D")
    n_days = len(dates)

    trend = np.linspace(start=150, stop=350, num=n_days)
    yearly_seasonality = 180 * (1 + np.sin(2 * np.pi * dates.dayofyear / 365.25 - np.pi/2))
    weekly_seasonality = 120 * (dates.dayofweek >= 5)
    noise = np.random.normal(0, 40, n_days)
    rentals = trend + yearly_seasonality + weekly_seasonality + noise
    rentals = np.maximum(0, rentals).astype(int)  # Ensure no negative rentals
    temp = 15 + 10 * np.sin(2 * np.pi * dates.dayofyear / 365.25 - np.pi/2) + np.random.randn(n_days) * 2
    is_weekend = (dates.dayofweek >= 5).astype(int)

    df = pd.DataFrame({
        'rentals': rentals,
        'temperature': temp,
        'is_weekend': is_weekend
    }, index=dates)

    return df

# --- 2. Prepare Data ---
rental_df = create_bike_rental_data()
time_series_data = rental_df['rentals'].values
horizon_len = 90  # Forecast the next 90 days
historical_data = time_series_data[:-horizon_len]
true_future_values = time_series_data[-horizon_len:]

# --- 3. Initialize TimesFM 2.5 PyTorch Model ---
model = timesfm.TimesFM_2p5_200M_torch.from_pretrained("google/timesfm-2.5-200m-pytorch")
model.compile(
    timesfm.ForecastConfig(
        max_context=1024,
        max_horizon=256,
        normalize_inputs=True,
        use_continuous_quantile_head=True,
        force_flip_invariance=True,
        infer_is_positive=True,
        fix_quantile_crossing=True,
    )
)

# --- 4. Generate Forecast ---
point_forecast, quantile_forecast = model.forecast(
    horizon=horizon_len,
    inputs=[historical_data],  # Single time series as input
)
forecast_values = point_forecast[0]  # Extract the forecast for our single series

# --- 5. Visualize the Results ---
dates = rental_df.index
plt.figure(figsize=(15, 7))
plt.plot(dates[:-horizon_len], historical_data, label="Historical Data", color="black")
plt.plot(dates[-horizon_len:], true_future_values, label="True Future Values", color="blue", linestyle='--')
plt.plot(dates[-horizon_len:], forecast_values, label="TimesFM 2.5 Forecast", color="red")
plt.fill_between(dates[-horizon_len:],quantile_forecast[0, :, 1], quantile_forecast[0, :, 9], alpha=0.2, color='red', label='80% Prediction Interval')
plt.legend()
plt.title("TimesFM 2.5 Zero-Shot Forecast for Daily Bike Rentals")
plt.xlabel("Date")
plt.ylabel("Number of Rentals")
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# --- 6. Calculate Metrics ---
rmse = np.sqrt(np.mean((true_future_values - forecast_values)**2))
mae = np.mean(np.abs(true_future_values - forecast_values))

print(f"Zero-Shot Forecast RMSE: {rmse:.2f}")
print(f"Zero-Shot Forecast MAE: {mae:.2f}")
print(f"\nPoint forecast shape: {point_forecast.shape}")
print(f"Quantile forecast shape: {quantile_forecast.shape}")

Code Summary

Generates synthetic daily bike rental data (2020–2023) using the custom function create_bike_rental_data().
Splits the dataset into historical and future segments for model evaluation.
Loads the TimesFM 2.5 model using timesfm.TimesFM_2p5_200M_torch.from_pretrained("google/timesfm-2.5-200m-pytorch") to fetch the latest pre-trained TimesFM model.
Configures the model with model.compile() using ForecastConfig to define forecast parameters (e.g., max_horizon, normalization, quantile options).
Performs forecasting via model.forecast(horizon, inputs=[historical_data]) which generates both point and quantile forecasts.
Visualizes trends and uncertainty using matplotlib.
Evaluates model performance using RMSE (Root Mean Square Error) and MAE (Mean Absolute Error)

Executing the code
To visualize trends and obtain forecasted results with TimesFM, run the code using your Python environment.

python bike_rental_forecast.py

Handson 2: Zero-Shot Forecasting with stats

Improved version of the bike rental forecasting with various statistics.

import torch
import numpy as np
from datetime import datetime, timedelta
import timesfm
import matplotlib.pyplot as plt

# Initialize TimesFM model
print("Loading TimesFM model...")
torch.set_float32_matmul_precision("high")
model = timesfm.TimesFM_2p5_200M_torch.from_pretrained("google/timesfm-2.5-200m-pytorch")
model.compile(
    timesfm.ForecastConfig(
        max_context=1024,
        max_horizon=256,
        normalize_inputs=True,
        use_continuous_quantile_head=True,
        force_flip_invariance=True,
        infer_is_positive=True,
        fix_quantile_crossing=True,
    )
)

def generate_forecast(historical_data, horizon=90, start_date=None, target_column="Bike Rentals"):

    if len(historical_data) < 30:
        raise ValueError("Historical data must contain at least 30 data points")
    
    if horizon > 256 or horizon < 1:
        raise ValueError("Horizon must be between 1 and 256 days")
    
    historical_array = np.array(historical_data, dtype=np.float32)
    
    print(f"Generating {horizon}-day forecast...")
    point_forecast, quantile_forecast = model.forecast(
        horizon=horizon,
        inputs=[historical_array],
    )
    
    # Extract forecasts
    forecast_values = point_forecast[0]
    quantiles = quantile_forecast[0]  # Shape: [horizon, 11]
    
    # Generate dates
    if start_date:
        if isinstance(start_date, str):
            start = datetime.strptime(start_date, "%Y-%m-%d")
        else:
            start = start_date
    else:
        start = datetime.now() - timedelta(days=len(historical_data))
    
    historical_dates = [start + timedelta(days=i) for i in range(len(historical_data))]
    forecast_start = start + timedelta(days=len(historical_data))
    forecast_dates = [forecast_start + timedelta(days=i) for i in range(horizon)]
    
    # Calculate summary statistics
    historical_mean = np.mean(historical_data)
    historical_std = np.std(historical_data)
    forecast_mean = np.mean(forecast_values)
    forecast_std = np.std(forecast_values)
    
    # Print summary
    print("\n" + "="*70)
    print(f"FORECAST SUMMARY: {target_column}")
    print("="*70)
    
    print("\nHistorical Data Statistics:")
    print(f"  Period: {historical_dates[0].strftime('%Y-%m-%d')} to {historical_dates[-1].strftime('%Y-%m-%d')}")
    print(f"  Count: {len(historical_data)} days")
    print(f"  Mean: {historical_mean:.2f}")
    print(f"  Std Dev: {historical_std:.2f}")
    print(f"  Min: {np.min(historical_data):.2f}")
    print(f"  Max: {np.max(historical_data):.2f}")
    
    print("\nForecast Statistics:")
    print(f"  Period: {forecast_dates[0].strftime('%Y-%m-%d')} to {forecast_dates[-1].strftime('%Y-%m-%d')}")
    print(f"  Count: {horizon} days")
    print(f"  Mean: {forecast_mean:.2f}")
    print(f"  Std Dev: {forecast_std:.2f}")
    print(f"  Min: {np.min(forecast_values):.2f}")
    print(f"  Max: {np.max(forecast_values):.2f}")
    
    print("\nTrend Analysis:")
    change_percent = ((forecast_mean - historical_mean) / historical_mean) * 100
    trend_direction = "increasing" if change_percent > 0 else "decreasing"
    print(f"  Direction: {trend_direction.upper()}")
    print(f"  Change: {change_percent:+.2f}%")
    
    print("\nConfidence Intervals (Average):")
    print(f"  80% Interval: [{np.mean(quantiles[:, 1]):.2f}, {np.mean(quantiles[:, 9]):.2f}]")
    print(f"  50% Interval: [{np.mean(quantiles[:, 3]):.2f}, {np.mean(quantiles[:, 7]):.2f}]")
    
    print("\nFirst 10 Forecast Values:")
    for i in range(min(10, len(forecast_values))):
        date_str = forecast_dates[i].strftime('%Y-%m-%d')
        val = forecast_values[i]
        q10 = quantiles[i, 1]
        q90 = quantiles[i, 9]
        print(f"  {date_str}: {val:.2f} (80% CI: [{q10:.2f}, {q90:.2f}])")
    
    if len(forecast_values) > 10:
        print(f"  ... and {len(forecast_values) - 10} more days")
    
    print("="*70 + "\n")
    
    # Generate plot for historical and forecasted data
    plt.figure(figsize=(15, 7))
    plt.plot(historical_dates, historical_data, 
             label="Historical Data", color="black", linewidth=2, marker='o', 
             markersize=3, markevery=max(1, len(historical_data)//50))
    plt.plot(forecast_dates, forecast_values, 
             label="Forecast", color="red", linewidth=2, marker='s',
             markersize=3, markevery=max(1, horizon//50))
    
    # Add confidence intervals
    plt.fill_between(
        forecast_dates,
        quantiles[:, 1],  # 10th percentile
        quantiles[:, 9],  # 90th percentile
        alpha=0.2,
        color='red',
        label='80% Prediction Interval'
    )
    
    plt.fill_between(
        forecast_dates,
        quantiles[:, 3],  # 25th percentile
        quantiles[:, 7],  # 75th percentile
        alpha=0.3,
        color='red',
        label='50% Prediction Interval'
    )
    
    plt.legend(loc='best', fontsize=11)
    plt.title(f"TimesFM Forecast: {target_column}", fontsize=16, fontweight='bold')
    plt.xlabel("Date", fontsize=13)
    plt.ylabel(target_column, fontsize=13)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Return detailed results
    return {
        "forecast_values": forecast_values.tolist(),
        "forecast_dates": [d.strftime('%Y-%m-%d') for d in forecast_dates],
        "quantiles": {
            "q10": quantiles[:, 1].tolist(),
            "q25": quantiles[:, 3].tolist(),
            "q50": quantiles[:, 5].tolist(),
            "q75": quantiles[:, 7].tolist(),
            "q90": quantiles[:, 9].tolist()
        },
        "historical_mean": float(historical_mean),
        "forecast_mean": float(forecast_mean),
        "trend": trend_direction,
        "change_percent": float(change_percent)
    }


if __name__ == "__main__":
    # Create synthetic data with trend and seasonality
    np.random.seed(42)
    days = 365
    trend = np.linspace(150, 300, days)
    seasonality = 50 * np.sin(np.linspace(0, 4*np.pi, days))
    noise = np.random.normal(0, 20, days)
    historical_data = trend + seasonality + noise
    historical_data = np.maximum(historical_data, 0)  # Ensure non-negative
    
    # Generate forecast
    results = generate_forecast(
        historical_data=historical_data.tolist(),
        horizon=90,
        start_date="2023-01-01",
        target_column="Bike Rentals"
    )
    print(results)

Code Summary

Model Initialization:
The latest version of TimesFM (timesfm-2.5–200m-pytorch) is loaded and compiled with forecasting configurations such as context length, horizon, normalization, and quantile settings.
Input Preparation:
Historical time-series data is converted into a numerical array. The model supports flexible context lengths, but a minimum of 30 data points is recommended for robust forecasts.
Forecast Generation:
The model predicts the next values for the specified forecast horizon (up to 256 steps), producing both point forecasts and quantile estimates for uncertainty intervals.
Analysis and Visualization:
Summary statistics for historical and forecasted data are computed.
Trend direction and percent change are calculated to understand expected growth or decline.
Forecasts are visualized with confidence intervals to highlight prediction uncertainty.

Executing the code
To visualize trends and obtain forecasted results with TimesFM, run the code using your Python environment.

python bike_rental_forecast.py

Handson 3: Finetuning the model

Although TimesFM’s zero-shot capabilities are already remarkable, fine-tuning the model on a specific dataset can further enhance forecasting accuracy.

Research shows that fine-tuning on as little as 10% of a dataset can achieve state-of-the-art performance, often outperforming models trained from scratch. While the current TimesFM library does not provide a high-level .finetune() API, the fine-tuning process can be implemented conceptually using frameworks like PyTorch or JAX, adjusting the model weights to better fit your dataset.

Create a file named bike_rental_forecast_finetuning.py and add the following code to it.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import timesfm
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
from pathlib import Path

# Initialize TimesFM model
print("Loading TimesFM model...")
torch.set_float32_matmul_precision("high")
model = timesfm.TimesFM_2p5_200M_torch.from_pretrained("google/timesfm-2.5-200m-pytorch")

model.compile(
    timesfm.ForecastConfig(
        max_context=1024,
        max_horizon=256,
        normalize_inputs=True,
        use_continuous_quantile_head=True,
        force_flip_invariance=True,
        infer_is_positive=True,
        fix_quantile_crossing=True,
    )
)
print("Model loaded successfully!\n")


# Custom Dataset for fine-tuning
class TimeSeriesDataset(Dataset):
    def __init__(self, data, context_length=365, horizon=90):
        self.data = np.array(data, dtype=np.float32)
        self.context_length = context_length
        self.horizon = horizon
        
        # Create sliding windows
        self.samples = []
        for i in range(len(self.data) - context_length - horizon + 1):
            context = self.data[i:i + context_length]
            target = self.data[i + context_length:i + context_length + horizon]
            self.samples.append((context, target))
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        context, target = self.samples[idx]
        return torch.tensor(context), torch.tensor(target)


def prepare_finetuning_data(historical_data, batch_size=8, context_length=365, horizon=90):
    dataset = TimeSeriesDataset(historical_data, context_length, horizon)
    data_loader = DataLoader(
        dataset, 
        batch_size=batch_size, 
        shuffle=True,
        num_workers=0  # Set to 0 for compatibility
    )
    return data_loader


def finetune_model(model, train_data, num_epochs=10, learning_rate=1e-4, 
                   context_length=365, horizon=90, batch_size=8, 
                   save_path="finetuned_timesfm.pt"):
    print("="*70)
    print("STARTING MODEL FINE-TUNING")
    print("="*70)
    print(f"Training data points: {len(train_data)}")
    print(f"Context length: {context_length}")
    print(f"Forecast horizon: {horizon}")
    print(f"Batch size: {batch_size}")
    print(f"Learning rate: {learning_rate}")
    print(f"Epochs: {num_epochs}\n")
    
    # Prepare data loader
    train_loader = prepare_finetuning_data(
        train_data, 
        batch_size=batch_size,
        context_length=context_length,
        horizon=horizon
    )
    
    if len(train_loader) == 0:
        raise ValueError(f"Not enough data for training. Need at least {context_length + horizon} points.")
    
    print(f"Created {len(train_loader)} training batches\n")
    
    # Access the underlying PyTorch model. TimesFM wraps the actual model, we need to access it
    try:
        # Try to access the internal model
        if hasattr(model, 'model'):
            pytorch_model = model.model
        elif hasattr(model, '_model'):
            pytorch_model = model._model
        else:
            # If we can't find the internal model, use the wrapper directly
            pytorch_model = model
        
        # Get trainable parameters
        trainable_params = [p for p in pytorch_model.parameters() if p.requires_grad]
        
        if len(trainable_params) == 0:
            print("Warning: No trainable parameters found. Attempting to unfreeze all parameters...")
            for param in pytorch_model.parameters():
                param.requires_grad = True
            trainable_params = list(pytorch_model.parameters())
        
        print(f"Found {len(trainable_params)} trainable parameter groups\n")
        
    except Exception as e:
        print(f"Warning: Could not access model parameters directly: {e}")
        print("TimesFM may not support direct fine-tuning through standard PyTorch methods.")
        print("Using the model in inference mode only.\n")
        return model, []
    
    optimizer = optim.Adam(trainable_params, lr=learning_rate)     # Create optimizer
    criterion = nn.MSELoss()    # Loss function
    training_losses = []    # Training loop
    
    for epoch in range(num_epochs):
        epoch_losses = []
        
        for batch_idx, (context, ground_truth_horizon) in enumerate(train_loader):
            try:
                # Note: TimesFM expects inputs as list of arrays
                predicted_horizon, _ = model.forecast(
                    horizon=horizon,
                    inputs=[context[i].numpy() for i in range(context.shape[0])]
                )                
                # Convert predictions to tensor
                predicted_tensor = torch.tensor(predicted_horizon, dtype=torch.float32, requires_grad=True)
                # Calculate loss
                loss = criterion(predicted_tensor, ground_truth_horizon)
                # Backward pass: update model weights
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                epoch_losses.append(loss.item())
                
            except Exception as e:
                print(f"Error during training batch {batch_idx}: {e}")
                print("Note: TimesFM may not support gradient-based fine-tuning.")
                print("The model will be used in inference-only mode.\n")
                return model, []
        
        avg_loss = np.mean(epoch_losses)
        training_losses.append(avg_loss)
        
        print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.6f}")
    
    print(f"\nFine-tuning complete!")
    
    if len(training_losses) > 0:
        print(f"Final Loss: {training_losses[-1]:.6f}")
        print(f"Saving fine-tuned model to: {save_path}")
        try:
            # Save the state dict of the internal model if possible
            if hasattr(model, 'model'):
                model_state = model.model.state_dict()
            elif hasattr(model, '_model'):
                model_state = model._model.state_dict()
            else:
                model_state = {}
                
            torch.save({
                'model_state_dict': model_state,
                'optimizer_state_dict': optimizer.state_dict() if optimizer else None,
                'training_losses': training_losses,
                'config': {
                    'context_length': context_length,
                    'horizon': horizon,
                    'num_epochs': num_epochs,
                    'learning_rate': learning_rate
                }
            }, save_path)
            print("Model saved successfully!\n")
        except Exception as e:
            print(f"Warning: Could not save model: {e}\n")
        
        # Plot training loss
        plt.figure(figsize=(10, 5))
        plt.plot(range(1, num_epochs + 1), training_losses, marker='o', linewidth=2)
        plt.title('Training Loss Over Epochs', fontsize=14, fontweight='bold')
        plt.xlabel('Epoch', fontsize=12)
        plt.ylabel('MSE Loss', fontsize=12)
        plt.grid(True, linestyle='--', alpha=0.6)
        plt.tight_layout()
        plt.show()
    else:
        print("Fine-tuning was not performed due to model limitations.")
        print("Using pre-trained model for inference.\n")
    
    return model, training_losses


def load_finetuned_model(model, checkpoint_path):
    print(f"Loading fine-tuned model from: {checkpoint_path}")
    try:
        checkpoint = torch.load(checkpoint_path)
        
        # Try to load state dict into internal model
        if 'model_state_dict' in checkpoint and checkpoint['model_state_dict']:
            if hasattr(model, 'model'):
                model.model.load_state_dict(checkpoint['model_state_dict'])
            elif hasattr(model, '_model'):
                model._model.load_state_dict(checkpoint['model_state_dict'])
            else:
                print("Warning: Could not access internal model structure")
        
        print("Model loaded successfully!")
        
        if 'config' in checkpoint:
            print("\nModel configuration:")
            for key, value in checkpoint['config'].items():
                print(f"  {key}: {value}")
    except Exception as e:
        print(f"Warning: Could not load fine-tuned model: {e}")
        print("Using pre-trained base model instead.")
    
    return model


def generate_forecast(historical_data, horizon=90, start_date=None, target_column="Bike Rentals", 
                     use_finetuned=False, finetuned_path=None):
    global model
    
    # Load fine-tuned model if requested
    if use_finetuned and finetuned_path:
        if Path(finetuned_path).exists():
            model = load_finetuned_model(model, finetuned_path)
        else:
            print(f"Warning: Fine-tuned model not found at {finetuned_path}. Using base model.")
    
    # Validate input
    if len(historical_data) < 30:
        raise ValueError("Historical data must contain at least 30 data points")
    
    if horizon > 256 or horizon < 1:
        raise ValueError("Horizon must be between 1 and 256 days")
    
    # Convert to numpy array
    historical_array = np.array(historical_data, dtype=np.float32)
    
    print(f"Generating {horizon}-day forecast...")
    
    # Generate forecast (TimesFM handles inference mode internally)
    point_forecast, quantile_forecast = model.forecast(
        horizon=horizon,
        inputs=[historical_array],
    )
    
    # Extract forecasts
    forecast_values = point_forecast[0]
    quantiles = quantile_forecast[0]  # Shape: [horizon, 11]
    
    # Generate dates
    if start_date:
        if isinstance(start_date, str):
            start = datetime.strptime(start_date, "%Y-%m-%d")
        else:
            start = start_date
    else:
        start = datetime.now() - timedelta(days=len(historical_data))
    
    historical_dates = [start + timedelta(days=i) for i in range(len(historical_data))]
    forecast_start = start + timedelta(days=len(historical_data))
    forecast_dates = [forecast_start + timedelta(days=i) for i in range(horizon)]
    
    # Calculate summary statistics
    historical_mean = np.mean(historical_data)
    historical_std = np.std(historical_data)
    forecast_mean = np.mean(forecast_values)
    forecast_std = np.std(forecast_values)
    
    # Print summary
    print("\n" + "="*70)
    print(f"FORECAST SUMMARY: {target_column}")
    print("="*70)
    
    print("\nHistorical Data Statistics:")
    print(f"  Period: {historical_dates[0].strftime('%Y-%m-%d')} to {historical_dates[-1].strftime('%Y-%m-%d')}")
    print(f"  Count: {len(historical_data)} days")
    print(f"  Mean: {historical_mean:.2f}")
    print(f"  Std Dev: {historical_std:.2f}")
    print(f"  Min: {np.min(historical_data):.2f}")
    print(f"  Max: {np.max(historical_data):.2f}")
    
    print("\nForecast Statistics:")
    print(f"  Period: {forecast_dates[0].strftime('%Y-%m-%d')} to {forecast_dates[-1].strftime('%Y-%m-%d')}")
    print(f"  Count: {horizon} days")
    print(f"  Mean: {forecast_mean:.2f}")
    print(f"  Std Dev: {forecast_std:.2f}")
    print(f"  Min: {np.min(forecast_values):.2f}")
    print(f"  Max: {np.max(forecast_values):.2f}")
    
    print("\nTrend Analysis:")
    change_percent = ((forecast_mean - historical_mean) / historical_mean) * 100
    trend_direction = "increasing" if change_percent > 0 else "decreasing"
    print(f"  Direction: {trend_direction.upper()}")
    print(f"  Change: {change_percent:+.2f}%")
    
    print("\nConfidence Intervals (Average):")
    print(f"  80% Interval: [{np.mean(quantiles[:, 1]):.2f}, {np.mean(quantiles[:, 9]):.2f}]")
    print(f"  50% Interval: [{np.mean(quantiles[:, 3]):.2f}, {np.mean(quantiles[:, 7]):.2f}]")
    
    print("\nFirst 10 Forecast Values:")
    for i in range(min(10, len(forecast_values))):
        date_str = forecast_dates[i].strftime('%Y-%m-%d')
        val = forecast_values[i]
        q10 = quantiles[i, 1]
        q90 = quantiles[i, 9]
        print(f"  {date_str}: {val:.2f} (80% CI: [{q10:.2f}, {q90:.2f}])")
    
    if len(forecast_values) > 10:
        print(f"  ... and {len(forecast_values) - 10} more days")
    
    print("="*70 + "\n")
    
    # Generate plot
    plt.figure(figsize=(15, 7))
    plt.plot(historical_dates, historical_data, 
             label="Historical Data", color="black", linewidth=2, marker='o', 
             markersize=3, markevery=max(1, len(historical_data)//50))
    plt.plot(forecast_dates, forecast_values, 
             label="Forecast", color="red", linewidth=2, marker='s',
             markersize=3, markevery=max(1, horizon//50))
    
    # Add confidence intervals
    plt.fill_between(
        forecast_dates,
        quantiles[:, 1],  # 10th percentile
        quantiles[:, 9],  # 90th percentile
        alpha=0.2,
        color='red',
        label='80% Prediction Interval'
    )
    
    plt.fill_between(
        forecast_dates,
        quantiles[:, 3],  # 25th percentile
        quantiles[:, 7],  # 75th percentile
        alpha=0.3,
        color='red',
        label='50% Prediction Interval'
    )
    
    plt.legend(loc='best', fontsize=11)
    model_type = "Fine-tuned" if use_finetuned else "Base"
    plt.title(f"TimesFM Forecast ({model_type}): {target_column}", fontsize=16, fontweight='bold')
    plt.xlabel("Date", fontsize=13)
    plt.ylabel(target_column, fontsize=13)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # Return detailed results
    return {
        "forecast_values": forecast_values.tolist(),
        "forecast_dates": [d.strftime('%Y-%m-%d') for d in forecast_dates],
        "quantiles": {
            "q10": quantiles[:, 1].tolist(),
            "q25": quantiles[:, 3].tolist(),
            "q50": quantiles[:, 5].tolist(),
            "q75": quantiles[:, 7].tolist(),
            "q90": quantiles[:, 9].tolist()
        },
        "historical_mean": float(historical_mean),
        "forecast_mean": float(forecast_mean),
        "trend": trend_direction,
        "change_percent": float(change_percent)
    }


if __name__ == "__main__":
    # Create synthetic data with trend and seasonality (2 years of data)
    np.random.seed(42)
    days = 730
    trend = np.linspace(150, 350, days)
    seasonality = 50 * np.sin(np.linspace(0, 8*np.pi, days))
    noise = np.random.normal(0, 20, days)
    historical_data = trend + seasonality + noise
    historical_data = np.maximum(historical_data, 0)  # Ensure non-negative
    
    print("OPTION 1: Use base pre-trained model (skip fine-tuning)")
    print("OPTION 2: Fine-tune model on your data first")
    print("\nChoosing OPTION 1 for this example...\n")
    
    # OPTION 1: Direct forecasting with base model
    print("Generating forecast with BASE model...")
    results_base = generate_forecast(
        historical_data=historical_data[-365:].tolist(),  # Use last year
        horizon=90,
        start_date="2023-01-01",
        target_column="Bike Rentals"
    )
    
    # OPTION 2: Fine-tune model then forecast
    print("\n" + "="*70)
    print("FINE-TUNING MODEL ON TRAINING DATA")
    print("="*70 + "\n")
    
    # Fine-tune on first 18 months of data
    finetuned_model, losses = finetune_model(
        model=model,
        train_data=historical_data[:540].tolist(),
        num_epochs=5,
        learning_rate=1e-4,
        context_length=180,
        horizon=30,
        batch_size=4,
        save_path="finetuned_timesfm.pt"
    )
    
    # Generate forecast with fine-tuned model
    print("\nGenerating forecast with FINE-TUNED model...")
    results_finetuned = generate_forecast(
        historical_data=historical_data[-365:].tolist(),
        horizon=90,
        start_date="2023-01-01",
        target_column="Bike Rentals",
        use_finetuned=True,
        finetuned_path="finetuned_timesfm.pt"
    )
    
    print("\n" + "="*70)
    print("FORECAST COMPLETE!")
    print("="*70)

Code Summary

Model Initialization:
Loads the pre-trained TimesFM model (timesfm-2.5-200m-pytorch) using PyTorch.
Configures forecasting parameters like max_context, max_horizon, normalization, quantile settings, and invariance options.
Custom Dataset Creation:
Defines TimeSeriesDataset class to create sliding windows of context and target sequences from historical data.
Converts data into PyTorch tensors for training.
Data Preparation:
prepare_finetuning_data() generates a DataLoader for batching historical data for fine-tuning.
Fine-Tuning Function (finetune_model):
Prepares training batches and sets up optimizer (Adam) and loss function (MSELoss).
Attempts to access and unfreeze internal model parameters for gradient updates.
Performs multiple epochs of training, updating model weights based on prediction error.
Saves fine-tuned model state and plots training loss over epochs.
Handles exceptions if the model does not support gradient-based fine-tuning (fallback to inference mode).
Loading Fine-Tuned Model:
load_finetuned_model() loads saved model checkpoints and updates internal model weights for inference.
Forecast Generation (generate_forecast):
Generates point forecasts and quantile-based confidence intervals for a specified horizon.
Computes summary statistics (mean, std, min, max) and trend direction.
Produces a plot showing historical data, forecasts, and confidence intervals.
Supports using either the base pre-trained model or fine-tuned model.

Executing the code
To visualize trends and obtain forecasted results using the fine-tuned model, run the code using your Python environment.

python bike_rental_forecast_finetuning.py

Handson 4: Model Comparison

This section compares the performance of the TimesFM model with other approaches, including statistical models (e.g., AutoETS), machine learning models (e.g., Random Forest, XGBoost, LGBM), and foundational models such as TimeGPT.

The dataset used for this analysis is sourced from Kaggle: Monthly Gold Prices (1979–2021), which provides historic gold prices across 18 different countries.

Monthly Gold Prices (1979-2021)

Setting up the environment

For this hands-on guide, we will be using Google Colab with a T4 GPU.

Open Google Colaboratory and sign in with your Google account.
Create a new notebook by clicking on + New Notebook.
Navigate to Runtime → Change runtime type.

Set Hardware Accelerator to GPU.
Choose T4 GPU (recommended for this tutorial).
Click Save.

Reading the data

Let's begins by reading the dataset from the CSV file containing monthly gold price records.

import pandas as pd
df = pd.read_csv("/content/1979-2021.csv")
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date').resample('MS').mean()
df = df.reset_index() # Reset index to have 'Date' as a column again
print(df.head())

Visualizing the dataset

Visualize the dataset using seaborn.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
plt.figure(figsize=(10, 6))
sns.lineplot(x="Date", y='India(INR)', data=df, color='green')
plt.title('Monthly Gold Prices Over Time')
plt.xlabel('Date')
plt.ylabel('Gold Price in INR')
plt.show()

Time-series decomposition visualization — trend, seasonality and residuals

Performs a time-series decomposition on India’s monthly gold prices and visually separates the data into trend (overall direction), seasonality (repeating pattern) and residuals (random noise) helping you understand the underlying patterns in gold price movements over time.

from statsmodels.tsa.seasonal import seasonal_decompose

df.set_index("Date", inplace=True)
result = seasonal_decompose(df['India(INR)'])
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(10, 12))
result.observed.plot(ax=ax1, color='green')
ax1.set_ylabel('Observed')
result.trend.plot(ax=ax2, color='green')
ax2.set_ylabel('Trend')
result.seasonal.plot(ax=ax3, color='green')
ax3.set_ylabel('Seasonal')
result.resid.plot(ax=ax4, color='green')
ax4.set_ylabel('Residual')

plt.tight_layout()
plt.show()
df.reset_index(inplace=True)

Arranging the Data in Format as Required by the Models

df = pd.DataFrame({'unique_id':[1]*len(df),'ds': df["Date"], "y":df['India(INR)']})
train_df = df[df['ds'] <= '31-07-2019']
test_df = df[df['ds'] > '31-07-2019']

1. Statistical Modeling

StatsForecast is an open-source library designed for fast and scalable statistical time-series forecasting. It provides a wide range of classical models such as ARIMA, ETS (Exponential Smoothing), Theta, and Seasonal Naive, implemented with high efficiency using Numba for parallel processing.

import pandas as pd
from statsforecast import StatsForecast
from statsforecast.models import AutoARIMA, AutoETS

# Define the AutoARIMA model
autoarima = AutoARIMA(season_length=12)
# Define the AutoETS model
autoets = AutoETS(season_length=12)

# Create StatsForecast object with AutoARIMA
statforecast = StatsForecast(
    models=[autoarima, autoets],
    freq='MS',
    n_jobs=-1)
statforecast.fit(train_df)

# Generate forecasts for 24 periods ahead
sf_forecast = statforecast.forecast(df=train_df, h=24, fitted=True)
sf_forecast = sf_forecast.reset_index()
print("StatsForecast:", sf_forecast)

2. MLForecast

MLForecast is a machine learning-based framework for time-series forecasting that leverages models such as Random Forest, XGBoost, and LightGBM.

from mlforecast import MLForecast
from mlforecast.target_transforms import AutoDifferences
from numba import njit
import lightgbm as lgb
import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from statsmodels.tsa.seasonal import seasonal_decompose
from mlforecast import MLForecast
from mlforecast.lag_transforms import (
 RollingMean, RollingStd, RollingMin, RollingMax, RollingQuantile,
 SeasonalRollingMean, SeasonalRollingStd, SeasonalRollingMin,
 SeasonalRollingMax, SeasonalRollingQuantile,
 ExpandingMean
)

models = [lgb.LGBMRegressor(verbosity=-1),
 xgb.XGBRegressor(),
 RandomForestRegressor(random_state=0),
]
fcst = MLForecast(
    models=models, # List of models to be used for forecasting
    freq='MS', # Monthly frequency, starting at the beginning of each month
    lags=[1,3,5,7,12], # Lag features: values from 1, 3, 5, 7, and 12 time steps ago
    lag_transforms={
        1: [ # Transformations applied to lag 1
            RollingMean(window_size=3),
            RollingStd(window_size=3),
            RollingMin(window_size=3),
            RollingMax(window_size=3),
            RollingQuantile(p=0.5, window_size=3),
            ExpandingMean()
        ],
        6:[ # Transformations applied to lag 6
            RollingMean(window_size=6),
            RollingStd(window_size=6),
            RollingMin(window_size=6),
            RollingMax(window_size=6),
            RollingQuantile(p=0.5, window_size=6),
        ],
        12: [ # Transformations applied to lag 12 (likely for yearly seasonality)
            SeasonalRollingMean(season_length=12, window_size=3),
            SeasonalRollingStd(season_length=12, window_size=3),
            SeasonalRollingMin(season_length=12, window_size=3),
            SeasonalRollingMax(season_length=12, window_size=3),
            SeasonalRollingQuantile(p=0.5, season_length=12, window_size=3)
        ]
    },
    date_features=['year', 'month', 'quarter'], # Extract year, month, and quarter from the date as features
    target_transforms=[AutoDifferences(max_diffs=3)]
)

fcst.fit(train_df)
ml_forecast = fcst.predict(len(test_df))
print("MLForecast:", ml_forecast)

3. TimeGPT

TimeGPT is a large foundational model for time-series forecasting developed by Nixtla, inspired by the architecture and scaling principles of large language models. Unlike traditional forecasting methods that require dataset-specific training, TimeGPT can perform zero-shot forecasting — generating accurate predictions on unseen time-series data without any fine-tuning.

from nixtla import NixtlaClient
nixtla_client = NixtlaClient(api_key = 'nixak-PJPGa3MxJ3VdxZhKvylOcu2XHBtZ8ssIykc7wzoLKB0sVcDMnHoD53kGpvuJGk9e5lj83KojwKaljmcK')
timegpt_forecast = nixtla_client.forecast(df=train_df, h=24, freq='MS')
print("TimeGPT: ", timegpt_forecast)

4. TimesFM

Now lets use TimesFM to generate forecasts for the dataset, leveraging its zero-shot or fine-tuned capabilities to predict future time-series values.

import torch
import numpy as np
import pandas as pd
import timesfm

torch.set_float32_matmul_precision("high")
model = timesfm.TimesFM_2p5_200M_torch.from_pretrained("google/timesfm-2.5-200m-pytorch")

model.compile(
    timesfm.ForecastConfig(
        max_context=1024,                # Maximum context length
        max_horizon=256,                 # Maximum forecast horizon
        normalize_inputs=True,           # Normalize time series before forecasting
        use_continuous_quantile_head=True,
        force_flip_invariance=True,
        infer_is_positive=True,
        fix_quantile_crossing=True,
    )
)
H = 24  # Forecast horizon

series_list = []
for _, g in train_df.groupby("unique_id"):
    series_list.append(g["y"].values.astype(np.float32))

point_forecast, quantile_forecast = model.forecast(
    horizon=H,
    inputs=series_list,
)

forecasts = []
for (uid, group), preds in zip(train_df.groupby("unique_id"), point_forecast):
    # Get last date and extend future timestamps
    last_date = group["ds"].iloc[-1]
    future_dates = pd.date_range(start=last_date, periods=H + 1, freq="MS")[1:]
    df_pred = pd.DataFrame({
        "unique_id": uid,
        "ds": future_dates,
        "timesfm": preds
    })
    forecasts.append(df_pred)

timesfm_forecast = pd.concat(forecasts, ignore_index=True)
print("TimesEM: ", timesfm_forecast)

Convert ‘ds’ to datetime in all DataFrames if necessary

# Assuming the DataFrames have a common column 'ds' for the dates
sf_forecast['ds'] = pd.to_datetime(sf_forecast['ds'])
ml_forecast['ds'] = pd.to_datetime(ml_forecast['ds'])
timegpt_forecast['ds'] = pd.to_datetime(timegpt_forecast['ds'])
timesfm_forecast['ds'] = pd.to_datetime(timesfm_forecast['ds'])
test_df['ds'] = pd.to_datetime(test_df['ds'])

# Print shapes to debug
print("sf_forecast shape:", sf_forecast.shape)
print("ml_forecast shape:", ml_forecast.shape)
print("timegpt_forecast shape:", timegpt_forecast.shape)
print("timesfm_forecast shape:", timesfm_forecast.shape)
print("test_df shape:", test_df.shape)

# Check the first few dates
print("\nFirst dates:")
print("sf_forecast:", sf_forecast['ds'].head(3).tolist())
print("ml_forecast:", ml_forecast['ds'].head(3).tolist())
print("test_df:", test_df['ds'].head(3).tolist())

Perform the merges

Start with test_df to ensure we keep all test dates.

merged_fcst = test_df[['ds', 'y', 'unique_id']].copy()
merged_fcst = pd.merge(merged_fcst, sf_forecast[['ds', 'AutoARIMA', 'AutoETS']], on='ds', how='left')
merged_fcst = pd.merge(merged_fcst, ml_forecast[['ds', 'LGBMRegressor', 'XGBRegressor', 'RandomForestRegressor']], on='ds', how='left')
merged_fcst = pd.merge(merged_fcst, timegpt_forecast[['ds', 'TimeGPT']], on='ds', how='left')
merged_fcst = pd.merge(merged_fcst, timesfm_forecast[['ds', 'timesfm']], on='ds', how='left')

print("\nMerged forecast shape:", merged_fcst.shape)
print("\nMerged forecast columns:", merged_fcst.columns.tolist())
print("\nFirst few rows of merged_fcst:")
print(merged_fcst.head())
print("\nNull counts:")
print(merged_fcst.isnull().sum())

Model Comparison

Finally, we can calculate and compare error metrics for multiple forecasting models (AutoARIMA, AutoETS, LGBMRegressor, XGBRegressor, RandomForestRegressor, TimeGPT, and timesfm) on a given dataset. It first defines a function calculate_error_metrics that computes MAE, RMSE, and MAPE between actual and predicted values.

import numpy as np

def calculate_error_metrics(actual_values, predicted_values):
    actual_values = np.array(actual_values)
    predicted_values = np.array(predicted_values)
    
    # Remove any NaN values
    mask = ~(np.isnan(actual_values) | np.isnan(predicted_values))
    actual_values = actual_values[mask]
    predicted_values = predicted_values[mask]
    
    if len(actual_values) == 0:
        print(f"Warning: No valid data points after removing NaNs")
        return pd.DataFrame({'Metric': ['MAE', 'RMSE', 'MAPE'], 'Value': [np.nan, np.nan, np.nan]})
    
    metrics_dict = {
        'MAE': np.mean(np.abs(actual_values - predicted_values)),
        'RMSE': np.sqrt(np.mean((actual_values - predicted_values)**2)),
        'MAPE': np.mean(np.abs((actual_values - predicted_values) / actual_values)) * 100
    }
    
    result_df = pd.DataFrame(list(metrics_dict.items()), columns=['Metric', 'Value'])
    return result_df

# Use actual gold prices from merged dataframe
actuals = merged_fcst['y'].values
error_metrics_dict = {}

# Model columns to evaluate
model_columns = ['AutoARIMA', 'AutoETS', 'LGBMRegressor', 'XGBRegressor', 'RandomForestRegressor', 'TimeGPT', 'timesfm']

for col in model_columns:
    if col in merged_fcst.columns:
        print(f"\nEvaluating {col}...")
        predicted_values = merged_fcst[col].values
        print(f"  Actuals shape: {actuals.shape}, Predictions shape: {predicted_values.shape}")
        print(f"  Non-null predictions: {(~np.isnan(predicted_values)).sum()}")
        error_metrics_dict[col] = calculate_error_metrics(actuals, predicted_values)['Value'].values
    else:
        print(f"\nWarning: {col} not found in merged_fcst")

error_metrics_df = pd.DataFrame(error_metrics_dict)
error_metrics_df.insert(0, 'Metric', ['MAE', 'RMSE', 'MAPE'])

print("\n" + "="*80)
print("FINAL ERROR METRICS:")
print("="*80)
print(error_metrics_df)

Executing the code step by step produces the final output as shown below.

The complete code for the model comparison is available at the link below.

Google Colab

Building NextGen Forecasting app using TimesFM

In the previous section, various forecasting models were explored, including zero-shot prompting and fine-tuning of the TimesFM model to generate forecasts. In this section, a Streamlit-based application is developed for time series forecasting. The app allows users to upload CSV files, configure forecast settings, generate forecasts, visualize results, and interactively query the forecasts. It supports multiple aggregation methods, customizable forecast horizons, and provides confidence intervals along with a Q&A interface to interpret forecast insights.

Installing the dependencies

Initialize a uv project by executing the following command.

uv init timesfm_forecasting_tool
cd timesfm_forecasting_tool

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Clone the github repository and move to the folder using the following command.

git clone https://github.com/google-research/timesfm.git
cd timesfm

Install timesfm by executing the following command.

uv pip install -e .[torch]

Install streamlit and matplotlib using uv.

uv add streamlit matplotlib

Navigate back to the root directory, create a file named app.py, and add the following code to it.

import streamlit as st
import pandas as pd
import numpy as np
import torch
import timesfm
import matplotlib.pyplot as plt

# Page configuration
st.set_page_config(
    page_title="TimesFM Forecasting Tool",
    page_icon="📈",
    layout="wide"
)

# Initialize TimesFM model
@st.cache_resource
def load_model():
    torch.set_float32_matmul_precision("high")
    LOCAL_MODEL_PATH = "./timesfm_model"
    
    try:
        model = timesfm.TimesFM_2p5_200M_torch.from_pretrained(LOCAL_MODEL_PATH)
        st.success(f"✓ Model loaded from local path: {LOCAL_MODEL_PATH}")
    except Exception as e:
        st.warning(f"Failed to load from local path, trying HuggingFace...")
        model = timesfm.TimesFM_2p5_200M_torch.from_pretrained("google/timesfm-2.5-200m-pytorch")
        st.success("✓ Model loaded from HuggingFace")
    
    model.compile(
        timesfm.ForecastConfig(
            max_context=1024,
            max_horizon=256,
            normalize_inputs=True,
            use_continuous_quantile_head=True,
            force_flip_invariance=True,
            infer_is_positive=True,
            fix_quantile_crossing=True,
        )
    )
    return model

def load_and_prepare_data(df, target_column, date_column, aggregation, time_period):
    """Prepare time series data from dataframe"""
    # Make a copy to avoid modifying original
    df = df.copy()
    
    # Convert date column to datetime with multiple format attempts
    date_formats = [
        '%d-%m-%Y',  # 01-01-1985
        '%m-%d-%Y',  # 01-01-1985 (alternative interpretation)
        '%Y-%m-%d',  # 1985-01-01
        '%m/%d/%Y %H:%M',  # 2/24/2003 0:00
        '%d/%m/%Y %H:%M',  # 24/2/2003 0:00
        '%Y-%m-%d %H:%M:%S',
        '%Y-%m-%d %H:%M',
        '%d-%m-%Y %H:%M:%S',
        '%d-%m-%Y %H:%M',
        '%m/%d/%Y',
        '%d/%m/%Y',
        '%Y/%m/%d',
        '%d.%m.%Y',  # 01.01.1985
        '%Y.%m.%d',  # 1985.01.01
    ]
    
    parsed_successfully = False
    successful_format = None
    
    for fmt in date_formats:
        try:
            test_parse = pd.to_datetime(df[date_column], format=fmt, errors='coerce')
            # Check if at least some dates were parsed
            if test_parse.notna().sum() > 0:
                df[date_column] = test_parse
                parsed_successfully = True
                successful_format = fmt
                break
        except:
            continue
    
    # If no format worked, try automatic parsing
    if not parsed_successfully:
        try:
            df[date_column] = pd.to_datetime(df[date_column], errors='coerce', infer_datetime_format=True)
            if df[date_column].notna().sum() > 0:
                parsed_successfully = True
                successful_format = "auto-detected"
        except:
            pass
    
    if not parsed_successfully:
        # Show first few non-null values to help debug
        sample_values = df[date_column].dropna().head(5).tolist()
        raise ValueError(f"Could not parse dates in column '{date_column}'. Sample values: {sample_values}")
    
    # Remove rows with invalid dates or missing target values
    initial_count = len(df)
    df = df.dropna(subset=[date_column, target_column])
    removed_count = initial_count - len(df)
    
    if len(df) == 0:
        raise ValueError(f"No valid data after removing {removed_count} rows with invalid dates or missing target values.")
    
    # Sort by date
    df = df.sort_values(date_column)
    
    # Create period column based on time_period
    if time_period == "Day":
        df['period'] = df[date_column].dt.strftime('%d-%b-%Y')
        period_format = "day"
    elif time_period == "Week":
        df['period'] = df[date_column].dt.strftime('Week %U, %Y')
        period_format = "week"
    elif time_period == "Month":
        df['period'] = df[date_column].dt.strftime('%B %Y')
        period_format = "month"
    elif time_period == "Year":
        df['period'] = df[date_column].dt.year.astype(str)
        period_format = "year"
    else:
        df['period'] = df[date_column].dt.strftime('%B %Y')
        period_format = "month"
    
    # Group by period and aggregate
    if aggregation == "Sum":
        grouped = df.groupby('period')[target_column].sum()
    elif aggregation == "Mean":
        grouped = df.groupby('period')[target_column].mean()
    elif aggregation == "Median":
        grouped = df.groupby('period')[target_column].median()
    elif aggregation == "Count":
        grouped = df.groupby('period')[target_column].count()
    else:
        grouped = df.groupby('period')[target_column].sum()
    
    periods = grouped.index.tolist()
    values = grouped.values.tolist()
    
    return periods, values, period_format

def generate_forecast_labels(last_period, horizon, period_format):
    """Generate forecast period labels"""
    forecast_labels = []
    
    if period_format == "month":
        try:
            last_date = pd.to_datetime(last_period, format='%B %Y')
            for i in range(1, horizon + 1):
                next_date = last_date + pd.DateOffset(months=i)
                forecast_labels.append(next_date.strftime('%B %Y'))
        except:
            forecast_labels = [f"Forecast {i+1}" for i in range(horizon)]
    elif period_format == "day":
        try:
            last_date = pd.to_datetime(last_period, format='%d-%b-%Y')
            for i in range(1, horizon + 1):
                next_date = last_date + pd.Timedelta(days=i)
                forecast_labels.append(next_date.strftime('%d-%b-%Y'))
        except:
            forecast_labels = [f"Forecast {i+1}" for i in range(horizon)]
    elif period_format == "year":
        try:
            last_year = int(last_period)
            forecast_labels = [str(last_year + i) for i in range(1, horizon + 1)]
        except:
            forecast_labels = [f"Forecast {i+1}" for i in range(horizon)]
    else:
        forecast_labels = [f"Forecast {i+1}" for i in range(horizon)]
    
    return forecast_labels

def generate_chat_response(query, forecast_data, summary):
    """Generate responses to user queries about the forecast"""
    query_lower = query.lower()
    
    if "trend" in query_lower or "direction" in query_lower:
        direction = summary['trend']['direction']
        change = summary['trend']['change_percent']
        return f"The forecast shows a **{direction}** trend with a {abs(change):.2f}% change compared to historical data."
    
    elif "highest" in query_lower or "maximum" in query_lower or "peak" in query_lower:
        max_val = summary['forecast_stats']['max']
        max_idx = forecast_data['point_forecast'].index(max_val)
        max_period = forecast_data['periods'][max_idx]
        return f"The highest forecasted value is **{max_val:,.2f}** in **{max_period}**."
    
    elif "lowest" in query_lower or "minimum" in query_lower:
        min_val = summary['forecast_stats']['min']
        min_idx = forecast_data['point_forecast'].index(min_val)
        min_period = forecast_data['periods'][min_idx]
        return f"The lowest forecasted value is **{min_val:,.2f}** in **{min_period}**."
    
    elif "average" in query_lower or "mean" in query_lower:
        avg = summary['forecast_stats']['mean']
        return f"The average forecasted value is **{avg:,.2f}**."
    
    elif "total" in query_lower or "sum" in query_lower:
        total = summary['forecast_stats']['total']
        return f"The total forecasted value across all periods is **{total:,.2f}**."
    
    elif "confidence" in query_lower or "interval" in query_lower:
        ci_80 = summary['confidence_intervals']['80_percent']
        return f"The 80% confidence interval ranges from **{ci_80['lower']:,.2f}** to **{ci_80['upper']:,.2f}**."
    
    elif "compare" in query_lower or "historical" in query_lower:
        hist_mean = summary['historical_stats']['mean']
        fore_mean = summary['forecast_stats']['mean']
        diff = fore_mean - hist_mean
        pct = ((fore_mean - hist_mean) / hist_mean * 100) if hist_mean != 0 else 0
        return f"Historical average: **{hist_mean:,.2f}**\nForecast average: **{fore_mean:,.2f}**\nDifference: **{diff:+,.2f}** ({pct:+.2f}%)"
    
    else:
        return "I can help you with questions about:\n- Trend and direction\n- Highest/lowest values\n- Average and totals\n- Confidence intervals\n- Comparing historical vs forecast data\n\nPlease ask a specific question!"

# Main UI
st.title("📈 TimesFM Forecasting Tool")
st.markdown("Upload your data and generate time series forecasts using Google's TimesFM model")

# Load model
with st.spinner("Loading TimesFM model..."):
    model = load_model()

# Sidebar for controls
st.sidebar.header("⚙️ Forecast Configuration")

# Data input method
input_method = st.sidebar.radio("Data Input Method", ["Upload CSV", "Paste Text Data"])

df = None

if input_method == "Upload CSV":
    uploaded_file = st.sidebar.file_uploader("Upload CSV File", type=['csv'])
    if uploaded_file:
        # Try different encodings
        encodings = ['utf-8', 'latin-1', 'iso-8859-1', 'cp1252', 'utf-16']
        df = None
        
        for encoding in encodings:
            try:
                uploaded_file.seek(0)  # Reset file pointer
                df = pd.read_csv(uploaded_file, encoding=encoding, on_bad_lines='skip')
                st.sidebar.success(f"✓ Loaded {len(df)} rows (encoding: {encoding})")
                break
            except (UnicodeDecodeError, Exception) as e:
                continue
        
        if df is None:
            st.sidebar.error("❌ Could not read file. Please check the file encoding.")
else:
    text_data = st.sidebar.text_area("Paste CSV Data (with headers)", height=200)
    if text_data:
        from io import StringIO
        df = pd.read_csv(StringIO(text_data))
        st.sidebar.success(f"✓ Loaded {len(df)} rows")

# Main content area
if df is not None:
    # Show data preview
    with st.expander("📊 Data Preview", expanded=True):
        st.dataframe(df.head(10), use_container_width=True)
        st.info(f"Total rows: {len(df)} | Columns: {len(df.columns)}")
    
    # Configuration
    col1, col2 = st.sidebar.columns(2)
    
    # Get numeric and date columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    all_cols = df.columns.tolist()
    
    with col1:
        date_column = st.selectbox("Date Column", all_cols)
    
    with col2:
        target_column = st.selectbox("Target Column", numeric_cols)
    
    col3, col4 = st.sidebar.columns(2)
    
    with col3:
        time_period = st.selectbox("Time Period", ["Day", "Week", "Month", "Year"])
    
    with col4:
        aggregation = st.selectbox("Aggregation", ["Sum", "Mean", "Median", "Count"])
    
    horizon = st.sidebar.slider("Forecast Horizon", min_value=1, max_value=24, value=6)
    
    # Forecast button
    if st.sidebar.button("🚀 Generate Forecast", type="primary", use_container_width=True):
        with st.spinner("Generating forecast..."):
            try:
                # Prepare data
                periods, values, period_format = load_and_prepare_data(
                    df.copy(), target_column, date_column, aggregation, time_period
                )
                
                if len(values) < 3:
                    st.error(f"Not enough data points after aggregation. Got {len(values)}, need at least 3")
                else:
                    # Generate forecast
                    historical_array = np.array(values, dtype=np.float32)
                    point_forecast, quantile_forecast = model.forecast(
                        horizon=horizon,
                        inputs=[historical_array],
                    )
                    
                    forecast_values = point_forecast[0]
                    quantiles = quantile_forecast[0]
                    
                    # Generate labels
                    forecast_labels = generate_forecast_labels(periods[-1], horizon, period_format)
                    
                    # Convert to native types
                    forecast_values_list = [float(x) for x in forecast_values]
                    
                    # Store in session state
                    st.session_state['forecast_data'] = {
                        'periods': forecast_labels,
                        'point_forecast': forecast_values_list,
                        'quantiles': {
                            'q10': [float(x) for x in quantiles[:, 1]],
                            'q25': [float(x) for x in quantiles[:, 3]],
                            'q50': [float(x) for x in quantiles[:, 5]],
                            'q75': [float(x) for x in quantiles[:, 7]],
                            'q90': [float(x) for x in quantiles[:, 9]]
                        },
                        'horizon': horizon
                    }
                    
                    st.session_state['summary'] = {
                        'historical_stats': {
                            'count': len(values),
                            'mean': float(np.mean(values)),
                            'std': float(np.std(values)),
                            'min': float(np.min(values)),
                            'max': float(np.max(values)),
                            'total': float(np.sum(values))
                        },
                        'forecast_stats': {
                            'count': len(forecast_values),
                            'mean': float(np.mean(forecast_values)),
                            'std': float(np.std(forecast_values)),
                            'min': float(np.min(forecast_values)),
                            'max': float(np.max(forecast_values)),
                            'total': float(np.sum(forecast_values))
                        },
                        'trend': {
                            'direction': 'increasing' if np.mean(forecast_values) > np.mean(values) else 'decreasing',
                            'change_percent': ((np.mean(forecast_values) - np.mean(values)) / np.mean(values) * 100) if np.mean(values) != 0 else 0
                        },
                        'confidence_intervals': {
                            '80_percent': {
                                'lower': float(np.mean(quantiles[:, 1])),
                                'upper': float(np.mean(quantiles[:, 9]))
                            },
                            '50_percent': {
                                'lower': float(np.mean(quantiles[:, 3])),
                                'upper': float(np.mean(quantiles[:, 7]))
                            }
                        }
                    }
                    
                    st.session_state['periods'] = periods
                    st.session_state['values'] = values
                    st.session_state['forecast_labels'] = forecast_labels
                    st.session_state['quantiles'] = quantiles
                    st.session_state['target_column'] = target_column
                    
                    st.success("✓ Forecast generated successfully!")
                    st.rerun()
                    
            except Exception as e:
                st.error(f"Error generating forecast: {str(e)}")
    
    # Display results if forecast exists
    if 'forecast_data' in st.session_state:
        st.markdown("---")
        st.header("📊 Forecast Results")
        
        # Summary metrics
        col1, col2, col3, col4 = st.columns(4)
        
        with col1:
            st.metric(
                "Historical Mean",
                f"{st.session_state['summary']['historical_stats']['mean']:,.2f}"
            )
        
        with col2:
            st.metric(
                "Forecast Mean",
                f"{st.session_state['summary']['forecast_stats']['mean']:,.2f}",
                delta=f"{st.session_state['summary']['trend']['change_percent']:.2f}%"
            )
        
        with col3:
            st.metric(
                "Trend",
                st.session_state['summary']['trend']['direction'].title()
            )
        
        with col4:
            st.metric(
                "Forecast Periods",
                st.session_state['forecast_data']['horizon']
            )
        
        # Visualization
        st.subheader("📈 Forecast Visualization")
        
        fig, ax = plt.subplots(figsize=(14, 6))
        
        periods = st.session_state['periods']
        values = st.session_state['values']
        forecast_labels = st.session_state['forecast_labels']
        forecast_values = st.session_state['forecast_data']['point_forecast']
        quantiles = st.session_state['quantiles']
        
        # Create x-axis
        total_points = len(values) + len(forecast_values)
        all_x = list(range(total_points))
        hist_x = all_x[:len(values)]
        forecast_x = all_x[len(values)-1:]
        
        # Plot
        ax.plot(hist_x, values, label="Historical Data", color="#3498db", 
                linewidth=2, marker='o', markersize=4)
        
        forecast_with_connection = [values[-1]] + forecast_values
        ax.plot(forecast_x, forecast_with_connection, label="Forecast", 
                color="#2ecc71", linewidth=2, marker='s', markersize=4)
        
        # Confidence intervals
        quantiles_with_connection = np.vstack([[values[-1]] * quantiles.shape[1], quantiles])
        
        ax.fill_between(forecast_x, quantiles_with_connection[:, 1], 
                        quantiles_with_connection[:, 9], alpha=0.2, 
                        color='#2ecc71', label='80% Prediction Interval')
        
        ax.fill_between(forecast_x, quantiles_with_connection[:, 3], 
                        quantiles_with_connection[:, 7], alpha=0.3, 
                        color='#2ecc71', label='50% Prediction Interval')
        
        # Labels
        all_labels = periods + forecast_labels
        step = max(1, len(all_labels) // 12)
        tick_positions = list(range(0, len(all_labels), step))
        tick_labels = [all_labels[i] for i in tick_positions]
        
        ax.set_xticks(tick_positions)
        ax.set_xticklabels(tick_labels, rotation=45, ha='right')
        ax.legend(loc='best')
        ax.set_title(f"TimesFM Forecast: {st.session_state['target_column']}", 
                     fontsize=14, fontweight='bold')
        ax.set_xlabel("Period")
        ax.set_ylabel(st.session_state['target_column'])
        ax.grid(True, linestyle='--', alpha=0.6)
        
        plt.tight_layout()
        st.pyplot(fig)
        
        # Forecast table
        st.subheader("📋 Forecast Values")
        
        forecast_df = pd.DataFrame({
            'Period': st.session_state['forecast_data']['periods'],
            'Forecast': st.session_state['forecast_data']['point_forecast'],
            'Lower (10%)': st.session_state['forecast_data']['quantiles']['q10'],
            'Lower (25%)': st.session_state['forecast_data']['quantiles']['q25'],
            'Median': st.session_state['forecast_data']['quantiles']['q50'],
            'Upper (75%)': st.session_state['forecast_data']['quantiles']['q75'],
            'Upper (90%)': st.session_state['forecast_data']['quantiles']['q90']
        })
        
        st.dataframe(forecast_df.style.format({
            'Forecast': '{:,.2f}',
            'Lower (10%)': '{:,.2f}',
            'Lower (25%)': '{:,.2f}',
            'Median': '{:,.2f}',
            'Upper (75%)': '{:,.2f}',
            'Upper (90%)': '{:,.2f}'
        }), use_container_width=True)
        
        # Chat interface
        st.markdown("---")
        st.subheader("💬 Ask Questions About Your Forecast")
        
        user_query = st.text_input(
            "Ask me anything about the forecast:",
            placeholder="e.g., What is the trend? What is the highest forecasted value?"
        )
        
        if user_query:
            response = generate_chat_response(
                user_query, 
                st.session_state['forecast_data'],
                st.session_state['summary']
            )
            st.info(response)
        
        # Example questions
        with st.expander("💡 Example Questions"):
            st.markdown("""
            - What is the trend?
            - What is the highest forecasted value?
            - What is the average forecast?
            - Show me the confidence interval
            - Compare historical and forecast data
            - What is the total forecasted amount?
            """)

else:
    # Welcome screen
    st.info("👈 Please upload a CSV file or paste your data to get started")
    
    st.markdown("""
    ### How to use this tool:
    
    1. **Upload Data**: Choose to upload a CSV file or paste text data
    2. **Configure**: Select your date column, target column, and forecast settings
    3. **Generate**: Click the "Generate Forecast" button
    4. **Analyze**: View the forecast visualization and ask questions
    
    ### Supported Features:
    
    - ✅ Multiple time periods (Day, Week, Month, Year)
    - ✅ Various aggregation methods (Sum, Mean, Median, Count)
    - ✅ Customizable forecast horizon (1-24 periods)
    - ✅ Confidence intervals (50% and 80%)
    - ✅ Interactive Q&A about forecasts
    """)

Run the Streamlit application using the following command to view the results.

streamlit run app.py

Thanks for reading this article !!

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/timesfm-experiments: Forecasting time series data using timesfm

References

Understanding DeepEval: A Practical Guide for Evaluating Large Language Models

Vishnu Sivan — Tue, 09 Sep 2025 04:43:13 GMT

The rapid evolution of Large Language Models (LLMs) has transformed the way we build intelligent applications, but with this growth comes an equally important challenge — how do we measure their true effectiveness? Traditional evaluation methods often fall short in capturing the diverse capabilities and limitations of LLMs. From reasoning and accuracy to coherence, bias, and ethical alignment, a robust evaluation framework is essential to ensure that these models are reliable and suitable for real-world use.

This is where DeepEval comes in. DeepEval is an open-source framework built to streamline LLM testing by offering a comprehensive suite of metrics, synthetic dataset generation, real-time evaluation, and seamless integration with popular testing frameworks like Pytest. By enabling easy customization, DeepEval empowers researchers and developers to benchmark models against tasks like MMLU, apply advanced metrics such as G-eval, and rigorously validate outputs for relevance and reliability.

In this tutorial, you’ll learn how to set up DeepEval, create a relevance test inspired by Pytest, evaluate LLM outputs using the G-eval metric, and run MMLU benchmarking on the TinyLlama model. By the end, you’ll have a clear workflow to systematically test and improve the performance of your LLM-powered applications.

Getting Started

What is DeepEval
Key Features
Getting started with evaluation of LLM models using DeepEval
Installing the dependencies
Querying the Model & Measuring Different Metrics
Example 1: Answer Relevancy Metric
Example 2: G-Eval Metric
Example 3: Prompt Alignment Metric
Example 4: Json Correctness Metric
Example 5: Summarization Metric
Example 6: LLM Integration
Example 7: Hallucinations
Example 8: Faithfulness Metric
Example 9: Chatbot Evaluation
Example 10: LLM Tracing
Example 11: MCP Interactions
MMLU benchmarking with DeepEval for custom LLMs

What is DeepEval

DeepEval is an open-source evaluation framework designed for testing Large Language Models (LLMs) across multiple dimensions such as reasoning, accuracy, coherence, relevance, and ethical alignment. Unlike simple benchmarks, DeepEval goes beyond by offering custom metrics, real-time evaluation, synthetic dataset generation, and seamless integration with testing pipelines. It allows researchers and developers to systematically measure and monitor LLM performance both in experimentation and production.

Key Features

Extensive Metric Suite
Provides 14+ research-backed metrics for LLM evaluation.
Includes advanced metrics like G-Eval (chain-of-thought reasoning), Faithfulness (accuracy & reliability), Toxicity, Answer Relevancy, and Conversational Metrics such as knowledge retention and conversation completeness.

Custom Metric Development
Allows users to define their own evaluation metrics tailored to specific use cases.
Integration with LLMs
Compatible with any LLM (including OpenAI models).
Supports benchmarking against popular datasets like MMLU and HumanEval.
Real-Time Monitoring & Benchmarking
Enables continuous monitoring of LLMs in production.
Provides robust benchmarking capabilities to assess models efficiently.
Simplified Testing with Pytest Integration
Built with a Pytest-like architecture, making it easy to write unit tests for LLM outputs in just a few lines of code.
Batch Evaluation Support
Supports large-scale evaluations with batch processing, saving time when benchmarking custom LLMs.

Getting started with evaluation of LLM models using DeepEval

In this session, we will explore how to evaluate Large Language Models (LLMs) with DeepEval. By default, DeepEval supports OpenAI, and we’ll be using the OpenAI GPT-4o-mini model. However, you can use any other LLM of your choice.

We will also use uv, a modern and fast Python package manager (instead of pip), to set up our environment and handle dependencies.

Installing uv

uv simplifies dependency management, virtual environments, and running scripts.

For Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
set Path=C:\Users\Codem\.local\bin;%Path%

For Linux / Mac:

curl -LsSf https://astral.sh/uv/install.sh | sh

Refer to the official website for detailed installation instructions.

Installation | uv

Installing the dependencies

Initialize a uv project by executing the following command.

uv init deepeval_demo
cd deepeval_demo

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Install deepeval, langchain-openai, fastmcp, transformers, accelerate, bitsandbytes, datasets, pandas and python-dotenv using uv.

uv add deepeval langchain-openai fastmcp transformers accelerate bitsandbytes datasets pandas python-dotenv

Setting up the credentials

Create a file named .env. This file will store your environment variables, including the OpenAI key.
Open the .env file and add the following code to specify your OpenAI API key and Neo4j credentials.

OPENAI_API_KEY=sk-proj-C1K1hKug99wXxtj...

Querying the Model & Measuring Different Metrics

Now that the environment is set up, let’s start querying our LLM and measure the quality of its responses using different metrics.

Example 1: Answer Relevancy Metric

The Answer Relevancy Metric evaluates how relevant the model’s response is compared to the retrieval context. This is useful in RAG (Retrieval-Augmented Generation) systems or whenever you want to ensure the response aligns with supporting facts.

Create a file named test_relevancy.py and add the following code to it.

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from dotenv import load_dotenv
load_dotenv()

def test_relevancy():
    # Define the metric with a threshold
    relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model="gpt-4o-mini")
    
    # Case 1: Partially relevant answer
    test_case_1 = LLMTestCase(
        input="Can I return these shoes after 30 days?",
        actual_output="Yes, you can return them. We offer a 30-day full refund. Do you have your original receipt?",
        retrieval_context=[
            "All customers are eligible for a 30-day full refund at no extra cost.",
            "Returns are only accepted within 30 days of purchase.",
        ],
    )
    
    # Case 2: Fully relevant answer
    test_case_2 = LLMTestCase(
        input="Can I return these shoes after 30 days?",
        actual_output="Unfortunately, returns are only accepted within 30 days of purchase.",
        retrieval_context=[
            "All customers are eligible for a 30-day full refund at no extra cost.",
            "Returns are only accepted within 30 days of purchase.",
        ],
    )
    
    # Run evaluation
    assert_test(test_case_1, [relevancy_metric])
    assert_test(test_case_2, [relevancy_metric])

Code explanation:

Metric Defined → AnswerRelevancyMetric(threshold=0.7) sets the relevancy cutoff.
Test Cases Created → Each LLMTestCase includes:
The user query (input), LLM output (actual_output) and retrieval context used for evaluation.
Assertion → assert_test() automatically checks if the relevancy score passes the threshold.
In Test Case 1: the answer contradicts the context slightly (“Yes, you can return after 30 days” vs. rule of within 30 days).
In Test Case 2: the answer is fully aligned with the context.

Executing the test

Use the following command to run the test.

deepeval test run test_relevancy.py

Example 2: G-Eval Metric

G-Eval is an LLM evaluation framework that leverages chain-of-thought (CoT) reasoning to assess model outputs based on custom criteria. Unlike fixed metrics, G-Eval is highly flexible and can evaluate nearly any aspect of a response — such as factual accuracy, omissions, clarity, or adherence to instructions.

Image Source: [2303.16634] G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

It works in two steps:

Generate Evaluation Steps — Uses CoT reasoning to break down the evaluation based on the given criteria.
Determine Final Score — Applies those steps to score the LLM’s output.

If evaluation steps are manually provided, G-Eval skips step one and directly uses them to calculate the score.

Create a file named test_geval_example.py and add the following code to it.

from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

from dotenv import load_dotenv
load_dotenv()

correctness_metric = GEval(
    name="Correctness",
    model="gpt-4o-mini",
    evaluation_params=[
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also lightly penalize omission of detail, and focus on the main idea",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
)

first_test_case = LLMTestCase(input="What are the main causes of deforestation?",
                              actual_output="The main causes of deforestation include agricultural expansion, logging, infrastructure development, and urbanization.",
                              expected_output="The main causes of deforestation include agricultural expansion, logging, infrastructure development, and urbanization.")


second_test_case = LLMTestCase(input="Define the term 'artificial intelligence'.",
                               actual_output="Artificial intelligence is the simulation of human intelligence by machines.",
                               expected_output="Artificial intelligence refers to the simulation of human intelligence in machines that are programmed to think and learn like humans, including tasks such as problem-solving, decision-making, and language understanding.")


third_test_case = LLMTestCase(input="List the primary colors.",
                              actual_output="The primary colors are green, orange, and purple.",
                              expected_output="The primary colors are red, blue, and yellow.")

test_cases = [first_test_case, second_test_case, third_test_case]
for test_case in test_cases:
    assert_test(test_case, [correctness_metric])

Code explanation

Define a G-Eval metric named “Correctness” using gpt-4o-mini.
Specify evaluation parameters: compare expected vs actual output.
Add custom evaluation steps: check contradictions, penalize omissions, allow vague language/opinions.
Create three LLM test cases with queries, actual outputs, and expected outputs.
Run all test cases using assert_test() with the correctness metric.

Executing the test

Use the following command to run the test.

deepeval test run test_geval_example.py

Example 3: Prompt Alignment Metric

The prompt alignment metric evaluates whether an LLM’s generated output aligns with the instructions defined in the prompt template. It ensures that the model not only provides a relevant response to the query but also adheres to any specified formatting, style, or structural requirements.

Create a file named test_prompt_alignment.py and add the following code to it.

from deepeval import evaluate
from deepeval.metrics import PromptAlignmentMetric
from deepeval.test_case import LLMTestCase
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

from dotenv import load_dotenv
load_dotenv()

template = """Question: {question}
Answer: Answer in Upper case."""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(model="gpt-4o-mini")
chain = prompt | model
query = "What is capital of India?"
input_data = {"question": query}
# Invoke the chain with input data and display the response
actual_output = chain.invoke(input_data).content
print("actual_output:", actual_output)

# Measuring prompt alignment
metric = PromptAlignmentMetric(
    prompt_instructions=["Reply in all uppercase"],
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input=query,
    actual_output=actual_output
)

metric.measure(test_case)
print("metric.score:", metric.score)
print("metric.reason:", metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

Run the test using deepval test run test_prompt_alignment.py command to check if the model output aligns with the prompt instructions.

Example 4: Json Correctness Metric

The JSON Correctness Metric evaluates whether an LLM’s generated output follows the correct JSON schema. Unlike other metrics that rely on an LLM for assessment, this metric simply checks the provided expected_schema and verifies if the actual_output can be successfully validated against it.

Create a file named test_json_correctness.py and add the following code to it.

from deepeval import evaluate
from deepeval.metrics import JsonCorrectnessMetric
from deepeval.test_case import LLMTestCase
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel

from dotenv import load_dotenv
load_dotenv()

class ExampleSchema(BaseModel):
    name: str

# Querying the model
template = """Question: {question}
Answer:  Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(model="gpt-4o-mini")
chain = prompt | model
query ="Output me a random Json with the 'name' key"
input_data = {"question": query}
# Invoke the chain with input data and display the response
actual_output = chain.invoke(input_data).content
print("actual_output:", actual_output)

# Measuring Json correctness
metric = JsonCorrectnessMetric(
    expected_schema=ExampleSchema,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input=query,
    actual_output=actual_output
)

metric.measure(test_case)
print("metric.score:", metric.score)
print("metric.reason:", metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

Run the test using deepval test run test_json_correctness.py command to check if the model output aligns with the instructions.

Example 5: Summarization Metric

The Summarization Metric evaluates whether an LLM generates factually accurate summaries that include the essential details from the original text. Its score is calculated using two components: the alignment_score, which checks if the summary avoids hallucinations or contradictions, and the coverage_score, which measures whether the summary captures all the necessary information from the source text.

Create a file named test_summarization.py and add the following code to it.

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from pydantic import BaseModel

from dotenv import load_dotenv
load_dotenv()

class ExampleSchema(BaseModel):
    name: str

# This is the original text to be summarized
text = """
Rice is the staple food of Bengal. Bhortas (lit-"mashed") are a really common type of food used as an additive too rice. there are several types of Bhortas such as Ilish bhorta shutki bhorta, begoon bhorta and more. Fish and other seafood are also important because Bengal is a reverrine region.
Some fishes like puti (Puntius species) are fermented. Fish curry is prepared with fish alone or in combination with vegetables.Shutki maach is made using the age-old method of preservation where the food item is dried in the sun and air, thus removing the water content. This allows for preservation that can make the fish last for months, even years in Bangladesh
"""

template = """Question: {question}
Answer:  Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(model="gpt-4o-mini")
chain = prompt | model
query ="Summarize the text for me %s"%(text)
input_data = {"question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data).content
print("actual_output:", actual_output)

test_case = LLMTestCase(input=text, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.7,
    model="gpt-4o-mini",
)

metric.measure(test_case)
print("metric.score:", metric.score)
print("metric.reason:", metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

Run the test using deepval test run test_summarization.py command to check if the summary is generated correctly.

Example 6: LLM Integration

DeepEval is capable of integrating with any LLM and evaluating its performance across different metrics. In this example, we use LangChain’s ChatOpenAI with DeepEval to test the relevancy of a response. The model (gpt-4o-mini) is queried with “What is the capital of India?”, and the output is evaluated using the Answer Relevancy Metric.

Create a file named test_openai_llm.py and add the following code to it.

from langchain_openai import ChatOpenAI
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

from dotenv import load_dotenv
load_dotenv()

# Initialize the model
chat = ChatOpenAI(model="gpt-4o-mini",temperature=0.7)

# Get response
query = "What is the capital of India?"
response = chat.invoke(query).content
print(f"User: {query} \nAssistant: {response}")

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input=query,
    actual_output=response
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Run the test using deepval test run test_openai_llm.py command to check the evaluation metrics.

Example 7: Hallucinations

DeepEval provides a Hallucination Metric that helps detect and score these cases. It compares the model’s actual output against the given context to check whether the response faithfully sticks to the facts provided. If the output strays beyond or invents unsupported details, the score decreases. This way, DeepEval helps ensure your LLM outputs remain accurate and grounded in the input context.

Create a file named test_hallucinations.py and add the following code to it.

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

from dotenv import load_dotenv
load_dotenv()

# Replace this with the actual documents that you are passing as input to your LLM.
context=["A man with blond-hair, and a brown shirt drinking out of a public water fountain."]

# Replace this with the actual output from your LLM application
actual_output="A blond drinking water in public."

test_case = LLMTestCase(
    input="What was the blond doing?",
    actual_output=actual_output,
    context=context
)
metric = HallucinationMetric(threshold=0.5)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

Run the test using deepval test run test_hallucinations.py command to check the evaluation metrics.

Example 8: Faithfulness Metric

The FaithfulnessMetric in DeepEval evaluates whether an LLM’s output is accurately grounded in the provided context or source material. It measures if the response remains consistent with the facts without introducing fabricated or contradictory information. This metric is particularly useful for applications like question-answering or retrieval-augmented generation, where maintaining factual consistency is critical. A higher faithfulness score indicates that the model’s output reliably reflects the input context.

Create a file named test_faithfulness.py and add the following code to it.

from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric

from dotenv import load_dotenv
load_dotenv()

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

metric = FaithfulnessMetric(
    threshold=0.7,
    model="gpt-4o-mini",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])

Run the test using deepval test run test_faithfulness.pycommand to check the evaluation metrics.

Example 9: Chatbot Evaluation

Chatbot Evaluation differs from standard single-turn evaluations because conversations occur over multiple turns. This requires the chatbot to maintain context awareness throughout the interaction, rather than simply providing accurate responses in isolation.
In DeepEval, chatbots are assessed through multi-turn interactions, which must be structured as test cases following OpenAI’s message format. Evaluating multi-turn conversations is challenging, as each AI response depends on the preceding user input and all prior turns in the conversation, making the evaluation inherently context-dependent.

Create a file named test_chatbots.py and add the following code to it.

from deepeval.test_case import ConversationalTestCase, Turn
from deepeval.metrics import TurnRelevancyMetric, KnowledgeRetentionMetric
from deepeval import evaluate

from dotenv import load_dotenv
load_dotenv()

test_case = ConversationalTestCase(
    turns=[
        Turn(role="user", content="Hello, how are you?"),
        Turn(role="assistant", content="I'm doing well, thank you!"),
        Turn(role="user", content="How can I help you today?"),
        Turn(role="assistant", content="I'd like to buy a ticket to a Coldplay concert."),
    ]
)

evaluate(test_cases=[test_case], metrics=[TurnRelevancyMetric(), KnowledgeRetentionMetric()])

Run the test using deepval test run test_chatbots.pycommand to check the evaluation results.

Example 10: LLM Tracing

LLM Tracing allows you to monitor the full execution of your application from start to finish. In DeepEval, the @observe decorator enables tracing and evaluation of any LLM interaction, regardless of the application’s complexity. By identifying individual components of your LLM workflow—such as functions that perform specific tasks or are invoked selectively—you can apply the @observe decorator to track their behavior. This provides detailed insights into how each part of your LLM application operates, making it easier to debug, optimize, and evaluate performance.

Tracing requires Confident AI credentials to see the traces. Create a basic account on Confident AI and add the given credentials in the .env file.

Confident AI is a comprehensive platform designed to evaluate and enhance the performance of large language models (LLMs). It leverages its open-source evaluation framework, DeepEval, to provide robust testing, benchmarking, and monitoring capabilities for LLM applications. Confident AI emphasizes observability, allowing teams to trace LLM interactions, conduct A/B testing, and gather real-time performance insights.

Confident AI - The DeepEval LLM Evaluation Platform

Create a file named test_llm_tracing.py and add the following code to it.

from openai import OpenAI
from deepeval.tracing import observe

from dotenv import load_dotenv
load_dotenv()

client = OpenAI()

@observe()
def llm_app(query: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "user", "content": query}
        ]
    ).choices[0].message.content
    return

# Call app to send trace to Confident AI
llm_app("Write me a poem.")

Run the test using deepval test run test_llm_tracing.pycommand to see the tracing results.

Example 11: MCP Interactions

The MCP Use Metric evaluates how effectively an MCP-based LLM agent utilizes the MCP servers it has access to. It leverages an LLM-as-a-judge approach to assess both the MCP primitives invoked and the arguments generated by the LLM application. This metric can be applied to a single-turn LLMTestCase containing MCP parameters, providing insights into the agent’s efficiency and correctness in interacting with the MCP environment.

Create a basic MCP server using FastMCP. For that, create a file named mcp_echo_server.py and add the following code to it.

import asyncio
from fastmcp import FastMCP

# Initialize FastMCP server
mcp = FastMCP("Simple Echo Server")

@mcp.tool()
def echo(message: str) -> str:
    """Echo back the provided message."""
    return message

def main():
    """Run the server."""
    mcp.run()

if __name__ == "__main__":
    main()

Run the server using the command

python mcp_echo_server.py

Create a test case to evalute the MCP. For that, create a file named test_mcp.py and add the following code to it.

from deepeval import evaluate
from deepeval.metrics import MCPUseMetric
from deepeval.test_case import LLMTestCase, MCPServer

from dotenv import load_dotenv
load_dotenv()

test_case = LLMTestCase(
    input="Hello", # Your input here
    actual_output="Hello", # Your LLM app's final output here
    mcp_servers=[MCPServer(server_name="Simple Echo Server")] # Your MCP server's data
    # MCP primitives used (if any)
)

metric = MCPUseMetric()

# To run metric as a standalone
# metric.measure(convo_test_case)
# print(metric.score, metric.reason)

evaluate([test_case], [metric])

Run the test using deepval test run test_mcp.pycommand to see the mcp evaluation results.

MMLU benchmarking with DeepEval for custom LLMs

MMLU (Massive Multitask Language Understanding) is a benchmark commonly used to evaluate large language models through multiple-choice questions. Covering 57 subjects ranging from math and history to law and ethics, it provides a thorough assessment of an LLM’s knowledge and reasoning skills. Its wide subject coverage and carefully designed questions have made MMLU a gold standard for measuring model performance.

In this guide, we will evaluate our custom LLM (TinyLlama-1.1B) on the MMLU dataset. Each entry in the dataset consists of an input prompt and multiple-choice answers (A, B, C, D). Model performance is measured by calculating the percentage of questions answered correctly.

Creating the Custom LLM Model Class

Define a custom class called TinyLlamaModel that extends DeepEvalBaseLLM to generate responses using the language model and tokenizer. The objective is to produce short outputs (two tokens) for a given prompt, while efficiently managing device allocation and handling preprocessing for both the input prompt and the generated output.

Create a file named test_custom_model.py and add the following code to it.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask
from typing import List

class TinyLlamaModel(DeepEvalBaseLLM):
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        # Clean the prompt for multiple choice
        prompt = prompt.replace("Output 'A', 'B', 'C', or 'D'. Full answer not needed.", "")
        
        # Format the prompt for TinyLlama
        formatted_prompt = f"### Instruction: Answer with just the letter (A, B, C, or D)\n\n### Question: {prompt}\n\n### Answer:"
        
        model_inputs = self.tokenizer([formatted_prompt], return_tensors="pt").to(self.device)
        
        generated_ids = self.model.generate(
            **model_inputs,
            max_new_tokens=3,
            do_sample=False,
            temperature=0.1,
            pad_token_id=self.tokenizer.eos_token_id,
            repetition_penalty=1.1
        )
        
        # Extract only the new tokens
        generated_tokens = generated_ids[0][model_inputs['input_ids'].shape[1]:]
        clean_output = self.tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
        
        # Extract just the letter (A, B, C, or D)
        for char in clean_output:
            if char in ['A', 'B', 'C', 'D']:
                return char
        
        return clean_output[:1]  # Fallback: return first character

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def batch_generate(self, prompts: List[str]) -> List[str]:
        """Batch generate method required by MMLU benchmark"""
        results = []
        for prompt in prompts:
            try:
                result = self.generate(prompt)
                results.append(result)
            except Exception as e:
                print(f"Error generating for prompt: {e}")
                results.append("")  # Fallback empty response
        return results

    async def a_batch_generate(self, prompts: List[str]) -> List[str]:
        return self.batch_generate(prompts)

    def get_model_name(self):
        return "TinyLlama-1.1B-Chat"

Loading the Model and Tokenizer

Create two functions to load the LLM model and tokenizer directly from local storage. The model will be loaded in 8-bit precision, and the tokenizer will be initialized with appropriate padding and special token configurations.

def load_model(model_name: str):
    # Use light quantization for TinyLlama
    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )
    
    try:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=quant_config,
            device_map="auto",
            dtype=torch.float16,
            trust_remote_code=True
        )
        return model
    except Exception as e:
        print(f"Error loading quantized model, trying without quantization: {e}")
        # Fallback without quantization
        return AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            dtype=torch.float16,
            trust_remote_code=True
        )

def load_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"
    return tokenizer

Building the custom LLM

We will use the Hugging Face model TinyLlama/TinyLlama-1.1B-Chat-v1.0 to directly load both the model and tokenizer. These will then be passed into the custom LLM class to create an LLM response generator.

After loading the TinyLlama 1.1B model and tokenizer from Hugging Face and wrapping them in our custom class, we can test the response generation. The code first runs a single test prompt related to abstract algebra to verify that the model produces an output. It then performs batch generation with multiple prompts, such as arithmetic and geography questions, to validate that the model can handle multiple inputs efficiently. The results are printed for both single and batch generations, ensuring that our custom model class works as expected.

# Load TinyLlama 1.1B model from Hugging Face
tinyllama_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print("Loading tokenizer...")
tokenizer = load_tokenizer(tinyllama_model_name)

print("Loading model...")
model = load_model(tinyllama_model_name)

print("Creating custom model...")
custom_model = TinyLlamaModel(model, tokenizer)

# Test model generation
print("\nTesting model generation:")
prompt = """
The following are multiple choice questions (with answers) about abstract algebra.

Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A. 0
B. 1
C. 2
D. 3
Answer:"""

test_output = custom_model.generate(prompt)
print(f"Generated output: '{test_output}'")

# Test batch generation
print("\nTesting batch generation...")
test_prompts = [
    prompt,
    "What is 2+2? A. 1 B. 2 C. 3 D. 4 Answer:",
    "Capital of France? A. London B. Berlin C. Paris D. Rome Answer:"
]

batch_outputs = custom_model.batch_generate(test_prompts)
for i, output in enumerate(batch_outputs):
    print(f"Batch output {i+1}: '{output}'")

Running the MMLU Benchmark

Finally, we will load the MMLU benchmark, define the tasks, and run the evaluation on the custom model. The results of each task can be reviewed using benchmark.task_scores, while benchmark.predictions provides detailed outputs showing which samples were answered correctly and which were not. This allows for a more granular analysis of the model’s performance.

# Run MMLU benchmark with very light settings
print("\nRunning MMLU benchmark...")
benchmark = MMLU(
    tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE],  # Only one task
    n_shots=2  # Reduced shots for smaller model
)

try:
    benchmark.evaluate(model=custom_model, batch_size=1)  # Batch size 1
    
    print("\nBenchmark Results:")
    print(f"Task Scores: {benchmark.task_scores}")
    print(f"Overall Score: {benchmark.overall_score}")
    
    # Print detailed predictions
    print("\nSample Predictions:")
    for i, (input_text, prediction) in enumerate(list(benchmark.predictions.items())[:3]):
        print(f"Prediction {i+1}:")
        print(f"Input: {input_text[:100]}...")
        print(f"Prediction: {prediction}")
        print("---")
        
except Exception as e:
    print(f"Error during benchmark evaluation: {e}")
    print("Trying with even smaller settings...")
    
    # Fallback: try with minimal settings
    benchmark = MMLU(
        tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE],
        n_shots=1
    )
    benchmark.evaluate(model=custom_model, batch_size=1)
    print(f"Fallback overall score: {benchmark.overall_score}")

Final Code

The complete implementation will look as follows:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.benchmarks import MMLU
from deepeval.benchmarks.tasks import MMLUTask
from typing import List

class TinyLlamaModel(DeepEvalBaseLLM):
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.device = "cuda" if torch.cuda.is_available() else "cpu"

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        # Clean the prompt for multiple choice
        prompt = prompt.replace("Output 'A', 'B', 'C', or 'D'. Full answer not needed.", "")
        
        # Format the prompt for TinyLlama
        formatted_prompt = f"### Instruction: Answer with just the letter (A, B, C, or D)\n\n### Question: {prompt}\n\n### Answer:"
        
        model_inputs = self.tokenizer([formatted_prompt], return_tensors="pt").to(self.device)
        
        generated_ids = self.model.generate(
            **model_inputs,
            max_new_tokens=3,
            do_sample=False,
            temperature=0.1,
            pad_token_id=self.tokenizer.eos_token_id,
            repetition_penalty=1.1
        )
        
        # Extract only the new tokens
        generated_tokens = generated_ids[0][model_inputs['input_ids'].shape[1]:]
        clean_output = self.tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
        
        # Extract just the letter (A, B, C, or D)
        for char in clean_output:
            if char in ['A', 'B', 'C', 'D']:
                return char
        
        return clean_output[:1]  # Fallback: return first character

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def batch_generate(self, prompts: List[str]) -> List[str]:
        """Batch generate method required by MMLU benchmark"""
        results = []
        for prompt in prompts:
            try:
                result = self.generate(prompt)
                results.append(result)
            except Exception as e:
                print(f"Error generating for prompt: {e}")
                results.append("")  # Fallback empty response
        return results

    async def a_batch_generate(self, prompts: List[str]) -> List[str]:
        return self.batch_generate(prompts)

    def get_model_name(self):
        return "TinyLlama-1.1B-Chat"
    
def load_model(model_name: str):
    # Use light quantization for TinyLlama
    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )
    
    try:
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            quantization_config=quant_config,
            device_map="auto",
            dtype=torch.float16,
            trust_remote_code=True
        )
        return model
    except Exception as e:
        print(f"Error loading quantized model, trying without quantization: {e}")
        # Fallback without quantization
        return AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            dtype=torch.float16,
            trust_remote_code=True
        )

def load_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "left"
    return tokenizer

# Load TinyLlama 1.1B model from Hugging Face
tinyllama_model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print("Loading tokenizer...")
tokenizer = load_tokenizer(tinyllama_model_name)

print("Loading model...")
model = load_model(tinyllama_model_name)

print("Creating custom model...")
custom_model = TinyLlamaModel(model, tokenizer)

# Test model generation
print("\nTesting model generation:")
prompt = """
The following are multiple choice questions (with answers) about abstract algebra.

Find all c in Z_3 such that Z_3[x]/(x^2 + c) is a field.
A. 0
B. 1
C. 2
D. 3
Answer:"""

test_output = custom_model.generate(prompt)
print(f"Generated output: '{test_output}'")

# Test batch generation
print("\nTesting batch generation...")
test_prompts = [
    prompt,
    "What is 2+2? A. 1 B. 2 C. 3 D. 4 Answer:",
    "Capital of France? A. London B. Berlin C. Paris D. Rome Answer:"
]

batch_outputs = custom_model.batch_generate(test_prompts)
for i, output in enumerate(batch_outputs):
    print(f"Batch output {i+1}: '{output}'")

# Run MMLU benchmark with very light settings
print("\nRunning MMLU benchmark...")
benchmark = MMLU(
    tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE],  # Only one task
    n_shots=2  # Reduced shots for smaller model
)

try:
    benchmark.evaluate(model=custom_model, batch_size=1)  # Batch size 1
    
    print("\nBenchmark Results:")
    print(f"Task Scores: {benchmark.task_scores}")
    print(f"Overall Score: {benchmark.overall_score}")
    
    # Print detailed predictions
    print("\nSample Predictions:")
    for i, (input_text, prediction) in enumerate(list(benchmark.predictions.items())[:3]):
        print(f"Prediction {i+1}:")
        print(f"Input: {input_text[:100]}...")
        print(f"Prediction: {prediction}")
        print("---")
        
except Exception as e:
    print(f"Error during benchmark evaluation: {e}")
    print("Trying with even smaller settings...")
    
    # Fallback: try with minimal settings
    benchmark = MMLU(
        tasks=[MMLUTask.HIGH_SCHOOL_COMPUTER_SCIENCE],
        n_shots=1
    )
    benchmark.evaluate(model=custom_model, batch_size=1)
    print(f"Fallback overall score: {benchmark.overall_score}")

Executing the test

Executing the custom LLM model benchmark test using the above code by running the following command.

deepeval test run test_custom_model.py

Thanks for reading this article !!

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/deepeval-experiments: This repository contains hands-on experiments with DeepEval, an open-source evaluation framework for testing Large Language Models (LLMs)

References

The Ultimate A2A Handbook: Rulebook for Agent Conversations

Vishnu Sivan — Sun, 27 Jul 2025 15:42:28 GMT

As artificial intelligence continues to evolve, the need for AI systems to communicate and collaborate has become increasingly important. From document summarization to image generation and intelligent decision-making, today’s AI agents are no longer working in isolation — they must interact to handle complex tasks effectively. This is where the Agent-to-Agent (A2A) Protocol comes into play.

A2A is a communication framework designed to enable seamless interaction between autonomous agents. In an era dominated by distributed systems and multi-agent architectures, A2A offers a structured way for intelligent agents to share information, delegate tasks, and make coordinated decisions without human intervention.

This article explores the A2A protocol from foundational concepts to advanced implementations, and a building a travel planner app using A2A.

Getting Started

What is A2A
Core concepts of A2A
A2A vs MCP
Experimenting with A2A
1. Implementing A2A from scratch using FastAPI
2. Implementing A2A using Google A2A SDK
A2A Client-Server Interaction Flow
3. Implementing A2A using Python-A2A library
Example 1: Echo Agent
Example 2: Basic A2A Agent
Example 3: LLM-Based Agent
Example 4: Converting LangChain to A2A Servers
Example 5: Converting MCP Tools to LangChain Tools
Building a travel planner app using A2A

What is A2A

Image Source: A Visual Guide to Agent2Agent (A2A) Protocol

Imagine assembling a team of exceptional AI assistants — one masters data analysis, another crafts insightful reports, and a third flawlessly manages your schedule. Individually, they’re outstanding. But there’s a hitch: each speaks a different language. One uses Python, another JSON, and the third relies on obscure API calls. Getting them to collaborate would be like reviving the digital Tower of Babel. This is the challenge that the Agent-to-Agent (A2A) Protocol is designed to solve.

The Agent-to-Agent (A2A) Protocol, introduced by Google Cloud, is an open standard that enables seamless communication and collaboration between AI agents — regardless of the frameworks or vendors they originate from. Like a universal translator, A2A solves the interoperability challenge by providing a common language for agents to share information, delegate tasks, and coordinate actions effectively.

A2A complements Anthropic’s Model Context Protocol (MCP) by focusing on inter-agent communication, leveraging Google’s expertise in deploying large-scale agent systems. Together, these protocols lay the foundation for the future of multi-agent collaboration in enterprise environments. The protocol is supported by over 50 major technology and consulting partners, reflecting a shared vision for scalable, interoperable agent ecosystems.

Core concepts of A2A

The Agent-to-Agent (A2A) Protocol is designed around a set of foundational concepts that enable intelligent agents to collaborate efficiently and reliably. These core building blocks define how agents communicate, manage tasks, and exchange data.

AgentCard
A standardized JSON document that describes an agent’s identity, capabilities, and supported protocols. Typically hosted at the /.well-known/agent.json endpoint, the AgentCard allows other agents or clients to easily discover and understand how to interact with the agent.
Task
A Task represents a stateful collaboration between a client and an agent, tracking progress toward a specific goal. It includes task status, execution history, and references to outputs (artifacts). Tasks enable agents to maintain context across multi-step processes.
Artifact
An Artifact is the final, immutable result produced by an agent during a task. It may include one or more Parts, which are discrete pieces of structured or unstructured content (e.g., text, files, or forms). Artifacts are useful for recording outcomes or sharing final deliverables.
Message
Used to exchange non-artifact content between agents or with clients. Messages may contain instructions, intermediate thoughts, context updates, or task status information. They support dynamic interaction throughout the lifecycle of a task.
Part
A Part is the smallest unit of content within a Message or Artifact. Each Part has a specific content type (e.g., text/plain, application/json, or a file reference), allowing agents to structure complex content with clarity and modularity.
Transport mechanism
The transport mechanism in the A2A protocol determines how agents exchange messages, using technologies like HTTP/HTTPS for simplicity, gRPC for low-latency communication, and MQTT or message buses for asynchronous, event-driven interactions — based on system requirements.
Discovery
Agents need a way to find and connect with each other. A2A supports discovery through mechanisms similar to DNS (e.g., static URLs) or through centralized agent registries — especially useful in enterprise or multi-agent platform settings.
Security and authentication
The A2A protocol ensures secure agent communication through authentication (using API keys, OAuth, or identity assertions), authorization for access control, and encryption (like TLS) to protect sensitive data during transmission.

A2A vs MCP

Source: A Visual Guide to Agent2Agent (A2A) Protocol

The Agent-to-Agent (A2A) protocol is designed to facilitate collaboration among multiple AI agents. It enables agents to interact securely through tasks, message exchanges, and artifact sharing. These interactions are stateful, allowing complex workflows to unfold over time. A2A supports discovery via JSON-based AgentCards, letting agents or clients locate and communicate with other agents based on capabilities. This protocol encourages modular, decentralized design where agents from different frameworks (like CrewAI or LlamaIndex) can work together as part of a team.

In contrast, the Model Context Protocol (MCP) is focused on enabling AI agents to access tools, plugins, or APIs. It allows a single model to invoke external capabilities like calculators, data retrievers, or code execution environments. MCP is less about inter-agent collaboration and more about augmenting an individual agent’s power through tool access. Importantly, A2A and MCP aren’t competitors — they’re complementary. A2A agents can be exposed as MCP tools, allowing models using MCP to discover and communicate with other agents via A2A. This layered integration bridges tool invocation with multi-agent orchestration, enabling richer, more intelligent systems.

Experimenting with A2A

In this section, we will explore how to build A2A (Agent-to-Agent) applications. There are three main approaches we can follow:

Implementing A2A from scratch using FastAPI,
Using Google’s A2A SDK (google-a2a),
Leveraging the Python-A2A library, a comprehensive implementation of Google's A2A protocol.

We will begin by building an A2A implementation from scratch using FastAPI to understand the core concepts. Then, we will explore how to create A2A agents using the official Google A2A SDK. Finally, we will use the Python A2A library, which simplifies the development process and provides a robust interface for enabling seamless communication and collaboration between AI agents.

1. Implementing A2A from scratch using FastAPI

Let’s begin with a basic echo agent serves as the “Hello World” of A2A, helping you learn the core concepts by returning whatever input it receives.

Installing uv

We will use uv, a fast and modern Python project manager, to set up and manage our environment. It simplifies tasks like handling dependencies, creating virtual environments, and running scripts.

To install uv, run this in your terminal:

# For Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
set Path=C:\Users\Codem\.local\bin;%Path%

# For linux / Mac
curl -LsSf https://astral.sh/uv/install.sh | sh

Refer to the official website for detailed installation instructions.

Installation | uv

Installing the dependencies

Initialize a uv project by executing the following command.

uv init basic_a2a_demo
cd basic_a2a_demo

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Install fastapi, uvicorn, requests, uuid and sseclient-py using uv.

uv add fastapi uvicorn requests uuid sseclient-py

Creating an agent card

The agent.json file acts as your agent’s identity card, describing its purpose and how others can interact with it.

Create a file named agent.json and add the following code to it.

{
  "schema_version": "1.0.0",
  "name": "Echo Agent",
  "description": "I repeat what you say, like a friendly cave.",
  "contact_email": "you@example.com",
  "capabilities": [
    "a2a.text-chat"
  ],
  "versions": [
    {
      "version": "1.0.0",
      "endpoint": "http://localhost:8000/a2a",
      "supports_streaming": true,
      "auth": {
        "type": "none"
      }
    }
  ]
}

Building A2A server

Build a basic server that listens for messages and sends them back just as it received like a digital echo.

Create a file named echo_server.py and add the following code to it.

from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from sse_starlette.sse import EventSourceResponse
import uuid
import json
import asyncio
from datetime import datetime

app = FastAPI()
# Serve your business card at the standard location
@app.get("/.well-known/agent.json")
async def get_agent_card():
    with open("agent.json") as f:
        return json.load(f)

# Handle regular (non-streaming) requests
@app.post("/a2a/tasks/send")
async def tasks_send(request: Request):
    data = await request.json()
    task_id = data.get("task_id", str(uuid.uuid4()))
    user_message = next((m for m in data.get("messages", []) 
                        if m.get("role") == "user"), None)
    
    if not user_message:
        return JSONResponse(status_code=400, content={"error": "No user message found"})

    parts = user_message.get("parts", [])
    text_parts = [p.get("text") for p in parts if p.get("type") == "text"]
    echo_text = f"Echo: {' '.join(text_parts)}"
    
    return {
        "task_id": task_id,
        "status": "completed",
        "created_time": datetime.utcnow().isoformat(),
        "updated_time": datetime.utcnow().isoformat(),
        "messages": [
            {
                "role": "agent",
                "parts": [{"type": "text", "text": echo_text}]
            }
        ]
    }

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Creating A2A client

Create a client to communicate with the server like crafting a remote control for your freshly built device.

import requests
import uuid

class SimpleA2AClient:
    def __init__(self, server_url):
        self.server_url = server_url
        
    def discover_agent(self):
        response = requests.get(f"{self.server_url}/.well-known/agent.json")
        response.raise_for_status()
        return response.json()
        
    def send_message(self, text):
        task_id = str(uuid.uuid4())
        
        # Prepare our request in A2A format
        payload = {
            "task_id": task_id,
            "messages": [
                {
                    "role": "user",
                    "parts": [
                        {
                            "type": "text",
                            "text": text
                        }
                    ]
                }
            ]
        }
        
        # Send the request to the agent
        endpoint = f"{self.server_url}/a2a/tasks/send"
        response = requests.post(endpoint, json=payload)
        response.raise_for_status()
        return response.json()

if __name__ == "__main__":
    # Create our client
    client = SimpleA2AClient("http://localhost:8000")
    
    # Check out the agent's capabilities
    agent_card = client.discover_agent()
    print(f"Found agent: {agent_card['name']}")
    
    # Send a message
    message = input("Type a message to send: ")
    response = client.send_message(message)
    
    # Extract and display the agent's response
    agent_message = response.get("messages", [{}])[0]
    agent_parts = agent_message.get("parts", [{}])
    response_text = agent_parts[0].get("text", "No response") if agent_parts else "No parts"
    
    print(f"\nAgent response: {response_text}")

Executing the app

Open two terminal windows.
In the first terminal, start the Echo Server:

python echo_server.py

In the second terminal, run the client:


python echo_client.py

The output will look like this,

2. Implementing A2A using Google A2A SDK

The A2A Protocol provides a standardized framework that allows agents to work together intelligently and efficiently. The Echo example implemented in the last section highlights the core concepts of A2A, however real-world applications can extend this by connecting agents to language models, databases, and APIs, enabling streaming for real-time updates, adding authentication for secure communication, and coordinating multiple agents to solve complex tasks. Since building these features from scratch is challenging, developers often rely on libraries like Agent Developer Kit (ADK) — Google ADK, Google A2A and Python A2A to simplify the development process.

Installing the dependencies

Initialize a uv project by executing the following command.

uv init google_a2a_demo
cd google_a2a_demo

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Install a2a-sdk and uvicorn using uv.

uv add a2a-sdk uvicorn

Creating Agent executor

To handle tasks, we need to create an Agent Executor. In a real-world application, this would involve connecting to an LLM or executing other complex logic. For our “Hello World” example, we’ll implement a minimal handler: whenever the agent receives a hello_world task, it simply responds with “Hello, world!”.

The A2A SDK provides an AgentExecutor class, where you define the logic for each skill. For this example, it’s as simple as implementing a function that returns the string "Hello, world" when called.

Create a file named agent_executor.py and add the following code to it.

from a2a.server.agent_execution import AgentExecutor, RequestContext
from a2a.server.events import EventQueue
from a2a.utils import new_agent_text_message

class HelloWorldAgent:
    async def invoke(self) -> str:
        return 'Hello World'

class HelloWorldAgentExecutor(AgentExecutor):
    def __init__(self):
        self.agent = HelloWorldAgent()

    async def execute(
        self,
        context: RequestContext,
        event_queue: EventQueue,
    ) -> None:
        result = await self.agent.invoke()
        await event_queue.enqueue_event(new_agent_text_message(result))

    async def cancel(
        self, context: RequestContext, event_queue: EventQueue
    ) -> None:
        raise Exception('cancel not supported')

Setting up the A2A server

Lets create a simple “Hello World” A2A (Agent-to-Agent) server using the A2A SDK. It defines basic skills, configures the agent’s public and extended profiles, and launches the server to handle requests via HTTP.

Add the following content to main.py file.

import uvicorn

from a2a.server.apps import A2AStarletteApplication
from a2a.server.request_handlers import DefaultRequestHandler
from a2a.server.tasks import InMemoryTaskStore
from a2a.types import AgentCapabilities, AgentCard, AgentSkill
from agent_executor import HelloWorldAgentExecutor

if __name__ == '__main__':
    skill = AgentSkill(
        id='hello_world',
        name='Returns hello world',
        description='just returns hello world',
        tags=['hello world'],
        examples=['hi', 'hello world'],
    )

    extended_skill = AgentSkill(
        id='super_hello_world',
        name='Returns a SUPER Hello World',
        description='A more enthusiastic greeting, only for authenticated users.',
        tags=['hello world', 'super', 'extended'],
        examples=['super hi', 'give me a super hello'],
    )

    # This will be the public-facing agent card
    public_agent_card = AgentCard(
        name='Hello World Agent',
        description='Just a hello world agent',
        url='http://localhost:9999/',
        version='1.0.0',
        default_input_modes=['text'],
        default_output_modes=['text'],
        capabilities=AgentCapabilities(streaming=True),
        skills=[skill],
        supports_authenticated_extended_card=True,
    )

    # This will be the authenticated extended agent card
    specific_extended_agent_card = public_agent_card.model_copy(
        update={
            'name': 'Hello World Agent - Extended Edition',  # Different name for clarity
            'description': 'The full-featured hello world agent for authenticated users.',
            'version': '1.0.1',  # Could even be a different version
            'skills': [
                skill,
                extended_skill,
            ],
        }
    )

    request_handler = DefaultRequestHandler(
        agent_executor=HelloWorldAgentExecutor(),
        task_store=InMemoryTaskStore(),
    )

    server = A2AStarletteApplication(
        agent_card=public_agent_card,
        http_handler=request_handler,
        extended_agent_card=specific_extended_agent_card,
    )

    uvicorn.run(server.build(), host='0.0.0.0', port=9999)

In this code,

we define a skill (AgentSkill) that returns a simple "Hello, world!" message, along with an extended version for authenticated users.
An agent card (AgentCard) describes the agent’s capabilities, including its skills and supported input/output modes. A separate extended agent card provides enhanced functionality for authenticated clients.
The request handler connects the logic (HelloWorldAgentExecutor) to an in-memory task store.
Finally, we use A2AStarletteApplication (built on Starlette) to expose the agent as an HTTP service and run it using Uvicorn on port 9999.

Setting up the client

Lets sets up an HTTP client, fetches the agent’s card (including an optional extended version for authenticated access), initializes the A2A client, and sends a message query ("How much is 10 USD in INR?") both as a regular and a streaming message.

Create a file named test_client.py and add the following code to it.

import logging
from typing import Any
from uuid import uuid4
import httpx

from a2a.client import A2ACardResolver, A2AClient
from a2a.types import (
    AgentCard,
    MessageSendParams,
    SendMessageRequest,
    SendStreamingMessageRequest,
)


async def main() -> None:
    PUBLIC_AGENT_CARD_PATH = '/.well-known/agent.json'
    EXTENDED_AGENT_CARD_PATH = '/agent/authenticatedExtendedCard'

    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)  

    base_url = 'http://localhost:9999'

    async with httpx.AsyncClient() as httpx_client:
        # Initialize A2ACardResolver
        resolver = A2ACardResolver(
            httpx_client=httpx_client,
            base_url=base_url,
        )
        # Fetch Public Agent Card and Initialize Client
        final_agent_card_to_use: AgentCard | None = None

        try:
            logger.info(f'Attempting to fetch public agent card from: {base_url}{PUBLIC_AGENT_CARD_PATH}')
            _public_card = await resolver.get_agent_card()
            logger.info('Successfully fetched public agent card:')
            logger.info(_public_card.model_dump_json(indent=2, exclude_none=True))
            final_agent_card_to_use = _public_card
            logger.info('\nUsing PUBLIC agent card for client initialization (default).')

            if _public_card.supports_authenticated_extended_card:
                try:
                    logger.info(f'\nPublic card supports authenticated extended card. Attempting to fetch from: {base_url}{EXTENDED_AGENT_CARD_PATH}')
                    auth_headers_dict = {
                        'Authorization': 'Bearer dummy-token-for-extended-card'
                    }
                    _extended_card = await resolver.get_agent_card(
                        relative_card_path=EXTENDED_AGENT_CARD_PATH,
                        http_kwargs={'headers': auth_headers_dict},
                    )
                    logger.info('Successfully fetched authenticated extended agent card:')
                    logger.info(
                        _extended_card.model_dump_json(
                            indent=2, exclude_none=True
                        )
                    )
                    final_agent_card_to_use = _extended_card
                    logger.info('\nUsing AUTHENTICATED EXTENDED agent card for client initialization.')
                except Exception as e_extended:
                    logger.warning(
                        f'Failed to fetch extended agent card: {e_extended}. Will proceed with public card.',
                        exc_info=True,
                    )
            elif (
                _public_card
            ):  # supportsAuthenticatedExtendedCard is False or None
                logger.info('\nPublic card does not indicate support for an extended card. Using public card.')

        except Exception as e:
            logger.error(f'Critical error fetching public agent card: {e}', exc_info=True)
            raise RuntimeError(
                'Failed to fetch the public agent card. Cannot continue.'
            ) from e

        client = A2AClient(
            httpx_client=httpx_client, agent_card=final_agent_card_to_use
        )
        logger.info('A2AClient initialized.')

        send_message_payload: dict[str, Any] = {
            'message': {
                'role': 'user',
                'parts': [
                    {'kind': 'text', 'text': 'How much is 10 USD in INR?'}
                ],
                'messageId': uuid4().hex,
            },
        }
        request = SendMessageRequest(
            id=str(uuid4()), params=MessageSendParams(**send_message_payload)
        )
        response = await client.send_message(request)
        print(response.model_dump(mode='json', exclude_none=True))

        streaming_request = SendStreamingMessageRequest(
            id=str(uuid4()), params=MessageSendParams(**send_message_payload)
        )
        stream_response = client.send_message_streaming(streaming_request)

        async for chunk in stream_response:
            print(chunk.model_dump(mode='json', exclude_none=True))

if __name__ == '__main__':
    import asyncio

    asyncio.run(main())

In this code,

An asynchronous HTTP client (httpx.AsyncClient) is used to fetch a public agent card from a locally running A2A server. It then attempts to retrieve an extended, authenticated version of the card using a dummy bearer token.
With the obtained card (either public or extended), the script initializes an A2AClient, which is then used to send a currency conversion query ("how much is 10 USD in INR?").
The response is printed using both the standard send_message method and the streaming send_message_streaming method, showcasing how to handle real-time agent replies.

A2A Client-Server Interaction Flow

Image source: A2A Samples: Hello World Agent | A2A Protocol

Executing the app

Open two terminal windows.
In the first terminal, start the A2A Server:

uv run main.py

In the second terminal, run the client:

python test_client.py

The output will look like this,

3. Implementing A2A using Python-A2A library

The Python-A2A library is a powerful and developer-friendly extension built on top of Google’s Agent-to-Agent (A2A) protocol. While Google’s native A2A SDK provides foundational tools to build interoperable agents and define standardized communication formats. It is ideal for developers looking to quickly prototype, deploy, and scale intelligent agents that can autonomously communicate and collaborate using the A2A standard. It bridges the gap between low-level protocol details and high-level use cases.

Installing the dependencies

Initialize a uv project by executing the following command.

uv init python_a2a_demo
cd python_a2a_demo

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Install python-a2a and python-dotenv using uv.

uv add python-a2a[all] python-dotenv

Example 1: Echo Agent

Lets begin with building a simple A2A compatible agent using the python-a2a library. The agent, named "Echo Agent", is designed to echo user's messages.

Creating echo agent server
Create a file named echo_agent.py and add the following code to create a basic A2A agent server.

from python_a2a import A2AServer, Message, TextContent, MessageRole, run_server

class EchoAgent(A2AServer):
    def handle_message(self, message):
        if message.content.type == "text":
            return Message(
                content=TextContent(text=f"Echo: {message.content.text}"),
                role=MessageRole.AGENT,
                parent_message_id=message.message_id,
                conversation_id=message.conversation_id
            )

if __name__ == "__main__":
    agent = EchoAgent()
    run_server(agent, host="0.0.0.0", port=5000)

In this code,

Defines an EchoAgent class by extending A2AServer to handle incoming A2A protocol messages.
When a text message is received, it responds with an echo by prepending "Echo:" to the input.
The response maintains conversation context using message and conversation IDs.
Launches the agent using run_server on host 0.0.0.0 and port 5000, making it accessible for incoming agent-to-agent communication.

Creating echo client
Create a file named echo_client.py and add the following code to create a client for the echo agent.

from python_a2a import A2AClient, Message, TextContent, MessageRole

client = A2AClient("http://localhost:5000/a2a")
message = Message(
    content=TextContent(text="Hello, Good morning!"),
    role=MessageRole.USER
)
response = client.send_message(message)
print(f"Agent says: {response.content.text}")

In this code,

Creates an A2AClient that connects to an agent running at http://localhost:5000/a2a.
Constructs a user message with the content "Hello, Good morning!" using the A2A message format.
Sends the message to the agent using client.send_message(...).
Receives and prints the agent’s response, showing the echoed reply from the server.

Executing the code
Open separate terminals in the project folder, activate the virtual environment in each, and run the agent server and client separately to execute the code.

# terminal 1
python echo_agent.py
# terminal 2
python echo_client.py

The output will look like the following:

Example 2: Basic A2A Agent

Lets begin with building a simple A2A compatible agent using the python-a2a library. The agent, named "Greeting Agent", is designed to detect greetings in a user's message and respond accordingly.

Creating greeting agent server
Create a file named greeting_agent.py and add the following code to create a basic A2A agent server.

from python_a2a import A2AServer, skill, agent, run_server
from python_a2a import TaskStatus, TaskState

@agent(
    name="Greeting Agent",
    description="A simple agent that responds to greetings",
    version="1.0.0"
)
class GreetingAgent(A2AServer):

    @skill(
        name="Greet",
        description="Respond to a greeting",
        tags=["greeting", "hello"]
    )
    def greet(self, name=None):
        if name:
            return f"Hello, {name}! How can I help you today?"
        else:
            return "Hello there! How can I help you today?"

    def handle_task(self, task):
        message_data = task.message or {}
        content = message_data.get("content", {})
        text = content.get("text", "") if isinstance(content, dict) else ""

        greeting_words = ["hello", "hi", "hey", "greetings"]
        is_greeting = any(word in text.lower() for word in greeting_words)

        if is_greeting:
            name = None
            if "my name is" in text.lower():
                name = text.lower().split("my name is")[1].strip()

            greeting = self.greet(name)
            task.artifacts = [{
                "parts": [{"type": "text", "text": greeting}]
            }]
            task.status = TaskStatus(state=TaskState.COMPLETED)
        else:
            task.artifacts = [{
                "parts": [{"type": "text", "text": "I'm a greeting agent. Try saying hello!"}]
            }]
            task.status = TaskStatus(state=TaskState.COMPLETED)

        return task

# Run the server
if __name__ == "__main__":
    agent = GreetingAgent()
    run_server(agent, port=5000)

In this code,

The agent is defined using the @agent decorator, and a skill named Greet is added with the @skill decorator to handle greeting responses.
The handle_task method identifies whether the incoming message is a greeting and responds accordingly, using the skill defined.
Finally, the server runs the agent on port 5000 using run_server.

Creating greeting agent client
Create a file named greeting_client.py and add the following code to create a client for the greeting agent.

from python_a2a import A2AClient

# Create a client
client = A2AClient("http://localhost:5000")

# Print agent information
print(f"Connected to: {client.agent_card.name}")
print(f"Description: {client.agent_card.description}")
print(f"Skills: {[skill.name for skill in client.agent_card.skills]}")

# Send a greeting
response = client.ask("Hello there! My name is Vishnu.")
print(f"Response: {response}")

# Send another message
response = client.ask("What can you do?")
print(f"Response: {response}")

Executing the code
Open separate terminals in the project folder, activate the virtual environment in each, and run the agent server and client separately to execute the code.

# terminal 1
python greeting_agent.py
# terminal 2
python greeting_client.py

The output will look like the following:

Example 3: LLM-Based Agent

Lets build an OpenAI-powered Agent-to-Agent (A2A) server using the python-a2a library. The agent uses OpenAI's GPT model to respond to queries.

Create .env File
In your project root directory, create a file named .env with the following content:

OPENAI_API_KEY=your_openai_api_key_here

Creating LLM Server
Create a file named llm_agent.py and add the following code to it.

from python_a2a import OpenAIA2AServer, run_server

import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# Create an OpenAI-based A2A agent
agent = OpenAIA2AServer(
    api_key=api_key,
    model="gpt-4",
    system_prompt="You are a helpful assistant that specializes in explaining complex concepts simply."
)

# Run the server
if __name__ == "__main__":
    print("Starting OpenAI-based A2A agent...")
    run_server(agent, host="0.0.0.0", port=5000)

Creating LLM Client
Create a file named llm_client.py and add the following code to it.

from python_a2a import A2AClient

# Connect to the OpenAI-based A2A agent
client = A2AClient("http://localhost:5000")

# Print agent metadata
print(f"Connected to: {client.agent_card.name}")
print(f"Description: {client.agent_card.description}")
print("Skills:")
for skill in client.agent_card.skills:
    print(f" - {skill.name}: {skill.description}")

# Send some example questions to test
messages = [
    "In short, why speed of light is constant?",
    "Explain bell Inequality experiment in simple terms.",
]

# Send messages and print responses
for msg in messages:
    response = client.ask(msg)
    print(f"\nUser: {msg}")
    print(f"Agent: {response}")

Executing the code
Open separate terminals in the project folder, activate the virtual environment in each, and run the agent server and client separately to execute the code.

# terminal 1
python llm_agent.py
# terminal 2
python llm_client.py

The output will look like the following:

Example 4: Converting LangChain to A2A Servers

You can turn any LangChain agent or chain into an A2A-compatible server.

LangChain requires a few additional libraries for proper integration. Run the following code to install the necessary dependencies into the project.

uv add langchain-community langchain-openai numexpr

Create a file named langchain_to_a2a_server.py and add the following code to it.

from langchain.chains import LLMMathChain
from langchain_openai import OpenAI
from python_a2a.langchain import to_a2a_server
from python_a2a import run_server

import os
from dotenv import load_dotenv
load_dotenv()
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise ValueError("Please set the OPENAI_API_KEY environment variable.")

# Create a LangChain chain
llm = OpenAI(temperature=0)
math_chain = LLMMathChain(llm=llm)

# Convert to A2A server
a2a_server = to_a2a_server(math_chain)

# Run the server
if __name__ == "__main__":
    run_server(a2a_server, port=5000)

You can reuse the previously created llm_client for this agent server using the following code. Execute both server and client to see the result.

The output will look like the following:

Example 5: Converting MCP Tools to LangChain Tools

You can convert any MCP tools into LangChain-compatible tools, enabling them to be used directly with LangChain agents.

from python_a2a.mcp import FastMCP
from python_a2a.langchain import to_langchain_tool
from python_a2a import run_server
from langchain.agents import initialize_agent, AgentType
from langchain.llms import OpenAI

# Create an MCP server with tools
calculator = FastMCP(name="Calculator MCP")

@calculator.tool()
def add(a: float, b: float) -> float:
    """Add two numbers together."""
    return a + b

@calculator.tool()
def subtract(a: float, b: float) -> float:
    """Subtract b from a."""
    return a - b

# Run the MCP server in a background thread
import threading
server_thread = threading.Thread(
    target=run_server,
    args=(calculator,),
    kwargs={"port": 8000},
    daemon=True
)
server_thread.start()

# Convert MCP tools to LangChain tools
add_tool = to_langchain_tool("http://localhost:8000", "add")
subtract_tool = to_langchain_tool("http://localhost:8000", "subtract")

# Use in a LangChain agent
llm = OpenAI(temperature=0)
tools = [add_tool, subtract_tool]

agent = initialize_agent(
    tools, 
    llm, 
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Run the agent
result = agent.run("Add 15 and 27, then subtract 5 from the result")
print(result)

Building a travel planner app using A2A

In this section, we will build a multi-agent Travel Planner powered by the A2A (Agent-to-Agent) protocol. This system leverages LangChain, the Ollama LLM, and multiple A2A-compatible agents that work collaboratively to retrieve weather information, perform web searches, and generate a comprehensive travel recommendation.

Application flow

The user requests a travel plan (e.g., “Plan a trip to Kerala”).
The Travel Planner asks the Weather Agent for the forecast.
Based on the weather (clear or rainy), it decides the activity type.
It then queries the Tavily Search Agent for suitable activities.
A local LLM (e.g., via Ollama) compiles this into a final itinerary — ensuring user privacy without third-party data sharing.

Prerequisites

This hands-on requires the following tools to be installed on your machine:

Ollama: Ollama is a platform for running large language models locally on your computer.

Download Ollama on Windows

Run the following command in your terminal to pull the model using Ollama:

ollama pull llama3.2

2. Python: Python is the core language used in this hands-on for scripting and backend logic.

Download Python

3. uv (Micro virtualenv manager): uv is a fast and modern Python project manager, to set up and manage our environment.

To install uv, run this in your terminal:

# For Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
set Path=C:\Users\Codem\.local\bin;%Path%

# For linux / Mac
curl -LsSf https://astral.sh/uv/install.sh | sh

Refer to the official website for detailed installation instructions.

Installation | uv

Installing the dependencies

Initialize a uv project by executing the following command.

uv init travel_planner
cd travel_planner

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Install python-a2a, langchain-ollama, python-dotenv, tavily-python and streamlit using uv.

uv add python-a2a langchain-ollama python-dotenv tavily-python streamlit

Setting up the environment

This hands-on project uses Tavily and OpenWeather API keys.

Visit the Tavily official website and sign in to obtain your API key.
Go to the OpenWeather website and create a new account. Once you complete the sign-up, you will receive your API key in your mail.
In the root directory of your project, create a .env file and add the following content:

OPENWEATHER_API_KEY=fedc837bbb7477...
TAVILY_API_KEY=tvly-dev-h5xKcqcytBeJQ...

Creating Weather Agent

Create a file named WeatherAgent.py and add the following code to it.

from python_a2a import A2AServer, skill, agent, run_server, TaskStatus, TaskState

import os
import requests
import logging

from dotenv import load_dotenv
load_dotenv()
api_key = os.environ.get("OPENWEATHER_API_KEY")


@agent(
    name="Weather Agent",
    description="Provides weather information",
    version="1.0.0",
    url="https://zzz.example.com"
)
class WeatherAgent(A2AServer):
    
    @skill(
        name="Get Weather",
        description="Get current weather for a location",
        tags=["weather", "forecast"],
        examples="I am a weather agent for getting weather forecast from Open weather"
    )
    def get_weather(self, location):
        if not api_key:
            return "Weather service not available (missing API key)."
        
        try:
            url = (
                f"https://api.openweathermap.org/data/2.5/weather?"
                f"q={location}&units=imperial&appid={api_key}"
            )
            logging.debug(f"Request URL: {url}")  # Log the full request URL

            response = requests.get(url, timeout=5)
            response.raise_for_status()
            logging.debug(f"Response Status Code: {response.status_code}")  # Log status code
            logging.debug(f"Response Text: {response.text}")  # Log raw response text
            
            data = response.json()
            
            temp = data["main"]["temp"]
            description = data["weather"][0]["description"]
            city_name = data["name"]

            logging.debug(f"Parsed Data: Temp = {temp}, Description = {description}, City = {city_name}")
            
            return f"The weather in {city_name} is {description} with a temperature of {temp}°F."
        
        except requests.RequestException as e:
            return f"Error fetching weather: {e}"
        except (KeyError, TypeError):
            return "Could not parse weather data."
    
    def handle_task(self, task):
        # Extract location from message
        message_data = task.message or {}
        content = message_data.get("content", {})
        text = content.get("text", "") if isinstance(content, dict) else ""
        
        if "weather" in text.lower() and "in" in text.lower():
            location = text.split("in", 1)[1].strip().rstrip("?.")
            
            # Get weather and create response
            weather_text = self.get_weather(location)
            task.artifacts = [{
                "parts": [{"type": "text", "text": weather_text}]
            }]
            task.status = TaskStatus(state=TaskState.COMPLETED)
        else:
            task.status = TaskStatus(
                state=TaskState.INPUT_REQUIRED,
                message={"role": "agent", "content": {"type": "text", 
                         "text": "Please ask about weather in a specific location."}}
            )
        return task

# Run the server
if __name__ == "__main__":
    agent = WeatherAgent(google_a2a_compatible=True)
    run_server(agent, port=8001, debug=True)

In this code,

Loads the OpenWeather API key from .env and defines a weather agent using A2AServer.
Registers a skill to fetch and return weather data for a given location using OpenWeatherMap API.
Handles user tasks by extracting location from the message and invoking the weather skill.
Responds with the weather info or prompts the user if the location is unclear.
Runs the agent server on port 8001 with debug mode enabled.

Creating Tavily Search Agent

Create a file named TavilySearchAgent.py and add the following code to it.

from python_a2a import A2AServer, skill, agent, run_server, TaskStatus, TaskState
from tavily import TavilyClient
import os
import logging

from dotenv import load_dotenv
load_dotenv()
api_key = os.environ.get("TAVILY_API_KEY")

@agent(
    name="Tavily Search Agent",
    description="Performs internet search using Tavily API",
    version="1.0.0",
    url="https://yourdomain.com"
)
class TavilySearchAgent(A2AServer):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.client = TavilyClient(api_key)

    @skill(
        name="Search Internet",
        description="Perform a web search using Tavily API",
        tags=["search", "internet", "tavily"],
        examples="Search 'must visit places in utah in may'"
    )
    def search(self, query: str):
        """Perform search using Tavily Search API"""
        try:
            response = self.client.search(query=query)

            results = response.get("results", [])
            if not results:
                return "No search results found."

            summary = "\n".join(
                [f"- {r.get('title')}: {r.get('url')}" for r in results]
            )
            return f"Top results for '{query}':\n{summary}"

        except Exception as e:
            logging.error(f"Error during Tavily search: {e}")
            return f"Search failed: {e}"

    def handle_task(self, task):
        message_data = task.message or {}
        content = message_data.get("content", {})
        text = content.get("text", "") if isinstance(content, dict) else ""

        if text.strip():
            query = text.strip()
            result = self.search(query)
            task.artifacts = [{
                "parts": [{"type": "text", "text": result}]
            }]
            task.status = TaskStatus(state=TaskState.COMPLETED)
        else:
            task.status = TaskStatus(
                state=TaskState.INPUT_REQUIRED,
                message={"role": "agent", "content": {"type": "text", 
                         "text": "Please provide a search query."}}
            )
        return task


if __name__ == "__main__":
    agent = TavilySearchAgent(google_a2a_compatible=True)
    run_server(agent, port=8002, debug=True)

Creating Local LLM Agent

Create a file named LocalLLMAgent.py and add the following code to it.

from python_a2a import run_server
from python_a2a.langchain import to_a2a_server
from langchain_ollama.llms import OllamaLLM

# Create a LangChain LLM
#llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm = OllamaLLM(model="llama3.2:latest")

# Convert LLM to A2A server
llm_server = to_a2a_server(llm)

if __name__ == "__main__":
    print("Starting LLM A2A server on port 5001...")
    run_server(llm_server, port=5001)

Creating Travel Planner Agent

Create a file named TravelPlannerApp.py and add the following code to it.

import streamlit as st
from python_a2a import AgentNetwork, A2AClient
import asyncio

# Function to run async logic inside Streamlit
def run_async(coro):
    return asyncio.run(coro)

async def plan_trip(destination, travel_dates):
    # Create an agent network
    network = AgentNetwork(name="Travel Assistant Network")
    network.add("weather", "http://localhost:8001")
    network.add("search", "http://localhost:8002")

    # Get agents
    weather_agent = network.get_agent("weather")
    search_agent = network.get_agent("search")
    llm_client = A2AClient("http://localhost:5001")

    # Get weather forecast
    forecast = weather_agent.ask(f"What's the weather in {destination}?")

    # Search based on weather
    if "sunny" in forecast.lower() or "clear" in forecast.lower():
        activities = search_agent.ask(f"Recommend outdoor activities in {destination}")
    else:
        activities = search_agent.ask(f"Recommend indoor activities in {destination}")

    # Summarize using LLM
    prompt = (
        f"You are a travel assistant. Based on the weather forecast result '{forecast}' "
        f"and the recommendations [{activities}], suggest me a few must-see attractions "
        f"on date {travel_dates}."
    )

    llm_result = llm_client.ask(prompt)

    return forecast, activities, llm_result

# Streamlit UI
st.set_page_config(page_title="🧳 Travel Planner Assistant")

st.title("🧭 Travel Planner Assistant")
st.write("Get personalized trip suggestions based on real-time weather and recommendations.")

destination = st.text_input("Enter destination", value="Kerala, India")
travel_dates = st.text_input("Enter travel dates", value="August 1-5")

if st.button("Plan My Trip"):
    with st.spinner("Planning your trip..."):
        try:
            forecast, activities, llm_result = run_async(plan_trip(destination, travel_dates))

            st.subheader("📍 Weather Forecast")
            st.success(forecast)

            st.subheader("🎯 Recommended Activities")
            st.info(activities)

            st.subheader("🗺️ Suggested Travel Plan")
            st.markdown(llm_result)
        except Exception as e:
            st.error(f"Something went wrong: {e}")

In this code,

Initializes a Streamlit web UI for a travel planner that collects user inputs like destination and travel dates.
Defines an asynchronous function to coordinate multiple agents via the AgentNetwork class.
Connects to weather, search, and LLM agents running locally on different ports.
Based on the weather forecast for the destination, selects indoor or outdoor activity recommendations.
Summarizes the final travel plan using a language model (LLM) agent and displays the results on the UI.
Uses asyncio.run() to integrate asynchronous agent responses within Streamlit’s synchronous workflow.

Creating the executor

You can either run each agent separately or create an executor script that uses Python’s subprocess module to launch all agents sequentially.

In this hands-on exercise, we will follow the subprocess approach to simplify execution.

Open your main.py file and replace its contents with the following code:

import subprocess
import time
import sys

def main():
    print("Hello from travel-planner!")

    # Scripts to launch before the Streamlit app
    scripts = ["WeatherAgent.py", "TavilySearchAgent.py", "LocalLLMAgent.py"]
    streamlit_app = "TravelPlannerApp.py"

    processes = []

    # Launch agent scripts
    for script in scripts:
        print(f"Launching {script}...")
        p = subprocess.Popen([sys.executable, script])
        processes.append(p)
        print(f"{script} started. Waiting 2 seconds before next...\n")
        time.sleep(2)

    # Launch Streamlit app
    print(f"Launching Streamlit app: {streamlit_app}...")
    p = subprocess.Popen(["streamlit", "run", streamlit_app])
    processes.append(p)

    # Keep the main process alive
    try:
        print("All agents (and UI) are running. Press Ctrl+C to stop.")
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        print("\nShutting down all agents...")
        for p in processes:
            p.terminate()
        print("All agents stopped.")

if __name__ == "__main__":
    main()

Executing the app

To launch the entire application using subprocess, simply run the following command:

uv run main.py

If you would prefer to run each agent separately, open four terminal windows and execute the following commands in each terminal:

python WeatherAgent.py
python TavilySearchAgent.py
python LocalLLMAgent.py
streamlit run TravelPlannerApp.py

🎉 Awesome Work! You’ve successfully built a Travel Planner powered by the A2A protocol.

Thanks for reading this article !!

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

References

Level Up Your RAG Workflow: Building a GraphRAG-Powered KYC Agent

Vishnu Sivan — Sat, 12 Jul 2025 16:16:05 GMT

As developers, we often face the challenge of making sense of messy, scattered data — especially when it lives across dozens of PDFs, reports, and databases. Traditional RAG systems are a great start, but they often fall short when it comes to complex queries that span multiple documents or require a deeper understanding of context.

That’s where GraphRAG steps in.

GraphRAG brings structure to chaos by building a knowledge graph from raw text and using it as the backbone for LLM-powered responses. It doesn’t just retrieve text chunks — it understands how entities relate, clusters them into meaningful communities, and provides a graph-structured lens into your data. GraphRAG combines the structured clarity of knowledge graphs with the generative power of large language models (LLMs) to bring meaning and connection to fragmented information.

Unlike conventional RAG systems that rely on flat semantic search, GraphRAG extracts key entities, maps relationships, and organizes information into intuitive, navigable hierarchies. This makes it especially valuable in domains like compliance, risk management, and enterprise intelligence, where answers often lie at the intersection of multiple documents and hidden connections. By turning unstructured data into a structured web of knowledge, GraphRAG delivers more accurate, contextual, and actionable results — empowering users to make better decisions, faster.

In this article, we will explore how to build a prototype Know-Your-Customer (KYC) agent using OpenAI’s Agent SDK, illustrating how GraphRAG can streamline investigations and surface hidden risk signals.

Getting Started

How RAG works
What is a Knowledge Graph
What is GraphRAG
How GraphRAG works
Where traditional RAG falls short
Vector RAG and Graph RAG
Experimenting with GraphRAG: Build a KYC agent with OpenAI, MCP, Ollama, and Neo4j
Prerequisites
Installing the dependencies
Setting up the credentials
Synthetic dataset preparation
Creating schemas
Building KYC agent
Importing required libraries
Setting up the neo4j connection
Building tools
Tool 1: Get customer and accounts
Tool 2: Find customer rings
Tool 3: Neo4j MCP server toolset
Tool 4: Generate Cypher
Tool 5: Create memory
Main function
Agent reasoning and execution flow
Executing the agent
References

How RAG works

RAG enhances large language models by allowing them to fetch relevant information from an external knowledge source (like documents or databases) before generating a response.

Instead of relying solely on the LLM’s internal training data, RAG introduces a two-step pipeline:

Retrieval Phase

The input query is used to search a vector database (e.g., FAISS, Pinecone, Chroma).
This database contains embedded representations (vectors) of your documents or chunks of data.
Using semantic similarity, the system retrieves the top-K most relevant text chunks for the query.

2. Generation Phase

These retrieved documents are fed into the LLM along with the original query.
The model then generates an answer that is grounded in the retrieved context, improving factual accuracy and relevance.

What is a Knowledge Graph

A knowledge graph is a structured and visual way of organizing information that captures real-world entities, their attributes, and the relationships between them. It helps model complex data in a way that makes hidden connections more obvious and queryable — ideal for domains like customer intelligence, fraud detection, or KYC (Know Your Customer).

Let’s break it down using the example from your customer data graph:

Key Components:

Entities (Nodes): These are the main objects represented in the graph. In your schema, entities include:
Customer, Account, Transaction, Device, Address, Company, Payment_Method, IP_Address
Attributes: These are properties of each entity, like a Customer having a name, ID, or email, though they’re not directly shown in the graph view. Attributes are usually stored as metadata inside each node.
Relationships (Edges): These define how entities are connected. For example:
A Customer OWNS an Account
A Customer USES_DEVICE (e.g., a phone or laptop)
A Customer LIVES_AT an Address
A Customer HAS_METHOD linked to a Payment_Method
An Account is TO or FROM a Transaction
A Device is ASSOCIATED WITH an IP_Address
A Customer is EMPLOYED_BY a Company

It might look something like this:

With this structure, it becomes easy to run intelligent queries like:

“Which accounts are associated with customers living at the same address?”
“Show all transactions made using devices linked to the same IP address.”
“Which customers share payment methods or devices?”

A knowledge graph like this turns scattered data into a connected web of insights, making it a foundational component for systems like GraphRAG, where reasoning over relationships is key.

What is GraphRAG

https://medium.com/media/39353d52c37924f100165cb7c4d690b5/href

Graph-based Retrieval-Augmented Generation (GraphRAG) is an advanced AI technique that enhances traditional RAG systems by integrating knowledge graphs into the information retrieval and generation process. While conventional RAG pipelines retrieve top-matching text chunks from unstructured documents and pass them to a language model for generation, GraphRAG introduces structure and semantics to this flow. It does this by first transforming raw textual data into a knowledge graph, where key entities (like people, organizations, or events) and their relationships are identified and connected.

This graph-based representation allows for deeper reasoning and more context-aware generation, as the model can now leverage structured relationships, not just surface-level text similarity. For instance, instead of just finding documents that mention a customer, GraphRAG can traverse the graph to understand which accounts they own, which devices they’ve used, or what addresses they’re associated with — all before generating a response.

By combining the strengths of symbolic reasoning (via graphs) and generative AI (via LLMs), GraphRAG delivers more accurate, explainable, and scalable results. This makes it particularly useful in complex domains like fraud detection, compliance, legal discovery, and enterprise search, where understanding entity relationships is crucial for meaningful answers.

How GraphRAG works

GraphRAG operates in two main phases: Indexing (organizing information) and Querying (retrieving meaningful answers).

Indexing: Building the Knowledge Graph

Chunking the Text: GraphRAG starts by breaking down your documents into smaller segments called TextUnits. These are manageable chunks that make analysis easier.
Entity and Relationship Extraction: From each TextUnit, GraphRAG identifies key entities (like people, places, or organizations), claims, and how these elements are connected.
Graph Construction: All this information is structured into a knowledge graph — a visual network of entities (nodes) and their relationships (edges). Important entities appear as larger nodes, and related ones are grouped into clusters.
Community Summarization: For each cluster, GraphRAG generates a concise summary capturing the key themes and topics — offering a high-level view of your entire dataset.

Querying: Asking Questions and Getting Answers

Once the knowledge graph is built, GraphRAG uses it to answer questions in three ways:

Global Search: Ideal for understanding overarching themes or trends. The system uses community-level summaries to provide broad insights.
Local Search: Best for specific queries. If you ask about a person or company, GraphRAG explores their direct connections in the graph to provide a targeted, factual response.
DRIFT Search: A hybrid approach that combines Local Search with community-level insights — useful when deeper, contextual understanding is needed.

Where traditional RAG falls short

Traditional Retrieval-Augmented Generation (RAG) excels at finding and returning semantically similar text snippets. But when data becomes more complex or scattered, traditional RAG begins to struggle — especially in real-world enterprise applications.

One major limitation is its inability to synthesize information spread across multiple sources. If a question requires connecting subtle, indirect relationships — such as tracing a customer’s activity across different devices, transactions, or locations — traditional RAG often fails to assemble those connections. It lacks an understanding of how different pieces of data relate within a larger context, leading to incomplete or inaccurate responses.

Another weakness lies in capturing broader context or summarizing nuanced datasets. Traditional RAG models aren’t built for higher-level semantic reasoning. So when faced with a query like “What are the main themes in the dataset?”, the system flounders — unless those themes are explicitly written out. That’s because this is not a simple retrieval problem; it’s a query-focused summarization task, which requires abstracting insights across the dataset — something traditional RAG isn’t inherently equipped to handle.

In short, while traditional RAG works well for localized lookups and direct Q&A, it falls short when dealing with interconnected, abstract, or multi-hop reasoning — the exact gaps that GraphRAG is designed to fill.

VectorRAG vs GraphRAG

While both VectorRAG and GraphRAG extend the power of language models by incorporating external knowledge, they approach the task very differently in how they retrieve, structure, and reason over information. VectorRAG excels at fast, precise look‑ups, while GraphRAG provides deeper, more explainable reasoning over structured relationships — ideal for complex investigative and analytical tasks.

Experimenting with GraphRAG: Build a KYC agent with OpenAI, MCP, Ollama, and Neo4j

Know-Your-Customer (KYC) processes involves navigating vast networks of entities — customers, accounts, devices, IPs, transactions — each intricately linked. Traditional RAG falls short in these scenarios where uncovering fraud demands tracing indirect, multi-hop relationships across data points. This is where GraphRAG shines. By grounding retrieval in a structured knowledge graph, it enables investigators to reason across complex connections and uncover hidden patterns — making it a powerful tool for detecting money laundering, sanctions violations, and other financial crimes.

In the next section, we will walk through how to build a prototype GraphRAG-powered KYC agent.

Prerequisites

This hands-on requires the following tools to be installed on your machine:

Ollama: Ollama is a platform for running large language models locally on your computer.

Download Ollama on Windows

To convert natural language questions into Cypher queries, we’ll use the Text-to-Cypher model provided by Neo4j.

Run the following command in your terminal to pull the model using Ollama:

ollama pull ed-neo4j/t2c-gemma3-4b-it-q8_0-35k

2. Neo4j: Neo4j is a graph database used to model and query relationships between entities like customers, accounts, transactions, and more.

You have two options to set up a free Neo4j database:

Option 1: Local Neo4j Docker Instance

Download the Neo4j installer from the official site: https://neo4j.com/download
Follow the installation instructions based on your operating system.
Once installed, start the Neo4j Desktop application or run it via Docker to create and manage local databases.

Option 2: Neo4j AuraDB Free (Cloud-Based Managed Instance)

Visit the Neo4j Aura Console: https://console.neo4j.io
Sign in or create a Neo4j account.
Click “Create Database” and select AuraDB Free.
Once the database is created, download the connection credentials bundle. You’ll need these to connect to the database programmatically.

In this tutorial, we will proceed with Neo4j AuraDB Free as it is lightweight, cloud-hosted, and easily accessible from anywhere.

3. Python: Python is the core language used in this hands-on for scripting and backend logic.

Download Python

4. uv (Micro virtualenv manager): uv is a fast and modern Python project manager, to set up and manage our environment. It simplifies tasks like handling dependencies, creating virtual environments, and running scripts.

To install uv, run this in your terminal:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
set Path=C:\Users\Codem\.local\bin;%Path%

Refer to the official website for detailed installation instructions.

Installation | uv

Installing the dependencies

Initialize a uv project by executing the following command.

uv init kyc_agent
cd kyc_agent

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Install neo4j, numpy, ollama, openai-agents, python-dotenv libraries using pip.

uv add neo4j neo4j-rust-ext numpy ollama openai-agents python-dotenv

Setting up the credentials

Create a file named .env. This file will store your environment variables, including the OpenAI key and Neo4j credentials.
Open the .env file and add the following code to specify your OpenAI API key and Neo4j credentials. Copy the Neo4j credentials from the file you downloaded while setting up the Neo4j AuraDB instance.

OPENAI_API_KEY=sk-proj-C1K1hKug99wXxtj...
NEO4J_URI=neo4j+s://b3383662.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=fa_BBW1s6kjvOjSLnTvkKht...
NEO4J_DATABASE=neo4j
AURA_INSTANCEID=b3383662
AURA_INSTANCENAME=Free instance

Synthetic dataset preparation

For the purpose of this blog, we have generated a synthetic dataset comprising 8,000 customers along with their associated accounts, transactions, registered addresses, devices, and IP addresses.

Dataset generation script

Use the following script to generate the dataset.

import numpy as np
import os
import random
import uuid
import time
from datetime import datetime, timedelta
from neo4j import GraphDatabase
from dotenv import load_dotenv
load_dotenv()  

random.seed(42)

NEO4J_URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.getenv("NEO4J_USERNAME", "neo4j")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "password")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE", "neo4j")

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

def get_session():
    return driver.session(database=NEO4J_DATABASE)

# Create uniqueness constraints (once)
with get_session() as sess:
    
    for label in ('Customer','Account','Company','Address',
                  'Device','IP_Address','Payment_Method','Transaction'):
        sess.execute_write(
            lambda tx, L=label: tx.run(
                f"CREATE CONSTRAINT IF NOT EXISTS FOR (n:{L}) REQUIRE n.id IS UNIQUE"
            )
        )

# Configuration & ID lists
random.seed(42)
np.random.seed(42)

n_customers = 8_000
mean_accounts_per_customer   = 1.5
mean_devices_per_customer    = 2
mean_addresses_per_customer  = 1.2
mean_payment_methods_per_customer = 1
mean_transactions_per_account     = 10
p_pep       = 0.01
p_watchlist = 0.02

customers   = [f"CUST_{i:05d}" for i in range(1, n_customers+1)]
n_companies = int(n_customers * 0.2)
companies   = [f"COMP_{i:05d}" for i in range(1, n_companies+1)]


# Prepare payloads
customer_rows = [
    {"id": cust,
     "pep": (random.random() < p_pep),
     "wl":  (random.random() < p_watchlist),
     "name": cust
    }
    for cust in customers
]

company_rows = [
    {"id": comp,
     "ind": random.choice(['Finance','Tech','Manufacturing','Retail']),
     "name": comp
     }
    for comp in companies
]

# 3. Push Customers & Companies
print(f"loading start...")
start_time = time.perf_counter()
batch_size=50

with get_session() as sess:
    # Customers in implicit transactions of 50 rows each
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (c:Customer {id: row.id})
          SET c.is_pep       = row.pep,
              c.on_watchlist = row.wl,
              c.name = row.name
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=customer_rows,
        batch_size=batch_size
    )

    # Companies in implicit transactions of 50 rows each
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (c:Company {id: row.id})
          SET c.industry = row.ind,
              c.name = row.name 
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=company_rows,
        batch_size=batch_size
    )

    
end_time = time.perf_counter()
elapsed = end_time - start_time
print(f"⌛ Loading Customers & Companies took {elapsed:.2f} seconds")


# 3. Push Accounts, Addresses, Devices, IP addresses, Payment Methods and Transactions

# Build payloads
acct_counter = addr_counter = dev_counter = ip_counter = pm_counter = txn_counter = 0

account_rows     = []
employed_rows    = []
address_rows     = []
device_rows      = []
ip_rows          = []
payment_rows     = []
transaction_rows = []


# 1.1 Accounts & OWNS
for cust in customers:
    for _ in range(np.random.poisson(mean_accounts_per_customer)):
        acct_counter += 1
        aid = f"ACCT_{acct_counter:05d}"
        account_rows.append({"cust": cust, "acct": aid,"name":aid})

# 1.2 EMPLOYED_BY
for cust in customers:
    if random.random() < 0.8:
        comp = random.choice(companies)
        employed_rows.append({"cust": cust, "co": comp})

# 1.3 Addresses & LIVES_AT
for cust in customers:
    for _ in range(max(1, np.random.poisson(mean_addresses_per_customer))):
        addr_counter += 1
        aid = f"ADDR_{addr_counter:05d}"
        city = random.choice(['London','Manchester','Birmingham','Leeds'])
        address_rows.append({"cust": cust, "addr": aid, "city": city,"name":aid})

# 1.4 Devices & USES_DEVICE → ASSOCIATED_WITH IP_Address
for cust in customers:
    for _ in range(np.random.poisson(mean_devices_per_customer)):
        dev_counter += 1
        did = f"DEV_{dev_counter:05d}"
        osys = random.choice(['Android','iOS','Windows','MacOS'])
        device_rows.append({"cust": cust, "dev": did, "os": osys,"name":did})

        ip_counter += 1
        iid = f"IP_{ip_counter:05d}"
        ip_rows.append({"dev": did, "ip": iid,"name":iid})

# 1.5 Payment Methods & HAS_METHOD
for cust in customers:
    for _ in range(np.random.poisson(mean_payment_methods_per_customer)):
        pm_counter += 1
        pid = f"PM_{pm_counter:05d}"
        ptype = random.choice(['Credit_Card','Debit_Card','EWallet'])
        cnum = ''.join(random.choice('0123456789') for _ in range(16)) \
               if ptype in ('Credit_Card','Debit_Card') \
               else uuid.uuid4().hex[:16]
        payment_rows.append({
            "cust": cust,
            "pid": pid,
            "ptype": ptype,
            "cnum": cnum,
            "name": pid
        })


# 1.6 Transactions & FROM/TO
all_accts = [r["acct"] for r in account_rows]
for src in all_accts:
    for _ in range(np.random.poisson(mean_transactions_per_account)):
        txn_counter += 1
        tid = f"TXN_{txn_counter:06d}"
        amt = round(np.random.lognormal(mean=3, sigma=1), 2)
        ts  = (datetime(2025,1,1) + timedelta(days=random.randint(0,120))).isoformat()
        dst = random.choice(all_accts)
        transaction_rows.append({
            "src": src, "tid": tid, "amt": amt, "ts": ts, "dst": dst, "name":tid
        })


# 2. Push in batches
start_time = time.perf_counter()
with get_session() as sess:
    # 2.1 Accounts
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (a:Account {id: row.acct})
          SET a.name = row.name
          WITH a, row
          MATCH (c:Customer {id: row.cust})
          MERGE (c)-[:OWNS]->(a)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=account_rows, batch_size=batch_size
    )

    # 2.2 Employed
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MATCH (c:Customer {id: row.cust})
          MATCH (co:Company  {id: row.co})
          MERGE (c)-[:EMPLOYED_BY]->(co)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=employed_rows, batch_size=batch_size
    )

    # 2.3 Addresses
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (a:Address {id: row.addr})
          SET a.city = row.city,
              a.name = row.name
          WITH a, row
          MATCH (c:Customer {id: row.cust})
          MERGE (c)-[:LIVES_AT]->(a)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=address_rows, batch_size=batch_size
    )

    # 2.4 Devices
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (d:Device {id: row.dev})
          SET d.os = row.os,
            d.name = row.name
          WITH d, row
          MATCH (c:Customer {id: row.cust})
          MERGE (c)-[:USES_DEVICE]->(d)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=device_rows, batch_size=batch_size
    )
    # 2.5 IPs
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (i:IP_Address {id: row.ip})
          SET i.name = row.name
          WITH i, row
          MATCH (d:Device {id: row.dev})
          MERGE (d)-[:ASSOCIATED_WITH]->(i)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=ip_rows, batch_size=batch_size
    )

    # 2.6 Payment Methods
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (p:Payment_Method {id: row.pid})
          SET p.pm_type     = row.ptype,
              p.card_number = row.cnum,
              p.name = row.name
          WITH p, row
          MATCH (c:Customer {id: row.cust})
          MERGE (c)-[:HAS_METHOD]->(p)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=payment_rows, batch_size=batch_size
    )

    # 2.7 Transactions
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (t:Transaction {id: row.tid})
          SET t.amount    = row.amt,
              t.timestamp = row.ts,
              t.name = row.name
          WITH t, row
          MATCH (a1:Account {id: row.src})
          MATCH (a2:Account {id: row.dst})
          MERGE (a1)-[:FROM]->(t)-[:TO]->(a2)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=transaction_rows, batch_size=batch_size
    )

end_time = time.perf_counter()
elapsed = end_time - start_time
print(f"⌛ Loading Account, Employed, Owns, Addresses, Devices, Payment Methods & Transactions took {elapsed:.2f} seconds")


# 5. Select anomalies
n_anomalies        = int(0.05 * len(customers))
anoms             = random.sample(customers, n_anomalies)
chunk             = n_anomalies // 5

# Prepare payload lists
super_rows        = []
ring_acct_rows    = []
ring_txn_rows     = []
bridge_rows       = []
isolate_rows      = []
dense_addr_rows   = []
dense_pm_rows     = []


# Super-hubs: 50 new accounts per customer
for cust in anoms[0:chunk]:
    for _ in range(50):
        acct_counter += 1
        aid = f"ACCT_{acct_counter:05d}"
        super_rows.append({"cust": cust, "acct": aid,"name":aid})

#Upload
with get_session() as sess:
    # 4.1 Super-hubs
    start_time = time.perf_counter()
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (a:Account {id: row.acct})
          SET a.name = row.name
          WITH a, row
          MATCH (c:Customer {id: row.cust})
          MERGE (c)-[:OWNS]->(a)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=super_rows, batch_size=50
    )
    end_time = time.perf_counter()
    elapsed = end_time - start_time
    print(f"⌛ Loading Anomalies: Super Hubs took {elapsed:.2f} seconds")

# 2.2 Circular rings: 3-customer cycles
for i in range(chunk, 2*chunk, 3):
    trio = anoms[i : i+3]
    if len(trio) == 3:
        accts = []
        for c in trio:
            acct_counter += 1
            aid = f"ACCT_{acct_counter:05d}"
            ring_acct_rows.append({"cust": c, "acct": aid})
            accts.append(aid)
        for j in range(3):
            txn_counter += 1
            tid = f"TXN_{txn_counter:06d}"
            ring_txn_rows.append({
                "src":      accts[j],
                "dst":      accts[(j+1) % 3],
                "tid":      tid,
                "amount":   1000,
                "ts":       datetime(2025, 2, 1).isoformat()
            })

with get_session() as sess:
    # 4.2 Circular rings – ring transactions
    start_time = time.perf_counter()
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (t:Transaction {id: row.tid})
          SET t.amount = row.amount, t.timestamp = row.ts
          WITH t, row
          MATCH (a1:Account {id: row.src}), (a2:Account {id: row.dst})
          MERGE (a1)-[:FROM]->(t)-[:TO]->(a2)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=ring_txn_rows, batch_size=50
    )
    end_time = time.perf_counter()
    elapsed = end_time - start_time
    print(f"⌛ Loading Anomalies: Circular Rings took {elapsed:.2f} seconds")


# 2.3 Bridges: employed by two companies
for cust in anoms[2*chunk : 3*chunk]:
    c1, c2 = random.sample(companies, 2)
    bridge_rows.append({"cust": cust, "co": c1})
    bridge_rows.append({"cust": cust, "co": c2})
with get_session() as sess:
    # 4.3 Bridges
    start_time = time.perf_counter()
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MATCH (c:Customer {id: row.cust}), (co:Company {id: row.co})
          MERGE (c)-[:EMPLOYED_BY]->(co)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=bridge_rows, batch_size=50
    )
    end_time = time.perf_counter()
    elapsed = end_time - start_time
    print(f"⌛ Loading Anomalies: Bridges - Customers employeed by 2 companies took {elapsed:.2f} seconds")

#  Isolates: 5 device→IP pairs per customer, no link to customers
for cust in anoms[3*chunk : 4*chunk]:
    for _ in range(5):
        dev_counter += 1
        ip_counter  += 1
        isolate_rows.append({
            "dev": f"DEV_{dev_counter:05d}",
            "ip":  f"IP_{ip_counter:05d}"
        })

with get_session() as sess:
    # 4.4 Isolates
    start_time = time.perf_counter()
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MERGE (d:Device {id: row.dev})
          SET d.os = 'Unknown'
          MERGE (i:IP_Address {id: row.ip})
          MERGE (d)-[:ASSOCIATED_WITH]->(i)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=isolate_rows, batch_size=50
    )
    end_time = time.perf_counter()
    elapsed = end_time - start_time
    print(f"⌛ Loading Anomalies: Isolated Devices and IP Addresses with no Customers took {elapsed:.2f} seconds")

# Dense watchlist cluster: shared address & payment method
shared_addr = f"ADDR_{addr_counter+1:05d}"
shared_pm   = f"PM_{pm_counter+1:05d}"
dense_addr_rows = [{"cust": cust, "addr": shared_addr}
                   for cust in anoms[4*chunk : ]]
dense_pm_rows   = [{"cust": cust, "pm":   shared_pm}
                   for cust in anoms[4*chunk : ]]

# 3. Create the two shared nodes up front
with get_session() as sess:
    sess.run(
        "MERGE (a:Address {id:$addr}) SET a.city='London', a.name=$addr",
        addr=shared_addr
    )
    sess.run(
        "MERGE (p:Payment_Method {id:$pm}) SET p.pm_type='Credit_Card', p.name=$pm",
        pm=shared_pm
    )
with get_session() as sess:
    start_time = time.perf_counter()
    # 4.5 Dense cluster – shared address
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MATCH (c:Customer {id: row.cust}), (a:Address {id: row.addr})
          MERGE (c)-[:LIVES_AT]->(a)
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=dense_addr_rows, batch_size=50
    )
    # 4.5 Dense cluster – shared payment method + watchlist flag
    sess.run(
        """
        UNWIND $rows AS row
        CALL (row) {
          MATCH (c:Customer {id: row.cust}), (p:Payment_Method {id: row.pm})
          MERGE (c)-[:HAS_METHOD]->(p)
          SET c.on_watchlist = true
        } IN TRANSACTIONS OF $batch_size ROWS
        """,
        rows=dense_pm_rows, batch_size=50
    )
    end_time = time.perf_counter()
    elapsed = end_time - start_time
    print(f"⌛ Loading Anomalies: Dense clusters - Around shared address & payment method took {elapsed:.2f} seconds")

You can now navigate to the Neo4j Aura Console and click on the “Query” section. Connect to your current database instance, then execute the following query to view the schema:

CALL db.schema.visualization();

Creating schemas

Next, we will define schemas for each of the core entities in our graph model — Customer, Account, Transaction, etc.

# schemas.py
from pydantic import BaseModel
from typing import List, Optional

# Tool 1: Get Customer and Accounts
class CustomerAccountsInput(BaseModel):
    customer_id: str

class TransactionModel(BaseModel):
    id: Optional[str] = None
    amount: Optional[float] = None
    timestamp: Optional[str] = None

class AccountModel(BaseModel):
    id: str = None
    name: str = None
    transactions: List[TransactionModel] = []

class CustomerModel(BaseModel):
    id: Optional[str] = None
    name: Optional[str] = None
    on_watchlist: Optional[bool] = False
    is_pep: Optional[bool] = False

class CustomerAccountsOutput(BaseModel):
    customer: CustomerModel
    accounts: List[AccountModel]

# Tool 2: Identify watchlisted customers in suspicious rings
from typing import Dict, Any

class RingModel(BaseModel):
    ring_path: List[Dict[str, Any]]  # List of node dicts
    watched_customers: List[Dict[str, Any]]  # List of customer dicts
    watch_relationships: List[Dict[str, Any]]  # List of relationship dicts

class CustomerRingsInput(BaseModel):
    max_number_rings: int = 10
    customer_in_watchlist: Optional[bool] = True
    customer_is_pep: Optional[bool] = False

class CustomerRingsOutput(BaseModel):
    customer_rings: List[RingModel]


class GenerateCypherRequest(BaseModel):
    question: str
    database_schema: str

Building KYC agent

Let’s begin with the agent creation process.

Create a file named agent.py and add the following code to it.

Importing required libraries

import os
from agents import Agent, Runner, function_tool
from agents.mcp import MCPServerStdio
from neo4j import GraphDatabase
from schemas import CustomerAccountsInput, CustomerAccountsOutput, CustomerModel, AccountModel, TransactionModel, GenerateCypherRequest
import asyncio
from ollama import chat
from dotenv import load_dotenv
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logging.getLogger("httpx").setLevel(logging.ERROR)
logger = logging.getLogger("KYC_AGENT")

# Load environment variables
load_dotenv()

Setting up the neo4j connection

# Read Neo4j environment variables into variables
NEO4J_URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.getenv("NEO4J_USERNAME", "neo4j")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD", "password")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE", "neo4j")

# Neo4j connection setup
def get_neo4j_driver():
    return GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD))

# Neo4j driver
driver = get_neo4j_driver()

Building tools

An agent’s effectiveness depends on its tools. In this case, we are providing five key tools to the KYC agent, including optimized Cypher queries wrapped in Python functions using the @function_tool decorator from the OpenAI Agent SDK.

Tool 1: Get customer and accounts

This tool retrieves a customer’s profile, including their accounts and recent transactions — an essential part of any investigation. It uses a function that takes a customer ID and runs a simple Cypher query.

@function_tool
def get_customer_and_accounts(input: CustomerAccountsInput, tx_limit: int = 5) -> CustomerAccountsOutput:
    logger.info(f"TOOL: GET_CUSTOMER_AND_ACCOUNTS - {input.customer_id}")
    with driver.session() as session:
        result = session.run(
            """
            MATCH (c:Customer {id: $customer_id})-[o:OWNS]->(a:Account)
            WITH c, a
            CALL (c,a) {
                MATCH (a)-[b:TO|FROM]->(t:Transaction)
                ORDER BY t.timestamp DESC
                LIMIT $tx_limit
                RETURN collect(t) as transactions
            }
            RETURN c as customer, a as account, transactions
            """,
            customer_id=input.customer_id,
            tx_limit=tx_limit
        )
        # Get the records from the result
        records = result.data()
        # Initialize lists to store the customer, accounts, and transactions
        accounts = []
        for record in records:
            customer = dict(record["customer"])
            account = dict(record["account"])
            account["transactions"] = [dict(t) for t in record["transactions"]]
            accounts.append(account)

        return CustomerAccountsOutput(
            customer=CustomerModel(**customer),
            accounts=[AccountModel(**a) for a in accounts]
        )

Tool 2: Find customer rings

This tool detects circular transaction patterns — commonly linked to money laundering — by identifying cycles in the KYC graph where funds return to their origin. It uses a find_customer_rings function to run a Cypher query that returns up to 10 potential rings, including the involved customers, accounts, and transactions.

@function_tool 
def find_customer_rings(max_number_rings: int = 10, customer_in_watchlist: bool = True, customer_is_pep: bool = False, customer_id: str = None):
    logger.info(f"TOOL: FIND_CUSTOMER_RINGS - {max_number_rings} - {customer_in_watchlist} - {customer_is_pep}")
    with driver.session() as session:
        result = session.run(
            f"""
            MATCH p=(a:Account)-[:FROM|TO*6]->(a:Account)
            WITH p, [n IN nodes(p) WHERE n:Account] AS accounts
            UNWIND accounts AS acct
            MATCH (cust:Customer)-[r:OWNS]->(acct)
            WHERE cust.on_watchlist = $customer_in_watchlist AND cust.is_pep = $customer_is_pep
            WITH 
              p, 
              COLLECT(DISTINCT cust)   AS watchedCustomers,
              COLLECT(DISTINCT r)      AS watchRels
            RETURN 
              p, 
              watchedCustomers,
              watchRels
            LIMIT $max_number_rings
            """,
            max_number_rings=max_number_rings,
            customer_in_watchlist=customer_in_watchlist,
            customer_is_pep=customer_is_pep
        )
        rings = []
        for record in result:
            # Convert path to a list of node dictionaries for easier consumption
            path_nodes = [dict(node) for node in record["p"].nodes]
            watched_customers = [dict(cust) for cust in record["watchedCustomers"]]
            watch_rels = [dict(rel) for rel in record["watchRels"]]
            rings.append({
                "ring_path": path_nodes,
                "watched_customers": watched_customers,
                
            })
        
        return {"customer_rings": rings}

Tool 3: Neo4j MCP server toolset

This section outlines a common architecture for enabling agents to interact with a knowledge graph. It combines a Text-to-Cypher model (Gemma3–4B) with the Neo4 MCP Server to translate natural language into Cypher and execute dynamic queries. Key tools include get-neo4j-schema, read-neo4j-cypher, and write-neo4j-cypher, allowing the agent to understand the graph structure and perform read/write operations.

neo4j_mcp_server = MCPServerStdio(
    params={
        "command": "uvx",
        "args": ["mcp-neo4j-cypher@0.2.1"],
        "env": {
            "NEO4J_URI": NEO4J_URI,
            "NEO4J_USERNAME": NEO4J_USER,
            "NEO4J_PASSWORD": NEO4J_PASSWORD,
            "NEO4J_DATABASE": NEO4J_DATABASE,
        },
    },
    cache_tools_list=True,
    name="Neo4j MCP Server",
    client_session_timeout_seconds=20
)

Tool 4: Generate Cypher

Translating natural language into Cypher queries relies on schema-aware LLMs fine-tuned for this task. Open-source models like neo4j/text-to-cypher-Gemma-3-4B-Instruct-2025.04.0 from Hugging Face enable accurate query generation. For example, given a question about shared addresses, the agent can dynamically generate the appropriate Cypher query without failure.

@function_tool
def generate_cypher(request: GenerateCypherRequest) -> str:
    USER_INSTRUCTION = """Generate a Cypher query for the Question below:
    Use the information about the nodes, relationships, and properties from the Schema section below to generate the best possible Cypher query. 
    Return only the Cypher query as your final output, without any additional text or explanation.
    ####Schema:
    {schema}
    ####Question:
    {question}"""

    logger.info(f"TOOL: GENERATE_CYPHER - INPUT - {request.question}")
    user_message = USER_INSTRUCTION.format(
        schema=request.database_schema, 
        question=request.question
    )
    # Generate Cypher query using the text2cypher model
    model: str = "ed-neo4j/t2c-gemma3-4b-it-q8_0-35k"
    response = chat(
        model=model,
        messages=[{"role": "user", "content": user_message}]
    )
    generated_cypher = response['message']['content']
    # Replace \n with new line
    generated_cypher = generated_cypher.replace("\\n", "\n")

    print(f"GENERATED CYPHER: - OUTPUT - {generated_cypher}")
    
    return generated_cypher

Tool 5: Create memory

While agents handle short-term memory through conversation history, complex tasks like financial investigations require persistent long-term memory. This memory acts as a dynamic knowledge base, tracking insights and context across sessions. The create_memory tool enables this by storing investigation summaries as nodes linked to relevant entities in the knowledge graph.

@function_tool
def create_memory(content: str, customer_ids: list[str] = [], account_ids: list[str] = [], transaction_ids: list[str] = []) -> str:
    logger.info(f"TOOL: CREATE_MEMORY - {content} - {customer_ids} - {account_ids} - {transaction_ids}")
    with driver.session() as session:
        result = session.run(
            """
            CREATE (m:Memory {content: $content, created_at: datetime()})
            WITH m
            UNWIND $customer_ids as cid
            MATCH (c:Customer {id: cid})
            MERGE (m)-[:FOR_CUSTOMER]->(c)
            WITH m
            UNWIND $account_ids as aid
            MATCH (a:Account {id: aid})
            MERGE (m)-[:FOR_ACCOUNT]->(a)
            WITH m
            UNWIND $transaction_ids as tid
            MATCH (t:Transaction {id: tid})
            MERGE (m)-[:FOR_TRANSACTION]->(t)
            RETURN m.content as content
            """,
            content=content,
            customer_ids=customer_ids,
            account_ids=account_ids,
            transaction_ids=transaction_ids
        )
       
        return f"Created memory: {str(result)}"

Main function

The main() function sets up and runs an interactive KYC agent using the Neo4j MCP server and OpenAI's Agent SDK. It connects to the MCP server, defines the agent's instructions and tools, and maintains a conversation history for context. In a loop, it accepts user queries, passes them to the agent for processing, and displays the results. If needed, the agent dynamically generates and executes Cypher queries. The function also ensures proper cleanup of resources when the session ends.

async def main():
    await neo4j_mcp_server.connect()  # Connect the MCP server before using it

    # Define the instructions for the agent
    instructions = """You are a KYC analyst with access to a knowledge graph. Use the tools to answer questions about customers, accounts, and suspicious patterns.
    You are also a Neo4j expert and can use the Neo4j MCP server to query the graph.
    If you get a question about the KYC database that you can not answer with GraphRAG tools, you should
    - use the Neo4j MCP server to get the schema of the graph (if needed)
    - use the generate_cypher tool to generate a Cypher query from question and the schema
    - use the Neo4j MCP server to query the graph to answer the question
    """

    kyc_agent = Agent(
        name="KYC Analyst",
        instructions=instructions,
        tools=[get_customer_and_accounts, find_customer_rings, create_memory, generate_cypher],
        mcp_servers=[neo4j_mcp_server]
    )
    
    # Initialize conversation history
    conversation_history = []
    
    while True:
        query = input("Enter your KYC query (or 'quit' to exit): ")
        if query.lower() == 'quit':
            break
            
        # Run the agent with conversation history
        result = await Runner.run(
            kyc_agent, 
            conversation_history + [{"role": "user", "content": query}]
        )
        
        # Add the new interaction to conversation history
        conversation_history.extend([
            {"role": "user", "content": query},
            {"role": "assistant", "content": result.final_output}
        ])
        
        print(result.final_output)

    # Clean up
    await neo4j_mcp_server.cleanup()

if __name__ == "__main__":
    try:
        asyncio.run(main())
    finally:
        # Clean up any remaining resources
        driver.close()

Agent reasoning and execution flow

When the agent receives a user query, it follows a structured decision-making process to determine the appropriate tools to use:

Schema Discovery
Example Query: “Get me the schema of the database.”
Action: The agent recognizes a schema request and uses the neo4j-mcp-server.get-neo4j-schema tool to retrieve the graph schema.
Use of Custom GraphRAG Tool
Example Query: “Show me 5 watchlisted customers involved in suspicious transaction rings.”
Action: This aligns with a predefined tool. The agent calls find_customer_rings with the parameter customer_in_watchlist=True.
Dynamic Cypher Generation & Execution
Example Query: “For each of these customers, find their addresses and check if they’re shared with others.”
Action: Since no GraphRAG tool directly addresses this, the agent:
Uses the previously fetched schema.
Passes the question and schema to generate_cypher to generate a Cypher query.
Executes the query via neo4j-mcp-server.read-neo4j-cypher.
Entity-Specific Data Retrieval
Example Query: “Get details for the customer whose address is shared.”
Action: The agent identifies this as a profile lookup and calls get_customer_and_accounts with the customer ID.
Memory Creation for Long-Term Context
Example Query: “Write a 300-word summary of this investigation and link it to all related accounts and transactions.”
Action: The agent generates the summary using its internal LLM and then stores it using create_memory, linking it to the relevant entities.

Executing the agent

The entire workflow is now fully integrated and operational. We can proceed to execute the agent and test it using the sample queries outlined above to observe its behavior and capabilities in handling real-world KYC scenarios.

python agent.py

Thanks for reading this article !!

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/kyc-agent: A smart, tool‑augmented AI agent for Know Your Customer (KYC) investigations.

References

From Pretrained to Purposeful: Fine-Tuning LLaMA 3.2 Made Easy with Unsloth

Vishnu Sivan — Mon, 23 Jun 2025 15:25:05 GMT

Large language models like LLaMA and Mistral have accelerated open-source innovation, but their size often makes them difficult to fine-tune or deploy on everyday hardware. To address this, smaller models such as TinyLlama-1B, Microsoft Phi-2, and Alibaba Qwen-3B have emerged — offering strong performance in a much smaller footprint.

Still, fine-tuning is essential to adapt any base model to specific tasks like support chat, summarization, or domain-specific Q&A. Traditionally, this requires high memory and compute, putting it out of reach for many developers.

Unsloth solves this challenge by enabling efficient fine-tuning on modest hardware. It uses LoRA (Low-Rank Adaptation) to reduce the number of trainable parameters, lowering memory consumption significantly. Key to Unsloth’s performance is the integration of BitsAndBytes, a library that enables 4-bit and 8-bit quantization. Combined with LoRA, it drastically reduces memory usage and accelerates training — making it possible to fine-tune models on GPUs with limited VRAM.

In this guide, we’ll fine-tune LLaMA 3.2 (3B) using Maxime Labonne’s FineTome-100k dataset in ShareGPT format, demonstrating a practical and efficient setup for real-world fine-tuning without the need for expensive infrastructure.

Getting Started

What is LLM fine-tuning
What is Unsloth
Getting Started with Fine-Tuning LlaMa 3.2
Setting up the environment
Create and Configure Your Colab Notebook
Add Hugging Face Access Token (Optional but Recommended)
Install Unsloth and Required Dependencies
Loading the Model and Tokenizer
Applying LoRA Adapters for Efficient Fine-Tuning
Preparing the Training Dataset
Standardizing the Dataset Format
Loading the dataset
Formatting Prompts
Setting Up and Configuring the Trainer
Training Only on Assistant Responses
Inference — Generating responses
Saving and Loading the Fine-Tuned Model
Load the LoRA Adapters for Inference

What is LLM Fine-Tuning

Fine-tuning is the process of adapting a pre-trained large language model (LLM) to perform better on a specific task or domain. While pre-trained models are trained on massive amounts of general-purpose data, they often fall short in specialized use cases. Fine-tuning bridges this gap by training the model further on curated, domain-specific datasets.

For instance, while a base LLM might perform well on single-turn question-answering tasks, it may struggle with multi-turn conversations typically expected from chatbot systems. To handle such scenarios, the model needs exposure to dialogue-format datasets — something achieved through fine-tuning.

Fine-tuning allows developers to mold general LLMs into custom “avatars” suited for various tasks such as legal document summarization, healthcare Q&A, or multilingual support. The effectiveness of fine-tuning largely depends on the quality of the dataset, the capabilities of the base model, and the method of fine-tuning used.

Common Fine-Tuning Techniques

Full Fine-Tuning: This traditional method updates all the parameters of the model. While effective, it requires significant computational resources and memory, making it less feasible for large models or limited hardware setups.
LoRA (Low-Rank Adaptation): LoRA introduces small trainable matrices (adapters) into the model and only updates them, freezing the rest of the model weights. This reduces compute requirements and speeds up training — ideal for fine-tuning large models on consumer GPUs.
QLoRA (Quantized LoRA): QLoRA goes a step further by applying LoRA to a quantized version of the model. The model weights are first reduced to 4-bit or 8-bit precision using libraries like BitsAndBytes, drastically lowering memory consumption while retaining near-original performance.
Adapter Tuning: Adapter tuning inserts additional layers (adapters) into the network without modifying the original model weights. Like LoRA, it allows task-specific tuning with low resource usage and easy parameter sharing.
Prompt-Tuning / Prefix-Tuning: Instead of changing model parameters, this method learns a small prompt or prefix that conditions the model to perform a specific task. It’s lightweight and especially useful when storage or compute resources are constrained.

What is Unsloth

Unsloth is an open-source framework purpose-built for fast and efficient fine-tuning of large language models (LLMs). It provides an optimized training backend, making fine-tuning possible even on limited hardware setups by drastically improving training speed and memory efficiency.

At its core, Unsloth integrates custom Triton kernels and a manual backpropagation engine to accelerate training. This results in significant speedups — up to 2x faster than traditional fine-tuning pipelines — without compromising on performance.

Unsloth’s compatibility with QLoRA and BitsAndBytes further enhances its resource efficiency, making it one of the best frameworks for developers looking to fine-tune LLMs quickly and affordably.

Unsloth supports a wide range of popular models, including the latest LLaMA 3.2, Mistral, Phi, and Gemma variants. Most of these models are available in 4-bit quantized format (bnb-4bit), making them ideal for fine-tuning on consumer GPUs with limited VRAM.

Currently Supported Models (4-bit):

LLaMA 3.1 & 3.2: Meta-Llama-3.1-8B-bnb-4bit, Meta-Llama-3.1-8B-Instruct-bnb-4bit, Meta-Llama-3.1-70B-bnb-4bit, Meta-Llama-3.1-405B-bnb-4bit, Llama-3.2-1B-bnb-4bit, Llama-3.2-1B-Instruct-bnb-4bit, Llama-3.2-3B-bnb-4bit, Llama-3.2-3B-Instruct-bnb-4bit, Llama-3.3-70B-Instruct-bnb-4bit
Mistral: Mistral-Small-Instruct-2409, mistral-7b-instruct-v0.3-bnb-4bit
Phi: Phi-3.5-mini-instruct, Phi-3-medium-4k-instruct
Gemma: gemma-2-9b-bnb-4bit, gemma-2-27b-bnb-4bit

Getting Started with Fine-Tuning LlaMa 3.2

Fine-tuning large language models, even smaller variants, is a compute-intensive task. It typically requires a machine with at least 10–15 GB of VRAM. Fortunately, free cloud platforms like Google Colab and Kaggle Notebooks offer accessible environments equipped with GPUs — ideal for getting started without local setup.

For this hands-on guide, we will be using Google Colab with a T4 GPU.

Setting up the environment

Step 1: Create and Configure Your Colab Notebook

Open Google Colaboratory and sign in with your Google account.
Create a new notebook by clicking on + New Notebook.
Navigate to Runtime → Change runtime type.

Set Hardware Accelerator to GPU.
Choose T4 GPU (recommended for this tutorial).
Click Save.

💡 Tip: You can also run this setup on Kaggle by enabling GPU under “Accelerator” in notebook settings.

Step 2: Add Hugging Face Access Token (Optional but Recommended)

If you’re pulling models from Hugging Face, you’ll need an access token:

In the left sidebar, select the 🔑 Secrets tab.
Add a new secret:

Key: HF_TOKEN
Value: Your Hugging Face access token

3. Generate a write token from your Hugging Face profile settings → Create new token → Select write → Provide token name → Click on Create token.

Step 3: Install Unsloth and Required Dependencies

Use the following script to install all necessary packages:

!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip install --no-deps unsloth

Loading the Model and Tokenizer

Load the LLaMA 3.2 model using Unsloth’s optimized loading utilities. For this tutorial, we will be working with the Llama-3.2-3B-Instruct-bnb-4bit variant, which is quantized for efficient fine-tuning on limited hardware.

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

💡 Tip: If you’d like to fine-tune a different model, simply update the model_name variable with the desired model's name from Unsloth’s supported list.

Applying LoRA Adapters for Efficient Fine-Tuning

Low-Rank Adaptation (LoRA) enables efficient fine-tuning by updating only a small subset of the model’s parameters. This significantly reduces memory usage and accelerates training, making it ideal for resource-constrained environments.

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # Unsloth support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Preparing the Training Dataset

Before starting the training process, we need to load and preprocess the dataset. In this guide, we will use Maxime Labonne’s FineTome-100k, a high-quality dataset formatted in ShareGPT-style multi-turn conversations.

You are free to use any dataset, but it must be structured correctly for the model to interpret the inputs properly. If your dataset isn’t already in the expected format, you will need to preprocess it accordingly. The Hugging Face Datasets documentation is a helpful resource for transforming and preparing datasets for fine-tuning.

Standardizing the Dataset Format

For LLaMA 3.x models, Unsloth expects conversations to follow a specific format — similar to:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hey there! How are you?<|eot_id|>

To ensure compatibility, we convert the ShareGPT format to the standard Hugging Face multi-turn format using fields like "role" and "content" (instead of "from" and "value").

To ensure compatibility with the LLaMA 3.x training pipeline, we use the standardize_sharegpt utility to convert datasets from ShareGPT-style to Hugging Face’s generic multi-turn conversation format.

For example, the original format:

{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}

Is transformed into the standardized format:

{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}

This standardized structure ensures compatibility with Unsloth’s get_chat_template() function and avoids tokenization or formatting issues during training. It’s a crucial preprocessing step for models that expect role-based dialogue formatting.

Loading the dataset

To begin fine-tuning, we first need to load the dataset into our environment. In this tutorial, we use the FineTome-100k dataset curated by Maxime Labonne, which contains high-quality multi-turn conversations in ShareGPT format.

from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split = "train")

Formatting Prompts

After preparing the dataset, the next step is to structure the data using the appropriate chat format expected by the model. In this case, we apply the LLaMA 3.1 chat template using Unsloth’s get_chat_template() function. This function configures the tokenizer to format prompts in the LLaMA-style conversational structure, ensuring the model can effectively process and learn from multi-turn dialogues during fine-tuning.

from unsloth.chat_templates import get_chat_template
from unsloth.chat_templates import standardize_sharegpt

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }

dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

To verify that the dataset is correctly structured for fine-tuning with the LLaMA 3.1 format, it’s useful to inspect both the original conversation format and the formatted text version.

# View the original conversation format
print(dataset[5]["conversations"])
# View the same item in the formatted text format
print(dataset[5]["text"])

Setting Up and Configuring the Trainer

With the dataset and model prepared, the next step is to configure the fine-tuning process using Hugging Face’s SFTTrainer. This trainer simplifies fine-tuning by handling essential tasks such as tokenization, batching, gradient accumulation, and optimization. It is fully compatible with Unsloth, enabling efficient training with reduced VRAM consumption and improved speed.

For this tutorial, the training is limited to 60 steps for demonstration purposes. However, for a complete fine-tuning run, you can set num_train_epochs=1 and max_steps=None to train over the entire dataset.

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60, # Limit training steps to 60 (for quick testing)
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs", # Directory to save model checkpoints
        report_to = "none", # Use this for WandB etc
    ),
)

Training only on Assistant Responses

To make the training process more efficient and focused, we configure the model to learn only from the assistant’s responses, while ignoring the user’s inputs during loss computation. This approach helps the model better understand how to generate high-quality replies without being penalized for user input patterns.

Unsloth provides a convenient utility for this purpose: train_on_responses_only from unsloth.chat_templates.

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",     # Marks user input
    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",   # Marks assistant response
)
# Begin training
trainer_stats = trainer.train()

This setup ensures that the model is optimized solely on the assistant’s outputs. While training loss may decrease gradually, that’s expected — especially when using a small number of training steps. In this example, we limited training to 60 steps for quick experimentation.

💡 Tip: For better performance, it’s recommended to fine-tune the model for 2–3 epochs on large datasets or 3–5 epochs on smaller ones. Aim for 500+ training steps at minimum, and ideally 1000+ steps if hardware resources permit.

Inference — Generating responses

Once fine-tuning is complete, the trained model is ready for inference — generating responses based on new inputs. To run inference, simply provide an instruction and input, leaving the output field blank. The model will generate a response accordingly.

For this example, we use the following decoding parameters:

min_p = 0.1 – to ensure a level of sampling diversity
temperature = 1.5 – to introduce controlled randomness in the output

Feel free to adjust these values to fine-tune response creativity and coherence based on your use case.

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)

# Decode the generated tokens into human-readable text
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)

Saving and Loading the Fine-Tuned Model

Once training is complete, you can save your fine-tuned model and tokenizer either locally or push them to the Hugging Face Hub.

Save Locally

model_name = "Llama32_fine_tuned"
model.save_pretrained(model_name)
tokenizer.save_pretrained(model_name)

This will store only the LoRA adapter weights, not the full base model.

Push to Hugging Face Hub

To make your fine-tuned adapters publicly or privately accessible:

model.push_to_hub("your_name/your_model_name")
tokenizer.push_to_hub("your_name/your_model_name")

Save the Full Model in GGUF Format (Optional)

To export the complete model (base + LoRA adapters) in the efficient GGUF format which is suitable for CPU inference, use the following method:

model.push_to_hub_gguf(model_name, tokenizer, quantization_method="q4_k_m")

This compresses the model using the q4_k_m quantization method, which helps reduce model size and boosts inference performance.

Load the LoRA Adapters for Inference

To use your saved LoRA adapters for inference:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Llama32_fine_tuned",  # Name of your fine-tuned model
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model)  # Enable optimized inference

Now generate responses using the tokenizer and the trained model:

messages = [
    {"role": "user", "content": "Describe a tall tower in the capital of France."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1
)

This will generate a response in real-time using your fine-tuned model.

Thanks for reading this article !!

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

References

The Ultimate MCP Handbook: From Basics to Advanced LLM Integration

Vishnu Sivan — Thu, 15 May 2025 06:02:10 GMT

Large Language Models (LLMs) like GPT or Claude are incredibly powerful at generating natural language text — but at their core, they’re just really good at predicting the next token in a sequence. Out of the box, they can’t fetch local files, run custom code, or interact with external APIs. So, how do we bridge the gap between this language intelligence and the real world?

That’s where the Model Context Protocol (MCP) comes in.

MCP is a fast-emerging standard designed to extend LLM capabilities beyond static conversations. It acts as a universal connector between AI systems and the outside world, enabling seamless interaction with tools, databases, APIs, and other services — much like how USB-C connects different devices in a plug-and-play fashion.

In this tutorial, we will explore MCP through a practical, beginner-friendly project. You will learn how to build a custom MCP server that connects an AI model to the Yahoo Finance API, empowering the model to fetch real-time stock prices, compare multiple stocks, and even perform historical trend analysis. By the end of this article, you will have a working MCP server and a solid foundation to build more advanced, real-world AI integrations.

Getting Started

What is MCP
How the MCP server works
MCP Core functionalities
Building your first MCP server
Example 1: Creating Your First MCP Server
Example 2: Interacting with SQLite database
Example 3: Using Pre-Built MCP Servers
Example 4: Using a Python MCP Client
Stock Price Comparison using MCP Server
Building MCP Server for Stock Market Analysis

What is MCP

Model Context Protocol (MCP) is a standardized approach for organizing, delivering, and processing context information to large language models (LLMs). It is designed to help models better understand and utilize the information provided to them in the prompt.

The key components of Model Context Protocol include:

Structured formatting: Using a consistent format with clear section delineations (often XML-style tags) to organize different types of information.
Information hierarchies: Arranging information by importance and relevance to help the model prioritize what matters most.
Metadata tagging: Providing additional information about the context, such as source, reliability, or timestamp.
Processing instructions: Explicit guidance on how the model should handle, interpret, or use specific pieces of information.

MCP is particularly useful for:

Complex applications where models need to process multiple types of information
Situations requiring specific handling of different context elements
Improving model performance by reducing ambiguity in how context should be used

By standardizing how context is presented to models, MCP aims to make AI interactions more reliable, consistent, and effective across different use cases and implementations.

How the MCP server works

Basically, the host handles the user interface, the MCP client routes requests, and the MCP server performs the actual tasks — acting as the operational backbone that enables LLMs to interact with real-world systems.

User Input: A user makes a request through the host application.
LLM Interpretation: The LLM processes the request and identifies if a corresponding MCP tool is available to fulfill it.
Client Routing: If a suitable tool is configured, the MCP client packages the request and forwards it using the MCP protocol to the MCP server.
Task Execution by Server:
The MCP server receives the request.
It then triggers the appropriate action — this could be:
Querying a local resource (e.g., a SQLite database)
Calling an external service (e.g., an email API, stock data API, or internal system)
Response Handling: The server processes the response from the tool and sends the result back to the MCP client.
LLM Response Generation: The LLM takes the returned result, integrates it into a natural language response, and sends it back to the user via the host application.

MCP Core functionalities

MCP servers provide three core functionalities.

Resources: Storing and Serving File-like Data

MCP resources function as read-only data sources that provide structured information to a Large Language Model (LLM). They are similar to REST API GET requests—exposing data without performing any computations.

These resources can be accessed by the LLM on demand.

@mcp.resource("greeting://{name}")
def get_greeting(name: str) -> str:
    """Get a personalized greeting"""
    return f"Hello, {name}!"

This example defines a resource at greeting://{name} that returns a simple greeting string when accessed.

Tools: Functions Executed by the LLM

MCP tools allow the AI to perform specific tasks, similar to API POST requests. These functions can carry out computations, interact with databases, or call external APIs, enabling the LLM to go beyond static data and take meaningful actions.

@mcp.tool()
def add(a: int, b: int) -> int:
    """Add two numbers and return the result."""
    return a + b

This tool calculates the BMI based on the user’s weight and height.

Prompts: Predefined Instruction Templates

MCP prompts are reusable templates that help the LLM carry out structured tasks consistently. These templates guide the model’s responses for common or complex request types.

@mcp.prompt()
def review_code(code: str) -> str:
    return f"Please review this code:\n\n{code}"

This prompt helps the LLM respond in a structured manner when asked to perform a code review.

Building your first MCP server

Let’s build a local MCP server in Python that connects to an SQLite database to retrieve the top chatters in a community. You will interact with your LLM through tools like Cursor or Claude Desktop, while the MCP server handles all the backend database operations.

Installing uv

We will use uv, a fast and modern Python project manager, to set up and manage our environment. It simplifies tasks like handling dependencies, creating virtual environments, and running scripts.

To install uv, run this in your terminal:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

set Path=C:\Users\Codem\.local\bin;%Path%

Refer to the official website for detailed installation instructions.

Installation | uv

Installing the dependencies

Initialize a uv project by executing the following command.

uv init mcp_demo
cd mcp_demo

Create and activate a virtual environment by executing the following command.

uv venv
source .venv/bin/activate # for linux
.venv\Scripts\activate    # for windows

Install mcp SDK using uv. The mcp package includes both the server framework and optional CLI utilities. Installing it with the [cli] extra gives you access to helpful command-line tools.

uv add mcp[cli]

To confirm that the installation was successful, run:

mcp version

Example 1: Creating Your First MCP Server

Let’s begin by creating a simple calculator tool that adds two numbers.

Create a file named calculator.py inside the mcp_demo directory and insert the following code into it:

from mcp.server.fastmcp import FastMCP  # Import FastMCP, the quickstart server base

mcp = FastMCP("Calculator Server")  # Initialize an MCP server instance with a descriptive name

@mcp.tool()  # Register a function as a callable tool for the model
def add(a: int, b: int) -> int:
    """Add two numbers and return the result."""
    return a + b

# Add a dynamic greeting resource
@mcp.resource("greeting://{name}")
def get_greeting(name: str) -> str:
    """Get a personalized greeting"""
    return f"Hello, {name}!"

if __name__ == "__main__":
    mcp.run(transport="stdio")  # Run the server, using standard input/output for communication

This script sets up a basic MCP server with a single tool named add. The @mcp.tool() decorator registers the function with the MCP framework, making it accessible to connected LLMs.

You can test the mcp server using the following command.

mcp dev calculator.py

Once you run the MCP Inspector, you can access the interface in your browser at http://127.0.0.1:6274. The Inspector provides a user-friendly interface to view available tools and resources. It also allows you to interact with these tools directly using built-in UI controls.

Integrating your MCP server with Claude Desktop

To connect your MCP server to Claude, you’ll need to have Claude for Desktop installed.
You can download it from the official site: https://claude.ai/download

Follow the installation instructions provided for your operating system.

Adding Your MCP Server to Claude Desktop
Once Claude Desktop is installed, you can add your MCP server using the following command:

mcp install calculator.py

This registers your MCP server with Claude so it can be accessed from the desktop app.

Manual Configuration (Alternative Method)
If you would prefer to configure it manually, open the Claude configuration file by clicking Claude Desktop -> File -> Settings -> Developer -> Edit Config button.

Windows: %APPDATA%\Claude\claude_desktop_config.json
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json

Then, add the following entry to the mcpServers section:

{
 "mcpServers": {
   "Calculator Server": {
     "command": "C:\\Users\\Codem\\.local\\bin\\uv.EXE",
     "args": [
       "run",
       " - with",
       "mcp[cli]",
       "mcp",
       "run",
       "Absolute path to calculator.py"
     ]
   }
 }
}

Replace “Absolute path to calculator.py” with the full path to your actual calculator.py MCP server script.

Testing MCP tool with Claude desktop

Restart the Claude Desktop application to see the MCP tool appear inside the IDE. Once it’s loaded, you will be able to use the tool directly within the interface.

Example 2: Interacting with SQLite database

In the previous example, we created an add tool and integrated it with Claude Desktop. However, it's important to note that such a basic arithmetic function may not be triggered by Claude, as simple calculations are already handled natively within the IDE and don’t require external processing.

Now, let’s move on to a more practical use case that demonstrates the true potential of MCP — retrieving data from a local source and making it accessible to the LLM for dynamic and context-aware interactions.

Get the Sample Database

Download the community.db file, which contains a chatters table with sample data. Once downloaded, move the database file into your project directory.

https://doimages.nyc3.cdn.digitaloceanspaces.com/006Community/MCP-server-python/community.db

Create SQLite MCP server

Create a new file called sqlite_server.py and add the following code to it.

# sqlite-server.py
from mcp.server.fastmcp import FastMCP
import sqlite3

# Initialize the MCP server with a name
mcp = FastMCP("Community Chatters")

# Define a tool to fetch the top chatters from the SQLite database
@mcp.tool()
def get_top_chatters():
    """Retrieve the top chatters sorted by number of messages."""
    # Connect to the SQLite database
    conn = sqlite3.connect('E:\\Experiments\\GenerativeAI\\MCP\\mcp_demo\\community.db')
    cursor = conn.cursor()
    
    # Execute the query to fetch chatters sorted by messages
    cursor.execute("SELECT name, messages FROM chatters ORDER BY messages DESC")
    results = cursor.fetchall()
    conn.close()
    
    # Format the results as a list of dictionaries
    chatters = [{"name": name, "messages": messages} for name, messages in results]
    return chatters

# Run the MCP server locally
if __name__ == '__main__':
    mcp.run()

Make sure to provide the absolute path to the database in the sqlite3.connect() method. If you use a relative path, Claude Desktop may not be able to locate or access the database correctly.

Test the sqlite server using the following command.

mcp dev sqlite_server.py

Testing with Claude Desktop

First, integrate the SQLite MCP server with Claude by running the following command:

mcp install sqlite_server.py

Next, restart the Claude Desktop application. Once it’s running, you can begin asking questions related to the local database directly within the interface. When executing the SQLite server MCP, a prompt will appear requesting permission to run the MCP tool. Please approve the prompt to proceed.

Example 3: Using Pre-Built MCP Servers

Anthropic and its community provide a set of pre-built MCP servers that can be directly integrated with Claude Desktop or Cursor to enable this functionality in your application.

GitHub - modelcontextprotocol/servers: Model Context Protocol Servers

In this section, we will implement the File System and Git MCP servers.

File System

To enable filesystem functionality, we will install a pre-built Filesystem MCP server in Claude for Desktop.

Open the configuration file using any text editor and append the following content to the end of the file.

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/path/to/directory"
      ]
    }
  }
}

For instance, the sample configuration file look like this.

{
  "mcpServers": {
    "Calculator Server": {
      "command": "C:\\Users\\Codem\\.local\\bin\\uv.EXE",
      "args": [
        "run",
        "--with",
        "mcp[cli]",
        "mcp",
        "run",
        "E:\\Experiments\\GenerativeAI\\MCP\\mcp_demo\\calculator.py"
      ]
    },
    "Community Chatters": {
      "command": "C:\\Users\\Codem\\.local\\bin\\uv.EXE",
      "args": [
        "run",
        "--with",
        "mcp[cli]",
        "mcp",
        "run",
        "E:\\Experiments\\GenerativeAI\\MCP\\mcp_demo\\sqlite_server.py"
      ]
    },
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "E:\\Experiments\\GenerativeAI\\MCP\\mcp_demo"
      ]
    }
  }
}

After updating the configuration file, restart Claude for Desktop to apply the changes. Once it’s running, you can begin asking questions related to the specified folder directly within the interface.

Git

Anthropic provides the mcp-server-git MCP server, which includes tools for reading, searching, and manipulating Git repositories using Large Language Models.

To enable git functionality, open the configuration file using any text editor and append the following content to the end of the file.

"mcpServers": {
  "git": {
    "command": "uvx",
    "args": ["mcp-server-git", "--repository", "path/to/git/repo"]
  }
}

Your application now has Git support enabled, allowing you to execute Git commands through Large Language Models (LLMs).

Example 4: Using a Python MCP Client

To perform a specific task programmatically, you can use the MCP Python SDK to create a client. In the earlier section, we created a simple tool (calculator.py) that adds two numbers. Now, let's use the MCP client to invoke the add tool and perform an addition operation.

Create a file named calculator_client.py and add the following code to it.

from mcp import ClientSession, StdioServerParameters, types
from mcp.client.stdio import stdio_client

# Create server parameters for stdio connection
server_params = StdioServerParameters(
    command="python",  # Executable
    args=["calculator.py"],  # Optional command line arguments
    env=None,  # Optional environment variables
)

async def run():
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            # Initialize the connection
            await session.initialize()

            # Call a tool
            result = await session.call_tool("add", arguments={"a": 3, "b": 4})
            print(f"Result of add tool: {result}")


if __name__ == "__main__":
    import asyncio

    asyncio.run(run())

To run the client, start the server in a terminal using the following command:

python calculator.py

Next, open another terminal and run the client using:

python calculator_client.py

Once both are running, you will see an output similar to the following:

Stock Price Comparison using MCP Server

In this section, we will build a custom MCP server using the Yahoo Finance Python API. The server will be capable of fetching real-time stock prices, performing comparisons, and providing historical data analysis.

Installing dependencies

This application is created inside the uv project (mcp_demo) which we created earlier. You can also create it as a separate project by initializing uv if required.

Install the YFinanace Python package using the PIP command.

pip install yfinance

Create a file named stock_price_server.py and add the following code to it.

from mcp.server.fastmcp import FastMCP
import yfinance as yf

# Create an MCP server with a custom name
mcp = FastMCP("Stock Price Server")

@mcp.tool()
def get_stock_price(symbol: str) -> float:
    """
    Retrieve the current stock price for the given ticker symbol.
    Returns the latest closing price as a float.
    """
    try:
        ticker = yf.Ticker(symbol)
        # Get today's historical data; may return empty if market is closed or symbol is invalid.
        data = ticker.history(period="1d")
        if not data.empty:
            # Use the last closing price from today's data
            price = data['Close'].iloc[-1]
            return float(price)
        else:
            # As a fallback, try using the regular market price from the ticker info
            info = ticker.info
            price = info.get("regularMarketPrice", None)
            if price is not None:
                return float(price)
            else:
                return -1.0  # Indicate failure
    except Exception:
        # Return -1.0 to indicate an error occurred when fetching the stock price
        return -1.0

@mcp.resource("stock://{symbol}")
def stock_resource(symbol: str) -> str:
    """
    Expose stock price data as a resource.
    Returns a formatted string with the current stock price for the given symbol.
    """
    price = get_stock_price(symbol)
    if price < 0:
        return f"Error: Could not retrieve price for symbol '{symbol}'."
    return f"The current price of '{symbol}' is ${price:.2f}."

@mcp.tool()
def get_stock_history(symbol: str, period: str = "1mo") -> str:
    """
    Retrieve historical data for a stock given a ticker symbol and a period.
    Returns the historical data as a CSV formatted string.
    
    Parameters:
        symbol: The stock ticker symbol.
        period: The period over which to retrieve historical data (e.g., '1mo', '3mo', '1y').
    """
    try:
        ticker = yf.Ticker(symbol)
        data = ticker.history(period=period)
        if data.empty:
            return f"No historical data found for symbol '{symbol}' with period '{period}'."
        # Convert the DataFrame to a CSV formatted string
        csv_data = data.to_csv()
        return csv_data
    except Exception as e:
        return f"Error fetching historical data: {str(e)}"

@mcp.tool()
def compare_stocks(symbol1: str, symbol2: str) -> str:
    """
    Compare the current stock prices of two ticker symbols.
    Returns a formatted message comparing the two stock prices.
    
    Parameters:
        symbol1: The first stock ticker symbol.
        symbol2: The second stock ticker symbol.
    """
    price1 = get_stock_price(symbol1)
    price2 = get_stock_price(symbol2)
    if price1 < 0 or price2 < 0:
        return f"Error: Could not retrieve data for comparison of '{symbol1}' and '{symbol2}'."
    if price1 > price2:
        result = f"{symbol1} (${price1:.2f}) is higher than {symbol2} (${price2:.2f})."
    elif price1 < price2:
        result = f"{symbol1} (${price1:.2f}) is lower than {symbol2} (${price2:.2f})."
    else:
        result = f"Both {symbol1} and {symbol2} have the same price (${price1:.2f})."
    return result

if __name__ == "__main__":
    mcp.run()

Test the MCP server using the MCP Inspector by running the following command in your terminal.

mcp dev stock_price_server.py

You will receive an output similar to the following.

Integrate the stock price server with Claude for Desktop by running the following command:

mcp install stock_price_server.py --with yfinance

If your server has any dependencies that need to be installed, it is important to specify them using — with as arguments. This ensures that the necessary libraries and modules are installed before the server runs.

After the integration, restart Claude for Desktop to enable the new MCP server for stock price-related queries.

Building MCP Server for Stock Market Analysis

Manually performing stock market predictions and analysis can be a tedious and time-consuming task. Instead, imagine being able to simply ask: “What’s the RSI for MSFT right now?”

An MCP server can instantly fetch the latest stock data, calculate the RSI, and return the result — making it significantly easier to make informed trading decisions without switching between multiple apps and websites.

In this section, we will use the Alpha Vantage API (free tier) to pull real-time stock data and integrate it into an MCP server. This integration allows us to analyze stocks using custom-built AI tools.

Installing the dependencies

This application is created inside the uv project (mcp_demo) which we created earlier. You can also create it as a separate project by initializing uv if required.

For separate project, create a new uv project and add the dependencies using the following command.

# Create a new directory for our project
uv init finance
cd finance

# Create virtual environment and activate it
uv venv
.venv\Scripts\activate

# Install dependencies
uv add mcp[cli] requests pandas tabulate

For the existing project (mcp_demo), Install the MCP and httpx Python package using the PIP command.

pip install requests pandas tabulate

Fetching Stock Market Data Using the Alpha Vantage API

Alpha Vantage is a widely used service that provides both real-time and historical financial market data. It offers a range of APIs for accessing information on equities, currencies, cryptocurrencies, and more.

To begin using Alpha Vantage, you’ll need to sign up on their official website and obtain a free API key. The free tier allows up to 25 API requests per day. Once you have your API key, you can retrieve intraday stock price data using the TIME_SERIES_INTRADAY endpoint. This API returns time series data with key metrics such as open, high, low, close, and volume, updated in real time.

To make a request, you’ll need to specify:

The stock symbol (e.g., MSFT)
The interval between data points (1min, 5min, 15min, 30min, or 60min)
Your API key

Example API Call (5-minute interval for MSFT):

https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol=MSFT&interval=5min&apikey=YOUR_API_KEY

This call returns the latest 5-minute interval data for the Microsoft stock, which you can then parse and use in your application or analysis workflow.

Implementing MCP Tools for Stock Analysis

Let's implement the MCP tool for stock analysis. This MCP server will create tools for moving averages, Relative Strength Index and Trade Recommendation.

Tool 1: Moving Averages

Moving averages are used in stock analysis to smooth price data and identify trends.

Short-term averages (5–10 days) react quickly and highlight recent market shifts.
Long-term averages (50–200 days) change slowly and reveal broader, sustained trends.

Implementing tool for calculating moving averages.

@mcp.tool()
def calculate_moving_averages(symbol: str, short_period: int = 20, long_period: int = 50) -> Dict[str, Any]:
    """
    Calculate short and long moving averages for a symbol
    
    Args:
        symbol: The ticker symbol to analyze
        short_period: Short moving average period in minutes
        long_period: Long moving average period in minutes
        
    Returns:
        Dictionary with moving average data and analysis
    """
    cache_key = f"{symbol}_1min"
    
    if cache_key not in market_data_cache:
        df = AlphaVantageAPI.get_intraday_data(symbol, "1min", outputsize="full")
        market_data_cache[cache_key] = MarketData(
            symbol=symbol,
            interval="1min",
            data=df,
            last_updated=datetime.now()
        )
    
    data = market_data_cache[cache_key].data
    
    # Calculate moving averages
    data[f'SMA{short_period}'] = data['close'].rolling(window=short_period).mean()
    data[f'SMA{long_period}'] = data['close'].rolling(window=long_period).mean()
    
    # Get latest values
    latest = data.iloc[-1]
    current_price = latest['close']
    short_ma = latest[f'SMA{short_period}']
    long_ma = latest[f'SMA{long_period}']
    
    # Determine signal
    if short_ma > long_ma:
        signal = "BULLISH (Short MA above Long MA)"
    elif short_ma < long_ma:
        signal = "BEARISH (Short MA below Long MA)"
    else:
        signal = "NEUTRAL (MAs are equal)"
    
    # Check for crossover in the last 5 periods
    last_5 = data.iloc[-5:]
    crossover = False
    crossover_type = ""
    
    for i in range(1, len(last_5)):
        prev = last_5.iloc[i-1]
        curr = last_5.iloc[i]
        
        # Golden Cross (short crosses above long)
        if prev[f'SMA{short_period}'] <= prev[f'SMA{long_period}'] and curr[f'SMA{short_period}'] > curr[f'SMA{long_period}']:
            crossover = True
            crossover_type = "GOLDEN CROSS (Bullish)"
            break
            
        # Death Cross (short crosses below long)
        if prev[f'SMA{short_period}'] >= prev[f'SMA{long_period}'] and curr[f'SMA{short_period}'] < curr[f'SMA{long_period}']:
            crossover = True
            crossover_type = "DEATH CROSS (Bearish)"
            break
    
    return {
        "symbol": symbol,
        "current_price": current_price,
        f"SMA{short_period}": short_ma,
        f"SMA{long_period}": long_ma,
        "signal": signal,
        "crossover_detected": crossover,
        "crossover_type": crossover_type if crossover else "None",
        "analysis": f"""Moving Average Analysis for {symbol}:
Current Price: ${current_price:.2f}
{short_period}-period SMA: ${short_ma:.2f}
{long_period}-period SMA: ${long_ma:.2f}
Signal: {signal}
Recent Crossover: {"Yes - " + crossover_type if crossover else "No"}

Recommendation: {
    "STRONG BUY" if crossover and crossover_type == "GOLDEN CROSS (Bullish)" else
    "BUY" if signal == "BULLISH (Short MA above Long MA)" else
    "STRONG SELL" if crossover and crossover_type == "DEATH CROSS (Bearish)" else
    "SELL" if signal == "BEARISH (Short MA below Long MA)" else
    "HOLD"
}"""
    }

Tool 2: Relative Strength Index (RSI)

The Relative Strength Index (RSI) is a momentum indicator that helps identify overbought (RSI > 70) or oversold (RSI < 30) conditions in an asset. Calculated over a typical 14-day period, it uses the ratio of average gains to losses to assess the speed and change of price movements, aiding in better trading decisions.

Implementing tool for calculating RSI.

@mcp.tool()
def calculate_rsi(symbol: str, period: int = 14) -> Dict[str, Any]:
    """
    Calculate Relative Strength Index (RSI) for a symbol
    
    Args:
        symbol: The ticker symbol to analyze
        period: RSI calculation period in minutes
        
    Returns:
        Dictionary with RSI data and analysis
    """
    cache_key = f"{symbol}_1min"
    
    if cache_key not in market_data_cache:
        df = AlphaVantageAPI.get_intraday_data(symbol, "1min", outputsize="full")
        market_data_cache[cache_key] = MarketData(
            symbol=symbol,
            interval="1min",
            data=df,
            last_updated=datetime.now()
        )
    
    data = market_data_cache[cache_key].data.copy()
    
    # Calculate price changes
    delta = data['close'].diff()
    
    # Create gain and loss series
    gain = delta.copy()
    loss = delta.copy()
    gain[gain < 0] = 0
    loss[loss > 0] = 0
    loss = abs(loss)
    
    # Calculate average gain and loss
    avg_gain = gain.rolling(window=period).mean()
    avg_loss = loss.rolling(window=period).mean()
    
    # Calculate RS and RSI
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    
    # Get latest RSI
    latest_rsi = rsi.iloc[-1]
    
    # Determine signal
    if latest_rsi < 30:
        signal = "OVERSOLD (Potential buy opportunity)"
    elif latest_rsi > 70:
        signal = "OVERBOUGHT (Potential sell opportunity)"
    else:
        signal = "NEUTRAL"
    
    return {
        "symbol": symbol,
        "period": period,
        "rsi": latest_rsi,
        "signal": signal,
        "analysis": f"""RSI Analysis for {symbol}:
            {period}-period RSI: {latest_rsi:.2f}
            Signal: {signal}

            Recommendation: {
                "BUY" if latest_rsi < 30 else
                "SELL" if latest_rsi > 70 else
                "HOLD"
            }"""
    }

Tool 3: Trade Recommendation

This tool aggregates insights from both the moving average and RSI indicators to provide a clear recommendation on whether to buy, hold, or sell an asset.

@mcp.tool()
def trade_recommendation(symbol: str) -> Dict[str, Any]:
    """
    Provide a comprehensive trade recommendation based on multiple indicators
    
    Args:
        symbol: The ticker symbol to analyze
        
    Returns:
        Dictionary with trading recommendation and supporting data
    """
    # Calculate individual indicators
    ma_data = calculate_moving_averages(symbol)
    rsi_data = calculate_rsi(symbol)
    
    # Extract signals
    ma_signal = ma_data["signal"]
    ma_crossover = ma_data["crossover_detected"]
    ma_crossover_type = ma_data["crossover_type"]
    rsi_value = rsi_data["rsi"]
    rsi_signal = rsi_data["signal"]
    
    # Determine overall signal strength
    signal_strength = 0
    
    # MA contribution
    if "BULLISH" in ma_signal:
        signal_strength += 1
    elif "BEARISH" in ma_signal:
        signal_strength -= 1
        
    # Crossover contribution
    if ma_crossover:
        if "GOLDEN" in ma_crossover_type:
            signal_strength += 2
        elif "DEATH" in ma_crossover_type:
            signal_strength -= 2
            
    # RSI contribution
    if "OVERSOLD" in rsi_signal:
        signal_strength += 1.5
    elif "OVERBOUGHT" in rsi_signal:
        signal_strength -= 1.5
    
    # Determine final recommendation
    if signal_strength >= 2:
        recommendation = "STRONG BUY"
    elif signal_strength > 0:
        recommendation = "BUY"
    elif signal_strength <= -2:
        recommendation = "STRONG SELL"
    elif signal_strength < 0:
        recommendation = "SELL"
    else:
        recommendation = "HOLD"
    
    # Calculate risk level (simple version)
    risk_level = "MEDIUM"
    if abs(signal_strength) > 3:
        risk_level = "LOW"  # Strong signal, lower risk
    elif abs(signal_strength) < 1:
        risk_level = "HIGH"  # Weak signal, higher risk
    
    analysis = f"""# Trading Recommendation for {symbol}

        ## Summary
        Recommendation: {recommendation}
        Risk Level: {risk_level}
        Signal Strength: {signal_strength:.1f} / 4.5

        ## Technical Indicators
        Moving Averages: {ma_signal}
        Recent Crossover: {"Yes - " + ma_crossover_type if ma_crossover else "No"}
        RSI ({rsi_data["period"]}): {rsi_value:.2f} - {rsi_signal}

        ## Reasoning
        This recommendation is based on a combination of Moving Average analysis and RSI indicators.
        {
            f"The {ma_crossover_type} provides a strong directional signal. " if ma_crossover else ""
        }{
            f"The RSI indicates the stock is {rsi_signal.split(' ')[0].lower()}. " if "NEUTRAL" not in rsi_signal else ""
        }

        ## Action Plan
        {
            "Consider immediate entry with a stop loss at the recent low. Target the next resistance level." if recommendation == "STRONG BUY" else
            "Look for a good entry point on small dips. Set reasonable stop loss." if recommendation == "BUY" else
            "Consider immediate exit or setting tight stop losses to protect gains." if recommendation == "STRONG SELL" else
            "Start reducing position on strength or set trailing stop losses." if recommendation == "SELL" else
            "Monitor the position but no immediate action needed."
        }
        """
    
    return {
        "symbol": symbol,
        "recommendation": recommendation,
        "risk_level": risk_level,
        "signal_strength": signal_strength,
        "ma_signal": ma_signal,
        "rsi_signal": rsi_signal,
        "current_price": ma_data["current_price"],
        "analysis": analysis
    }

Prompt 1: Analyze a Single Ticker

@mcp.prompt()
def analyze_ticker(symbol: str) -> str:
    """
    Analyze a ticker symbol for trading opportunities
    """
    return f"""You are a professional stock market analyst. I would like you to analyze the stock {symbol} and provide trading insights.

        Start by examining the current market data and technical indicators. Here are the specific tasks:

        1. First, check the current market data for {symbol}
        2. Calculate the moving averages using the calculate_moving_averages tool
        3. Calculate the RSI using the calculate_rsi tool
        4. Generate a comprehensive trade recommendation using the trade_recommendation tool
        5. Based on all this information, provide your professional analysis, highlighting:
        - The current market position
        - Key technical indicators and what they suggest
        - Potential trading opportunities and risks
        - Your recommended action (buy, sell, or hold) with a brief explanation

        Please organize your response in a clear, structured format suitable for a professional trader.
        """

Prompt 2: Compare Multiple Tickers

@mcp.prompt()
def compare_tickers(symbols: str) -> str:
    """
    Compare multiple ticker symbols for the best trading opportunity
    
    Args:
        symbols: Comma-separated list of ticker symbols
    """
    symbol_list = [s.strip() for s in symbols.split(",")]
    symbol_section = "\n".join([f"- {s}" for s in symbol_list])
    
    return f"""You are a professional stock market analyst. I would like you to compare these stocks and identify the best trading opportunity:

        {symbol_section}

        For each stock in the list, please:

        1. Check the current market data using the appropriate resource
        2. Generate a comprehensive trade recommendation using the trade_recommendation tool
        3. Compare all stocks based on:
        - Current trend direction and strength
        - Technical indicator signals
        - Risk/reward profile
        - Trading recommendation strength

        After analyzing each stock, rank them from most promising to least promising trading opportunity. Explain your ranking criteria and why you believe the top-ranked stock represents the best current trading opportunity.

        Conclude with a specific recommendation on which stock to trade and what action to take (buy, sell, or hold).
        """

Prompt 3: Build an Intraday Trading Strategy

@mcp.prompt()
def intraday_strategy_builder(symbol: str) -> str:
    """
    Build a custom intraday trading strategy for a specific ticker
    """
    return f"""You are an expert algorithmic trader specializing in intraday strategies. I want you to develop a custom intraday trading strategy for {symbol}.

        Please follow these steps:

        1. First, analyze the current market data for {symbol} using the market-data resource
        2. Calculate relevant technical indicators:
        - Moving averages (short and long periods)
        - RSI
        3. Based on your analysis, design an intraday trading strategy that includes:
        - Specific entry conditions (technical setups that would trigger a buy/sell)
        - Exit conditions (both take-profit and stop-loss levels)
        - Position sizing recommendations
        - Optimal trading times during the day
        - Risk management rules

        Make your strategy specific to the current market conditions for {symbol}, not just generic advice. Include exact indicator values and price levels where possible.

        Conclude with a summary of the strategy and how a trader should implement it for today's trading session.
        """

Complete Code for Stock Market Analysis

Create a file named stock_analysis_server.py to implement the MCP server. Add the following code to it.

# stock_analysis_server.py
from mcp.server.fastmcp import FastMCP
import requests
import pandas as pd
from dataclasses import dataclass
from datetime import datetime
from typing import Dict, Any

# Create the MCP server
mcp = FastMCP("Stock Analysis Server", dependencies=["requests", "pandas", "tabulate"])

# Constants and configurations
API_KEY = "6BZ33KPJPJ09AQAP"  # Replace with your actual AlphaVantage API key

@dataclass
class MarketData:
    symbol: str
    interval: str
    data: pd.DataFrame
    last_updated: datetime
    
class AlphaVantageAPI:
    @staticmethod
    def get_intraday_data(symbol: str, interval: str = "1min", outputsize: str = "compact") -> pd.DataFrame:
        """Fetch intraday data from AlphaVantage API"""
        url = f"https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY&symbol={symbol}&interval={interval}&outputsize={outputsize}&apikey={API_KEY}"
        
        response = requests.get(url)
        data = response.json()
        
        # Check for error responses
        if "Error Message" in data:
            raise ValueError(f"API Error: {data['Error Message']}")
        if "Note" in data:
            print(f"API Note: {data['Note']}")
            
        # Extract time series data
        time_series_key = f"Time Series ({interval})"
        if time_series_key not in data:
            raise ValueError(f"No time series data found for {symbol} with interval {interval}")
            
        time_series = data[time_series_key]
        
        # Convert to DataFrame
        df = pd.DataFrame.from_dict(time_series, orient="index")
        df.index = pd.to_datetime(df.index)
        df = df.sort_index()
        
        # Rename columns and convert to numeric
        df.columns = [col.split(". ")[1] for col in df.columns]
        for col in df.columns:
            df[col] = pd.to_numeric(df[col])
            
        return df

# In-memory cache for market data
market_data_cache: Dict[str, MarketData] = {}

# Resources
@mcp.resource("config://app")
def get_config() -> str:
    """Static configuration data"""
    return "App configuration here"

# Technical Analysis Tools
@mcp.tool()
def calculate_moving_averages(symbol: str, short_period: int = 20, long_period: int = 50) -> Dict[str, Any]:
    """
    Calculate short and long moving averages for a symbol
    
    Args:
        symbol: The ticker symbol to analyze
        short_period: Short moving average period in minutes
        long_period: Long moving average period in minutes
        
    Returns:
        Dictionary with moving average data and analysis
    """
    cache_key = f"{symbol}_1min"
    
    if cache_key not in market_data_cache:
        df = AlphaVantageAPI.get_intraday_data(symbol, "1min", outputsize="full")
        market_data_cache[cache_key] = MarketData(
            symbol=symbol,
            interval="1min",
            data=df,
            last_updated=datetime.now()
        )
    
    data = market_data_cache[cache_key].data
    
    # Calculate moving averages
    data[f'SMA{short_period}'] = data['close'].rolling(window=short_period).mean()
    data[f'SMA{long_period}'] = data['close'].rolling(window=long_period).mean()
    
    # Get latest values
    latest = data.iloc[-1]
    current_price = latest['close']
    short_ma = latest[f'SMA{short_period}']
    long_ma = latest[f'SMA{long_period}']
    
    # Determine signal
    if short_ma > long_ma:
        signal = "BULLISH (Short MA above Long MA)"
    elif short_ma < long_ma:
        signal = "BEARISH (Short MA below Long MA)"
    else:
        signal = "NEUTRAL (MAs are equal)"
    
    # Check for crossover in the last 5 periods
    last_5 = data.iloc[-5:]
    crossover = False
    crossover_type = ""
    
    for i in range(1, len(last_5)):
        prev = last_5.iloc[i-1]
        curr = last_5.iloc[i]
        
        # Golden Cross (short crosses above long)
        if prev[f'SMA{short_period}'] <= prev[f'SMA{long_period}'] and curr[f'SMA{short_period}'] > curr[f'SMA{long_period}']:
            crossover = True
            crossover_type = "GOLDEN CROSS (Bullish)"
            break
            
        # Death Cross (short crosses below long)
        if prev[f'SMA{short_period}'] >= prev[f'SMA{long_period}'] and curr[f'SMA{short_period}'] < curr[f'SMA{long_period}']:
            crossover = True
            crossover_type = "DEATH CROSS (Bearish)"
            break
    
    return {
        "symbol": symbol,
        "current_price": current_price,
        f"SMA{short_period}": short_ma,
        f"SMA{long_period}": long_ma,
        "signal": signal,
        "crossover_detected": crossover,
        "crossover_type": crossover_type if crossover else "None",
        "analysis": f"""Moving Average Analysis for {symbol}:
            Current Price: ${current_price:.2f}
            {short_period}-period SMA: ${short_ma:.2f}
            {long_period}-period SMA: ${long_ma:.2f}
            Signal: {signal}
            Recent Crossover: {"Yes - " + crossover_type if crossover else "No"}

            Recommendation: {
                "STRONG BUY" if crossover and crossover_type == "GOLDEN CROSS (Bullish)" else
                "BUY" if signal == "BULLISH (Short MA above Long MA)" else
                "STRONG SELL" if crossover and crossover_type == "DEATH CROSS (Bearish)" else
                "SELL" if signal == "BEARISH (Short MA below Long MA)" else
                "HOLD"
            }"""
    }

@mcp.tool()
def calculate_rsi(symbol: str, period: int = 14) -> Dict[str, Any]:
    """
    Calculate Relative Strength Index (RSI) for a symbol
    
    Args:
        symbol: The ticker symbol to analyze
        period: RSI calculation period in minutes
        
    Returns:
        Dictionary with RSI data and analysis
    """
    cache_key = f"{symbol}_1min"
    
    if cache_key not in market_data_cache:
        df = AlphaVantageAPI.get_intraday_data(symbol, "1min", outputsize="full")
        market_data_cache[cache_key] = MarketData(
            symbol=symbol,
            interval="1min",
            data=df,
            last_updated=datetime.now()
        )
    
    data = market_data_cache[cache_key].data.copy()
    
    # Calculate price changes
    delta = data['close'].diff()
    
    # Create gain and loss series
    gain = delta.copy()
    loss = delta.copy()
    gain[gain < 0] = 0
    loss[loss > 0] = 0
    loss = abs(loss)
    
    # Calculate average gain and loss
    avg_gain = gain.rolling(window=period).mean()
    avg_loss = loss.rolling(window=period).mean()
    
    # Calculate RS and RSI
    rs = avg_gain / avg_loss
    rsi = 100 - (100 / (1 + rs))
    
    # Get latest RSI
    latest_rsi = rsi.iloc[-1]
    
    # Determine signal
    if latest_rsi < 30:
        signal = "OVERSOLD (Potential buy opportunity)"
    elif latest_rsi > 70:
        signal = "OVERBOUGHT (Potential sell opportunity)"
    else:
        signal = "NEUTRAL"
    
    return {
        "symbol": symbol,
        "period": period,
        "rsi": latest_rsi,
        "signal": signal,
        "analysis": f"""RSI Analysis for {symbol}:
            {period}-period RSI: {latest_rsi:.2f}
            Signal: {signal}

            Recommendation: {
                "BUY" if latest_rsi < 30 else
                "SELL" if latest_rsi > 70 else
                "HOLD"
            }"""
    }

@mcp.tool()
def trade_recommendation(symbol: str) -> Dict[str, Any]:
    """
    Provide a comprehensive trade recommendation based on multiple indicators
    
    Args:
        symbol: The ticker symbol to analyze
        
    Returns:
        Dictionary with trading recommendation and supporting data
    """
    # Calculate individual indicators
    ma_data = calculate_moving_averages(symbol)
    rsi_data = calculate_rsi(symbol)
    
    # Extract signals
    ma_signal = ma_data["signal"]
    ma_crossover = ma_data["crossover_detected"]
    ma_crossover_type = ma_data["crossover_type"]
    rsi_value = rsi_data["rsi"]
    rsi_signal = rsi_data["signal"]
    
    # Determine overall signal strength
    signal_strength = 0
    
    # MA contribution
    if "BULLISH" in ma_signal:
        signal_strength += 1
    elif "BEARISH" in ma_signal:
        signal_strength -= 1
        
    # Crossover contribution
    if ma_crossover:
        if "GOLDEN" in ma_crossover_type:
            signal_strength += 2
        elif "DEATH" in ma_crossover_type:
            signal_strength -= 2
            
    # RSI contribution
    if "OVERSOLD" in rsi_signal:
        signal_strength += 1.5
    elif "OVERBOUGHT" in rsi_signal:
        signal_strength -= 1.5
    
    # Determine final recommendation
    if signal_strength >= 2:
        recommendation = "STRONG BUY"
    elif signal_strength > 0:
        recommendation = "BUY"
    elif signal_strength <= -2:
        recommendation = "STRONG SELL"
    elif signal_strength < 0:
        recommendation = "SELL"
    else:
        recommendation = "HOLD"
    
    # Calculate risk level (simple version)
    risk_level = "MEDIUM"
    if abs(signal_strength) > 3:
        risk_level = "LOW"  # Strong signal, lower risk
    elif abs(signal_strength) < 1:
        risk_level = "HIGH"  # Weak signal, higher risk
    
    analysis = f"""# Trading Recommendation for {symbol}

        ## Summary
        Recommendation: {recommendation}
        Risk Level: {risk_level}
        Signal Strength: {signal_strength:.1f} / 4.5

        ## Technical Indicators
        Moving Averages: {ma_signal}
        Recent Crossover: {"Yes - " + ma_crossover_type if ma_crossover else "No"}
        RSI ({rsi_data["period"]}): {rsi_value:.2f} - {rsi_signal}

        ## Reasoning
        This recommendation is based on a combination of Moving Average analysis and RSI indicators.
        {
            f"The {ma_crossover_type} provides a strong directional signal. " if ma_crossover else ""
        }{
            f"The RSI indicates the stock is {rsi_signal.split(' ')[0].lower()}. " if "NEUTRAL" not in rsi_signal else ""
        }

        ## Action Plan
        {
            "Consider immediate entry with a stop loss at the recent low. Target the next resistance level." if recommendation == "STRONG BUY" else
            "Look for a good entry point on small dips. Set reasonable stop loss." if recommendation == "BUY" else
            "Consider immediate exit or setting tight stop losses to protect gains." if recommendation == "STRONG SELL" else
            "Start reducing position on strength or set trailing stop losses." if recommendation == "SELL" else
            "Monitor the position but no immediate action needed."
        }
        """
    
    return {
        "symbol": symbol,
        "recommendation": recommendation,
        "risk_level": risk_level,
        "signal_strength": signal_strength,
        "ma_signal": ma_signal,
        "rsi_signal": rsi_signal,
        "current_price": ma_data["current_price"],
        "analysis": analysis
    }

# Prompts
@mcp.prompt()
def analyze_ticker(symbol: str) -> str:
    """
    Analyze a ticker symbol for trading opportunities
    """
    return f"""You are a professional stock market analyst. I would like you to analyze the stock {symbol} and provide trading insights.

        Start by examining the current market data and technical indicators. Here are the specific tasks:

        1. First, check the current market data for {symbol}
        2. Calculate the moving averages using the calculate_moving_averages tool
        3. Calculate the RSI using the calculate_rsi tool
        4. Generate a comprehensive trade recommendation using the trade_recommendation tool
        5. Based on all this information, provide your professional analysis, highlighting:
        - The current market position
        - Key technical indicators and what they suggest
        - Potential trading opportunities and risks
        - Your recommended action (buy, sell, or hold) with a brief explanation

        Please organize your response in a clear, structured format suitable for a professional trader.
        """

@mcp.prompt()
def compare_tickers(symbols: str) -> str:
    """
    Compare multiple ticker symbols for the best trading opportunity
    
    Args:
        symbols: Comma-separated list of ticker symbols
    """
    symbol_list = [s.strip() for s in symbols.split(",")]
    symbol_section = "\n".join([f"- {s}" for s in symbol_list])
    
    return f"""You are a professional stock market analyst. I would like you to compare these stocks and identify the best trading opportunity:

        {symbol_section}

        For each stock in the list, please:

        1. Check the current market data using the appropriate resource
        2. Generate a comprehensive trade recommendation using the trade_recommendation tool
        3. Compare all stocks based on:
        - Current trend direction and strength
        - Technical indicator signals
        - Risk/reward profile
        - Trading recommendation strength

        After analyzing each stock, rank them from most promising to least promising trading opportunity. Explain your ranking criteria and why you believe the top-ranked stock represents the best current trading opportunity.

        Conclude with a specific recommendation on which stock to trade and what action to take (buy, sell, or hold).
        """

@mcp.prompt()
def intraday_strategy_builder(symbol: str) -> str:
    """
    Build a custom intraday trading strategy for a specific ticker
    """
    return f"""You are an expert algorithmic trader specializing in intraday strategies. I want you to develop a custom intraday trading strategy for {symbol}.

        Please follow these steps:

        1. First, analyze the current market data for {symbol} using the market-data resource
        2. Calculate relevant technical indicators:
        - Moving averages (short and long periods)
        - RSI
        3. Based on your analysis, design an intraday trading strategy that includes:
        - Specific entry conditions (technical setups that would trigger a buy/sell)
        - Exit conditions (both take-profit and stop-loss levels)
        - Position sizing recommendations
        - Optimal trading times during the day
        - Risk management rules

        Make your strategy specific to the current market conditions for {symbol}, not just generic advice. Include exact indicator values and price levels where possible.

        Conclude with a summary of the strategy and how a trader should implement it for today's trading session.
        """

Integrating the MCP server with Claude Desktop

Integrate the stock price server with Claude for Desktop by running the following command:

mcp install stock_analysis_server.py --with requests --with pandas --with tabulate

After the integration, restart Claude for Desktop to enable the new MCP server for stock analysis related queries.

Thanks for reading this article !!

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the clap button 👏 and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/mcp-server-experiments

Resources

Getting Started with CI/CD in Machine Learning

Vishnu Sivan — Fri, 11 Apr 2025 12:40:47 GMT

Continuous Integration (CI) and Continuous Deployment (CD) have long been essential practices in modern software development, enabling teams to integrate changes frequently, run automated tests, and deploy applications efficiently. While these methodologies were originally designed for traditional software, their value is increasingly evident in the world of machine learning (ML), where reproducibility, automation, and rapid iteration are just as critical.

In this article, we’ll explore how CI/CD can streamline your machine learning workflows — from training and evaluating models to deploying them seamlessly. Instead of relying on complex tools or platforms, we’ll keep things simple and accessible by using GitHub Actions, Makefile, CML (Continuous Machine Learning), and the Hugging Face CLI to build a fully automated ML pipeline.

Getting Started

Setting up the project
Step 1: GitHub Repository
Step 2: Hugging Face Spaces
Step 3: Project Structure
Training and Evaluating Drug Classification Model
Installing the dependencies
Loading the Dataset
Splitting train and test data
Building the Training Pipeline
Evaluating the model
Saving the model and results
Building Your Machine Learning CI Pipeline
Creating update branch
Makefile
GitHub Actions
Building Your Machine Learning CD Pipeline
Building the Gradio App
CD Workflow
Setting up repository secrets
Project Resources

Setting up the project

In this section, we will guide you through setting up your environment, building a CI/CD pipeline, and optimizing the entire workflow. The drug classifier model is trained using a scikit-learn pipeline with a Random Forest model, automate evaluation with CML, and deploy everything to the Hugging Face Hub. Once everything is set up, every code push to GitHub will automatically retrain the model, evaluate it, and update the app, model, and results on Hugging Face.

Step 1: GitHub Repository

To begin, create a new GitHub repository for your machine learning project. This repo will host your code, datasets and configuration files for automation.

Go to GitHub, click the “+” icon in the top right, and select “New repository”.
Enter a repository name and optional description.
Check “Add a README file”.
Set .gitignore to Python.
Click “Create repository”.

Copy the repository URL and run the following commands in your terminal to clone it:

git clone your-github-repo-url

Example:

git clone https://github.com/codemaker2015/CICD-for-Machine-Learning.git 
cd CICD-for-Machine-Learning

Step 2: Hugging Face Spaces

To begin, create a new Hugging Face Space for your machine learning project. This Space will host your web application, model files, and serve as the deployment endpoint for your CI/CD pipeline.

Go to Hugging Face and click on your profile picture in the top right corner. Select “New Space” from the dropdown.
Fill in the required details:
Space name: Choose a unique name for your Space
License: Select an appropriate license
SDK type: Choose Gradio or Streamlit depending on your app
Click “Create Space” to finish setup.

Step 3: Project Structure

Let’s set up the required folders and files before experimenting and building the pipeline.

Create app, data, model and results folders in your GitHub cloned repository.

3.1 App folder

The App folder is used to store all files related to the Hugging Face Space. It contains the web application script (drug_app.py), a README.md file with metadata for the Space, and a requirements.txt file listing the necessary Python packages.

Create the following files inside the app folder:

app.py: The main script for your classifier web app.
README.md: Contains metadata and a description for your Hugging Face Space. You can either download the README.md file directly from the Hugging Face Space you created earlier or use the sample content provided below.

---
title: Drug Classification
emoji: 💻
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 5.23.3
app_file: app.py
pinned: false
license: apache-2.0
---

requirements.txt: Specifies the dependencies needed to run your app. Add the following packages to the requirements.txt file.

scikit-learn
skops

3.2 Data folder

Download the Drug Classification dataset from Kaggle, extract the contents, and move the CSV file into the data folder.

Drug Classification

3.3 Model and Results folder

Both the models and results folders will initially be empty. They will be automatically populated by the Python scripts during training and evaluation.

3.4 Repository files

Makefile: Defines command shortcuts for running scripts, making it easier to trigger processes in the GitHub Actions workflow.
requirements.txt: Lists all the dependencies required to set up the environment for CI workflow jobs. Add the following dependencies to the requirements.txt.

pandas
scikit-learn
numpy
matplotlib
skops

train.py: Contains the core Python logic to load and preprocess data, train and evaluate the model, and save both the trained model and performance metrics.

Your project folder should now look like this:

Training and Evaluating Drug Classification Model

In this section, we will experiment with Python code to process the data and train a model using a scikit-learn pipeline. After training, evaluate the model performance and save both the results and the trained model for later use.

Installing the dependencies

Create and activate a virtual environment by executing the following command.

python -m venv venv
source venv/bin/activate #for ubuntu
venv/Scripts/activate #for windows

Install pandas, scikit-learn, numpy, matplotlib, skops and black libraries using pip.

pip install pandas scikit-learn numpy matplotlib skops black

Loading the Dataset

Load the CSV file using Pandas, shuffle the rows using the sample() function to randomize the data, and then display the first three rows.

import pandas as pd

drug_df = pd.read_csv("data/drug.csv")
drug_df = drug_df.sample(frac=1)
print(drug_df.head(3))

Splitting train and test data

Define the independent variables (X) and dependent variable (y) then split the dataset into training and testing sets. This is essential for evaluating the model performance on unseen data.

from sklearn.model_selection import train_test_split

X = drug_df.drop("Drug", axis=1).values
y = drug_df.Drug.values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=125
)

Building the Training Pipeline

Construct a data processing pipeline using ColumnTransformer, which performs the following operations:

Encodes categorical columns using OrdinalEncoder
Fills missing values in numerical columns using SimpleImputer
Scales the numerical columns using StandardScaler

After preprocessing, build a training pipeline that feeds the transformed data into a RandomForestClassifier.

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# Define categorical and numerical column indices
cat_col = [1, 2, 3]
num_col = [0, 4]

# Create a column transformer for preprocessing
transform = ColumnTransformer(
    transformers=[
        ("encoder", OrdinalEncoder(), cat_col),
        ("num_imputer", SimpleImputer(strategy="median"), num_col),
        ("num_scaler", StandardScaler(), num_col),
    ]
)

# Build the complete pipeline
pipe = Pipeline(
    steps=[
        ("preprocessing", transform),
        ("model", RandomForestClassifier(n_estimators=100, random_state=125)),
    ]
)

# Train the model
pipe.fit(X_train, y_train)

Evaluating the model

After training the model, evaluate its performance using two common metrics: accuracy and F1 score.

from sklearn.metrics import accuracy_score, f1_score

predictions = pipe.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average="macro")

print("Accuracy:", str(round(accuracy * 100, 2)) + "%", "F1 Score:", round(f1, 2))

Saving the model and results

We will store the evaluation metrics and confusion matrix in the results/ folder. This helps in tracking performance over time, especially in CI/CD pipelines.

1. Save Accuracy and F1 Score to a Text File

with open("results/metrics.txt", "w") as outfile:
    outfile.write(f"Accuracy = {round(accuracy, 2)}, F1 Score = {round(f1, 2)}")

2. Save Confusion Matrix as an Image

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
# Generate confusion matrix
cm = confusion_matrix(y_test, predictions, labels=pipe.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe.classes_)
# Plot and save the confusion matrix
disp.plot()
plt.savefig("results/model_results.png", dpi=120)

3. Save and Load model using skops

We will use the skops Python package to save our entire pipeline including both the preprocessing steps and the trained model. With skops, model versioning and reproducibility become much easier in a CI/CD workflow.


import skops.io as sio

# Save the trained pipeline to a file
sio.dump(pipe, "model/drug_pipeline.skops")
Load the Model Pipeline

# Load the saved pipeline
loaded_pipe = sio.load("model/drug_pipeline.skops", trusted=True)

Creating train.py file

Here’s how you can structure your train.py file using the code snippets you've worked on. This script will handle the loading, training, evaluation, saving of the model and results in a modular way.

import pandas as pd
import matplotlib.pyplot as plt
import skops.io as sio

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay

# Load and shuffle dataset
drug_df = pd.read_csv("data/drug.csv")
drug_df = drug_df.sample(frac=1)

# Train-test split
X = drug_df.drop("Drug", axis=1).values
y = drug_df["Drug"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=125)

# Define preprocessing and model pipeline
cat_col = [1, 2, 3]
num_col = [0, 4]

transform = ColumnTransformer([
    ("encoder", OrdinalEncoder(), cat_col),
    ("num_imputer", SimpleImputer(strategy="median"), num_col),
    ("num_scaler", StandardScaler(), num_col),
])

pipe = Pipeline(steps=[
    ("preprocessing", transform),
    ("model", RandomForestClassifier(n_estimators=100, random_state=125)),
])

# Train the model
pipe.fit(X_train, y_train)

# Make predictions and evaluate
predictions = pipe.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
f1 = f1_score(y_test, predictions, average="macro")

# Save metrics
with open("results/metrics.txt", "w") as outfile:
    outfile.write(f"Accuracy = {round(accuracy, 2)}, F1 Score = {round(f1, 2)}")

# Save confusion matrix
cm = confusion_matrix(y_test, predictions, labels=pipe.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=pipe.classes_)
disp.plot()
plt.savefig("results/model_results.png", dpi=120)

# Save model
sio.dump(pipe, "model/drug_pipeline.skops")

Building Your Machine Learning CI Pipeline

In this section, we will explore how to use CML, Makefile, and GitHub Actions to automate model training, evaluation, and version control for our machine learning project.

CML

Continuous Machine Learning (CML) is an open-source tool for integrating CI into ML projects. We will use the iterative/setup-cml GitHub Action to automate model evaluation reporting. On every push, it generates a report with performance metrics and a confusion matrix under the commit and sends an email notification.

Creating update branch

We’re generating the evaluation report, but currently, the model and results aren’t being versioned. To track these changes properly, we’ll create a new branch called “update” and push the updated model and results to it.

To create the “update” branch:

Click on the branch dropdown (where it says main)
Type “update” in the search box
Select “Create branch: update from main” to finalize the creation.

Makefile

A Makefile contains command sets that can automate tasks like preprocessing, training, testing, and deploying. It helps simplify the CI workflow by bundling related commands, keeping the GitHub Actions file clean and modular.

Add the following content to the Makefile.

install:
 pip install --upgrade pip &&\
  pip install -r requirements.txt

format: 
 black *.py 

train:
 python train.py

eval:
 echo "## Model Metrics" > report.md
 cat ./results/metrics.txt >> report.md
 
 echo '\n## Confusion Matrix Plot' >> report.md
 echo '![Confusion Matrix](./results/model_results.png)' >> report.md
 
 cml comment create report.md
  
update-branch:
 git config --global user.name $(USER_NAME)
 git config --global user.email $(USER_EMAIL)
 git commit -am "Update with new results"
 git push --force origin HEAD:update

hf-login: 
 pip install -U "huggingface_hub[cli]"
 git pull origin update
 git switch update
 huggingface-cli login --token $(HF) --add-to-git-credential

push-hub: 
 huggingface-cli upload codemaker2015/Drug-Classification ./app --repo-type=space --commit-message="Sync App files"
 huggingface-cli upload codemaker2015/Drug-Classification ./model model --repo-type=space --commit-message="Sync Model"
 huggingface-cli upload codemaker2015/Drug-Classification ./results metrics --repo-type=space --commit-message="Sync Model"

deploy: hf-login push-hub

all: install format train eval update-branch deploy

After we make the necessary changes, commit them, and push the updates to the remote GitHub repository.

git add .
git commit -m "code integration"
git push origin main

GitHub Actions

To automate training and evaluation, we will create a GitHub Actions workflow.

Go to the “Actions” tab in your GitHub repository.
Click on “set up a workflow yourself.”
Rename the default main.yml file to ci.yml.
Start by defining the name of the workflow.
Set the trigger so that it runs on every push or pull request to the main branch or through manual dispatch.
Define the environment by using the latest Ubuntu runner.
Set up and activate the GitHub Actions we need for this CI workflow.
Use make commands to add different steps like installing dependencies, training, formatting, and evaluating.
Commit your changes to trigger the workflow — GitHub Actions will execute each step sequentially.
Provide a GitHub Token to the CML job via repository secrets (e.g., secrets.GITHUB_TOKEN).
Add the following code into your ci.yml file:

name: Continuous Integration
on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]
  workflow_dispatch:
  
permissions: write-all
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: iterative/setup-cml@v2
      - name: Install Packages
        run: make install
      - name: Format
        run: make format
      - name: Train
        run: make train
      - name: Evaluation
        env:
          REPO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: make eval
      - name: Update Branch
        env:
          NAME: ${{ secrets.USER_NAME }}
          EMAIL: ${{ secrets.USER_EMAIL }}
        run: make update-branch USER_NAME=$NAME USER_EMAIL=$EMAIL

Building Your Machine Learning CD Pipeline

In this section, we will explore how to automate the deployment of both the model and the application. This includes pulling the updated model and app files from the update branch, logging into the Hugging Face CLI using a token, pushing the necessary files, and ultimately deploying the application.

Building the Gradio App

To deploy our model and make it accessible, we’ll create a Gradio app with the following components:

Load the scikit-learn pipeline and trained model.
Define a Python function to predict drug labels based on user input.
Design the input interface using sliders for numerical values and radio buttons for categorical inputs.
Add sample inputs to quickly test the model’s functionality.
Provide metadata such as title for the application and brief description highlighting its features and purpose.

Add the following code to app.py file inside the app folder.

import gradio as gr
import skops.io as sio
import warnings
from sklearn.exceptions import InconsistentVersionWarning

# Suppress the version warnings
warnings.filterwarnings("ignore", category=InconsistentVersionWarning)

# Explicitly specify trusted types
trusted_types = [
    "sklearn.pipeline.Pipeline",
    "sklearn.preprocessing.OneHotEncoder",
    "sklearn.preprocessing.StandardScaler",
    "sklearn.compose.ColumnTransformer",
    "sklearn.preprocessing.OrdinalEncoder",
    "sklearn.impute.SimpleImputer",
    "sklearn.tree.DecisionTreeClassifier",
    "sklearn.ensemble.RandomForestClassifier",
    "numpy.dtype",
]
pipe = sio.load("./model/drug_pipeline.skops", trusted=trusted_types)


def predict_drug(age, sex, blood_pressure, cholesterol, na_to_k_ratio):
    """Predict drugs based on patient features.

    Args:
        age (int): Age of patient
        sex (str): Sex of patient
        blood_pressure (str): Blood pressure level
        cholesterol (str): Cholesterol level
        na_to_k_ratio (float): Ratio of sodium to potassium in blood

    Returns:
        str: Predicted drug label
    """
    features = [age, sex, blood_pressure, cholesterol, na_to_k_ratio]
    predicted_drug = pipe.predict([features])[0]

    label = f"Predicted Drug: {predicted_drug}"
    return label


inputs = [
    gr.Slider(15, 74, step=1, label="Age"),
    gr.Radio(["M", "F"], label="Sex"),
    gr.Radio(["HIGH", "LOW", "NORMAL"], label="Blood Pressure"),
    gr.Radio(["HIGH", "NORMAL"], label="Cholesterol"),
    gr.Slider(6.2, 38.2, step=0.1, label="Na_to_K"),
]
outputs = [gr.Label(num_top_classes=5)]

examples = [
    [30, "M", "HIGH", "NORMAL", 15.4],
    [35, "F", "LOW", "NORMAL", 8],
    [50, "M", "HIGH", "HIGH", 34],
]


title = "Drug Classification"
description = "Enter the details to correctly identify Drug type?"
article = "A Beginners Guide to CI/CD for Machine Learning. It teaches how to automate training, evaluation, and deployment of models to Hugging Face using GitHub Actions."


gr.Interface(
    fn=predict_drug,
    inputs=inputs,
    outputs=outputs,
    examples=examples,
    title=title,
    description=description,
    article=article,
    theme=gr.themes.Soft(),
).launch()

Add the following dependencies to the requirements.txt inside the app folder.

scikit-learn
skops
gradio

CD workflow

To make our workflow fully CI/CD compliant, we need to create another file named cd.yml, similar to the existing ci.yml file. Once the CI pipeline completes successfully, it will trigger the cd.yml workflow using the on.workflow_run parameter. This deployment workflow will set up the environment and execute the make deploy command, using the Hugging Face token to push the latest model and application updates to the Hugging Face Hub.

Go to GitHub actions and create a workflow named as cd.yml.
Add the following code to it.

name: Continuous Deployment
on:
  workflow_run:
    workflows: ["Continuous Integration"]
    types:
      - completed

  workflow_dispatch:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Deployment To Hugging Face
        env:
          HF: ${{ secrets.HF }}
        run: make deploy HF=$HF

Setting up repository secrets

To commit and push changes using Git, you need to configure a username and email. While you can set these directly, it’s recommended to use GitHub Secrets for better security. Additionally, the CD pipeline requires a Hugging Face token to deploy the application to the Hugging Face Hub.

Follow these steps to securely add the necessary credentials to your GitHub repository using Secrets.

Go to your repository Settings and click on “Secrets and variables” under the Security section.
Select “Actions”, then click the green “New repository secret” button.
Add a name and value — this works like setting an environment variable on your local machine.
To generate a Hugging Face token, click on your profile picture and select “Settings”.
Navigate to “Access Tokens”, then click “New Token” and ensure it has write permissions.
Copy the token and add it as a repository secret in the same way as you did for the Git username and email.

After we make the necessary changes, commit them, and push the updates to the remote GitHub repository. Note that we have added a few GitHub Actions, so make sure to pull the latest changes from the remote repository before pushing your local updates.

git add .
git commit -m "gradio code added"
git pull origin main
git push origin main

Once you push the changes, GitHub Actions will be triggered automatically to run the CI/CD pipelines. Alternatively, you can run them manually by clicking the Re-run all jobs button.

You can monitor live logs for each step by selecting the run option in the workflow build. Once the files are successfully uploaded to the Hugging Face server, the corresponding Space will begin setting up the environment. Shortly after, the application will launch and start running.

Project Resources

GitHub Repository: codemaker2015/CI-CD-for-Machine-Learning
Hugging Face Space: Drug Classification — a Hugging Face Space by codemaker2015
Kaggle Dataset: Drug Classification

Stories by Vishnu Sivan on Medium

A Practical Guide to Training AI Agents with Microsoft Agent Lightning

Getting Started

Table of contents

What is Agent Lightning?

Why Agent Lightning Matters

Three-Component Architecture of Agent Lightning

How Agent Lightning Works

Basic Integration

Hands-on 1: Manual Prompt Search with AgentLightning and OpenAI

Setting Up the Environment

Installing dependencies

Hands-on 2: Building a Trainable LLM Agent with AgentLightning

Hands-on 3: Sentiment Analysis Agent with AgentLightning

Hands-on 4: LangGraph SQL Agent with AgentLightning

Resources

Zvec: Reimagining Vector Databases with SQLite-Style Simplicity

Getting Started

Table of contents

What is Zvec

How Zvec Works

Why Zvec Matters

Performance Benchmarks

RAG-Focused Features

Hands-On 1: Creating Your First Embedded Vector Database with Zvec

Setting up the environment

Creating and querying a Zvec collection

Hands-On 2: Building a Zvec-based FAQ Agent for customer queries

Installing dependencies

Preparing sample FAQ data

Creating Zvec collection

Building retrieval function

Connecting to LLM for answer generation

Testing FAQ agent

Resources

OmniDaemon: The Universal Event-Driven Runtime for Production Ready AI Agents

Getting Started

Table of contents

The “Monolithic Trap” in Agentic AI

What is OmniDaemon?

Key Pillars:

The Architecture: How it Works

The “Omni Stack” Ecosystem

The Core Problem OmniDaemon Solves

Why Event-Driven Architecture for AI Agents?

Why Traditional Architectures Fail for Agents

Getting Started: A Technical Quickstart

Prerequisites

Step 1: Set Up the Environment

Initializing Redis

Installing Dependencies

Step 2: Create your first agent

Step 3: Create the Event Producer (The Trigger)

Executing the code

Hands-On Project: Intelligent Log Insights Generator

Project Architecture

Step 1: Set Up the Environment

Step 2: Create Sample Log Generator

Step 3: Build the Log Ingestion Agent

Step 4: Build the LLM-Powered Analysis Agent

Step 5: Build the Reporting Agent

Executing the app

Step 6: (Optional) Advanced Agent Chaining for Specialized Analysis

Advanced Features and Patterns

1. Agent Chaining for Complex Workflows

2. Multi-Tenant Log Analysis

2. Multi-Tenant Log Analysis

3. Real-Time Alerting with Webhooks

4. Horizontal Scaling

Performance Optimization

Resources

Exploring TimesFM: The Foundation Model That Understands the Language of Time

Getting Started

Table of contents

The Evolution of Time-Series Forecasting

What is a Decoder?

What is TimesFM?

Architecture Overview

Experimenting with TimesFM model

Installing uv