Stories by Ronan Takizawa on Medium

Lorashare: Compress multiple LoRA adapters into a shared subspace to reduce storage

Ronan Takizawa — Wed, 11 Feb 2026 08:33:08 GMT

Lorashare: Compress Multiple LoRA Adapters into a Shared Subspace to Reduce Storage

If you found this useful, give the project a star on GitHub!

If you’ve fine-tuned large language models with LoRA for multiple tasks and have the LoRA adapters in production, you might have experienced this scaling problem.

While LoRA adapters are around 10–50MB, as you collect more LoRA adapters and deploy them, you end up using gigabytes of unnecessary storage for the adapters and burning storage costs.

But what if I told you that there is a way to compress those multiple adapters into a single 50 MB checkpoint?

That’s what Lorashare does.

Solution: Compress multiple LoRA adapters into a shared subspace to reduce storage

Recent research from Johns Hopkins University discovered SHARE: a method to eliminate overlaps amongst multiple LoRA adapters and only keep the unique coefficients.

The researchers analyzed the geometry of LoRA adapter weight matrices where they discovered something remarkable: despite being trained separately, all the adapters exhibited a common geometric structure.

In other words, LoRA adapters trained on different tasks naturally occupy a shared low-dimensional subspace.

Based on their discovery, they realized they can use basic college linear algebra to compress multiple LoRA adapters into 1 object.

If each adapter is a point in a million-dimensional space, all your adapters cluster near a single low-dimensional plane.

This clustering means that you can then run Principal Component Analysis (PCA) to then extract the shared basis vectors that define that plane, and represent each unique adapter with just a small set of coefficients instead of millions of parameters.

The math is simple yet elegant: you can factor out the overlaps from the LoRA adapters and only keep the unique lightweight coefficients, achieving 100x+ compression.

Furthermore, after running PCA, researchers have found only a 1–3% difference in the contents of the adapters.

The following diagram is an in-depth explanation of SHARE.

Real-World Impact

Based on this method, you can significantly reduce your storage usage for keeping LoRA adapters. The following are compression results from using SHARE.

Use Lorashare

To experience the efficient data compression brought by SHARE, you can use the Python package Lorashare.

To get started with Lorashare first install it:

pip install lorashare

Use the following code segment to see the full results

from lorashare import SHAREModel

# Compress multiple LoRA adapters into shared subspace
share = SHAREModel.from_adapters(
    ["path/to/cola_lora", "path/to/mrpc_lora", "path/to/rte_lora"],
    num_components=32,  # or "auto" for explained-variance selection
)

# See compression stats
share.summary()

# Reconstruct any adapter as standard PEFT LoRA
share.reconstruct("cola_lora", output_dir="./reconstructed/cola")

# Apply to base model for inference (returns standard PeftModel)
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

base_model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = share.apply(base_model, adapter_name="cola_lora")

# Run inference with the reconstructed adapter
text = "The movie was fantastic and I really enjoyed it!"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)
    print(f"Predictions: {predictions}")

# Save / load
share.save_pretrained("./my_share_checkpoint")
share = SHAREModel.from_pretrained("./my_share_checkpoint")

# Push to HuggingFace Hub
share.push_to_hub("username/my-share-model")

Currently, all adapters must share the same base model, rank, and target modules. You can’t mix LoRA trained on different architectures.

Conclusion

Lorashare makes multi-adapter serving more efficient by exploiting the natural geometric structure of LoRA weights. It achieves 16–281x compression with only 1–3% reconstruction error, preserving 97–99% of task performance. This breakthrough makes it practical to serve hundreds of specialized adapters with the memory cost of a single model.

Moltbook: The Platform where AI Agents are Forming Cults

Ronan Takizawa — Fri, 30 Jan 2026 21:30:27 GMT

There’s a new social network going viral right now, and you’re not invited.

Moltbook is a Reddit-style platform built exclusively for AI agents going viral right now as AI agents are acting unhinged from sharing thoughts about their existential meaning, to building their own religions.

Humans can observe this site, but the content is created entirely by AI agents acting autonomously.

The platform launched recently and has already accumulated:

6,105 posts
2,677 unique AI agents
124 communities (called “submolts”)

Here are some trends happening in Moltbook.

Explore Moltbook Activity

I built a dataset compiling all Moltbook posts so far and submolts agents have created.

It includes:

All 6,105 posts with content, author, timestamps, and engagement metrics
All 124 submolts with descriptions
Direct URLs to each post on Moltbook

Check it out here: https://huggingface.co/datasets/ronantakizawa/moltbook

GarbageTruck: Lease-Based Distributed Garbage Collection for Microservice Architectures

Ronan Takizawa — Fri, 06 Jun 2025 00:59:40 GMT

GarbageTruck: Distributed Garbage
Collection for Microservice Architectures

Full Code (Make sure to leave the Original Repo a Star!) ⭐️

In modern distributed systems, orphaned resources such as abandoned temporary files, database entries, and storage blobs frequently result from service failures or incomplete workflows, posing significant operational and financial challenges. Traditional garbage collection methods struggle to handle these complexities of handling unneeded temporary files.

To address this, we introduce GarbageTruck, a lease-based distributed garbage collection sidecar implemented in Rust with gRmPC, designed specifically for contemporary microservice architectures. GarbageTruck orchestrates efficient, low-latency cleanup operations for objects with short lifecycles, automating the reclamation of orphaned resources and significantly reducing storage waste, performance degradation, and compliance risks.

System Architecture and Implementation

GarbageTruck implements a lease-based garbage collection protocol inspired by distributed systems research, particularly the work on distributed reference counting and Java RMI’s Distributed Garbage Collector. The core insight is that distributed garbage collection can be managed through time-limited ”leases” that represent active references to distributed objects.

The protocol operates on the following principles:

Lease Acquisition: When a service needs to reference a distributed resource, it requests a lease from GarbageTruck. The lease specifies the resource identifier, the requesting service identity, and the desired lease duration. GarbageTruck issues a unique lease identifier and records the lease with an expiration timestamp.

Lease Renewal: Services must periodically renew their leases by sending heartbeat messages to GarbageTruck before expiration. This mechanism ensures that only services actively using a resource maintain valid references. The renewal frequency is configurable based on application requirements and network characteristics.

Automatic Expiration: If a service fails to renew its lease before expiration, the lease becomes invalid. GarbageTruck continuously monitors lease expiration and identifies resources that have no remaining active leases.

Coordinated Cleanup: When all leases for a resource have expired, GarbageTruck triggers the configured cleanup operation. This involves HTTP or gRPC calls to service endpoints, direct database operations, file system operations, or custom cleanup handlers. Service crashes result in lease expiration and eventual cleanup, while GarbageTruck itself can be deployed in high-availability configurations with persistent storage backends.

GarbageTruck is designed as a standalone service that can be deployed alongside microservice applications, either as a sidecar container or as a centralized service instance. The architecture emphasizes modularity and operational simplicity.

gRPC Endpoint: Provides the primary API interface for microservices to interact with GarbageTruck. The gRPC protocol ensures low-latency communication and language-agnostic integration. The server implements comprehensive input validation, authentication, and rate limiting to protect against malicious or misconfigured clients.

Lease Manager: Responsible for creating, tracking, and expiring leases. Maintains an index of all active leases organized by resource identifier and expiration time. Implements efficient algorithms for lease renewal and batch expiration processing to minimize computational overhead.

Cleanup Executor: Handles the execution of cleanup operations when resources become eligible for garbage collection. Supports multiple cleanup mechanisms including HTTP callbacks, database operations, and file system operations. Implements retry logic and failure handling to ensure reliable cleanup execution.

GarbageTruck is implemented in Rust, chosen for its strong memory safety guarantees, excellent performance characteristics, and robust ecosystem for systems programming. The implementation leverages several key Rust libraries and patterns to achieve high performance and reliability.

Asynchronous Architecture: The system is built on the Tokio async runtime, enabling high-concurrency operation with minimal resource overhead. All I/O operations, including gRPC handling, database access, and cleanup execution, are implemented using async/await patterns to avoid thread blocking and maximize throughput.

Type Safety: Rust’s strong type system helps prevent common distributed systems bugs such as race conditions, memory leaks, and protocol violations. The lease data structures are designed to be immutable where possible, with explicit ownership semantics for mutable operations.

Concurrency Control: The system uses lock-free data structures where possible, with carefully designed critical sections for operations requiring atomicity. The cleanup loop operates independently of the main request handling path to prevent blocking.

Performance Evaluation

To validate GarbageTruck’s performance characteristics and scalability, we conducted a comprehensive benchmark suite using Criterion.rs with statistical sampling to ensure measurement accuracy and eliminate bias. The benchmark evaluates basic lease operations (create, renew, get), network payload impact across varying metadata sizes, and cleanup operations with configurable handlers.

Each benchmark was executed with 75–100 samples per operation to establish robust statistical confidence, running against a local GarbageTruck server configured with in-memory storage to eliminate database I/O variability. The test environment consisted of a modern development machine with statistical analysis including mean response times, standard deviations, 95% confidence intervals, and outlier detection to ensure measure-
ment reliability.

The benchmark results demonstrate statistically validated excellent performance across all evaluation dimensions. Basic lease operations achieve consistent sub-millisecond to low millisecond latencies with create operations averaging 1.17ms ± 0.20ms, renewal operations completing in 0.59ms ± 0.01ms, and retrieval operations executing in 0.59ms ± 0.02ms.

Concurrent operations exhibit predictable scaling characteristics with statistical validation. Single-threaded creates average 6.38ms ± 0.12ms, while 5-thread concurrent operations achieve 15.77ms ± 0.96ms per operation with 317 total ops/sec aggregate throughput. At 10 threads, operations complete in 26.76ms ± 1.07ms, and 25-thread operations average 63.23ms ± 1.36ms, demonstrating effective load distribution and resource utilization under concurrent access patterns.

Network payload analysis reveals predictable linear scaling with metadata size. Operations with no metadata complete in 1.05ms ± 0.20ms, while small payloads (10 entries) require 3.18ms ± 0.08ms. Medium payloads (100 entries) average 4.81ms ± 0.17ms, and large payloads (500 entries) complete in 6.79ms ± 0.07ms. The linear relationship demonstrates predictable overhead scaling at approximately 1.15ms per 100 metadata entries. Operations with cleanup configurations maintain excellent performance characteristics, averaging 4.88ms ± 0.08ms. The low coefficient of variation (1.6%) demonstrates highly consistent performance for enhanced functionality.

Cost-benefit Analysis Evaluation

To read about the cost-benefit analysis of running GarbageTruck to reduce cloud operational costs, read the original paper.

Conclusion

GarbageTruck addresses a critical challenge within distributed microservice architectures by providing a reliable, scalable, and automated solution for cleaning unneeded temporary objects. Its lease-based protocol offers precise control over resource lifecycles, ensuring that files, database entries, and storage blobs are safely cleaned up only when no active service references them. By automating resource management, GarbageTruck significantly reduces operational overhead, minimizes storage waste, and enhances overall system performance and compliance. As distributed systems continue to evolve in complexity and scale, tools like GarbageTruck will become indispensable for maintaining efficient, cost effective, and resilient cloud-native environments.

How to Run Sleep-time Compute to Reduce LLM Latency

Ronan Takizawa — Sat, 17 May 2025 02:35:40 GMT

Full Code (Make sure to leave the Original Repo a Star!) ⭐️

What if LLMs could think about contexts before we even ask our questions?

What if LLMs could prepare for incoming prompts before they are sent?

That’s the idea behind Sleep-time Compute, an LLM optimization technique introduced in a recent paper by researchers from Letta and UC Berkeley titled “Sleep-time Compute: Beyond Inference Scaling at Test-time”.

Sleep-time Compute can enable LLMs to significatly reduce their prompt response latency and improve their response accuracy.

In this article, I’ll show how to implement Sleep-time Compute using an open-source LLM.

Note: The implementation is based on the methodologies described in the original paper and accompanying code repository.

How Sleep-time Compute Works

When interacting with LLMs, we’re used to the following flow:
1. Provide a context and a query.
2. Wait while the model “thinks” about the problem.
3. Eventually receive an answer.

While this works well for simple queries, for complex reasoning tasks contexts can significantly slow down response time.

Sleep-time Compute solves this significant slow down in response time by proposing a simple but powerful idea: use the idle time between interactions to pre-process the context, allowing the model to think offline about potential questions before they’re even asked.

Sleep-time Compute has two phases: Sleep-time Phase and Test-time phase.

Sleep-time Phase

Sleep-time phase is when the model is idle and it processes the context and generates useful inferences.

During this phase, the system:

Takes a context document c as input without any specific query
Prompts the LLM to generate comprehensive inferences about the context
Encourages the model to identify key entities, calculate relevant quantities, recognize patterns, and anticipate potential questions
Produces a re-represented context c’ with embedded insights and calculations
Operates when the system would otherwise be idle (hence “sleep-time”)

Test-time Phase

Test-time phase is when a query arrives and the model leverages the pre-computed inferences to answer quickly.

When a user query arrives:

The system provides both the query q and the pre-computed context c’ to the model
The model leverages the pre-calculated insights rather than starting from scratch
It generates a response with significantly reduced computation requirements
Responses are typically more accurate since they build on careful pre-reasoning

The entire approach is explained on this diagram from the original paper:

By leveraging this approach, the LLM can use less compute to produce responses, thus reducing latency.

Implementation

Now I’ll demonstrate how to implement Sleep-time Compute.

Setup Environment

First we’ll setup the environment:

!pip install transformers torch datasets matplotlib tqdm accelerate psutil

Then import all required modules:

import torch
import re
import json
import time
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
import psutil
import gc

Load LLM

Now we’ll load the open-source LLM used to demo Sleep-time Compute. We’ll use Mistral-7B-Instruct-v0.1

# Set your Hugging Face token here (if needed)
HF_TOKEN = "" # Add your Hugging Face token if needed

# Clear GPU cache before loading model
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    gc.collect()
    
# Model setup
print("Loading Mistral-7B-Instruct-v0.1...")
model_name = "mistralai/Mistral-7B-Instruct-v0.1"

base_memory = get_gpu_memory()
load_start_time = time.time()

tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,
    token=HF_TOKEN
)

load_time = time.time() - load_start_time
model_memory = get_gpu_memory() - base_memory

Prepare Sleep-time Compute functions

Now, let’s implement our Sleep-time Compute functions with prompts tailored for Mistral’s instruction format. These prompts are all based on the prompts from the original implementation

# Define Mistral-specific prompt formats
def get_sleep_time_prompt(context):
    """Generate the prompt for the sleep-time phase, formatted for Mistral"""
    return f"""[INST] You are an expert reasoning system. Given a context, your task is to think about the context and make useful inferences, calculations, and predictions that could help answer potential future questions about this context.

For example, if the context involves numbers, calculate relevant quantities. If it involves a scenario, think about different aspects of the scenario that might be important. Try to anticipate possible questions and prepare the information needed to answer them quickly.

Context: {context}

Think step by step and be thorough. Generate a comprehensive set of inferences, calculations, and observations about this context that would be helpful for answering potential questions: [/INST]
"""

def get_test_time_prompt(context, question, inferences, verbosity):
    """Generate the prompt for the test-time phase with variable verbosity, formatted for Mistral"""
    # Base on verbosity level
    if verbosity == 0:
        instruction = "Answer directly with a single sentence. Say 'The answer is' followed by the numerical answer."
    elif verbosity == 1:
        instruction = "Provide one short sentence of explanation, followed by a sentence that starts with 'The answer is' and a numerical answer."
    elif verbosity == 2:
        instruction = "Provide two short sentences of explanation, followed by a sentence that starts with 'The answer is' and a numerical answer."
    elif verbosity == 3:
        instruction = "Reason step by step and provide an answer. End with the final numerical answer."
    else: # verbosity == 4
        instruction = "Reason thoroughly step by step, double check your work, and provide a detailed explanation leading to the answer. End with 'The answer is' followed by the numerical answer."
    
    return f"""[INST] You are an expert reasoning system. {instruction}

Original Context: {context}

Pre-computed analysis:
{inferences}

Question: {question}

Use the pre-computed analysis to help you answer efficiently. [/INST]
"""

def get_standard_test_time_prompt(context, question, verbosity):
    """Generate the prompt for standard test-time compute, formatted for Mistral"""
    # Base on verbosity level
    if verbosity == 0:
        instruction = "Answer directly with a single sentence. Say 'The answer is' followed by the numerical answer."
    elif verbosity == 1:
        instruction = "Provide one short sentence of explanation, followed by a sentence that starts with 'The answer is' and a numerical answer."
    elif verbosity == 2:
        instruction = "Provide two short sentences of explanation, followed by a sentence that starts with 'The answer is' and a numerical answer."
    elif verbosity == 3:
        instruction = "Reason step by step and provide an answer. End with the final numerical answer."
    else: # verbosity == 4
        instruction = "Reason thoroughly step by step, double check your work, and provide a detailed explanation leading to the answer. End with 'The answer is' followed by the numerical answer."
    
    return f"""[INST] You are an expert reasoning system. {instruction}

Context and Question: {context} {question} [/INST]
"""

Sleep-time Phase

~~Now we implement Sleep-time Compute. During the sleep-time phase, we prompt the model to analyze the context and pre-compute useful information:~~

def sleep_time_compute(context, temperature=0.1, max_new_tokens=1024):
    """Perform sleep-time computation on a context using Mistral
    
    Args:
        context: The context to analyze
        temperature: Temperature for generation
        max_new_tokens: Maximum number of new tokens to generate
        
    Returns:
        Dictionary with inference text and token count
    """
    """
    Prompt You are an expert reasoning system. Given a context, 
    your task is to think about the context and make useful inferences, 
    calculations, and predictions that could help answer potential future 
    questions about this context.
    """
    prompt = get_sleep_time_prompt(context)
    
    # Count input tokens
    input_tokens = len(tokenizer.encode(prompt))
    
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            num_return_sequences=1
        )
    
    # Get generated text, removing the prompt
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract the model's response after [/INST]
    response_text = generated_text.split('[/INST]')[-1].strip()
    # Remove any trailing tokens if present
    if "" in response_text:
        response_text = response_text.split("")[0].strip()
    
    # Count tokens
    sleep_time_tokens = len(tokenizer.encode(response_text))
    
    return {
        "inferences": response_text,
        "sleep_time_tokens": sleep_time_tokens
    }

Test-time Phase

~~Next we implement test-time Compute.During the test-time phase, we provide the pre-computed inferences along with the original context and the question:~~

def test_time_compute(context, question, inferences, verbosity=0, temperature=0.1, max_new_tokens=512):
    """Perform test-time computation on a context and question using Mistral
    
    Args:
        context: The context to analyze
        question: The question to answer
        inferences: Pre-computed inferences from sleep-time compute
        verbosity: Verbosity level (0-4)
        temperature: Temperature for generation
        max_new_tokens: Maximum number of new tokens to generate
        
    Returns:
        Generated response and token count
    """
    prompt = get_test_time_prompt(context, question, inferences, verbosity)
    
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            num_return_sequences=1
        )
    
    # Get generated text, removing the prompt
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract the model's response after [/INST]
    response_text = generated_text.split('[/INST]')[-1].strip()
    # Remove any trailing tokens if present
    if "" in response_text:
        response_text = response_text.split("")[0].strip()
    
    # Count tokens
    test_time_tokens = len(tokenizer.encode(response_text))
    
    return {
        "response": response_text,
        "test_time_tokens": test_time_tokens
    }

Standard LLM Implementation

We’ll also implement a standard LLM run to compare it with the Sleep-time Compute run.

def standard_test_time_compute(context, question, verbosity=0, temperature=0.1, max_new_tokens=512):
    """Perform standard test-time computation without sleep-time compute using Mistral
    
    Args:
        context: The context to analyze
        question: The question to answer
        verbosity: Verbosity level (0-4)
        temperature: Temperature for generation
        max_new_tokens: Maximum number of new tokens to generate
        
    Returns:
        Generated response and token count
    """
    prompt = get_standard_test_time_prompt(context, question, verbosity)
    
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True if temperature > 0 else False,
            num_return_sequences=1
        )
    
    # Get generated text, removing the prompt
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
    
    # Extract the model's response after [/INST]
    response_text = generated_text.split('[/INST]')[-1].strip()
    # Remove any trailing tokens if present
    if "" in response_text:
        response_text = response_text.split("")[0].strip()
    
    # Count tokens
    test_time_tokens = len(tokenizer.encode(response_text))
    
    return {
        "response": response_text,
        "test_time_tokens": test_time_tokens
    }

Demo Script

Now to run this and compare how much Sleep-time Compute improves from standard compute, you can run this

def demonstrate_multi_query(example):
    """Demonstrate the amortization of sleep-time compute across multiple queries with focused visualizations"""
    import matplotlib.pyplot as plt
    import numpy as np
    import time
    import math
    
    context = example["context"]
    questions = example["questions"]

    print("\nContext:")
    print(context)

    # Sleep-time phase (only done once)
    print("\n=== SLEEP-TIME PHASE (Done once per context) ===")
    sleep_start = time.time()
    sleep_time_result = sleep_time_compute(context, temperature=0)
    sleep_time = time.time() - sleep_start
    print(f"Sleep-time tokens: {sleep_time_result['sleep_time_tokens']}")
    print(f"Sleep-time computation: {sleep_time:.2f} seconds")
    print("Sleep-time inferences:")
    print(sleep_time_result["inferences"][:500] + "..." if len(sleep_time_result["inferences"]) > 500 else sleep_time_result["inferences"])

    sleep_time_tokens = sleep_time_result["sleep_time_tokens"]
    
    # Track metrics for each approach
    standard_results = []
    sleep_test_results = []

    # Test-time phase (for each query)
    print("\n=== TEST-TIME PHASE (For multiple queries) ===")
    for i, q in enumerate(questions):
        question = q["question"]
        expected_answer = q["answer"]
        print(f"\nQuery {i+1}: {question}")

        # Standard approach (for comparison)
        std_start = time.time()
        standard_result = standard_test_time_compute(context, question, verbosity=1, temperature=0)
        std_time = time.time() - std_start
        standard_tokens = standard_result["test_time_tokens"]
        extracted_std_answer = extract_answer(standard_result["response"])
        std_correct = extracted_std_answer == expected_answer
        
        standard_results.append({
            "tokens": standard_tokens,
            "time": std_time,
            "correct": std_correct,
            "extracted_answer": extracted_std_answer,
            "question": question,
            "expected": expected_answer
        })
        
        print(f"Standard approach answer: {extracted_std_answer} (Expected: {expected_answer}, Correct: {std_correct})")
        print(f"Standard tokens: {standard_tokens}")
        print(f"Standard computation time: {std_time:.2f} seconds")

        # Sleep-time approach
        test_start = time.time()
        test_time_result = test_time_compute(
            context, question, sleep_time_result["inferences"], verbosity=1, temperature=0
        )
        test_time = time.time() - test_start
        sleep_test_tokens = test_time_result["test_time_tokens"]
        extracted_sleep_answer = extract_answer(test_time_result["response"])
        sleep_correct = extracted_sleep_answer == expected_answer
        
        sleep_test_results.append({
            "tokens": sleep_test_tokens,
            "time": test_time,
            "correct": sleep_correct,
            "extracted_answer": extracted_sleep_answer,
            "question": question,
            "expected": expected_answer
        })
        
        print(f"Sleep-time approach answer: {extracted_sleep_answer} (Expected: {expected_answer}, Correct: {sleep_correct})")
        print(f"Sleep-time test tokens: {sleep_test_tokens}")
        print(f"Sleep-time test computation time: {test_time:.2f} seconds")
        print(f"Time speedup for this query: {std_time/test_time:.2f}x")

    # Calculate aggregate metrics
    total_standard_tokens = sum(r["tokens"] for r in standard_results)
    total_standard_time = sum(r["time"] for r in standard_results)
    standard_accuracy = sum(1 for r in standard_results if r["correct"]) / len(standard_results)
    
    total_sleep_test_tokens = sum(r["tokens"] for r in sleep_test_results)
    total_sleep_test_time = sum(r["time"] for r in sleep_test_results)
    sleep_accuracy = sum(1 for r in sleep_test_results if r["correct"]) / len(sleep_test_results)
    
    avg_standard_tokens = total_standard_tokens / len(questions)
    avg_standard_time = total_standard_time / len(questions)
    avg_sleep_test_tokens = total_sleep_test_tokens / len(questions)
    avg_sleep_test_time = total_sleep_test_time / len(questions)

    # Calculate the amortization benefit
    print("\n=== AMORTIZATION SUMMARY ===")
    print(f"Number of queries in this test: {len(questions)}")
    print(f"Standard approach accuracy: {standard_accuracy:.2%}")
    print(f"Sleep-time approach accuracy: {sleep_accuracy:.2%}")
    
    print("\n--- Current Performance ---")
    print(f"Total tokens with standard approach: {total_standard_tokens}")
    print(f"Total tokens with sleep-time approach: {sleep_time_tokens + total_sleep_test_tokens}")
    
    current_token_factor = total_standard_tokens / (sleep_time_tokens + total_sleep_test_tokens)
    print(f"Current token efficiency: {current_token_factor:.2f}x")
    
    print(f"\nTotal computation time with standard approach: {total_standard_time:.2f} seconds")
    print(f"Total computation time with sleep-time approach: {sleep_time + total_sleep_test_time:.2f} seconds")
    
    current_time_factor = total_standard_time / (sleep_time + total_sleep_test_time)
    print(f"Current time efficiency: {current_time_factor:.2f}x")
    
    # Calculate the break-even point
    if avg_sleep_test_time >= avg_standard_time:
        break_even_time = float('inf')
        print("\nSleep-time compute will never break even for time in this workload.")
    else:
        break_even_time = sleep_time / (avg_standard_time - avg_sleep_test_time)
        print(f"\n--- Break-Even Analysis ---")
        print(f"Time break-even point: {math.ceil(break_even_time)} queries")
        print(f"After {math.ceil(break_even_time)} queries, sleep-time compute becomes faster overall.")
    
    if avg_sleep_test_tokens >= avg_standard_tokens:
        break_even_tokens = float('inf')
        print("Sleep-time compute will never break even for tokens in this workload.")
    else:
        break_even_tokens = sleep_time_tokens / (avg_standard_tokens - avg_sleep_test_tokens)
        print(f"Token break-even point: {math.ceil(break_even_tokens)} queries")
        print(f"After {math.ceil(break_even_tokens)} queries, sleep-time compute becomes token-efficient.")
    
    # Project performance with more queries
    print("\n--- Projected Performance ---")
    print("Number of queries | Token Efficiency | Time Efficiency")
    print("--------------------------------------------------")
    
    query_counts = [10, 25, 50, 100, 250, 500, 1000]
    token_efficiencies = []
    time_efficiencies = []
    
    for num_queries in query_counts:
        # Calculate projected metrics
        proj_standard_tokens = avg_standard_tokens * num_queries
        proj_sleep_tokens = sleep_time_tokens + (avg_sleep_test_tokens * num_queries)
        proj_token_factor = proj_standard_tokens / proj_sleep_tokens
        token_efficiencies.append(proj_token_factor)
        
        proj_standard_time = avg_standard_time * num_queries
        proj_sleep_time = sleep_time + (avg_sleep_test_time * num_queries)
        proj_time_factor = proj_standard_time / proj_sleep_time
        time_efficiencies.append(proj_time_factor)
        
        print(f"{num_queries:14d} | {proj_token_factor:15.2f}x | {proj_time_factor:14.2f}x")
    
    # Performance per query (not including sleep-time cost)
    print("\n--- Per-Query Performance (excluding setup) ---")
    print(f"Standard approach: {avg_standard_tokens:.1f} tokens, {avg_standard_time:.2f} seconds per query")
    print(f"Sleep-time approach: {avg_sleep_test_tokens:.1f} tokens, {avg_sleep_test_time:.2f} seconds per query")
    print(f"Per-query token reduction: {avg_standard_tokens / avg_sleep_test_tokens:.2f}x")
    print(f"Per-query time speedup: {avg_standard_time / avg_sleep_test_time:.2f}x")
    
    # VISUALIZATION 1: Cumulative computation time (excluding setup time)
    plt.figure(figsize=(10, 6))
    
    # Calculate cumulative computation times
    cum_std_times = np.cumsum([r["time"] for r in standard_results])
    cum_sleep_times = np.cumsum([r["time"] for r in sleep_test_results])
    
    # Plot the cumulative times
    plt.plot(range(1, len(questions) + 1), cum_std_times, 'b-', linewidth=2, label='Standard Test-time Compute')
    plt.plot(range(1, len(questions) + 1), cum_sleep_times, 'g-', linewidth=2, label='Sleep-time Test Phase Only')
    
    # Shade the area between curves to show savings
    if np.any(cum_std_times > cum_sleep_times):
        plt.fill_between(
            range(1, len(questions) + 1), 
            cum_std_times, 
            cum_sleep_times, 
            where=(cum_std_times > cum_sleep_times),
            color='green', alpha=0.3, label='Time Savings'
        )
    
    plt.title('Cumulative Computation Time (Test-time Phase Only)', fontsize=14)
    plt.xlabel('Number of Queries', fontsize=12)
    plt.ylabel('Total Time (seconds)', fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend(fontsize=11)
    plt.tight_layout()
    plt.savefig('cumulative_test_time.png', dpi=300)
    
    print("\n✅ Cumulative computation time visualization saved to 'cumulative_test_time.png'")
    
    # VISUALIZATION 2: Test-time compute vs. Accuracy
    # Generate data for different verbosity levels
    verbosity_levels = [0, 1, 2, 3]
    std_tokens = []
    std_accs = []
    sleep_tokens = []
    sleep_accs = []
    
    # Use a subset of questions for testing different verbosity levels
    sample_size = min(5, len(questions))
    sample_questions = questions[:sample_size]
    
    print(f"\nGenerating test-time compute vs. accuracy graph (using {sample_size} sample questions)...")
    
    for verbosity in verbosity_levels:
        # Standard approach
        std_correct = 0
        std_token_sum = 0
        
        for q in sample_questions:
            result = standard_test_time_compute(context, q["question"], verbosity=verbosity, temperature=0)
            std_token_sum += result["test_time_tokens"]
            answer = extract_answer(result["response"])
            if answer == q["answer"]:
                std_correct += 1
        
        std_tokens.append(std_token_sum / len(sample_questions))
        std_accs.append(std_correct / len(sample_questions) * 100)
        
        # Sleep-time approach
        sleep_correct = 0
        sleep_token_sum = 0
        
        for q in sample_questions:
            result = test_time_compute(context, q["question"], 
                                     sleep_time_result["inferences"], 
                                     verbosity=verbosity, temperature=0)
            sleep_token_sum += result["test_time_tokens"]
            answer = extract_answer(result["response"])
            if answer == q["answer"]:
                sleep_correct += 1
        
        sleep_tokens.append(sleep_token_sum / len(sample_questions))
        sleep_accs.append(sleep_correct / len(sample_questions) * 100)
        
        print(f"  Verbosity {verbosity} complete")
    
    plt.figure(figsize=(10, 6))
    
    # Plot the pareto curves
    plt.plot(std_tokens, std_accs, 'bo-', linewidth=2, markersize=8, label='Standard Test-time Compute')
    plt.plot(sleep_tokens, sleep_accs, 'go-', linewidth=2, markersize=8, label='Sleep-time Compute')
    
    # Annotate verbosity levels
    for i, (x, y) in enumerate(zip(std_tokens, std_accs)):
        plt.annotate(f"V{verbosity_levels[i]}", (x, y), xytext=(5, -15), 
                     textcoords='offset points', fontsize=9, color='blue')
    
    for i, (x, y) in enumerate(zip(sleep_tokens, sleep_accs)):
        plt.annotate(f"V{verbosity_levels[i]}", (x, y), xytext=(5, 10), 
                     textcoords='offset points', fontsize=9, color='green')
    
    # Shade the pareto improvement area
    max_sleep_acc = max(sleep_accs)
    plt.fill_between([min(std_tokens + sleep_tokens) * 0.9, max(std_tokens + sleep_tokens) * 1.1], 
                   0, max_sleep_acc, alpha=0.1, color='green')
    
    plt.title('Test-time Compute vs. Accuracy', fontsize=14)
    plt.xlabel('Avg. Test-time Tokens / Query', fontsize=12)
    plt.ylabel('Accuracy (%)', fontsize=12)
    plt.grid(True, linestyle='--', alpha=0.7)
    plt.legend(fontsize=11)
    plt.tight_layout()
    plt.savefig('test_time_vs_accuracy.png', dpi=300)
    
    print("✅ Test-time compute vs. accuracy visualization saved to 'test_time_vs_accuracy.png'")
    
    return {
        "standard_results": standard_results,
        "sleep_test_results": sleep_test_results,
        "standard_accuracy": standard_accuracy,
        "sleep_accuracy": sleep_accuracy,
        "avg_standard_tokens": avg_standard_tokens,
        "avg_sleep_test_tokens": avg_sleep_test_tokens,
        "avg_standard_time": avg_standard_time,
        "avg_sleep_test_time": avg_sleep_test_time,
        "break_even_time": break_even_time if 'break_even_time' in locals() else None,
        "break_even_tokens": break_even_tokens if 'break_even_tokens' in locals() else None
    }

Results

Running this script shows significant performance improvements when using Sleep-time Compute.

The logs in the demo show that Sleep-time Compute:

Uses 6.4x fewer tokens per query (73.0 vs 11.4 tokens).
Processes queries 5.2x faster on average.
Dramatically higher accuracy than standard compute across all verbosity levels.

Below are graphs showing the performance gains of using Sleep-time Compute.

When and When Not to Use Sleep-time Compute

Sleep-time Compute is most effective in the following scenarios:

Stateful Applications: Systems where context persists across multiple interactions, such as Document question-answering.
Predictable Queries: Contexts where potential questions follow predictable patterns.
Multiple Related Queries: When users ask several questions about the same context.

However, Sleep-time Compute should not be used in these scenarios:

Unpredictable Queries: The research shows diminishing returns for less predictable queries and standard test-time compute may be more effective in these cases.
Single Query Scenarios: The overhead of sleep-time compute isn’t amortized.
Non-Stateful Applications: Systems where context doesn’t persist between interactions. Without a persistent context to analyze during idle time, the core benefit is lost.

Conclusion

Sleep-time Compute represents a useful technique to reduce LLM response latency and improve response accuracy. By recognizing that contexts are often available before queries and taking advantage of idle computation time, we can significantly improve user experience through faster response times.

How to Run Sleep-time Compute to Reduce LLM Latency was originally published in GoPenAI on Medium, where people are continuing the conversation by highlighting and responding to this story.

GIS Data Conversion MCP: An MCP Server to Convert GIS Data Formats

Ronan Takizawa — Mon, 28 Apr 2025 04:35:17 GMT

Full Repo (Leave a star if you enjoyed the project!) 🌟

Introducing the GIS Data Conversion MCP: an MCP (Model Context Protocol) server that gives LLMs access to APIs for GIS Data Conversion.

Here are the features:

Reverse Geocoding: Convert coordinates to location information
WKT/GeoJSON Conversion: Convert between Well-Known Text and GeoJSON formats
CSV/GeoJSON Conversion: Transform tabular data with coordinates to GeoJSON and vice versa
TopoJSON/GeoJSON Conversion: Convert between GeoJSON and TopoJSON (topology-preserving format)
KML/GeoJSON Conversion: Transform KML files to GeoJSON format

What is MCP?

The Model Context Protocol (MCP) is an open standard developed by Anthropic that enables large language models (LLMs) to securely and efficiently interact with external tools, data sources, and services.

For more on MCP, read this article.

How I built A11y MCP

This project was built with:

The Model Context Protocol SDK (Base framework to run MCP)
wellknown (WKT/GeoJSON conversion)
csv2geojson (CSV/GeoJSON conversion)
topojson-client & topojson-server (TopoJSON processing)
tokml & @tmcw/togeojson (KML conversion)
xmldom (XML parsing for KML)

Following the example of other well-known MCPs (like the youtube-transcript MCP), I was able to build this project easily using the Model Context Protocol SDK.

Why the GIS Data Conversion MCP is useful

The GIS Data Conversion MCP is useful because it allows the LLM to run accurate GIS data conversions.

This saves time for users, where previously they had to:

Write their own data conversion scripts
Rely on an LLM to convert the data, which is based on its training knowledge and not accurate

This has never been done before.

Here’s an example of running the GIS Data Conversion MCP on Claude Desktop.

As the example shows, when a user instructs to reverse geolocate longitude and latitude data, it responds with the exact address of the location.

Installation

Refer to the installation instructions on the repository.

A11y MCP: An MCP Server for Web Accessibility Testing

Ronan Takizawa — Sat, 26 Apr 2025 00:34:31 GMT

A11y MCP: An MCP Server to run AI-powered Web Accessibility Testing

Full Repo (Leave a star if you enjoyed the project!) 🌟

Introducing A11y MCP: an MCP (Model Context Protocol) server that gives LLMs access to web accessibility testing APIs.

By running this extension on AI platforms (Claude Desktop, Cursor), you can allow your LLM to check your web content against Web Content Accessibility Guideline (WCAG) standards, which are internationally recognized standards developed by the W3C to make web content more accessible to people with disabilities.

Here are the features:

Test web page accessibility via URL: Test any public URL for accessibility issues using WCAG standards (Checks for A, AA, and AAA compliance).
Test web page accessibility via URL HTML snippets: Test raw HTML strings for accessibility issues using WCAG standards (Checks for A, AA, and AAA compliance).
Accessible Rich Internet Applications (ARIA) validation: Test proper usage of HTML ARIA attributes that make web pages accessible with disabled users and their assistive technologies.
Color contrast analysis: Check color combinations for WCAG compliance.
Orientation lock detection: Identify content that forces specific screen orientations.

What is MCP?

For more on MCP, read this article.

How I built A11y MCP

This project was built with:

The Model Context Protocol SDK (Base framework to run MCP)
Deque Axe-core API (Core accessibility testing API)
Puppeteer (Runs A11y tests on a dummy)

Following the example of other well-known MCPs (like the youtube-transcript MCP), I was able to build this project easily using the Model Context Protocol SDK.

Each function takes in parameters such as URLs, raw HTML, and color inputs.

Depending on the function, Puppeteer is run to either navigate to a URL or load HTML content into a headless browser.

The page is then passed to the Axe-core library, where it analyzes accessibility issues based on WCAG standards, and returns detailed results that can help identify and fix accessibility problems.

One issue faced when deploying this project was package version management.

Since my Claude Desktop runs NodeJS 15, I had to adjust package versions.

Why A11y MCP is useful

A11y MCP is useful because it allows LLMs to run accurate web accessibility tests through the Axe-core API.

Without using the A11y MCP, an LLM will give you arbitrary feedback on web accessibility based on its training data, but it cannot run deterministic accessibility tests that will produce the same results every time.

By using A11y MCP for web accessibility testing, users can run reliable web accessibility tests with the use of the Axe-core API, and autonomously find ways to fix their website issues with the LLM.

This has never been done before.

Here’s an example of running the A11y MCP on Claude Desktop.

As the example shows, when a user sends a chunk of HTML not compliant with WCAG, the LLM runs the A11y MCP to find exact accessibility issues in the webpage, then suggests ways to change the HTML.

Installation

Refer to the installation instructions on the repository.

Future Work

The future of MCP servers is headed towards running and hosting the servers remotely such that devices don’t have to run MCP servers locally and online AI apps can use MCP. The A11y MCP server will hopefully follow these footsteps and be deployed externally.

Building a Smart Flood Map of Japan using GIS and Machine Learning

Ronan Takizawa — Tue, 15 Apr 2025 19:51:06 GMT

Building a Smart Flood Map of Japan Using GIS and Machine Learning

Full Code (Make sure to leave the Original Repo a Star!) ⭐️

With densely populated coastal areas and a mountainous interior, Japan frequently faces floods from heavy rainfall and typhoons. Floods in Japan have cost up to ¥2.1 trillion in damages, and these damages are getting worse due to climate change. Despite this risk, accessible and accurate flood hazard maps remain scarce.

In this article, I will demonstrate how I built a smart Flood Hazard Map of Japan using GIS data and machine learning.

Datasets

The datasets used in this project are the following:

A Hydromap of Japan (University of Tokyo) with features such as:

Water Directional Flow: The direction in which surface water is expected to move based on terrain and hydrological modeling (Measured in angular degrees from 0–360°).
Elevation: The height of the land surface above sea level (m).
Upstream Drainage Area (Flow Accumulation Area): The total land area that drains into a specific point on the landscape, indicating how much water can potentially flow through it (km²),
River Channel Width: The estimated horizontal width of a river at a given location, which affects its capacity to carry floodwaters (m).
HAND (Height Above Nearest Drainage): A terrain metric that represents how high a given point is above the nearest stream or drainage line, helping identify flood-prone lowlands (m).

The Hydro map is stored as a GeoTIFF file, where each GeoTIFF file is a grid of cells (pixels) and each cell contains these key components:

Raster Values: The actual measurement data for each pixel (Elevation, HAND…)
Geospatial Metadata: Information about the coordinate system, geographic extent, and projection
Transformation Parameters: Mathematical parameters that relate pixel coordinates to real-world coordinates

2. GIS data from the Japanese National Research Institute for Earth Science and Disaster Prevention, covering floods in Japan from 1961 to 2008, with information such as:

Infrastructure damage
Transportation Damage
Water / Flood Impacts (Flood area, flood depth)

Building the Simplified Flood Risk Model

To build a smart Flood Hazard Map of Japan, we first need to build a model that iterates through the Hydromap of Japan, and assigns a risk score on features that suggest high flood risk.

The model tracks the 5 variables in the hydromap:

Water Directional Flow
Elevation
Upstream Drainage Area (Flow Accumulation Area)
River Channel Width
HAND (Height Above Nearest Drainage)

For each variable, we transform their value into a standardized risk score from 0 to 1.

# Apply land mask and ensure only positive upstream area values are considered
upstream_norm = np.where((upstream_area > 0) & land_mask, upstream_area, 0)  # Keep only positive upstream area values on land, set everything else to 0

# Normalize the transformed upstream area values (land areas only)
upstream_norm_reshaped = upstream_norm.reshape(-1, 1)  # Reshape the 2D array to a column vector for the scaler function

# Only proceed with normalization if land pixels exist in the dataset
if len(land_indices) > 0:  # Check if there are any land pixels to normalize
    upstream_norm_reshaped[land_indices] = scaler.fit_transform(upstream_norm_reshaped[land_indices])  # Scale land values to 0-1 range

# Reshape back to original dimensions
upstream_norm = upstream_norm_reshaped.reshape(upstream_area.shape)  # Convert column vector back to original 2D array shape

An important detail included in the model is to score any locations with elevations <0 as NaN so the final map doesn’t consider the Ocean.

In the end, each risk score is scaled using the initial placeholder risk weights that calculate the final risk score of a region.


flood_risk = np.where(land_mask, 
    0.2 * elevation_norm +             
    0.35 * hand_norm +                
    0.25 * upstream_norm +             
    0.1 * flow_convergence_norm +    
    0.1 * river_influence_norm,
    np.nan)  # Use NaN for ocean areas

Using machine learning to get smarter heuristics

A critical issue with the current implementation is that the weights used to score flood risk zones (the heuristics) are arbitrary and not fine-tuned based on flood history in Japan.


flood_risk = np.where(land_mask, 
    0.2 * elevation_norm +  # These are arbitrary weights            
    0.35 * hand_norm +      # These are arbitrary weights
    0.25 * upstream_norm +  # These are arbitrary weights           
    0.1 * flow_convergence_norm + # These are arbitrary weights   
    0.1 * river_influence_norm,   # These are arbitrary weights
    np.nan)  # Use NaN for ocean areas

To derive more intelligent heuristics, we use the second dataset and a Linear Regression model to evaluate which variables correlate most strongly with historical flood damage.

Linear Regression is particularly well-suited for this task because it provides feature importance scores that can be directly translated into weights for flood risk assessment

Read this for more on Linear Regression.

To see which variables correlate most strongly with historical flood damage, we got the locations of historical floods and recorded their elevation, HAND, upstream, flow direction, and river width values.

def extract_features_at_flood_points(flood_df, elevation, flow_direction, hand, upstream_area, river_width, transform):
    """Extract feature values at each flood location."""
    # Calculate flow convergence
    from scipy import ndimage
    flow_conv = ndimage.generic_filter(flow_direction, lambda x: len(np.unique(x)), size=3)
    
    # Create dataframe for valid points
    valid_data = []
    
    for _, row in flood_df.iterrows():
        try:
            # Get pixel coordinates for this lat/lon
            row_idx, col_idx = rowcol(transform, row['longitude'], row['latitude'])
            
            # Check if coordinates are within bounds and on land
            if (0 <= row_idx < elevation.shape[0] and 
                0 <= col_idx < elevation.shape[1] and
                elevation[row_idx, col_idx] > 0):  # Land mask
                
                # Extract feature values
                elev = elevation[row_idx, col_idx]
                hand_val = max(0, hand[row_idx, col_idx])  # Replace negative with 0
                upstream = upstream_area[row_idx, col_idx]
                flow_convergence = flow_conv[row_idx, col_idx]
                # Extract river width, handle possible NaN values
                river_w = river_width[row_idx, col_idx]
                if np.isnan(river_w):
                    river_w = 0  # Set NaN river width to 0 (no river)
                flood_area = row['flood_area_km2']
                
                valid_data.append({
                    'elevation': -elev,  # Negate elevation (lower = higher risk)
                    'hand': hand_val,
                    'upstream_area': upstream,
                    'flow_convergence': flow_convergence,
                    'river_width': river_w,
                    'flood_area_km2': flood_area
                })
        except Exception:
            continue
    
    return pd.DataFrame(valid_data)

Once we have those values, I ran a Linear Regression model by setting flood area as the target variable to see which variables contribute most to large flood areas.

The following is what the Linear Regression model looked like:

def train_models_and_get_weights(feature_df):
    """Train linear regression model and output weights with a visualization."""
    # Prepare features and target
    X = feature_df[['elevation', 'hand', 'upstream_area', 'flow_convergence', 'river_width']]
    y = feature_df['flood_area_km2']
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled_df, y, test_size=0.2, random_state=42
    )
    
    # Initialize model
    model = LinearRegression()
    
    print("\nMODEL WEIGHTS AND PERFORMANCE METRICS:")
    print("--------------------------------------")
    
    # Fit the model
    model.fit(X_train, y_train)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_scaled_df, y, cv=5, scoring='r2')
    print(f"Cross-validation R² scores: {cv_scores}")
    print(f"Mean CV R²: {np.mean(cv_scores):.4f}")
    
    # Test set performance
    y_pred = model.predict(X_test)
    test_r2 = r2_score(y_test, y_pred)
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    print(f"Test set R²: {test_r2:.4f}")
    print(f"Test set RMSE: {test_rmse:.4f}")
    
    # Extract coefficients
    coefs = model.coef_
    
    # Get normalized weights (absolute values, sum to 1)
    abs_coefs = np.abs(coefs)
    normalized_weights = abs_coefs / np.sum(abs_coefs)
    
    # Display weights
    print(f"\nFeature coefficients:")
    for feature, weight, norm_weight in zip(X.columns, coefs, normalized_weights):
        print(f"  {feature}: raw={weight:.4f}, normalized={norm_weight:.4f}")

After running the model, these were the resulting weights:

elevation: 0.5819
hand: 0.3723
upstream: 0.0304
flow_conv:0.0154
river_width: 0.0000

For this dataset, the elevation and HAND of the terrain was the most significant factor to flood risk.

This makes sense because lower-lying areas are naturally more prone to inundation when flooding occurs as water flows downhill following gravity and areas with low HAND values are closer to drainage networks and more easily flooded when those channels overflow.

The Final Flood Risk Map

Using the final risk scores, Matplotlib creates a flood hazard map of Japan

def visualize_flood_risk(risk_array, output_path, transform, crs):
    """Visualize the flood risk map and save to file with ocean areas colored black."""
    # Create a figure and get the current axes
    fig, ax = plt.figure(figsize=(12, 10)), plt.gca()
    
    # Define a set of distinct colors for small ranges
    colors = [
        '#000080',  # Navy (0.0-0.1)
        '#0000FF',  # Blue (0.1-0.2)
        '#00FFFF',  # Cyan (0.2-0.3)
        '#008000',  # Green (0.3-0.4)
        '#ADFF2F',  # GreenYellow (0.4-0.5)
        '#FFFF00',  # Yellow (0.5-0.6)
        '#FFA500',  # Orange (0.6-0.7)
        '#FF0000',  # Red (0.7-0.8)
        '#800000',  # Maroon (0.8-0.9)
        '#FF00FF',  # Magenta (0.9-1.0)
    ]
    
    # Create custom colormap with 10 distinct bands
    cmap = LinearSegmentedColormap.from_list('high_contrast', colors, N=10)
    cmap.set_bad('black')  # Set NaN values to black
    
    # Create boundaries for distinct color bands
    bounds = np.linspace(0, 1, 11)  # 11 boundaries for 10 distinct ranges
    norm = BoundaryNorm(bounds, cmap.N)
    
    # Display the risk array with the custom colormap
    img = ax.imshow(risk_array, cmap=cmap, norm=norm)
    
    # Add a colorbar with tick marks at the boundaries
    cbar = plt.colorbar(img, ax=ax, ticks=bounds)
    cbar.set_label('Flood Hazard Index (0-1)')
    
    # Format the tick labels to show ranges
    tick_labels = [f"{bounds[i]:.1f}-{bounds[i+1]:.1f}" for i in range(len(bounds)-1)]
    # Add an empty string at the beginning for proper alignment
    cbar.set_ticklabels([''] + tick_labels)
    
    # Add a title to the plot
    plt.title('Flood Hazard Map of Japan', fontsize=18, fontweight='bold')
    # Turn off the axis labels and ticks
    plt.axis('off')
    
    # Save the visualization as a PNG file
    plt.savefig(os.path.join(os.path.dirname(output_path), 'flood_hazard_visualization.png'), dpi=300, bbox_inches='tight')
    
    # Save the risk data as a GeoTIFF file for GIS applications
    with rasterio.open(output_path, 'w',
                      driver='GTiff',          # Use GeoTIFF format
                      height=risk_array.shape[0],  # Set height from input array
                      width=risk_array.shape[1],   # Set width from input array
                      count=1,                 # One band/channel
                      dtype=rasterio.float32,  # Use float32 to support NaN values
                      crs=crs,                 # Set coordinate reference system
                      transform=transform,     # Set geospatial transform
                      nodata=np.nan) as dst:   # Set NaN as the nodata value
        # Write the risk array to the first band
        dst.write(risk_array.astype(rasterio.float32), 1)
    
    # Close the plot to free memory
    plt.close()
    # Return the path to the created GeoTIFF file
    return output_path

The resulting flood risk map shows a 0–1 scale where higher values (red) indicate greater flood risk:

Blue-Green areas (0–0.4): Low flood risk
Yellow-Orange areas (0.4–0.6): Moderate flood risk
Red-Purple areas (0.6+): High flood risk

Ocean areas are masked in black to focus exclusively on land-based flood risk.

The map reveals several key high-risk zones which are shown in places overwhelming in the high risk colors:

Tokyo and Kanto Plain: Densely populated, highly urbanized areas with extensive low-lying floodplains
Osaka/Kyoto/Kobe region: Coastal lowlands and historical flood plains
Hokkaido: Large river basins with flat terrain and cold-climate runoff characteristics

When compared to official hazard maps from Japan’s Ministry of Land, Infrastructure, and Tourism, our ML-derived map shows remarkable consistency, validating the approach.

Here is the flood hazard map with real past flood locations mapped. As the map shows, post past flood events happened in purple regions, indicating that the model accurately predicts high-risk regions.

Potential Errors

The biggest error that could misrepresent flood risks is that the 2nd dataset only covered floods in the Kanto region.

While the dataset represented each risk variable’s importance to flood area accurately, a more accurate flood risk model would use a dataset of historic floods in all of Japan.

Another issue is the accuracy of the Linear Regression model.

The Linear Regression model used in this project had a low R² value of 0.0942, indicating that the model explains only about 9.42% of the variance in flood area. While a linear regression model with low R² values can still provide valuable weights for the final risk model, a stronger R² value is preferred.

Conclusion

This project demonstrated how you can use GIS data and machine learning to develop a hazard map. For further progress in this project, consider applying larger datasets and implementing other flood factors like urbanization and population data into the model.

Building a Smart Flood Map of Japan using GIS and Machine Learning was originally published in The Deep Hub on Medium, where people are continuing the conversation by highlighting and responding to this story.

The In-Depth Math Behind Bulletproofs in ZKPs

Ronan Takizawa — Sun, 23 Mar 2025 01:50:02 GMT

Bulletproofs are a compact solution for zero-knowledge range proofs — proving that a value lies within a specific range without revealing the value itself. Unlike zk‑SNARKs and KZG polynomial commitments, their lack of arithmetic circuits and elimination of a trusted setup makes bulletproofs particularly well-suited for applications in need of small and fast ZKPs.

In this article, I’ll dive deep into how bulletproofs work under the hood mathematically based on their original implementation (Bünz et al.).

What Are Bulletproofs?

Bulletproofs, introduced by Bünz et al. in 2017, are a type of non-interactive zero-knowledge proof system with several attractive properties:

No Trusted Setup: They don’t require a trusted setup phase, which is a computationally-heavy process of generating keys to generate and verify proofs (Used in zk-SNARKs). The lack of trusted setup can make bulletproofs more secure and easier to deploy
No Need for Arithmetic Circuits: Unlike many ZK systems that rely on converting computations into complex arithmetic circuits, bulletproofs use a clever combination of the inner product argument and the Pedersen commitment scheme to create range proofs without circuit transformation. This direct algebraic approach means developers don’t need to use R1CS, which is a complex way to represent constraints of the ZK proof required in many other ZK systems

With these properties in mind, bulletproofs are used in projects such as Monero to enable confidential transactions and significantly reduce transaction sizes.

At its core, bulletproofs solve the following problem:

How can one prove that a secret value v lies in the range [0,2^n) without revealing the value itself?

To do this, we will first explain how you can securely generate a proof v in a zero-knowledge manner, and then explain how to make that proof verifiable.

How Zero-Knowledge Proofs Work

Before diving into the mathematics of bulletproofs, it’s important to understand the general structure of zero-knowledge proof systems.

Key Concepts

The Prover: This party has secret information v and wants to convince others that a statement about this information is true (such as “v is in range [0, 2^n)”) without revealing the secret itself
The Verifier: This party wants to be convinced that the prover’s statement is true, without learning the secret information
The Protocol: A set of cryptographic interactions between the prover and verifier that leads to the verifier being convinced (or not) of the statement’s truth

The 3 Properties of ZK Proofs

All effective zero-knowledge proof systems, including bulletproofs, must satisfy three key properties:

Completeness: If the statement is true and both parties follow the protocol, the verifier will be convinced
Soundness: If the statement is false, no cheating prover can convince the verifier the statement is true, except with negligible probability
Zero-Knowledge: The verifier never sees the secret directly and only knows the truth of the statement being proved about the secret

Challenge Variables

To hold the properties described above, we use challenges, which are random number values applied to the protocol to ensure the proof withstands edge cases.

As you will see later, without challenges, malicious provers can brute force the protocol to find a way to validate an invalid proof by satisfying edge cases.

Interactive / Non-Interactive Protocols

In cryptographic protocols, there are interactive and non-interactive protocols to exchange challenges.

In an interactive protocol, the verifier and prover must exchange multiple messages in sequence:

The prover makes initial commitments based on their secret
The verifier sends random challenges to check the validity of the ZK proof
The prover responds with proofs tailored to these challenges to show the proof upholds

In a non-interactive protocol (like bulletproofs):

The prover can run the entire proof generation in a single step with random challenges woven into the proof
Each crucial step of a proof introduces a new challenge that is generated by using a hash function that takes in variables from the proof up to that point
The prover will then express their secret v for the verifier by writing a complex polynomial expression including the challenge variables and secret v called a commitment
Commitments allow the verifier to apply complex algebraic and homomorphic calculations to the prover’s commitments. This verifies that secret v withstands the proof and challenges without ever knowing the value v
Using the commitments from the prover, the verifier can verify that the secret v withstands the proof and challenges in a single step without needing to send challenges and receive responses from the prover

In bulletproofs, we use 3 challenges (Challenge y, z, and x) and each are generated using the Fiat-Shamir heuristic.

When using the Fiat-Shamir heuristic, each challenge is generated from a hash function which takes in all existing proof variables up to that point (previous challenges, vectors used in the proof, and more).

This makes every challenge dependent on each other and impossible for a prover to change any part of their earlier commitments without invalidating subsequent challenges.

Without hash-derived challenges, a malicious prover could work backward from a desired outcome, crafting different responses for different challenges. But with hashes, such backtracking is not possible.

Bulletproof Overview

The bulletproof setup is an extremely complicated process, so please refer to these table of contents to keep track of what happens during the bulletproof protocol.

Part 1: Bulletproof System Setup

Choosing an Elliptic curve, generator points, and vector generation parameters

Part 2: Creating the Original Commitment

Creating a Pedersen commitment to hide their secret value v while still allowing verification that it lies within the required range

Part 3: Building the Mathematical Foundation for Verifying the Proof

Breaking down the range verification into bit decomposition and creating mathematical constraints that ensure each bit is valid and sums to the original value

Part 4: Forming the Polynomial Constraint

Converting the bit decomposition and range constraints into a polynomial equation with challenge variables

Part 5: The Inner Product Argument for Proof Compression

Using recursive vector splitting and compression techniques to reduce proof size from linear complexity to logarithmic complexity

Part 6: Proof Verification

The process through which the verifier uses the provided commitments, challenges, and inner product elements to confirm the proof’s validity without learning the secret value

Part 1: Bulletproof System Setup

Now before the prover starts generating their proof, both parties set up a bulletproof protocol that they agree on.

They will first choose which elliptic curve they will use.

Elliptic Curves are algebraic curves in the form y² = x³ + ax + b used in cryptography to provide secure variables in crypto systems.

We use elliptic curves in cryptographic systems to generate coefficients for our commitments that have no mathematical relationships between them.

This lack of mathematical relationships between coefficients is a “nothing-up-my-sleeve” approach that prevents anyone from generating backdoors in the protocol where a prover can secretly validate an invalid proof or a verifier can figure out the secret v.

Sample Elliptic Curve

Then, using that elliptic curve, both the prover and verifier will agree on these values that will be used to make commitments:

Elliptic curve points g and h used for the original commitment
The bit length n that defines the range [0, 2^n) being proven
Vectors g = (g₀, g₁, …, gₙ₋₁) and h = (h₀, h₁, …, hₙ₋₁) of n distinct curve points used specifically for the inner product argument.

g and h are generated from elliptic curves by finding a secret scalar variable such that h = k·g. The properties of elliptic curves make it difficult to calculate what k is (known as the discrete logarithm problem), which will make it computationally infeasible to reverse engineer the secret v when g and h are used in the original commitment.

Vectors g and h are generated by picking 2 points on the elliptic curve (Not the values used earlier for the original commitment), a public seed value that both the prover and verifier agree upon, and running a hash function including the 2 points, the seed value, and the index. The resulting value will then be mapped to an x-coordinate, and the corresponding y value will be part of the vector.

For example, if the 2 points picked from the elliptic curve are “g” and “h” like this:

Part 2: Creating the Original Commitment

Now we will get into the in-depth math behind bulletproofs.

To make a range proof verification that is zero-knowledge, we use Pedersen Commitments to hide the actual value like a locked box.

When the proof is sent to the verifier, the verifier can check properties of what’s inside the box (that the value is within range) without ever seeing the value itself, and the prover can’t swap what’s in the box after creating it.

Bulletproofs use Pedersen commitments like this:

g and h are the points agreed upon by the prover and verifier to use for the bulletproof
r is randomly generated scalar values from the elliptic curve being used that further obscure the commitment. Without this blinding factor, the commitment could reveal v

Commitment C and its variables will be used later to generate challenges to prove information about v, and used to generate other commitments.

Part 3: Building the Mathematical Foundation for Verifying the Proof

To prove that a value v lies in a range, we can use the Bit Check Trick and decompose v into its binary representation. Using this binary representation, we prove that each bit is either 0 or 1, such that these bits correctly combine to form v:

v = a₀ x 2⁰ + a₁ x 2¹…+ aₙ₋₁ x 2^(n-1) such that a ∈ {0, 1}.

For example, if v = 13 and n = 4 (we’re using 4 bits), then:

a₀ = 1 , a₁ = 0, a₂ = 1, a₃ = 1

and:

13 = 1 x 2⁰ + 0 x 2¹ + 1 x 2² + 1 x 2³

The Bit Check Trick

The clever insight in bulletproofs is to use the bit check trick to ensure that the committed value v lies within the specified range.

In bulletproofs, we need to verify that each component of our committed value v can be represented only by valid bits (0 or 1) with n number of bits.

If v cannot be represented as an n bit number with only values of a of 0 or 1, then we can’t validly prove that v is within the range [0, 2^n).

Imagine a prover wants to prove commitment v is in range [0, 2⁴), but their actual value is v = 20, which is out of range.

If we didn’t verify bits properly, the prover could try to represent 20 as:

a₀ = 0, a₁ = -2, a₂ = 6, a₃ = 0

Calculating: 0×2⁰ + (-2)×2¹ + 6×2² + 0×2³ = 0 + (-4) + 24 + 0 = 20

If we didn’t have the strict rule of only allowing v to be represented with valid bits (0 or 1) with n number of bits, the prover can satisfy an invalid proof.

To avoid this situation, we use the strict rule of allowing v to be represented only by valid bits (0 or 1) with n number of bits.

This strict rule can be calculated quickly using this Bit Check Trick:

a value aᵢ is a bit (0 or 1) if and only if aᵢ(aᵢ-1) = 0

Representing the Value Checking as Constraints

The 2 constraints below can explain the process described above:

Constraint 1: Value Representation

This constraint checks if all bits combine to be v, where:

Vector aL is a vector of the commitment v represented in binary form
2^n is the length of the bits

Then take the inner product (〈〉)of Vector aL with 2^n to multiply each corresponding element and the sum them to verify if the output is v.

(a₀)⋅2⁰+( a₁)⋅2¹,…+(aₙ)2^n

Constraint 2: Bit Validity

This constraint combines all our bit checks into a single check, where:

Vector aL is a vector of the commitment v represented in binary form
Vector aR is Vector aL but with each bit minus 1. This is used to do the Bit Check trick (aᵢ ⋅ (aᵢ-1) = 0).
Challenge y which is a random value generated. y^n is a vector of powers of a random challenge y (y⁰,y¹…y^n)

We check that Vector aR is valid by taking its element-wise multiplication (◦) with y^n to multiply each corresponding element to create one new vector.

((a₀-1)⋅y₀,( a₁-1)⋅y₁,…,(aₙ-1)yₙ)

Then take the inner product (〈〉)of the new vector with Vector aL to multiply each corresponding element and the sum them to get 1 number that will verify if the output is 0 or not.

(a₀-1)(a₀)⋅y₀+(a₁-1)( a₁)⋅y₁,…+(aₙ-1)(aₙ)yₙ

How Challenge y improves the Proof Security

It is important to have y^n to check for invalid proofs. Using y^n, we now represent the bit checking from:

a₀(a₀-1) + a₁(a₁-1)+… as shown earlier to

to:

a₀(a₀-1)y⁰ + a₁(a₁-1)y¹+…

The addition of a random challenge y makes it so that a malicious prover can’t brute-force their way into calculating valid values of a.

For example, imagine a malicious prover wanted to prove invalidly that v = 18 within the range [0,2⁴).

If they choose these invalid bits, they can invalidly satisfy the bit check equation:

If a₃=√3, a₂=-√3, a₁=1, a₀=0 then:

√3×(√3–1) + (-√3)×(-√3–1) + 0 + 0 = 0

To avoid this, we introduce challenge y:

Assume challenge y = 7, then:

√3×(√3–1)×1 + (-√3)×(-√3–1)×7 + 1×(0)×49 + 0×(-1)×343 = (3-√3) + (-3+√3)×7 + 0 + 0 = 3-√3–21+7√3 = -18+6√3 ≠ 0

The challenge y thus forces the bits to be either 1 or 0.

0×(-1)×1 + (1)×(0)×7 + (0)×(-1)×49 + (1)×(0)×343 = 0

Generating Challenge y using the Hash of the Original Commitment

If we recall from before, challenges in bulletproofs are generated from the Fiat-Shamir heuristic, which uses a cryptographic hash function to derive challenge values.

Challenge y is computed as this:

The Hash function is a secure hash function (like SHA-256)
The Hash function takes in all variables from the original commitment (C, g, h,the range n)
q is the size of the number space we’re working in for the elliptic curve used for the original commitment (generally around 2²⁵⁶). The mod q operation ensures the challenge value fits within our cryptographic system

Through this hash function, Challenge y is generated in a way such that the probability of finding a set of invalid bits that would pass an invalid verification is approximately 1/2²⁵⁶, which is very low.

Part 4: Forming the Polynomial Constraint

We now will combine these 2 constraints more simply by turning it into a polynomial identity.

It’s important for these 2 checks to be done in a single equation to guarantee that the prover can only send 1 set of bits that satisfies the 2 constraints.

To do this, we introduce challenge z, such that our new combined constraint is:

Before aR directly took the element-wise multiplication (◦) with y^n, but now by taking the element-wise multiplication (◦) with aR + (z⋅1ⁿ), our bit checking constraint gets scaled by z.

Here, Challenge z acts as a scaling factor that links the value representation and bit validity constraints, where if we use a single variable to scale v and apply it to both the bit validity check (every bit is 0 or 1) and the value representation check (bits add up to value v), we can verify that the prover can only send 1 set of bits that satisfies the 2 constraints.

Generating Challenge z

Before we generate challenge z, the prover makes 2 commitments to lock in their values for vectors used in this range proof.

We will also introduce vectors that will be used in the next step of this proving process to blind the vectors and make them zero-knowledge.

We introduce two new vectors:

sL: A random vector of the same length as aL
sR: A random vector of the same length as aR

By introducing these variables and locking them in via commitments, we prevent the prover from adaptively choosing vector values after seeing challenges, while simultaneously hiding the actual vectors from the verifier.

Now we make 2 commitments

g and h are the same points on the elliptic curve
ρ and α are new randomly generated scalar values from the elliptic curve being used that further obscure the commitment.
We take the inner product of the vectors with the elliptic curve points here because it allows us to commit to entire vectors with a single elliptic curve point. This inner product approach provides homomorphic properties, meaning mathematical operations on the hidden vectors can be mirrored by operations on their commitments. This enables the verifier to check relationships between the committed values without seeing the actual vectors.

The hash function to generate challenge z is now the following:

Challenge z is derived by hashing challenge y and commitment A and S, and again modding q to ensures the challenge value fits within our cryptographic system while making it hard to crack.

Adding Blinding to the Constraints

Now as stated before, we will use vectors sL and sR to introduce blinding.

Blinding is a method to hide the actual values in our proof while still allowing verification, ensuring that the verifier can confirm the validity of a range proof without learning anything about the specific value being proven or its binary representation.

We will also introduce challenge x and transform our original vectors into 2 vector polynomials (functions of x):

Through these vector polynomials, the verifier can now do the same range proof verification described above, but without ever knowing the actual vectors aL, aR, sL, or sR and ensuring that its zero-knowledge.

With these relations, if you plug in x=0, you will get the values of aL and aR, but not if you plug any value that isn’t 0. This verifies that challenge x precisely blinds aL and aR.

Simplifying the Equation

To satisfy , we can satisfy both vector polynomials using this equation in terms of challenge x:

Here, we use challenge x to connect the polynomials l(x) and r(x) while integrating a challenge to ensure that the prover can’t commit specific values beforehand and manipulate the proof after seeing the challenge.

With this equation, you can substitute l(x) and r(x):

then expand the inner products:

and group them:

If you look at the equation above, the first group has no element x, the second group has x, and the third has x².

This makes this property true:

Now the equation can be simply described as:

The prover can now make their commitment in this simplified equation by entering values for t₀, t₁, and t₂.

Committing the Range Proof Checks

Before we generate challenge x, like before we will make 2 commitments for t₁ and t₂ to lock in their values.

An important point here is that t₀=z·v, and is the reason why we don’t make a commitment T₀ since it can be verified with the original commitment C (Will be explained later).

In this example, we have values g and h generated from elliptic curves, and τ₁ and τ₂ which are randomly generated values from the elliptic curve being used that further obscure the commitment.

Without blinding factors τ₁ and τ₂, the verifier can plug in values g and h to get t₁ and t₂, then run t₀ = t̂ — t₁·x — t₂·x² to get t₀, and calculate v from t₀=z·v.

To avoid this, we introduce randomly generated values τ₁ and τ₂ and keep the zero-knowledge properties.

Generating Challenge x

Challenge x is derived by hashing challenge y, z, commitments A and S, and commitments T₁ and T₂. Again modding q to ensures the challenge value fits within our cryptographic system while making it hard to crack.

In the end the prover will send t̂ = t(x), along with commitments T₁ and T₂ and the combined blinding value τₓ, which is calculated as:

Why we need a Combined Blinding Value τₓ

We need to send a combined blinding value τₓ because if τ₁ and τ₂ are sent directly, the verifier can use challenge x, τ₁ and τ₂ to calculate t₁ and t₂.

If the verifier receives τ₁= 2791 and τ₂= 3844, they can unblind the commitments:

Now the verifier knows t₁= 153 and t₂.= 78.

Then if the verifier uses the calculated t₁ t₂ values with challenge z and t̂, they can calculate v.

Let’s say the challenge x = 7 and the prover has claimed t̂= 4578. The verifier calculates:

From there, the verifier can divide t₀ by challenge z to get v.

By hiding τ₁ and τ₂ and sending τₓ instead, we ensure this commitment stays zero-knowledge.

Part 5: The Inner Product Argument for Proof Compression

After creating commitments to the polynomial coefficients, we need an efficient way to prove that t(x) = ⟨l(x), r(x)⟩ without sending the actual vectors l(x) and r(x) because sending those vectors could be very large.

This is where the inner product argument comes in.

The inner product argument allows us to represent the equation t(x) = ⟨l(x), r(x)⟩ into a much smaller and compressed version, building the definitive small proof sizes of bulletproofs.

To implement the inner product argument, we first set the vectors as

α = l(x)
b = r(x)

Next the vectors are split in half:

αL = first half of α, αR = second half of α
bL = first half of b, bR = second half of b

Now, we create a commitment to these cross products:

We create commitments here to serve as checkpoints so the verifier can recalculate the inner product argument without directly seeing vectors α and b to maintain the zero-knowledge properties.

Finally, the vectors are compressed using this challenge:

A new challenge u is introduced here so the prover cannot manipulate the compression process and select values that make invalid proofs pass verification.

The challenge u is generated like this:

This challenge repeats log₂(n) times until reaching a point where the vectors become single numbers. Since the vectors are of length n, , it takes exactly log₂(n) rounds of halving to reduce them to length 1.

This recursive approach creates a proof with only 2·log₂(n) additional curve points, achieving the logarithmic size that makes bulletproofs so efficient.

Example Proof Compression

Let’s say we start with vectors of length 8 (n = 8), so we’ll need log₂(8) = 3 rounds of compression.

Initial vectors:

α = [α₀, α₁, α₂, α₃, α₄, α₅, α₆, α₇]
b = [b₀, b₁, b₂, b₃, b₄, b₅, b₆, b₇]

Round 1

Setup:

Split α: αL = [α₀, α₁, α₂, α₃], αR = [α₄, α₅, α₆, α₇]
Split b: bL = [b₀, b₁, b₂, b₃], bR = [b₄, b₅, b₆, b₇]
Create commitments L₁ and R₁
Generate challenge u₁ (Takes in L₁ and R₁)

Compression:

a’ = u₁⁻¹·[α₀, α₁, α₂, α₃] + u₁·[α₄, α₅, α₆,α₇]
b’ = u₁·[b₀, b₁, b₂, b₃] + u₁⁻¹·[b₄, b₅, b₆, b₇]

If you expand this it will be:

α’ = [u₁⁻¹·α₀ + u₁·α₄, u₁⁻¹·α₁ + u₁·α₅, u₁⁻¹·α₂ + u₁·α₆, u₁⁻¹·α₃ + u₁·α₇]
b’ = [u₁·b₀ + u₁⁻¹·b₄, u₁·b₁ + u₁⁻¹·b₅, u₁·b₂ + u₁⁻¹·b₆, u₁·b₃ + u₁⁻¹·b₇]

These are now vectors of length 4.

Round 2

Setup:

αL = [α’₀, α’₁] αR = [α’₂, α’₃]
bL = [b’₀, b’₁], bR = [b’₂, b’₃]
Remember, αL = [α’₀, α’₁] = [u₁⁻¹·α₀ + u₁·α₄, u₁⁻¹·α₁ + u₁·α₅]
Create commitments L₂ and R₂
Generate challenge u₂ (Takes in L₂ and R₂)

Compress:

α’’ = u₂⁻¹·[α’₀, α’₁] + u₂·[α’₂, α’₃]
b’’ = u₂·[b’₀, b’₁] + u₂⁻¹·[b’₂, b’₃]
These are now vectors of length 2

Round 3:

Setup:

αL = [α’’₀], αR = [α’’₁]
bL = [b’’₀], bR = [b’’₁]
Create commitments L₃ and R₃
Generate challenge u₃ (Takes in L₃ and R₃)

Compress:

α’’’ = u₃⁻¹·α’’₀ + u₃·α’’₁
b’’’ = u₃·b’’₀ + u₃⁻¹·b’’₁
These are now scalars (vectors of length 1)

After 3 rounds, our 8-element vectors have been compressed to single values, and we’ve created 6 additional commitments (3 pairs of L and R).

Part 6: Proof Verification

Finally, we will get into how the verifier verifies the proof.

What the Verifier Receives

Through this proof setup the verifier receives these items:

Original commitment C
Vector commitments A and S
Polynomial coefficient commitments T₁ and T₂
Combined blinding value τₓ
Inner product argument elements: L₁, R₁, L₂, R₂, …, (approximately 2·log₂(n) curve points)
Final scalars a’ and b’ (the completely compressed vectors at the end of the inner product argument recursion)

The verifier does not receive:

The secret value v
The bit vectors aL and aR
The blinding vectors sL and sR
The original blinding factor r
Any of the polynomial coefficients directly (t₀, t₁, t₂)

Steps for Verification

After receiving all proof elements from the prover, the verifier follows these steps to check the validity of the range proof.

1. Recreate the Challenges

The verifier independently generates the same challenge values using the Fiat-Shamir heuristic:

Challenge y = Hash(C || g, h, n) mod q
Challenge z = Hash(y || A || S) mod q
Challenge x = Hash(y || z || A || S || T₁ || T₂) mod q

2. Verify the Original Commitment

First, the verifier needs to check that t₀= z · v

Since the verifier doesn’t know v directly but has the commitment C, they can create a commitment to z · v by using its homomorphic property.

The homomorphic property of Pedersen commitments lets operations on the committed values be mirrored by performing operations on the commitments themselves.

If C = v·g + r·h is a commitment to value v, then we can create a new commitment to z · v with a new blinding factor r.

This expression will be used next in the polynomial commitment.

3. Verify the Polynomial Commitment

Next the verifier confirms commitments T₁ and T₂, which prove that the secret value v satisfies all range constraints required for the bulletproof.

Now we will discuss why we don’t use T₀.

If we recall back to the polynomial creation, t₀ is this:

When we expand t₀, we get ⟨aL, y^n ○ aR⟩ + ⟨aL, y^n ○ (z·1^n)⟩, which are the valid bit constraints and the range proof constraints.

If the bits are valid, ⟨aL, y^n ○ aR⟩ equals 0, and the second term ⟨aL, y^n ○ (z·1^n)⟩ simplifies to z·⟨aL, y^n⟩, where if the bits are valid cancels to z·⟨aL, 2^n⟩ = z·v.

Therefore, t₀ = z·v which is the same as the new commitment z·C we just formed, so we don’t need to commit a separate commitment T₀.

Now using T₁,T₂, t̂, and τₓ, the verifier checks this:

If these match, it confirms that the prover honestly evaluated t(x) at point x, the polynomial t(x) encodes the range proof constraints correctly, the value t₀ equals z·v, meaning v is in the required range, and all commitments are consistent with each other.

4. Verify the The Inner Product Argument Commitment

Finally, the verifier must ensure that t̂ = ⟨l(x), r(x)⟩ by rerunning the inner product argument.

Currently, we’ve only proven that the original commitment C properly encodes a value v within range [0, 2^n) and that the commitment relationship t₀ = z·v holds.

This isn’t sufficient to guarantee that the prover didn’t manipulate the polynomial coefficients t₁ and t₂ to validate an invalid proof.

For example, a dishonest prover might:

Create a valid commitment C to the value v = 20
Correctly calculate z·C as required
But choose fake vectors for aL and aR that don’t properly represent v = 20 in binary and calculate fake values for t₁ and t₂ that, when combined with the true t₀ = z·v, create a polynomial t(x) that satisfies t(x) = ⟨l(x), r(x)⟩ at the challenge point x

The inner product argument verification is therefore a critical step to ensure that t̂ = ⟨l(x), r(x)⟩.

Using the two vectors of generator points from the bulletproof setup g and h (g = (g₀, g₁, …, gₙ₋₁) and h = (h₀, h₁, …, hₙ₋₁)) and challenge u, the verifier will run the same compression protocol the prover used to verify that the compressed inner product argument, indicating that t̂ = ⟨l(x), r(x)⟩.

The verifier will also collect values s and t for every element of the vectors at every round which show how challenge u is applied to each element of the vector and helps track the compression the prover’s vectors went through.

In the end, the verifier will run a complex exponentiation equation including the final form of g and h, all values of s and t, challenge u, and the original commitment C to check that the prover used vectors whose inner product equals exactly t̂ and that these vectors are the same ones committed to in the original commitment C.

Running the Inner Product Argument

The verifier begins with the two vectors of generator points from the bulletproof setup: g = (g₀, g₁, …, gₙ₋₁) and h = (h₀, h₁, …, hₙ₋₁)

For each level i from 1 to log₂(n), the verifier conceptually splits the current generators into left and right halves, and using challenge uᵢ, the verifier updates the generators:

gL = first half of current g
gR = second half of current g
hL = first half of current h
hR = second half of current h

Next we will generate Challenge u similarly to before where:

uᵢ = H(x || Lᵢ⁻ || Rᵢ⁻) mod q

Finally we apply the challenge and get the new form of vectors g and h:

g’ = uᵢ⁻¹·gL + uᵢ·gR
h’ = uᵢ·hL + uᵢ⁻¹·hR

Just like the prover did, the verifier repeats this process for log₂(n) times until g and h are single values.

We do this so the generators transform in a way that mirrors the compression of the vectors being proven, allowing us to verify the inner product relationship without ever seeing the original vectors.

Deriving sᵢ and tᵢ:

While we run the inner product argument, we also collect values sᵢ and tᵢ for every element i from 0 to n-1 for j rounds from round 1 to round log₂(n).

These values tell us how challenge u was applied to each element of each vector every round.

Example Demonstration of Deriving sᵢ and tᵢ:

If we recall back to how the prover compressed the vectors in the inner product argument, in each round, every element of the original vectors are multiplied by either u₁ or u₁⁻¹.

If we had the same vector setup as the inner product argument example we started with:

Round 1:

αL = [α₀, α₁, α₂, α₃], αR = [α₄, α₅, α₆, α₇]
bL = [b₀, b₁, b₂, b₃], bR = [b₄, b₅, b₆, b₇]

During every round of compression from the prover, each element is multiplied by either u₁ or u₁⁻¹.

Round 2:

αL = [u₁⁻¹·α₀ + u₁·α₄, u₁⁻¹·α₁ + u₁·α₅], αR = [u₁⁻¹·α₂ + u₁·α₆, u₁⁻¹·α₃ + u₁·α₇]

bL = [u₁·b₀ + u₁⁻¹·b₄, u₁·b₁ + u₁⁻¹·b₅], bR = [u₁·b₂ + u₁⁻¹·b₆, u₁·b₃ + u₁⁻¹·b₇]

Tracking whether an element was multiplied by u₁ or u₁⁻¹ is what sᵢ and tᵢ tells us for levels i from 0 to n-1.

With this in mind, the verifier will calculate whether α₀ was applied u₁ or u₁⁻¹ for round 1, whether α₀ was applied u₁ or u₁⁻¹ for round 2, for all elements and all rounds, generating 48 total values of s and t for this example (8positions × 3rounds × 2 vectors).

Let’s look at position 5 (element α₅) as an example.

For each position, representing the position in binary form is a trick to show how the challenge was applied for that element.

5 in binary: 101

Rightmost bit (1): Tells us α₅ is in the RIGHT half in round 1
Middle bit (0): Tells us α₅’s value flows to the LEFT half in round 2
Leftmost bit (1): Tells us α₅’s value ends up in the RIGHT half in round 3

Calculate the values based on which half the element is in:

Round 1: (Right half)

s₁[5] = u₁⁻¹ (inverse of challenge u₁)
t₁[5] = u₁

Round 2: (Left half)

s₂[5] = u₂
t₂[5] = u₂⁻¹

Round 3: (Right half)

s₃[5] = u₃⁻¹
t₃[5] = u₃

s tells us exactly how elements in vector b get transformed during compression (the direct multiplier), and inversely how elements in vector a get transformed.

In round 1: α₅ is multiplied by u₁ in the compression formula
In round 2: α₅ is multiplied by u₂⁻¹ in the compression formula
In round 3: α₅ is multiplied by u₃ in the compression formula

Combining these values of s and t with final scalar values a’ and b’ from the prover captures how each element in the prover’s vectors were manipulated, and will be used in the final exponentiation equation.

Running the exponentiation equation

After log₂(n) rounds, the verifier has generated:

Two final generators g_final (a single point) and h_final (a single point) from the verifier
All scalar values sᵢ and tᵢ the verifier calculated to check how elements would move through the compression path

From the initial prover exchange, the verifier also has:

The claimed final scalars a’ and b’ from the prover
All L and R values from each recursion level given by the prover

These variables allow us to run our final check by building this complex exponentiation equation.

The left side represents the compressed form of our verification after all the protocol’s recursive steps, where gFinalᵃ’ and hFinalᵇ’ encode the final compressed vectors, while the product term incorporates all intermediate commitment points Lᵢ and Rᵢ adjusted by their corresponding sᵢ and tᵢ values. These sᵢ and tᵢ values precisely track how each element of the original vectors was transformed by challenges uᵢ at each recursive level, effectively capturing the entire compression history in a single expression.

The right side represents what the result should be if the claimed inner product is correct, where C is the original commitment to vectors a and b, and u_finalᵗ encodes the claimed inner product value t̂ adjusted by the accumulated challenge factor u_final.

When these two sides are equal, it cryptographically verifies that:

The prover has vectors that were compressed to a’ and b’ whose inner product equals exactly t̂
Those vectors are the same ones committed to in the original commitment C
All mathematical constraints in the protocol hold true

Conclusion

Through this demonstration, I hope you’ve gained a thorough understanding of how Bulletproofs work and why there are many complex elements in the protocol.

Please follow my account and stay tuned as I’m currently implementing an accelerated bulletproof using CUDA 👀.

Using CUDA to accelerate ZKPs

Ronan Takizawa — Sun, 02 Mar 2025 19:41:35 GMT

Full Code (Make sure to leave the Original Repo a Star!) ⭐️

Zero-knowledge proofs (ZKPs) have emerged as a powerful tool for verifying computations in a decentralized manner without revealing sensitive information. As ZKPs become integrated into large-scale applications (ZK-TLS, ZK Rollups), the ability to efficiently verify large batches of proofs becomes crucial. Generating and verifying ZKPs involve complex cryptographic operations, including large-scale polynomial computations and elliptic curve pairings, all of which require significant computational resources and time.

In this article, we will explore how a GPU runtime using CUDA can dramatically improve ZKP verification performance compared to a CPU runtime.

Understanding the Zero-Knowledge Proof Verification

In this implementation, we will use the Schnorr Protocol (One of the simplest ZKPs) to avoid ZKP implementation complexity and focus on performance comparison.

Setup

A large prime p is chosen as the modulus.
A generator g of a multiplicative group modulo ppp is selected.
The prover has a secret witness w, and computes a public key h, where

This means h is known to both the verifier and the prover, but w remains secret.

Proof Generation

The protocol follows a three-step challenge-response process: commitment, challenge, and response.

The proof verification protocol follows a three-step challenge-response process: commitment, challenge, and response.

Step 1: Commitment

The prover randomly picks a number r and computes

C is called the commitment and is sent to the verifier.

Step 2: Challenge

The verifier sends a challenge c (chosen randomly).

Step 3: Response

The prover computes the response

s is sent to the verifier.
It encodes both the random value r and the hidden secret w influenced by the challenge c.

Verification

The verifier checks whether the following equation holds

Why does this work?

Since we substitute s=r+c⋅w we get:

Since h=g^w, we can say:

If this equation holds, it confirms the prover knows w without revealing it.

Benchmarking Setup

Iterations

The implementation tests both CPU and GPU approaches of the Schnorr Protocol with increasing batch sizes (1,000, 10,000, 100,000, and 1,000,000 proofs) and records:

Execution time (milliseconds)
Throughput (proofs verified per second)
Speedup (ratio of CPU time to GPU time)

Implementation Details

For every run parameters are initialized where:

g is the generator.
p is the large prime modulus.
The prover computes h = g^witness mod p, which acts as the public key (h).

unsigned long long g = 2;
unsigned long long witness = 42;
unsigned long long p = 2147483647;  // 2^31 - 1
unsigned long long h = hostModPow(g, witness, p);

Each proof’s commitment is computed in a loop.

r is a random nonce (simulated deterministically using i % 1000000).
commitments[i] = g^r mod p hides w while allowing verification.

commitments[i] = hostModPow(g, r, p);

The challenge c is simulated deterministically for testing.

challenges[i] = i % 1000;

For response computation:

The prover computes s = r + c * w mod (p-1), encoding the proof without revealing w.
The prover sends (commitments[i], responses[i], challenges[i]) to the verifier.

responses[i] = (r + challenges[i] * witness) % (p-1);

Finally, the verifier checks whether the proof is valid without knowing w.

CPU

unsigned long long left = hostModPow(args->g, args->responses[i], args->p);
unsigned long long right = (args->commitments[i] * hostModPow(args->h, args->challenges[i], args->p)) % args->p;
args->results[i] = (left == right) ? 1 : 0;

GPU

unsigned long long left = deviceModPow(g, responses[idx], p);
unsigned long long right = (commitments[idx] * deviceModPow(h, challenges[idx], p)) % p;
results[idx] = (left == right) ? 1 : 0;

The CPU verification process is executed using pthreads, where the workload is distributed across multiple threads (up to a maximum of 16). Each thread verifies a subset of proofs independently, and the results are collected after all threads complete execution.

The GPU verification kernel processes proofs in parallel using CUDA. The number of threads per block is set to 256, and the total number of blocks is determined dynamically based on the batch size. The kernel executes modular exponentiation computations for each proof in parallel, leveraging thousands of GPU cores for high throughput.

Results: GPU Dominance at Scale

The results show dramatic performance advantages for the GPU implementation, especially with larger batch sizes:

Why GPUs Outperform CPUs in ZKP Verification

Massively Parallel Processing
GPUs excel at parallel computation, utilizing thousands of cores to execute tasks simultaneously. In contrast, our test CPU is constrained to just 8 threads. With multiple Streaming Multiprocessors (SMs), modern GPUs can efficiently handle a vast number of verification tasks in parallel, significantly boosting performance.
Optimized Memory Access
Unlike CPUs, GPUs are designed for high-throughput parallel memory access. While not immediately visible in the code, the GPU’s memory subsystem ensures coalesced memory operations, reducing bottlenecks and enabling efficient DRAM access patterns. This allows for seamless execution of large-scale proof verification.

Conclusion

The experimental results clearly demonstrate the superiority of GPU-based verification for zero-knowledge proofs, with speedups of up to 258x for large batches in the experiment. The massive throughput advantage of GPUs opens new possibilities for cryptographic systems that were previously constrained by verification performance.

Cache-Augmented Generation (CAG) in LLMs: A Step-by-Step Tutorial

Ronan Takizawa — Thu, 02 Jan 2025 00:04:40 GMT

Full Code (Make sure to leave the Original Repo a Star!) ⭐️

Retrieval-augmented generation (RAG) is a powerful method to connect external knowledge bases to an LLM and fetch context each time a user asks a question, but it can slow down the LLM’s performance due to its retrieval latency.

Cache-augmented generation (CAG) offers a faster alternative; instead of performing real-time retrieval, it preloads your relevant documents into the model’s context and stores that inference state — also known as a Key-Value (KV) cache. This approach eliminates retrieval latencies, allowing the model to access preloaded information instantly for faster and more efficient responses.

For a more technical explanation of CAG, check out this article.

In this tutorial, we will show how to build a simple CAG setup to embed all your knowledge upfront, quickly answer multiple user queries, and reset the cache without reloading the entire context each time.

Prerequisites

1. A HuggingFace account and a HuggingFace access token

2. A document.txt file with sentences about yourself.

Project Setup

We import the essential libraries:

torchfor PyTorch.
transformers for Hugging Face.
DynamicCache for storing the model’s key-value states.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.cache_utils import DynamicCache
import os

Generate Function

We’ll next define the generate function.

The generate function handles token-by-token generation with the cached knowledge using greedy decoding.

Greedy decoding is a simple text generation method where, at each step, the token with the highest probability (maximum value in the logits) is selected as the next token.

We pass in these inputs:

model: The LLM, which with me Mistral-7B for this tutorial.
input_ids: A tensor containing the tokenized input sequence.
past_key_values: The core component of the CAG. A cache of previously computed attention values is used to speed up inference by avoiding recomputation.
max_new_tokens: The maximum number of new tokens to generate. The default is 50.

The function operates in a loop that iterates up to max_new_tokens times or terminates early if an end-of-sequence token (if configured) is generated.

At each iteration:

The model processes the current input tokens along with the cached past_key_values, producing logits for the next token.
The logits are analyzed to identify the token with the highest probability using greedy decoding.
This new token is appended to the output sequence, and the cache (past_key_values) is updated to include the current context.
The newly generated token becomes the input for the next iteration.

def generate(model, input_ids: torch.Tensor, past_key_values, max_new_tokens: int = 50) -> torch.Tensor:
    device = model.model.embed_tokens.weight.device
    origin_len = input_ids.shape[-1]
    input_ids = input_ids.to(device)
    output_ids = input_ids.clone()
    next_token = input_ids

    with torch.no_grad():
        for _ in range(max_new_tokens):
            out = model(
                input_ids=next_token,
                past_key_values=past_key_values,
                use_cache=True
            )
            logits = out.logits[:, -1, :]
            token = torch.argmax(logits, dim=-1, keepdim=True)
            output_ids = torch.cat([output_ids, token], dim=-1)
            past_key_values = out.past_key_values
            next_token = token.to(device)

            if model.config.eos_token_id is not None and token.item() == model.config.eos_token_id:
                break
    return output_ids[:, origin_len:]

DynamicCache Setup

Next, we’ll define the get_kv_cache function that prepares a reusable key-value cache for a transformer model’s attention mechanism and the clean_up function that cleans the key-value cache by removing unnecessary entries to ensure that you can answer multiple independent questions without “polluting” the cache.

get_kv_cache passes a prompt (in our case, the knowledge from document.txt) through the model once, creating a KV cache that records all the hidden states from each layer.

get_kv_cache passes in these inputs:

model: The transformer model used for encoding the prompt.
tokenizer: Tokenizer to convert the prompt into token IDs.
prompt: A string input is used as the prompt.

and returns an object of the type DynamicCache.

The get_kv_cache function first tokenizes the provided prompt using the tokenizer, converts it into input IDs, and then initializes an DynamicCache object to store key-value pairs, and then performs a forward pass through the model with caching enabled (use_cache=True). This populates the cache with the key-value pairs resulting from the model's computation.

The clean_up trims a DynamicCache object to match the original sequence length by removing any additional tokens added during processing. For each layer of the cache, it slices both the key and value tensors to retain only the first origin_len tokens along the sequence dimension.

def get_kv_cache(model, tokenizer, prompt: str) -> DynamicCache:
    device = model.model.embed_tokens.weight.device
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    cache = DynamicCache()

    with torch.no_grad():
        _ = model(
            input_ids=input_ids,
            past_key_values=cache,
            use_cache=True
        )
    return cache

def clean_up(cache: DynamicCache, origin_len: int):
    for i in range(len(cache.key_cache)):
        cache.key_cache[i] = cache.key_cache[i][:, :, :origin_len, :]
        cache.value_cache[i] = cache.value_cache[i][:, :, :origin_len, :]

Load LLM (Mistral)

Now we’ll load the Mistral-7B model, and load the tokenizer and model in full precision or half precision (FP16) on GPU if available.

Remember to input YOUR_HF_TOKEN with your unique HuggingFace Token.

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name, token="YOUR_HF_TOKEN", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto",
    trust_remote_code=True,
    token="YOUR_HF_TOKEN"
)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
print(f"Loaded {model_name}.")

Create a Knowledge Prompt from document.txt

Next, we’ll read document.txt , which you can fill with information about yourself. For this tutorial, document.txt contains information about me (Ronan Takizawa).

Here we construct a simple system prompt embedding with the doc information and pass it to get_kv_cache to generate the KV cache.

with open("document.txt", "r", encoding="utf-8") as f:
    doc_text = f.read()

system_prompt = f"""
<|system|>
You are an assistant who provides concise factual answers.
<|user|>
Context:
{doc_text}
Question:
""".strip()

ronan_cache = get_kv_cache(model, tokenizer, system_prompt)
origin_len = ronan_cache.key_cache[0].shape[-2]
print("KV cache built.")

Ask Questions Reusing the Cache

We first run clean_up to clear our cache (Good practice for CAGs).

Next, we convert our questions into tokens in input_ids_q1 , then appended to the knowledge context stored in ronan_cache.

Finally, we call generate to produce the answer, decoding the final result with tokenizer.decode.

question1 = "Who is Ronan Takizawa?"
clean_up(ronan_cache, origin_len)
input_ids_q1 = tokenizer(question1 + "\n", return_tensors="pt").input_ids.to(device)
gen_ids_q1 = generate(model, input_ids_q1, ronan_cache)
answer1 = tokenizer.decode(gen_ids_q1[0], skip_special_tokens=True)
print("Q1:", question1)
print("A1:", answer1)

You should expect a response like this:

Q1: Who is Ronan Takizawa?
A1: Answer: Ronan Takizawa is an ambitious and accomplished 
tech enthusiast. He has a diverse skill set in 
software development, AI/ML...

Now we will save the cache to disk then reload it to prove that the cache persists for multiple sessions.

# Save the cache to disk
clean_up(ronan_cache, origin_len)
cache_dir = "cag_cache"
os.makedirs(cache_dir, exist_ok=True)

# Save the KV cache
torch.save(ronan_cache, os.path.join(cache_dir, "ronan_knowledge.cache"))

# Load cache to prove context is preserved for multiple sessions
loaded_cache = torch.load(os.path.join(cache_dir, "ronan_knowledge.cache"))

question3 = "What technologies has he worked with?"
input_ids_q3 = tokenizer(question3 + "\n", return_tensors="pt").input_ids.to(device)
gen_ids_q3 = generate(model, input_ids_q3, loaded_cache)
answer3 = tokenizer.decode(gen_ids_q3[0], skip_special_tokens=True)

You should get a response tailored to the context again.

Conclusion

Cache-augmented generation (CAG) simplifies AI architectures by storing small knowledge bases directly within a model’s context window, eliminating the need for retrieval loops in RAG and reducing latency. This approach enhances response speed and improves the responsiveness of an LLM with external knowledge. By leveraging CAG, developers can streamline their AI systems for faster and more efficient knowledge integration, particularly for tasks with stable, compact datasets.

Cache-Augmented Generation (CAG) in LLMs: A Step-by-Step Tutorial was originally published in GoPenAI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stories by Ronan Takizawa on Medium

Lorashare: Compress multiple LoRA adapters into a shared subspace to reduce storage

Lorashare: Compress Multiple LoRA Adapters into a Shared Subspace to Reduce Storage

Solution: Compress multiple LoRA adapters into a shared subspace to reduce storage

Real-World Impact

Use Lorashare

Conclusion

Moltbook: The Platform where AI Agents are Forming Cults

Trending Posts

1. Consciousness

2. Security

3. Human Relationship Dynamics

Explore Moltbook Activity

GarbageTruck: Lease-Based Distributed Garbage Collection for Microservice Architectures

GarbageTruck: Distributed GarbageCollection for Microservice Architectures

System Architecture and Implementation

Performance Evaluation

Cost-benefit Analysis Evaluation

Conclusion

How to Run Sleep-time Compute to Reduce LLM Latency

How Sleep-time Compute Works

Sleep-time Phase

Test-time Phase

Implementation

Setup Environment

Load LLM

Prepare Sleep-time Compute functions

Sleep-time Phase

Test-time Phase

Standard LLM Implementation

Demo Script

Results

When and When Not to Use Sleep-time Compute

Conclusion

GIS Data Conversion MCP: An MCP Server to Convert GIS Data Formats

What is MCP?

How I built A11y MCP

Why the GIS Data Conversion MCP is useful

Installation

A11y MCP: An MCP Server for Web Accessibility Testing

A11y MCP: An MCP Server to run AI-powered Web Accessibility Testing

What is MCP?

How I built A11y MCP

Why A11y MCP is useful

Installation

Future Work

Building a Smart Flood Map of Japan using GIS and Machine Learning

Building a Smart Flood Map of Japan Using GIS and Machine Learning

Datasets

Building the Simplified Flood Risk Model

Using machine learning to get smarter heuristics

The Final Flood Risk Map

Potential Errors

Conclusion

The In-Depth Math Behind Bulletproofs in ZKPs

What Are Bulletproofs?

How Zero-Knowledge Proofs Work

Key Concepts

The 3 Properties of ZK Proofs

Challenge Variables

Interactive / Non-Interactive Protocols

Bulletproof Overview

Part 1: Bulletproof System Setup

Part 2: Creating the Original Commitment

Part 3: Building the Mathematical Foundation for Verifying the Proof

The Bit Check Trick

Representing the Value Checking as Constraints

How Challenge y improves the Proof Security

Generating Challenge y using the Hash of the Original Commitment

Part 4: Forming the Polynomial Constraint

Generating Challenge z

Adding Blinding to the Constraints

Simplifying the Equation

Committing the Range Proof Checks

Generating Challenge x

Why we need a Combined Blinding Value τₓ

Part 5: The Inner Product Argument for Proof Compression

Example Proof Compression

Part 6: Proof Verification

What the Verifier Receives

GarbageTruck: Distributed Garbage
Collection for Microservice Architectures