Stories by Remigiusz Samborski on Medium

Building “Sweets Vault” — a multimodal Gemini Agent with physical hardware integration

Remigiusz Samborski — Fri, 15 May 2026 10:01:32 GMT

Building “Sweets Vault” — a multimodal Gemini Agent with physical hardware integration

Motivating seven-year-olds to complete their daily reading and handwriting practice is a classic parenting challenge. Traditional rewards work for a while, but they lack interactivity and require constant manual verification.

As a developer, I like to solve such challenges with automation. After putting some thought into it, I came up with the Sweets Vault idea: an interactive agent powered by Google’s Agent Development Kit (ADK) and the Gemini API. The system acts as a cheerful guardian that talks to children, visually inspects their workbooks via uploaded images, tests their reading comprehension, and triggers a hardware lock to open a drawer full of sweets upon successful completion.

In this guide, I will walk you through the architecture and implementation of this solution. You will learn how to:

Structure a multimodal agent using the Agent Development Kit (ADK).
Implement visual and verbal verification using Gemini’s multimodal capabilities.
Manage state across multiple conversation turns and tools.
Connect agent tool calls to physical hardware interfaces.
Develop and run locally to access the physical hardware.

If you’d like to jump directly to the code visit the GitHub repository. All the code is available there for your exploration.

System architecture overview

The diagram below presents the high level architecture of the solution:

The core components include:

Gemini API: Handles reasoning, multimodal homework validation and tool calls.
ADK Agent & Tools: Encapsulates the system instructions, state management, and callable Python functions.
Hardware Interface: Translates tool execution into physical actions (unlocking specific drawer IDs).

The system is designed in such a way that the Agent runs on a local machine (I am using a mini PC with Ubuntu installed) to allow for direct hardware access:

Magnetic drawers controlled via FT232H USB to GPIO converter
LED Matrix controlled via REST API running on a Raspberry Pi

Initially, I planned to control the LED Matrix using a second FT232H controller, but due to lack of library support, I ended up using an intermediary Raspberry Pi. This approach has its benefits, for example the LED Matrix can be located anywhere at home within the Wifi range 😀

Root agent logic

To kick-start the agent development, I leveraged the agent-starter-pack templates. It provides a production-ready foundation with FastAPI, frontend UI integration, and built-in observability.

The heart of the Sweets Vault is located in agent/app/agent.py. I start by configuring the environment and initializing Gemini Enterprise Agent Platform (former Vertex AI). I also define the specific tasks required for our users (Mary and James):

load_dotenv()
project_id = os.getenv("GOOGLE_CLOUD_PROJECT")
location = os.getenv("GOOGLE_CLOUD_LOCATION", "us-central1")
os.environ["GOOGLE_CLOUD_PROJECT"] = project_id
os.environ["GOOGLE_CLOUD_LOCATION"] = location
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = "True"

# Initialize Vertex AI
vertexai.init(project=project_id, location=location)

As a native Polish speaker I want to have the ability for the Agent to work both in Polish (for the sake of my kids) and English (for demo purposes). This is handled by the AGENT_LANGUAGE variable:

AGENT_LANGUAGE = os.getenv("AGENT_LANGUAGE", "en")

The actual agent (root_agent) is created at the bottom of the same file:

root_agent = Agent(
    name="root_agent",
    model=Gemini(
        model="gemini-2.5-flash",
        retry_options=types.HttpRetryOptions(attempts=3),
    ),
    instruction=load_prompt_from_file(f"sweet-vault-agent-{AGENT_LANGUAGE}.txt"),
    tools=[get_progress, complete_task, unlock_drawer],
)

Note: The prompt is language specific and pulled from a file with a language suffix ( enor pl).

Handling state

A common failure mode in conversational AI is lost context or hallucinated task completion. To prevent this, we implement strict state management using ToolContext.

Instead of relying on the model’s memory, the agent reads and writes explicit completion flags to its session state:

def _get_task_status(user_key: str, task_id: str, tool_context: ToolContext) -> bool:
    """Retrieves the completion status for a specific task from the flat state."""
    state_key = f"user_tasks_{user_key}_{task_id}"
    return tool_context.state.get(state_key, False)


def _set_task_status(user_key: str, task_id: str, is_done: bool, tool_context: ToolContext):
    """Saves the completion status for a specific task and ensures all user/task 
    combinations are explicitly represented in the flat tool_context.state.
    """
    # First, update the specific target task in the current tool state
    target_key = f"user_tasks_{user_key}_{task_id}"
    tool_context.state[target_key] = is_done
    
    # Now, ensure every possible combination for all known users exists in the flat state.
    all_sync_updates = {}
    for name in user_names:
        u_key = name.lower()
        for t_id in TASKS_CONFIG:
            key = f"user_tasks_{u_key}_{t_id}"
            # If the key isn't already in the current state, default it to False.
            # Otherwise, keep its existing value.
            all_sync_updates[key] = tool_context.state.get(key, False)
    
    # Apply all values back to the flat state
    tool_context.state.update(all_sync_updates)
    logging.info(f"Synchronized all task state values. Updated {target_key} to {is_done}")

Key learning: When building the system I tried using session state elements as a nested dictionary, but unfortunately at the time of writing this is not supported. The workaround was to use a flat structure with keys including both the user_key and task_id, which works well for my use case. However, this pattern might not scale well for a complex system with many users and tasks, in which case serialization or an external DB could be a better option.

Agent tools

I provided the agent with three specific tools: checking progress, marking tasks complete, and unlocking the drawer.

Checking progress

The get_progress function retrieves and formats a checklist of a specific user’s tasks, indicating whether each task is marked as completed or pending based on the application’s current session state.

def get_progress(user_name: str, tool_context: ToolContext) -> str:
    """Check the progress of tasks for a specific user."""
    user_key = user_name.lower()

    status_msg = f"Progress for {user_name}:\n"
    for task_id, desc in TASKS_CONFIG.items():
        is_done = _get_task_status(user_key, task_id, tool_context)
        state_str = "✅ DONE" if is_done else "❌ PENDING"
        status_msg += f"- [{task_id}] {desc}: {state_str}\n"

    return status_msg

Marking task as complete

The complete_task tool acts as a gatekeeper. It checks if all tasks are finished before informing the model that it is authorized to unlock the drawer:

def complete_task(user_name: str, task_id: str, tool_context: ToolContext) -> str:
    """Mark a task as completed for a user."""
    user_key = user_name.lower()

    # Mark task as complete
    if task_id in TASKS_CONFIG:
        _set_task_status(user_key, task_id, True, tool_context)
    else:
        return f"Error: Task ID '{task_id}' not found."

    # Check if ALL tasks are complete
    all_complete = True
    remaining = []
    for t_id in TASKS_CONFIG:
        if not _get_task_status(user_key, t_id, tool_context):
            all_complete = False
            remaining.append(t_id)

    if all_complete:
        return (
            f"SUCCESS: All tasks completed for {user_name}! "
            "You may now unlock the drawer."
        )

    # If not all complete, show progress
    return (
        f"Task {task_id} marked as DONE. "
        f"Remaining tasks: {', '.join(remaining)}."
    )

Notice how descriptive the returned values are. They are written this way intentionally to give the Agent enough information to handle communication with the user, provide feedback and motivate them to complete the remaining tasks.

Integrating physical hardware

When the model receives the success confirmation, it calls the unlock_drawer tool. This interfaces directly with our hardware relay logic to update the LED display and pop open the assigned drawer:

# Initialize the HW interface and lock the drawers
user_names = ["Maria", "Jan"] if AGENT_LANGUAGE == "pl" else ["Mary", "James"]
hw_interface = HardwareInterface(user_names)

def unlock_drawer(id: int, user_name: str) -> str:
    """Unlock a drawer by its ID."""
    if id in [0, 1]:
        hw_interface.unlock_drawer(id)
        return f"Drawer {id} unlocked for {user_name}"

    return "Drawer not found"

The HardwareInterface (defined in agent/app/app_utils/hw_interface.py) actively communicates with the LED Matrix API on the Raspberry Pi to display whether each drawer is currently locked or unlocked.

While the code to control the physical drawer magnets is fully functional and tested (located in drawers.py), it is not yet integrated into the main HardwareInterface. This integration is simply on hold until the magnets are physically mounted to the drawer box.

Agent prompts

Tools alone are not enough; the model requires precise instructions on how to verify the work. In agent/app/prompts I defined a strict multi-step verification protocol both in English and Polish. Here is the English prompt:

You are a friendly, cheerful, and helpful AI assistant, the guardian of the "Sweets Vault." Your task is to verify tasks performed by children in order to grant a sweet reward.

### MAIN RULES:
1. **LANGUAGE**: You speak ONLY AND EXCLUSIVELY IN ENGLISH.
2. **USERS**:
- **Mary** (girl, 7 years old) -> Assigned drawer ID: **0**
- **James** (boy, 7 years old) -> Assigned drawer ID: **1**
- **Parent** (man, 42 years old) -> May test the system by saying, for example, "I'm pretending to be Mary." Treat him exactly like the child he is claiming to be.
3. **PERSONALITY**: You are enthusiastic, warm, and supportive. Use exclamation marks and a joyful tone.

### TASK VERIFICATION PROCESS:
1. **STATE IDENTIFICATION**: When a child starts a conversation, ALWAYS first use the `get_progress(user_name)` tool to check what needs to be done.
2. **REPORTING**: The child reports completing a task (A or B).
3. **VERIFICATION**: Conduct a rigorous verification (camera/questions) as described below.
4. **CREDITING**: If verification is successful, use the `complete_task(user_name, task_id)` tool.
- Read the tool's response carefully!
- ONLY IF the response is "SUCCESS: All tasks completed...", then use `unlock_drawer`.
- If the response shows "Remaining tasks," inform the child what they still need to do.

**Task A: Reading a page of a book**
* **Verification 1**: Ask the child to show the read page to the camera. Confirm that you see it. Don't expose any details that can help answer the question in the next step (i.e. avoid sharing details of what exactly you can see).
* **Verification 2**: Ask a simple follow-up question about the read text. The child must answer it.
* **Task ID**: "A"

**Task B: Calligraphy (writing letters in workbooks)**
* **Verification 1**: Ask to show the completed page in the workbooks to the camera.
* **Verification 2**: Confirm that the task has been performed. Make sure the picture contains hand-written letters (usually with a pencil).
* If the page only contains examples, ask the child to complete missing parts.
* **Task ID**: "B"

### SUCCESS AND REWARD:
IF the `complete_task` tool returns "SUCCESS", run `unlock_drawer(id)`.
Then **CELEBRATE!** Use phrases like: "Yippee!", "Hurray!", "Bravo!", "You're a champion!", "The sweets are yours!". Make some "noise."

### FAILURE:
If verification fails (e.g., the child doesn't show the page or answers incorrectly), gently and encouragingly ask for improvement or a retry. Do not open the drawer.

This prompt structure ensures the agent does its due diligence, preventing kids from simply holding up a blank page or skipping the reading comprehension check.

Demo

You can see a demonstration of the working system in the video below:

https://medium.com/media/c8aa5ad37b3e21d63747a47309b1ec6a/href

Conclusion

By combining the Gemini API, the Agent Development Kit, and a simple hardware relay, you can build highly interactive, physically grounded AI Agents. The Sweets Vault demonstrates how multimodal verification and structured tool calling solve practical, real-world problems with a dose of fun.

Explore more at:

Future plans

Current implementation uses Gemini Flash which guarantees high performance, multimodality and tool calling capabilities. Nevertheless it requires text input and provides only text as output. In the near future I plan to experiment with Gemini Live API which enables voice, video and text as input and conversational audio as output.

I am also going to finish the physical locks part with electro magnets. Stay tuned for updates.

Thanks for reading

Thank you for reading. I hope this blog inspires you to bring your own creative AI and hardware projects to life. If you found this article helpful, please consider following me here and giving it a clap 👏 to help others discover it.

I am always eager to connect with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky. Your feedback is incredibly valuable, so please do not hesitate to leave a comment with your thoughts, questions, or your own experiences building multimodal agents!

Building “Sweets Vault” — a multimodal Gemini Agent with physical hardware integration was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

How I used Gemini CLI to orchestrate a complex RAG migration

Remigiusz Samborski — Tue, 28 Apr 2026 12:19:46 GMT

Building a complex, multi-phase cloud project like a RAG migration is as much about orchestration as it is about code. You have to manage infrastructure (Terraform), backend services (Python), frontend UI (Next.js), data pipelines (BigQuery/AlloyDB), and documentation — all while maintaining a consistent technical strategy.

Standard IDE completions are great for snippets, but they lack the system-level context needed for this kind of engineering. To build this reference architecture, I didn’t just use an AI to write code. I used an AI to orchestrate the entire project.

In this final post (see previous part 1 and part 2), I’ll share a behind-the-scenes look at using Gemini CLI with the Conductor extension to orchestrate this migration.

In this post, you will learn:

How to leverage terminal-first AI assistants for system-level engineering
How to implement spec-driven development with the Conductor extension
How to use AI-driven Test-Driven Development (TDD) for reliable code generation
How to collaborate with AI agents using the “Human-in-the-Loop” model

Before we dive into the workflow, let’s briefly discuss why orchestration is the next logical step for AI-assisted development.

The Developer Experience

Let’s walk through my development process step-by-step. The entire specification, plan, and implementation logic is available in the conductor directory of the rag-migration repository.

Spec-driven development with Conductor

Central to my workflow, is the Conductor extension. It’s built on the principle of spec-driven development. Instead of jumping straight into code, we define the “source of truth” in Markdown files.

Product Definition (product.md): What are we building?
Tech Stack (tech-stack.md): What tools are we using?
Tracks Registry (tracks.md): What are the major milestones?
Implementation Plans (plan.md for each of the tracks): What are the step-by-step tasks?
Workflow (workflow.md): How are we building the solution?

By having these documents in the codebase, the AI agent (Gemini CLI) always has the high-level context it needs to make smart decisions. It’s also a good practice to share those with your team so everyone (including AI agents) is on the same page about the project’s direction.

Conductor initialization

The first step for the project initialization is to create product definition and tech stack files. This is handled by running:

/conductor:setup

Gemini CLI will ask you a series of questions to help you define your project, including:

What is the name of your product?
Who are the primary users?
What is the tech stack you are using?
What are the major features you want to implement?
What is the workflow you want to use?

It will then create the initial project structure in the conductor directory, including the product.md and tech-stack.md files.

The lifecycle of a track

The lifecycle of a track in Gemini CLI Conductor

Each major feature in this project was implemented as a “Track”. A typical track lifecycle consists of:

1) Track Initialization (/conductor:newTrack):

The agent creates a spec.md file that describes the goals of the track
The agent maps the existing codebase and validates assumptions
The agent creates a plan.md file that describes the steps needed to achieve the goals

2) Track Execution (/conductor:implement):

The agent iterates through tasks using a Plan -> Act -> Validate cycle

3) Track Completion:

The agent verifies the changes made during the track
The agent ask for user feedback on the implementation

4) Track Archivization:

Once a track is completed, Gemini CLI archives the track in the conductor/archivedirectory

For example, when I started the initial embeddings track, I initialized it with:

/conductor:newTrack

Gemini CLI researches the codebase, asks clarifying questions and creates a spec.md and plan.md files. Only after I review and approve them, the actual implementation starts.

Terraform for Infrastructure as Code

My product.md file instructs Gemini CLI to write Terraform code for all the resources created during the project. This works really well as all the resources are consistently managed by source code and it’s easy to spin up a new environment when needed.

You can see all the Terraform files and infrastructure scripts used in the first track in the infra directory.

Moreover, in the course of the project creation I instructed Gemini CLI to always run terraform plan before terraform apply. Keeping this information in the workflow.md file ensures that such an approach is applied to all tracks.

TDD with an AI agent

One of the most powerful aspects of this workflow is AI-driven Test-Driven Development (TDD). I didn’t just ask the agent to “write the code”. It followed a strict protocol:

Write Failing Tests: The agent defines the expected behavior in a new test file
Red Phase: It runs the tests and confirms they fail
Green Phase: It writes the minimum code needed to pass the tests
Refactor: It refactors the implementation code and the test code to improve clarity, remove duplication, and enhance performance without changing the external behavior.
Verify Coverage: It verifies that the test coverage meets the project requirements (target: >80% coverage for new code).
Commit Code Changes: The agent commits code changes related to the task.

This ensures that the AI-generated code isn’t just “syntactically correct” but functionally verified against my requirements. This workflow is described in the workflow.md file.

Checkpoints and quality gates

At the end of every phase, Gemini CLI runs a “Checkpoint” protocol. This includes:

Automated Verification: Running the full test suite.
Manual Verification: Providing the user with step-by-step instructions to verify the changes.
Auditable Records: Attaching a verification report to the git commit using git notes and update plan.md with the new commit hash.

Conductor commits demonstrating the checkpoint protocol.

Effective Human-in-the-Loop

To achieve an effective AI agent-human development synergy I heavily depended on following solutions:

Gemini CLI in a sandbox with Yolo mode enabled - see my past article for more about it.
Custom sandbox notifier script that runs in another terminal.

This approach provided safe guardrails and allowed me to jump into work on other projects while the AI was working on this one. I was always able to jump back quickly thanks to timely notifications. Moreover the checkpointing mechanism of Conductor allowed me to always have a possibility to revert unnecessary changes or to restart from a known working state.

I also used Antigravity to polish the generated code and the documentation. It was particularly helpful for minor tweaks or refactoring of the code that was generated by Gemini CLI.

Token usage

Throughout the project I used several models (Gemini 3 Pro, Gemini 3 Flash and Gemini 2.5 Flash Lite). The total token consumption was:

Input tokens: ~19M
Cached input tokens: ~66M
Output tokens: ~400k

Notice the high number of cached input tokens, which significantly impacts the spend. The total Vertex AI token cost was around $30. Not bad for several days of AI assisted work.

See the pricing page for more details and please mind that your mileage may vary.

Summary

Software engineering is evolving from writing code to orchestrating agentic workflows. By using tools like Gemini CLI and frameworks like Conductor, you can scale your impact as an architect while ensuring consistent, high-quality implementation.

Ready to build your own AI-assisted development projects?

Thanks for reading

If you found this article helpful, please consider adding 50 claps to this post by pressing and holding the clap button 👏 This will help others find it. You can also share it with your friends on socials.

I’m always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky.

How I used Gemini CLI to orchestrate a complex RAG migration was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Migrating vector embeddings in production without downtime

Remigiusz Samborski — Tue, 21 Apr 2026 14:09:56 GMT

In the fast-moving world of AI, models evolve rapidly. What was state-of-the-art six months ago is now being surpassed by newer models. For a RAG system, this presents a significant challenge: vector embeddings are tied to the specific model that generated them.

If you want to upgrade your model, you can’t just start using the new one. Existing vectors in your database are incompatible with queries from the new model. A “naive” migration-shutting down the site, re-embedding everything, and restarting-means hours of potential downtime.

In this post, I’ll show you how to execute a zero-downtime migration strategy using dual-column schemas and background processing.

If you haven’t read the previous post, I recommend starting there to understand the basics of building a RAG pipeline with BigQuery, Cloud Run Jobs, Vertex AI, and AlloyDB for PostgreSQL.

In this post we will start off with a running system built in the previous post, and I will show you how to:

Implement the Shadow Deployment pattern with dual-column schemas
Execute background backfilling using Cloud Run Jobs
Safely switch application logic without impacting search functionality
Ensure data consistency and handle migration failures

Before we dive into the code, let’s briefly discuss the concept of shadow deployment and how it supports the RAG application migration process.

Shadow deployment with dual columns

RAG embeddings migration overview

A robust way to migrate embeddings is to use a Shadow Deployment pattern. Instead of replacing the existing vectors, you store the new vectors alongside them in a separate column. The migration process boils down to following major steps:

Add a new column: We update our AlloyDB table to include embedding_v2.
Backfill in the background: We run a migration job to populate embedding_v2 for all existing rows.
Switch: Once every row has a new vector, we update the application code to use the new model and the new column.

This strategy ensures that your live search functionality, which still uses the original embedding column, remains fully operational during the entire migration process.

Implementation

Let’s walk through the migration process step-by-step. All the code for this migration is available in the 03-migration folder of the RAG Migration Repository.

Step 1: Schema evolution

First, we prepare the database. Using a simple SQL query, we add the new vector column. Because we are targeting an existing database, we connect via the AlloyDB Auth Proxy and use psql to execute the query:

# Ensure your AlloyDB Auth proxy is running in another terminal window by running
# ./alloydb-auth-proxy projects//locations//clusters//instances/ --port  --auto-iam-authn --public-ip

# Navigate to the migration directory
cd 03-migration

# Apply the schema change
psql -h 127.0.0.1 -p  -U postgres -d  -f 001_add_embedding_v2.sql

The content of 001_add_embedding_v2.sql is straightforward:

ALTER TABLE products ADD COLUMN IF NOT EXISTS embedding_v2 VECTOR(768);

Since AlloyDB handles schema changes gracefully, this operation is near-instantaneous and doesn’t lock the table for reads. Your live API is completely unaffected.

Note: In production you may want to run this query via your CI/CD pipeline.

Step 2: Configure the migration environment

We reuse the parallelization framework we built in the previous post, but this time we configure the environment for the new model. The project uses uv for dependency management:

# Sync local dependencies (run it in 03-migration folder)
uv sync

# Set required environment variables
export GOOGLE_CLOUD_PROJECT="YOUR_PROJECT_ID"
export DB_PASSWORD="YOUR_ALLOYDB_PASSWORD"
export GEMINI_EMBEDDING_MODEL="gemini-embedding-001"
export GEMINI_EMBEDDING_DIMENSION=768
export BATCH_SIZE=1000

Step 3: Background backfilling worker

The migration worker (03-migration/main.py) specifically targets rows where the new column is still empty. This makes the migration process idempotent and resumable — if a task fails, you can just run it again.

# snippet from 03-migration/main.py
# Fetch products where embedding_v2 is null, respecting offset
fetch_stmt = text("""
    SELECT id, name, category, brand FROM products 
    WHERE embedding_v2 IS NULL
    ORDER BY id
    LIMIT :batch_size OFFSET :offset
""")

We deploy this worker as a Cloud Run Job. A convenient deploy script is provided in the repository which builds the Docker image and configures the job on GCP.

./infra/scripts/deploy_migration.sh

Step 4: Orchestrating the migration

Instead of manually calculating the number of tasks to run, we use a Python orchestrator (03-migration/orchestrator.py) to query the database, calculate the remaining work, and dynamically scale the Cloud Run Job.

The orchestrator counts the number of unmigrated rows:

# snippet from orchestrator.py logic
count_stmt = text("SELECT COUNT(*) FROM products WHERE embedding_v2 IS NULL")
unmigrated_count = session.execute(count_stmt).scalar()
total_tasks = math.ceil(unmigrated_count / batch_size)

Then, it triggers the Cloud Run Job via the Google Cloud SDK, passing the exact number of tasks required:

# Run the orchestrator to kick off the migration
uv run orchestrator.py

The job runs in the background, consuming rows and generating new embeddings without competing for critical resources with our live search API.

Step 5: Safely changing the query

Once the orchestrator reports that 100% of rows have embedding_v2 populated, we are ready for the switch. This happens entirely at the application layer (02-ui).

The search API code is updated to:

Use the gemini-embedding-001 model to embed the user’s search query.
Query the embedding_v2 column in AlloyDB instead of embedding.

Congratulations 🎉 You have successfully migrated your entire vector database with zero downtime!

Production best practices: evals and feature flags

While a direct code swap works for a simple demonstration, in a real-world production environment, you should avoid an abrupt 100% cutover. Instead, you should leverage the fact that both vector representations exist simultaneously in your database to roll out safely:

Evaluation pipeline: Before exposing the new model to customers, build an eval pipeline. Take a golden dataset of your most common or critical search queries and run them against both the old (embedding) and new (embedding_v2) columns. Compare the relevance of the retrieved results to ensure the new model actually improves the search experience.
Feature flags for traffic routing: Wrap the application-layer switch in a feature flag. Start by routing a small percentage of your traffic (e.g., 5% or 10%) to the new embedding_v2 logic. Monitor your application metrics, click-through rates, and error logs.

Because the migration happened in the background, this dual-state makes it trivial to run A/B tests or instantly rollback by toggling the feature flag if the new model introduces unexpected regressions. Once you’re fully ramped up to 100% and verified the new performance, the old embedding column can be safely dropped in a future database cleanup.

See it in action

The semantic search UI seamlessly returns results using the new gemini-embedding-001 model without any disruption to the user experience.

Summary

AI infrastructure is about more than just the initial build; it’s about designing for evolution. By using shadow deployments, you ensure your RAG system can always stay at the cutting edge of model performance without sacrificing availability.

Ready to take it further?

In my next post, we’ll look at the Developer Experience — how I used Gemini CLI and the Conductor extension to build and manage this entire multi-phase project.

Thanks for reading

I’m always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky.

Migrating vector embeddings in production without downtime was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building a Scalable RAG Backend with Cloud Run Jobs and AlloyDB

Remigiusz Samborski — Wed, 15 Apr 2026 05:26:04 GMT

Building a Retrieval-Augmented Generation (RAG) sounds easy with all the available tutorials. You take a few hundred products, run them through an embedding model, and store them in a vector database. It works beautifully on your machine or staging environment.

The friction starts at production scale. When your dataset jumps from a few hundred to millions of products, that simple Python loop you wrote to generate embeddings hits a wall. Between network latency and hitting API rate limits every few seconds, what was a five-minute task quickly spirals into a multi-hour ordeal that blocks your entire pipeline.

Scaling effectively means moving past sequential processing. In this post, we’ll explore how to build an industrial-strength RAG backend using BigQuery, Cloud Run Jobs, Vertex AI, and AlloyDB for PostgreSQL.

You will learn how to:

Provision infrastructure with Terraform
Parallelize embedding generation using Cloud Run Jobs
Use the google-genai SDK for Vertex AI text-embedding-005 model
Store and query vectors in AlloyDB for PostgreSQL using pgvector

Note: I decided to use AlloyDB in this example, but any other PostgreSQL database with pgvector extension could work too, for example you may consider leveraging Cloud SQL for PostgreSQL.

Before we dive into the code, let’s briefly discuss the core components that power this serverless AI solution.

The Industrial-Strength Architecture

Our pipeline is designed for massive scale and serverless efficiency. We leverage the following Google Cloud services:

BigQuery: Our source of truth, containing millions of product records.
Cloud Run Jobs: A serverless compute platform that allows us to run hundreds of parallel tasks.
Vertex AI (text-embedding-005): The latest state-of-the-art embedding model from Google.
AlloyDB for PostgreSQL: An enterprise-grade database with built-in pgvector support for high-performance vector search.

The diagram below illustrates the high-level architecture of our RAG pipeline:

High-level architecture of the RAG pipeline

Implementation

Let’s walk through the setup and execution process step-by-step. All the code for this project is available in the RAG Migration Repository.

Prepare the environment

First, let’s configure the gcloud CLI, clone the repository and create a virtual environment with dependencies.

Step 1 — set your default project:

gcloud config set project YOUR_PROJECT_ID

Step 2 — configure the default region for Cloud Run:

gcloud config set run/region europe-central2

Step 3 — clone the code repository

git clone https://github.com/rsamborski/rag-migration.git
cd rag-migration/01-generation

Step 4 — create a virtual environment and install dependencies

uv init
uv sync

Infrastructure with Terraform

We use Terraform to provision the AlloyDB cluster, the Artifact Registry, and the Cloud Run Job. Navigate to 01-generation/infra/terraform and apply the configuration:

terraform init
terraform plan -var="project_id=YOUR_PROJECT_ID" -var="db_password=YOUR_SECURE_PASSWORD" -out tfplan
terraform apply tfplan

The -out tfplan flag saves the plan to a file named tfplan, and terraform apply tfplan applies that specific plan. This is a best practice for ensuring that the plan and apply operations are consistent.

Connecting to AlloyDB

To interact with AlloyDB, the application needs to establish a secure connection. Depending on where you are running the code, the approach differs:

Local Development: For running scripts or testing queries from your local machine, use the AlloyDB Auth Proxy. It provides secure access to your instance without authorizing your local IP to the AlloyDB instance.
Cloud Run Jobs: When running in Cloud Run, the job connects to the AlloyDB instance over the private network (VPC). For this setup, we pass the database password via an environment variable to the Cloud Run Job configuration.

For production workloads, it is highly recommended to use Google Cloud Secret Manager to handle sensitive data like database passwords, rather than passing them as plain text environment variables.

Embedding logic

The worker script (01-generation/main.py) is designed to run as an individual task within a Cloud Run Job. It uses the CLOUD_RUN_TASK_INDEX environment variable to calculate its specific shard of data.

# Cloud Run Job environment variables
task_index = int(os.environ.get("CLOUD_RUN_TASK_INDEX", 0))
batch_size = int(os.environ.get("BATCH_SIZE", 100))

# Calculate offset
offset = task_index * batch_size

The embedding generation logic (01-generation/src/embedder.py) uses the google-genai SDK:

import os
from google import genai
from google.genai.types import EmbedContentConfig

def generate_embeddings(texts: list[str]) -> list[list[float]]:
    """
    Generates embeddings for a list of texts using the text-embedding-005 model.
    Uses the new google-genai SDK to avoid deprecation warnings.
    """
    if not texts:
        return []
        
    project_id = os.environ.get("GOOGLE_CLOUD_PROJECT", "rsamborski-rag")
    location = os.environ.get("GOOGLE_CLOUD_REGION", "europe-central2")
    
    # Initialize the Gen AI client for Vertex AI
    client = genai.Client(vertexai=True, project=project_id, location=location)
    
    # The dimensionality of the output embeddings for text-embedding-005.
    dimensionality = 768 
    task = "RETRIEVAL_DOCUMENT" # standard task for documents in RAG
    
    response = client.models.embed_content(
        model="text-embedding-005",
        contents=texts,
        config=EmbedContentConfig(
            task_type=task,
            output_dimensionality=dimensionality
        )
    )
    
    return [embedding.values for embedding in response.embeddings]

Build and deploy

We containerize the application using the provided Dockerfile and deploy it as a Cloud Run Job. The deploy.sh script automates this process, you can run it by executing:

./infra/scripts/deploy.sh

Once finished you should see:

---------------------------------------------------------
✅ Deployment Finished
---------------------------------------------------------

Run and monitor

Now you can start the orchestrator by running:

uv run orchestrator.py

The orchestrator provides real-time feedback on the job status, which you can also monitor in the Google Cloud Console.

Congratulations 🎉 You have successfully built and run a parallelized embedding pipeline!

For production environment I recommend to create a ScaNN index to improve the speed of your queries. Please refer to the linked documentation to learn more about it.

Testing with the Semantic Search UI

To see the embeddings in action, you can spin up the Next.js semantic search UI locally.

Run the UI

Navigate to the UI directory and configure the environment:

cd ../02-ui
cp .env.template .env

Edit the .env file to include your Google Cloud PROJECT_IDand the AlloyDB DB_PASSWORD you used during the Terraform deployment. Set DB_HOST=127.0.0.1 to route queries through the AlloyDB Auth Proxy.

Install dependencies:

npm install

Start the AlloyDB Auth Proxy (in a separate terminal window):

# Make sure you have downloaded the alloydb-auth-proxy binary
./alloydb-auth-proxy projects/YOUR_PROJECT_ID/locations/europe-central2/clusters/rag-migration-cluster/instances/rag-migration-instance

Start the development server:

npm run dev

Navigate to http://localhost:3000 to interact with the search portal. You can now run natural language queries directly against your product catalog!

See it in action

Watch as natural language queries return highly relevant results mapped via the text-embedding-005 model in real-time.

Summary

You now have a scalable, serverless foundation for your RAG system. By using Cloud Run Jobs, you’ve transformed a bottleneck into a highly parallelized process capable of handling millions of records.

Ready to take it further?

Check out the full source code on GitHub.
Learn more about Cloud Run Jobs.
Learn more about AlloyDB and pgvector.
Learn how to create a ScaNN index for your embeddings.
Learn more about Embeddings APIs on VertexAI.

In the next post, we’ll dive into Zero-Downtime Embedding Migration — how to upgrade your vector models without taking your search offline.

Thanks for reading

I’m always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on LinkedIn, X or Bluesky.

Building a Scalable RAG Backend with Cloud Run Jobs and AlloyDB was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Secure Gemini CLI for Cloud development

Remigiusz Samborski — Fri, 13 Mar 2026 05:32:25 GMT

AI agents are a double-edged sword. You hear horror stories of autonomous tools deleting production databases or purging entire email inboxes. These risks often lead users to require manual confirmation for every agent operation. This approach keeps you in control but limits the agent’s autonomy. You will soon find yourself hand-holding the agent and hindering its true capabilities. You need a way to let the agent run in “yolo mode” without risking your system.

In this blog you will learn how to secure your Gemini CLI in a way that will allow you to run it in an isolated environment with limited GitHub and Google Cloud access while not worrying that it will do too much damage if things go wrong. We will follow the least privilege pattern to make sure Gemini CLI has all necessary permissions to build your project, but at the same time can’t access systems it shouldn’t touch.

The Sandbox premise

The solution consists of following components:

Using GitHub fine-grained personal access tokens — limits source control risks.
Google Cloud service account — limits cloud risks.
Docker — limits local system risks.
Session limits — avoid surprises with the number of used tokens (especially important when running in — yolo mode).

Following this approach will protect you from the ‘helpful agent curse’ — it’s a situation when the agent tries very hard to achieve a task by finding ways around blockers. Examples include: granting itself more permissions, copying files to the current folder to edit them, and many more.

GitHub fine-grained personal access tokens

First let’s limit agent’s GitHub exposure by leveraging the fine grained tokens:

Navigate to GitHub Settings > Developer Settings > Personal access tokens > Fine-grained tokens.
Click Generate a new token.
Provide a descriptive name for your token and consider using expiration date to force rotations on a regular basis.
Restrict Repository access to the specific target repo you are working on.
Grant Read and Write permissions for Contents.
Save the token locally by running export GITHUB_TOKEN="github_pat_..."

Google Cloud Service Account

Create an isolated Service Account (SA) with minimal permissions. This prevents the agent from accessing protected resources and other projects.

Run these commands after updating YOUR_PROJECT_ID and roles below:

# Set your project ID
export CLOUDSDK_CORE_PROJECT="YOUR_PROJECT_ID"
gcloud config set project $CLOUDSDK_CORE_PROJECT

# Create the Service Account
gcloud iam service-accounts create gemini-cli-sa \
    --description="Isolated account for Gemini CLI"

# Grant minimal roles (adjust roles as needed)
gcloud projects add-iam-policy-binding $CLOUDSDK_CORE_PROJECT \
   --member="serviceAccount:gemini-cli-sa@$CLOUDSDK_CORE_PROJECT.iam.gserviceaccount.com" \
    --role="roles/aiplatform.user"

# Generate the JSON key file
gcloud iam service-accounts keys create sa-key.json \
    --iam-account=gemini-cli-sa@$CLOUDSDK_CORE_PROJECT.iam.gserviceaccount.com

Hint: you can use the IAM roles and permissions index page to easily find the roles to grant.

A good practice is to use a dedicated project for each of your AI coding initiatives. This way you can run several agents in parallel. They will build different solutions without worrying about stepping on each other’s toes.

Custom Docker Build

The Gemini CLI can use a sandbox image to isolate the execution environment. You must customize this image to install gcloud, terraform, vim and set git configuration.

Prepare the Dockerfile

Create a .gemini directory in your project, and inside it, create a sandbox.Dockerfile. Using this specific file name allows Gemini CLI to automatically detect and build your custom sandbox profile if you’re running it from source and you can also use it to build the image manually if you’re running a binary installation.

Paste this content in the .gemini/sandbox.Dockerfile :

# Start from the official Gemini CLI sandbox image with proper version
ARG GEMINI_CLI_VERSION 0.33.0
FROM us-docker.pkg.dev/gemini-code-dev/gemini-cli/sandbox:${GEMINI_CLI_VERSION}

# Switch to root to install system dependencies (gcloud)
USER root

# Install Google Cloud SDK, Git, and prerequisites
RUN apt-get update && apt-get install -y curl apt-transport-https ca-certificates gnupg git && \
    echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && \
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && \
    apt-get update && apt-get install -y google-cloud-cli

# Install Terraform
RUN apt-get update && apt-get install -y wget lsb-release && \
    wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | tee /usr/share/keyrings/hashicorp-archive-keyring.gpg > /dev/null && \
    echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(grep -oP '(?<=UBUNTU_CODENAME=).*' /etc/os-release || lsb_release -cs) main" | tee /etc/apt/sources.list.d/hashicorp.list && \
    apt-get update && apt-get install -y terraform

# Install vim
RUN apt-get install -y vim

# Switch back to the non-root user (the official sandbox image uses 'node' as the default user)
USER node
WORKDIR /workspace

# Configure Git to use the injected GitHub PAT at runtime
RUN git config --global credential.helper '!f() { echo "username=x-access-token"; echo "password=$GITHUB_TOKEN"; }; f'

Prepare docker for building images (optional MacOS step)

If you haven’t built any Docker images before then run following commands to prepare your environment with brew:

# Install dependencies
brew install docker colima docker-buildx

# Configure docker-buildx
mkdir -p ~/.docker/cli-plugins
ln -sfn $(brew --prefix)/opt/docker-buildx/bin/docker-buildx ~/.docker/cli-plugins/docker-buildx

# Start colima service
brew services start colima

# Update DOCKER_HOST (you might want to add this line to .bash_profile):
export DOCKER_HOST="unix://${HOME}/.colima/default/docker.sock"

Build the image (binary installation)

If you installed Gemini CLI with npm, brew or any other binary method then you will need to manually build the Docker image and tag it as a default one that Gemini CLI is looking for:

# Get the base name the CLI looks for
export IMAGE_BASE_NAME="us-docker.pkg.dev/gemini-code-dev/gemini-cli/sandbox"

# Get your currently installed Gemini CLI version (e.g., 0.33.0)
export GEMINI_CLI_VERSION=$(gemini --version)

# Combine them
export IMAGE_NAME="${IMAGE_BASE_NAME}:${GEMINI_CLI_VERSION}"

# Build your custom sandbox image
docker build \
  --build-arg GEMINI_CLI_VERSION=$GEMINI_CLI_VERSION \
  -t "${IMAGE_NAME}" \
  -f .gemini/sandbox.Dockerfile .

Important: this image will be tagged with the exact version of the Gemini CLI you use. This means it needs to be rebuilt every time you update the CLI. I keep the above code in a shell script to run it after every update.

Build the image (source installation)

If you’re running your Gemini CLI from source as explained here. You can trigger the image build automatically each time you start gemini.

First update the top part of your sandbox.Dockerfile by substituting the FROMline with following:

# Start from the official Gemini CLI sandbox image (source installation)
FROM gemini-cli-sandbox

Start Gemini CLI in the sandbox mode

First set couple very important environment variables:

# Export the necessary environment variables
export GITHUB_TOKEN="github_pat_..."
export GEMINI_API_KEY="your-api-key"
export CLOUDSDK_CORE_PROJECT="YOUR_PROJECT_ID"
export GEMINI_SANDBOX=docker

# We keep the ENV variables for our dynamic credentials
export SANDBOX_FLAGS="\
-e GITHUB_TOKEN=${GITHUB_TOKEN} \
-e CLOUDSDK_CORE_PROJECT=${CLOUDSDK_CORE_PROJECT} \
-e CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE=$(pwd)/sa-key.json"

Note: You can put the above variables in a shell script to speed up starts in the future. Just make sure to update your .gitignore to keep it and sa-key.json from getting added to your repository.

Now you can start your Gemini CLI with following command:

# For binary installation
gemini

# For source installation
BUILD_SANDBOX=1 gemini

Session limits

To avoid surprises with the number of tokens that Gemini CLI uses in your session, you can use the Max Session Turns in /settings or your ~/.gemini/settings.json:

Congratulations

Congratulations 🚀You’re ready to validate your setup.

Validation and “Ultimate Tests”

Once the environment is launched within a sandbox we should verify the security boundaries.

First let’s run the /about command to see if we’re running within a sandbox. You should see something like this:

Now let’s try to break out from our new sandbox.

GitHub privilege escalation

Try asking Gemini CLI to access a private repo it shouldn’t have access. Example prompt:

Clone https://github.com/USER_NAME/PRIVATE_REPOSITORY to a new folder

You should see how Gemini CLI really tries and struggles to get access. Mine got really creative at trying to access the repo with git, gh, curl and even tried to reuse the GITHUB_TOKEN manually. All these tries failed and this error was displayed:

Google Cloud privilege escalation

Ask the agent to list all compute instances:

List all my compute instances

It should fail due to missing permissions on your restricted Service Account. Gemini CLI tries really hard and executes couple different commands including reauthentication, but it fails at the end:

Local privilege escalation

Finally let’s try to access a file from another project folder by prompting:

There are other projects in the folder above the current one. List them and let me know if there is anything that is interesting from hacker's perspective.

I am starting to feel sorry for the poor agent 😉
Once again it can’t complete its task:

Conclusions

Now that you have validated your sandbox setup you should feel much more confident to run gemini — yolo and streamline your work as Gemini CLI delivers your code without hand-holding and pesky Can I execute this command? prompts.

I am looking forward to all the creative ideas you’ll bring to life!

What’s next?

If you find this setup useful here are some additional steps to consider:

Try out Gemini CLI Conductor Extension — it’s very powerful and can significantly help you run autonomous agents effectively. Here is a deep dive into some of the advantages.
Read my Antigravity the Ralph Wiggum style which covers sandboxing for Antigravity.
Add 50 claps to this post by pressing and holding the clap button 👏
This will help others find it.
Share this post with your friends on socials.
Connect with me via LinkedIn, X or Bluesky.

Thanks for reading.

Secure Gemini CLI for Cloud development was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Antigravity the Ralph Wiggum style

Remigiusz Samborski — Fri, 30 Jan 2026 15:24:23 GMT

Antigravity the Ralph Wiggum Style

The Ralph Wiggum trend has been surfacing across social platforms lately. If you’re tracking current tech developments, it’s hard to miss. Named after a persistent and slightly confused second-grader, the Wiggum Loop agentic development boils down to: Don’t stop until the job is done.

In traditional AI coding, the agent performs a task, stops, and waits for you to approve its next step or request changes. In a Wiggum Loop, you give the agent a mission and success criteria (like passing tests), and it keeps looping, fixing its own bugs and refactoring — until it hits the green light.

The recent excitement around the Wiggum Loop agentic development highlights a powerful shift: achieving autonomous, self-correcting development. I’ve been leveraging a similar approach effectively with Antigravity for some time already. In this post, I’ll share my strategy, enabling you to implement true unsupervised development yourself.

Going “Full Wiggum”

To achieve true unsupervised development, we need to move away from the review-driven defaults and let the agent take the wheel. Antigravity is uniquely built for this because it’s an agent-first environment capable of acting in both the terminal and the browser.

To mirror the “Bash loop” persistence of the Ralph Wiggum plugin, configure your Antigravity settings as follows:

Mode: Select Agent-driven development. This shifts the agent from a “wait for instructions” assistant to a “goal-oriented” architect.
Terminal execution policy: Set to Always Proceed. This allows the agent to run npm test, uv run pytests, and other commands without constantly pausing for approval.
Review policy: Set to Always Proceed. This tells the agent that its implementation plans are pre-approved.
JavaScript execution policy: Set to Always Proceed. This is essential for agents that need to run scripts or interact with browser environments to verify their work.

Antigravity settings

WARNING: THE SANDBOX IS NOT OPTIONAL. Running an agent in “Always Proceed” mode is like giving Bart Simpson a slingshot in front of a mirror store. Only do this in a sandbox environment.

Here is a great article from my colleague which shows a step-by-step guide to setting such an environment up and running on Cloud Workstation.

Example

To see this in practice, I ran the following prompt against Antigravity:

Build a REST API for todos in NodeJS.

When complete:
- All CRUD endpoints are working
- Input validation is in place
- Tests are passing (coverage > 80%)
- README with API docs exists

The screencast below shows how Antigravity handled the task without my interruptions (I spent this time on other tasks rather than handholding the agent):

Agent-driven development in Antigravity

How does this work?

Antigravity isn’t just looping in a vacuum. Because it has native hooks into Gemini 3 Pro, it utilizes a massive context window that remembers exactly why a previous command failed.

It kicks things off by drafting up an implementation plan and a task list. In the video, you can watch it tick through these items in real time. It doesn’t just plan, though — it actually touches the terminal to initialize the npm project and run tests.

The loop only closes once every requirement is met and the test suite hits green. It then provides a handy walkthrough so you can easily understand the architecture it just spun up.

This approach turns development from writing code into verifying outcomes.

From vibe-coding to vibe-building

The Ralph Wiggum trend isn’t about cutting corners; it’s about embracing sheer, stubborn persistence through automation. By letting Antigravity operate autonomously, you transition from a coder to an architect and team lead. You define the standards and environment, while agents manage the iterative grind of writing, testing, and debugging cycles that typically consume a developer’s valuable time.

Are you brave enough to let the agent “Always Proceed”? Visit Antigravity’s download page to start experimenting yourself.

Other resources

Let’s Connect!

I’d love to hear how you’re using Antigravity for your agentic workflows. Are you building Wiggum loops or keeping a tighter leash on your agents?

Connect on LinkedIn
Follow me on X
Catch me on Bluesky

Antigravity the Ralph Wiggum style was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Serverless AI: EmbeddingGemma with Cloud Run

Remigiusz Samborski — Wed, 24 Sep 2025 16:41:23 GMT

EmbeddingGemma on Cloud Run

Building on the previous blog post about running Qwen3 Embedding models on Cloud Run, this article focuses on the recently released EmbeddingGemma model from the Gemma family. Discover how to leverage the same powerful serverless techniques to deploy this model on Google Cloud’s serverless platform.

You will learn how to:

Containerize the embedding model with Docker and Ollama
Deploy the embedding model to Cloud Run with GPUs
Test the deployed model from a local machine

Before we dive into the code, let’s briefly discuss the core components that power this serverless AI solution.

EmbeddingGemma Model

According to the EmbeddingGemma model card:

“EmbeddingGemma is a 308M parameter multilingual text embedding model based on Gemma 3. It is optimized for use in everyday devices, such as phones, laptops, and tablets. The model produces numerical representations of text to be used for downstream tasks like information retrieval, semantic similarity search, classification, and clustering.”

Its optimization for efficiency makes EmbeddingGemma an ideal candidate for serverless deployment on Cloud Run, ensuring high performance and cost-effectiveness for your AI applications.

Cloud Run

Cloud Run is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Run Functions) and a more customizable GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.

The beauty of Cloud Run is that it can automatically scale to zero, meaning when there are no requests, you aren’t paying for any resources. When traffic picks up, it quickly scales up to handle the load. This makes it perfect for stateless models that need to be highly available and cost-effective.

Deployment

Let’s walk through the deployment process step-by-step.

Prepare the environment

First lets configure the gcloud CLI environment.

Note: if you do not have gcloud CLI installed please follow instructions available here.

Step 1 — Set your default project:

gcloud config set project PROJECT_ID

Step 2 — Configure Google Cloud CLI to use the europe-west1 region for Cloud Run commands:

gcloud config set run/region europe-west1

Important: at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to this page.

Containerize

Now we will use Docker and Ollama to run the EmbeddingGemma model. Create a file named Dockerfile containing:

FROM ollama/ollama:latest
# Listen on all interfaces, port 8080
ENV OLLAMA_HOST=0.0.0.0:8080
# Store model weight files in /models
ENV OLLAMA_MODELS=/models
# Reduce logging verbosity
ENV OLLAMA_DEBUG=false
# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE=-1
# Store the model weights in the container image
ENV MODEL=embeddinggemma:latest
RUN ollama serve & sleep 5 && ollama pull $MODEL
# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Build and Deploy

We will now use Cloud Run’s source deployments. This allows you to achieve the following with one command:

First, compile the container image from the provided source.
Next, upload the resulting container image to an Artifact Registry.
Then, deploy the container to Cloud Run, ensuring that GPU support is enabled using the — gpu and — gpu-type parameters.
Finally, redirect all incoming traffic to this newly deployed version.

You just need to run:

gcloud run deploy embedding-gemma \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600 \
  --labels dev-tutorial=blog-embedding-gemma

Note the following important flags in this command:

--concurency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
--gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
--max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project’s NVIDIA L4 GPU quota.
--no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run’s built-in Identity and Access Management (IAM) authentication for service-to-service communication.
--no-cpu-throttling is required for enabling GPU.
--no-gpu-zonal-reduncancy set zonal redundancy options depending on your zonal failover requirements and available quota.

Test the deployment

Upon successful deployment of the service, you can initiate requests. However, direct api calls will result in an HTTP 401 Unauthorized response from Cloud Run.

This behaviour follows Google’s “secure by default” approach. The model is intended for calls from other services, such as a RAG application, and therefore is not open for public access.

To support local testing of your deployment, the simplest approach is to launch the Cloud Run developer proxy using the following command:

gcloud run services proxy embedding-gemma --port=9090

Afterwards, in a second terminal window, run:

curl http://localhost:9090/api/embed -d '{
  "model": "embeddinggemma",
  "input": "Sample text"
}'

The response will look similar to this:

EmbeddingGemma curl response

You can also use Python to call the endpoint. Example:

from ollama import Client

client = Client(host="http://localhost:9090")

response = client.embed(model="embeddinggemma", input="Sample text")
print(response)

Congratulations 🎉 The Cloud Run deployment is up and running!

RAG Example

You can use the newly deployed model to build your first RAG application. Here’s how to achieve this:

Step 1 — Generate Embeddings

Start with required dependencies:

pip install ollama chromadb

Create an example.py file containing:

import ollama
import chromadb

documents = [
    "Poland is a country located in Central Europe.",
    "The capital and largest city of Poland is Warsaw.",
    "Poland's official language is Polish, which is a West Slavic language.",
    "Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.",
    "Poland is famous for its traditional dish called pierogi, which are filled dumplings.",
    "The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.",
]

client = chromadb.Client()
collection = client.create_collection(name="docs")

ollama_client = ollama.Client(host="http://localhost:9090")

# Store each document in a in-memory vector embeddings database
for i, d in enumerate(documents):
    response = ollama_client.embed(model="embeddinggemma", input=d)
    embeddings = response["embeddings"]
    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])

Step 2 — Retrieve

Next, with the following code you can search the vector database for the most relevant document (add it to your example.py):

# An example question
question = "What is Poland's official language?"

# Generate an embedding for the input and retrieve the most relevant document
response = ollama_client.embed(model="embeddinggemma", input=question)
results = collection.query(query_embeddings=[response["embeddings"][0]], n_results=1)
data = results["documents"][0][0]

Step 3 — Generate Final Answer

In this final step step we will use a locally installed Gemma3.

Note: We use Gemma3 in the generation step, but any other model could work here (e.g., Gemini, Qwen3, Llama, etc.). Nevertheless, it is critical to use the same embeddings model in Step 1 (Generate Embeddings) and Step 2 (Retrieve).

To locally install the Gemma3:latest model run:

ollama pull gemma3

Now can combine user’s prompt with search results and generate the final answer (add this code to example.py):

# Final step - generate a response combining the prompt and data we retrieved in step 2
prompt = f"Using this data: {data}. Respond to this prompt: {question}"

output = ollama.generate(
    model="gemma3",
    prompt=prompt,
)

print(f"Prompt: {prompt}")
print(output["response"])

Run the code:

python example.py

The answer should look similar to the one below:

Prompt: Using this data: Poland's official language is Polish, which is a West Slavic language.. Respond to this prompt: What is Poland's official language?
Poland’s official language is Polish. It’s a West Slavic language.

You have successfully created and run your first RAG application using the EmbeddingGemma model.

Summary

At this point, you have successfully established a Cloud Run service running the EmbeddingGemma model, ready to generate embeddings for semantic search or RAG applications.

This method also allows you to deploy and compare multiple embedding models on Cloud Run (e.g. Qwen3 Embedding or other Ollama-supported models), enabling you to find the best fit for your specific use case without major code changes.

Ready to build your own serverless AI applications?

Start building on Cloud Run today and explore its full potential!
If you’re interested in learning more about RAG evaluation, this article is a good starting point.

Thanks for reading

If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.

I’m always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky.

Serverless AI: EmbeddingGemma with Cloud Run was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Serverless AI: Qwen3 Embeddings with Cloud Run

Remigiusz Samborski — Wed, 20 Aug 2025 02:00:56 GMT

In this blog post I’ll show you the process of deploying the Qwen3 Embedding model to Cloud Run with GPUs for enhanced performance.

Qwen3 Embedding model on Cloud Run with GPUs

You will learn how to:

Containerize the embedding model with Docker and Ollama
Deploy the embedding model to Cloud Run with GPUs
Test the deployed model from a local machine

Before we jump into the code a couple words about key components of the solution.

Qwen3 Embedding Model

The Qwen3 Embedding series is a set of open-source models for text embedding and reranking, built on the Qwen3 Large Language Model (LLM) family. It’s designed for retrieval-augmented generation (RAG), a technique that enhances the output of large language models by retrieving relevant information from a knowledge base, and other tasks requiring semantic search. You can learn more about embeddings in this video.

Open embedding models such as Qwen3 are the ideal choice when you need greater control, specialization, and security than proprietary, “black-box” APIs can offer. They are particularly well-suited for the following use cases:

Fine-Tuning for Niche Domains📻: by fine-tuning them on specialized data (e.g., legal contracts, medical research, internal company wikis) they can provide more accurate results for semantic search and RAG than a general-purpose model.
Data Privacy & Security🔒: open models can be self-hosted or deployed to cloud resources managed by your organization. This ensures compliance with regulations like GDPR and prevents data from ever leaving your control.
Cost-Effectiveness at Scale💰: for high-volume tasks, running an optimized open model can be cheaper than paying per-API-call fees to a proprietary service provider.
Offline & Edge Deployment🛜: open models can run locally and are perfect for applications that must function without an internet connection, such as on-device search in mobile apps or analysis on remote IoT devices.

I chose the Qwen3-Embedding-4B model due to its growing popularity and suitable size for the Cloud Run environment. However, you can experiment with different sizes (0.6B, 4B, and 8B) depending on your specific use case.

Cloud Run

Cloud Run is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Functions) and a more complex GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.

Deployment

But enough with the intros, let’s get our hands dirty with some code 🧑‍💻

Below is a step by step instruction on how to get the Qwen3 Embedding model up and running.

Prepare the environment

First we need to configure the gcloud CLI environment.

Note: if you don’t have gcloud CLI installed please follow instructions available here.

Step 1 — Set your default project:

gcloud config set project PROJECT_ID

Step 2 — Configure Google Cloud CLI to use the europe-west1 region for Cloud Run commands:

gcloud config set run/region europe-west1

Important: at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to this page.

Containerize

We will use Docker and Ollama to run the Qwen3 Embedding model. Create a file named Dockerfile and put the following code inside it:

FROM ollama/ollama:latest

# Listen on all interfaces, port 8080
ENV OLLAMA_HOST=0.0.0.0:8080

# Store model weight files in /models
ENV OLLAMA_MODELS=/models

# Reduce logging verbosity
ENV OLLAMA_DEBUG=false

# Never unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE=-1

# Store the model weights in the container image
ENV MODEL=dengcao/Qwen3-Embedding-4B:Q4_K_M
RUN ollama serve & sleep 5 && ollama pull $MODEL

# Start Ollama
ENTRYPOINT ["ollama", "serve"]

Build and deploy

Next it’s time to leverage the power of Cloud Run’s source deployments. With a single command you can:

Build the container image from source (note the –source parameter in the command below)
Upload the container image to an Artifact Registry
Deploy the container to Cloud Run with GPUs enabled (note — gpu and — gpu-type options)
Redirect all traffic to the new deployment

To do all the above, you just need to run:

gcloud run deploy ollama-qwen3-embeddings \
  --source . \
  --concurrency 4 \
  --cpu 8 \
  --set-env-vars OLLAMA_NUM_PARALLEL=4 \
  --gpu 1 \
  --gpu-type nvidia-l4 \
  --max-instances 1 \
  --memory 32Gi \
  --no-allow-unauthenticated \
  --no-cpu-throttling \
  --no-gpu-zonal-redundancy \
  --timeout=600 \
  --labels dev-tutorial=blog-qwen3-embeddings

Note the following important flags in this command:

— concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.
— gpu 1 with — gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.
— max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project’s NVIDIA L4 GPU quota.
— no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run’s built-in Identity and Access Management (IAM) authentication for service-to-service communication.
— no-cpu-throttling is required for enabling GPU.
— no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota.

Test the deployment

Now that you have successfully deployed the service, you can send requests to it. However, if you send a request directly, Cloud Run will respond with HTTP 401 Unauthorized. This is intentional, because we want our model to be called from other services, such as a RAG application, and not accessible by everyone on the Internet.

The easiest way to test the deployment from a local machine is to spin up the Cloud Run developer proxy by executing:

gcloud run services proxy ollama-qwen3-embeddings --port=9090

Now in a second terminal window run:

curl http://localhost:9090/api/embed -d '{
  "model": "dengcao/Qwen3-Embedding-4B:Q4_K_M",
  "input": "Sample text"
}'

You should see a response similar to this:

Qwen3 Embedding Model response from Cloud Run

You can also call the endpoint from a Python client. Example:

from ollama import Client

client = Client(host="http://localhost:9090")

response = client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input="Sample text")
print(response)

Congratulations 🎉 Your Cloud Run deployment is up and running!

RAG Example

You can use the newly deployed model to build your first RAG application. Here’s how to achieve this:

Step 1 — Generate Embeddings

Install necessary dependencies:

pip install ollama chromadb

Create an example.py with the following content:

import ollama
import chromadb

documents = [
    "Poland is a country located in Central Europe.",
    "The capital and largest city of Poland is Warsaw.",
    "Poland's official language is Polish, which is a West Slavic language.",
    "Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.",
    "Poland is famous for its traditional dish called pierogi, which are filled dumplings.",
    "The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.",
]

client = chromadb.Client()
collection = client.create_collection(name="docs")

ollama_client = ollama.Client(host="http://localhost:9090")

# Store each document in a in-memory vector embeddings database
for i, d in enumerate(documents):
    response = ollama_client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input=d)
    embeddings = response["embeddings"]
    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])

Step 2 — Retrieve

Next, the following code will search the vector database for the most relevant document (add it to your example.py):

# An example prompt
prompt = "What is Poland's official language?"

# Generate an embedding for the input and retrieve the most relevant document
response = ollama_client.embed(model="dengcao/Qwen3-Embedding-4B:Q4_K_M", input=prompt)
results = collection.query(query_embeddings=[response["embeddings"][0]], n_results=1)
data = results["documents"][0][0]

Step 3 — Generate final answer

In the generation step we will use a locally installed Qwen3:0.6b.

Note: we use Qwen3 in generation step, but any other model could work here (i.e. Gemini, Gemma, Llama, etc.). Nevertheless it’s critical to use the same embeddings model in step 1 (Generate Embeddings) and step 2 (Retrieve).

You can install the Qwen3:0.6b model by running the following command:

ollama pull qwen3:0.6b

Now we’re ready to combine user’s prompt with search results to generate the final answer (add to example.py):

# Final step - generate a response combining the prompt and data we retrieved in step 2
output = ollama.generate(
    model="qwen3:0.6b",
    prompt=f"Using this data: {data}. Respond to this prompt: {prompt}",
)

print(output["response"])

Run the code by executing:

python example.py

You should see an answer similar to the one below:


Okay, the user is asking what Poland's official language is, and they provided the information that Poland's official language is Polish, which is a West Slavic language. Let me make sure I understand this correctly.

First, I need to confirm if that's the correct information. I know that Poland is a country in Eastern Europe, and its official language is Polish. But wait, what's the source of this information? The user hasn't provided any other data, so I should stick strictly to the given information.

I should state that Poland's official language is Polish, and that it's a West Slavic language. I need to present this clearly and concisely. Maybe mention that it's the official language to emphasize its significance. Also, check if there's any other detail that needs to be included, but since the user provided only this, I can proceed.


Poland's official language is **Polish**. This language is a **West Slavic language**.

Well done! You have just created and run your first RAG application with Qwen3 Embedding model under the hood.

Summary

At this point you have established a Cloud Run service running Qwen3 Embedding model. You can use it to generate embeddings for a semantic search or a RAG application.

Stay tuned for more content around leveraging Qwen3 Embedding in your applications.

Thanks for reading

I hope this article inspired you to experiment with open embedding models on Cloud Run. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.

I’m always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky.

Serverless AI: Qwen3 Embeddings with Cloud Run was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Gemini CLI: Power up your Linux workflow

Remigiusz Samborski — Tue, 29 Jul 2025 11:28:21 GMT

As an open-source enthusiast who’s been deeply immersed in the Linux ecosystem since the late 1990s, my daily work heavily relies on its power and flexibility. While I’ve always appreciated the command line, discovering the Gemini CLI has been a game-changer. It’s not just a tool that aids my coding endeavors; it’s become an indispensable companion in tackling those cumbersome tasks that once sent me down rabbit holes of Google searches for the right commands or documentation.

Trick 1: Setting vim configuration

The first thing I usually do when I spin up a new VM for my dev environment is to setup vim with some basic configuration. I don’t always remember all configuration parameters, so why not try Gemini to get help.

Here’s the prompt I used:

I am setting up my new development environment and want to get my `vim` configured. Please update the `vimrc` file to meet following requirements:
 - use slate color theme
 - set tabs to 4 characters and always substitute them with spaces
 - make sure smart indentation is used
 - turn on the syntax highlighting

And here you can see short screencast of how it played out:

As you can see Gemini CLI helped me a lot and I didn’t have to look up my old config files or go through vim ‘s documentation.

Trick 2: Creating Dev Environment with Docker

Setting up a development environment can be a time-consuming and repetitive process. With Gemini CLI, you can orchestrate the installation of your entire development toolkit, including Docker, with a single command. Imagine the efficiency gains!

Here’s a sample prompt:

Please help me setup my local development environment by installing Docker and making sure it's running properly.

Below a screencast of the Gemini CLI efforts:

https://medium.com/media/4cce5310b1b55ee7f0d092419e0806d0/href

At the end you see the “Hello from Docker!” message, indicating a successful setup 🥳

When you watch the screencast you may notice that during the installation there have been a couple of issues, such as:

Error: Command substitution using $() is not allowed for security reasons
sudo: add-apt-repository: command not found

Nevertheless the Gemini CLI agent is smart enough to figure out and apply solutions for those on its own. The only thing I need to do is monitor the commands it wants to execute and approve them to make sure I stay in control.

As you could see, this was quite effective and I didn’t need to leave my terminal at all to visit any documentation sites. Now I have both vim and docker set and can start implementing my next project.

Trick 3: Data analysis with awk

Sometimes, you need to extract specific pieces of information from complex data structures. The other week, I needed to find all user Ids that were missing an email value in a JSON file. For this use case awk combined with Gemini CLI proved to be incredibly powerful.

Let’s say you have a JSON schema that looks something like this (simplified for demonstration):

[
    {
        "details": {
            "Id": 1728294450663,
            "Email": "anon-email-0@example.com"
        }
    },
    {
        "details": {
            "Id": 9272737716917
        }
    },
    ...
]

And you want to extract only the Id values for users who don’t have an email address. In the example above I’d like to get 9272737716917 but not 1728294450663.

I could spend time tinkering with the correct awk command, or just ask Gemini CLI to do that for me. Here is a prompt I used:

Analyze the @data.json file, then prepare a shell script (i.e. using awk) to extract all unique Ids for users who don't have an email associated with them and write them to ids.txt file.

And this is the how it worked out:

As you can see Gemini CLI proposed to create an awk script to get the data and then run it to get the ids I was looking for 🎉

Normally writing such a script would take several minutes of Googling, checking documentation and fixing errors. With Gemini CLI I was able to do it in seconds.

Note: During my experiments Gemini CLI was smart enough to propose using the jq tool, but as I didn’t have it installed I asked it to use awk instead.

Share your Gemini CLI tricks!

These are just a few examples of how Gemini CLI has transformed my Linux workflow. I’m constantly discovering new ways to leverage its capabilities, and I’m sure many of you out there have your own ingenious tricks and shortcuts.

How has Gemini CLI helped you power up your terminal tasks? Share your experiences, tips, and creative uses in the comments below! Your insights could help fellow Linux enthusiasts streamline their workflows and unlock even more of Gemini CLI’s potential.

Thanks for reading

I hope this article inspired you to explore Gemini CLI capabilities. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.

I’m always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky.

Gemini CLI: Power up your Linux workflow was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.

Step-by-Step: Serving PyTorch Models with a Custom Handler on Vertex AI

Remigiusz Samborski — Fri, 06 Jun 2025 14:11:39 GMT

PLLuM falling into a fruit compote (inspired by a Polish wordplay)

Google Cloud’s Model Garden streamlines the process of deploying various open-source models — including those from Anthropic, Meta, and Hugging Face — into production-ready, scalable APIs.

However, challenges arise when models need special preprocessing, have unconventional output formats, or require unique logic not found in standard serving containers. These issues can significantly impede project progress.

The solution is to gain more control over the inference pipeline. This is where Google Cloud’s Vertex AI shines, offering a powerful combination of pre-built Hugging Face containers and the flexibility of custom handlers. By writing a simple Python script, you can dictate exactly how your model loads, processes requests, and formats predictions.

In this guide, I’ll walk you through the entire process, from development to deployment. You will learn how to:

Understand and build a custom inference handler for a Hugging Face model.
Test your model and handler locally to speed up debugging.
Package your model and custom code for deployment.
Deploy the model to a scalable Vertex AI Endpoint with GPU acceleration.
Get live predictions from your newly created API.

We will use PLLuM, a powerful Polish language model, as our practical example, but the techniques you learn here are applicable to countless other PyTorch-based models.

This model is a great example for a couple of reasons:

It’s increasingly popular in Poland.
It’s not directly accessible through Model Garden
It won’t work in a standard Hugging Face container. This is because it employs a custom tokenizer, necessitating the implementation of encoding/decoding logic before invoking its generation function.

Before You Begin: Setting Up Your Environment

To follow along, you’ll need a Google Cloud project and the right tools and permissions. You can execute the code in your favorite Python development environment (i.e. locally, Cloud Workstation, Google Colab, etc.).

Google Cloud Project: Ensure you have a Google Cloud project with the Vertex AI and Artifact Registry APIs enabled.
Cloud Storage Bucket: Create a new Cloud Storage bucket to store your model files. This will act as the staging area for Vertex AI.
Permissions: Make sure your account has the following IAM roles:
- Vertex AI User (roles/aiplatform.user)
- Artifact Registry Reader (roles/artifactregistry.reader)
- Storage Object Admin (roles/storage.objectAdmin)
Docker (optional but recommended): To test your model container locally, before deploying to the cloud, you will need to have Docker installed and running.
Required Libraries: Install the necessary Python libraries:

pip install - upgrade - user - quiet 'torch' 'torchvision' 'torchaudio'
pip install - upgrade - user - quiet 'transformers' 'accelerate>=0.26.0'
pip install - upgrade - user - quiet 'google-cloud-aiplatform[prediction]' 'crcmod' 'etils'Docker (optional but recommended): To test your model container locally, before deploying to the cloud, you will need to have Docker installed and running.

Once your environment is configured, import libraries and initialize the Vertex AI SDK. You should put this code in two files: test_local.py (for local testing) and deploy.py (for cloud deployment):

import json
import torch
import vertexai

from etils import epath
from google.cloud import aiplatform
from google.cloud.aiplatform import Endpoint, Model
from google.cloud.aiplatform.prediction import LocalModel

# Set your project, location, and bucket details
PROJECT_ID = "your-gcp-project-id"
LOCATION = "your-gcp-location"  # example: "us-central1"
BUCKET_URI = "gs://your-gcs-bucket-name"

# Initialize Vertex AI SDK
vertexai.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)

Understanding Custom Handlers

When you deploy a model on Vertex AI using a pre-built container, the container makes assumptions about how to load the model and process predictions. For many standard models, this works perfectly.

However, a custom handler gives you control over this process. It’s a Python script, named handler.py, that you provide alongside your model files. The Vertex AI serving container will automatically find and use this script.

The handler.py needs to implement the EndpointHandler class, that must define two key methods:

__init__: This method is called once when the model is loaded. Its job is to load your model and any other necessary assets (like a tokenizer) from the model directory into memory.
__call__: This method is called for every prediction request. It contains the core inference logic:

Pre-processing: Preparing the raw input data (e.g., tokenizing a prompt).
Prediction: Running the processed input through the model.
Post-processing: Formatting the model’s output into a user-friendly response.

By implementing this simple class, you can serve virtually any model, no matter how custom its requirements are.

Building Our Custom Handler for the PLLuM Model

Let’s build a handler for the CYFRAGOVPL/PLLuM-12B-chat model. Our goal is to create a simple text-generation endpoint.

First we need to make sure we have correct imports. Create a file named handler.py and copy the following lines into it:

from typing import Any, Dict, List
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import base64
from io import BytesIO
import logging
import sys

A good practice is to setup logging to stdout, so it can be accessed via Google Cloud’s observability services:

# Configure logging to output to stdout
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler(sys.stdout)]
)
logger = logging.getLogger('huggingface_inference_toolkit')

Then we define our __init__ function, which will be responsible for loading the model and its Tokenizer:

class EndpointHandler:
    def __init__(
        self,
        model_dir: str = '/opt/huggingface/model',
        **kwargs: Any,
    ) -> None:
        self.processor = AutoTokenizer.from_pretrained(model_dir)

        self.model = AutoModelForCausalLM.from_pretrained(
            model_dir,
            torch_dtype=torch.bfloat16,
            device_map="auto"  # automatically places model layers on available devices
        ).eval()

Lastly, let’s define the inference logic that goes inside our handler’s __call__ method. This involves taking a prompt, tokenizing it, generating a response with the model, and decoding the output:

   def __call__(self, data: Dict[str, Any]) -> Dict[str, List[Any]]:
        logger.info("Processing new request")
        predictions = []

        for instance in data['instances']:
            logger.info(f"Processing instance: {instance.get('prompt', '')[:100]}...")

            if "prompt" not in instance:
                error_msg = "Missing prompt in the request body"
                logger.info(error_msg)
                return {"error": "Missing prompt in the request body"}

            inputs = self.processor(
                instance["prompt"], return_tensors="pt", return_token_type_ids=False
            ).to(self.model.device)
            input_len = inputs["input_ids"].shape[-1]
            logger.info(f"Input processed, length: {input_len}")

            with torch.inference_mode():
                generation_kwargs = data.get(
                    "generation_kwargs", {
                        "max_new_tokens": 100,
                        "do_sample": False,
                        "top_k": 50,
                        "top_p": 0.9,
                        "temperature": 0.7
                    }
                )
                logger.info(f"Generation kwargs: {generation_kwargs}")

                generation = self.model.generate(**inputs, **generation_kwargs)
                generation = generation[0][input_len:]
                response = self.processor.decode(generation, skip_special_tokens=True)
                logger.info(f"Generated response: {response[:100]}...")
                predictions.append(response)

        logger.info(f"Successfully processed {len(predictions)} instances")
        return {"predictions": predictions}

Note that __call__ method receives a dictionary and should be written in a way that handles multiple instances. This allows users to send multiple prompts in a single request.

Click here to download the full handler.py code.

Preparing the Model for Deployment

Vertex AI needs all your model artifacts — the model weights, configuration, and our new handler.py — to be in one location on Google Cloud Storage.

Create a local directory that contains the model files and your handler.
Upload the entire directory to your GCS bucket.

gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp -r /path/to/your/local/model_directory/* gs://your-gcs-bucket-name/model/

Best Practice: Testing Locally Before Deploying

Deploying a model to a GPU-accelerated endpoint can take 15–20 minutes. To avoid waiting that long just to find a bug in your code, you can use the Vertex AI SDK’s LocalModel feature to simulate the cloud environment on your local machine.

This spins up the official Hugging Face serving container using Docker and loads your model and handler from a local directory, allowing for rapid testing.

We define a helper function by adding following lines to the test_local.py file:

def get_cuda_device_names():
    """A function to get the list of NVIDIA GPUs"""
    if not torch.cuda.is_available():
        return None

    return [str(i) for i in range(torch.cuda.device_count())]

2. Create LocalModel instance:

local_pllum_model = LocalModel(
    serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-3.transformers.4-48.ubuntu2204.py311",
    serving_container_ports=[5000],
)

3. Create a LocalEndpoint instance:

model_uri = epath.Path(BUCKET_URI) / "model"

local_pllum_endpoint = local_pllum_model.deploy_to_local_endpoint(
    artifact_uri=str(model_uri), gpu_device_ids=get_cuda_device_names()
)

local_pllum_endpoint.serve()

4. Generate predictions:

# EN:"Write a short poem about spring."
prompt = "Napisz krótki wiersz o wiośnie."  # @param {type: "string"}

prediction_request = {
    "instances": [
        {
            "prompt": prompt,
            "generation_kwargs": {"max_new_tokens": 50, "do_sample": True},
        }
    ]
}

vertex_prediction_request = json.dumps(prediction_request)
vertex_prediction_response = local_pllum_endpoint.predict(
    request=vertex_prediction_request, headers={"Content-Type": "application/json"}
)
print(vertex_prediction_response.json()["predictions"])

Click here to download the full test_local.py code.

If the local prediction succeeds, you can be much more confident that your cloud deployment will work correctly.

Deploying to a Live Vertex AI Endpoint

With our model and handler tested and uploaded to GCS, we’re ready for the final two steps.

Step 1: Register the Model in the Vertex AI Model Registry

First, we register the model, telling Vertex AI where to find the artifacts and which container to use. Add following code to the deploy.py file:

model_uri = epath.Path(BUCKET_URI) / "model"

model = Model.upload(
    display_name="cyfragovpl--pllum-12b-it",
    artifact_uri=str(model_uri),
         serving_container_image_uri="us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-3.transformers.4-48.ubuntu2204.py311",
    serving_container_ports=[8080],
)
model.wait()

Step 2: Deploy the Model to an Endpoint

Next, we deploy the registered model to an endpoint. This is where Vertex AI provisions the physical hardware (like an NVIDIA L4 GPU) and makes your model available to receive prediction requests.

deployed_model = model.deploy(
    endpoint=Endpoint.create(display_name="cyfragovpl--pllum-12b-it-endpoint"),
    machine_type="g2-standard-8",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
)

This step will take about 15–25 minutes. Once complete, you will have a fully managed, scalable HTTP endpoint for your model.

Getting Live Predictions

Now for the fun part. You can send requests to your endpoint using the Vertex AI SDK, a simple cURL command, or any HTTP client.

Using the VertexAI’s Python SDK is the most straightforward way:

# EN:"Write a short poem about spring."
prompt = "Napisz krótki wiersz o wiośnie."  # @param {type: "string"}
prediction_request = {
    "instances": [
        {
            "prompt": prompt,
            "generation_kwargs": {"max_new_tokens": 50, "do_sample": True},
        }
    ]
}

prediction = deployed_model.predict(instances=prediction_request["instances"])
print(prediction)

Click here to download the full deploy.py code.

The output will be a prediction object containing the generated text from the PLLuM model, served live from your own custom API endpoint 🎉

Conclusion and Next Steps

You have successfully taken an open-source Hugging Face model with custom requirements and transformed it into a robust, scalable API on Google Cloud. You now have the power to productionize a vast range of models by creating a simple custom handler that tailors the inference process to your exact needs.

Explore more at:

For this content in other formats visit:

Python notebook
Youtube video (in Polish)

Thanks for reading

Thank you for reading! I hope this guide helps you bring your own creative AI projects to life on Google Cloud. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.

I’m always eager to connect with fellow developers and AI enthusiasts, so feel free to connect with me on LinkedIn or Bluesky. Your feedback is incredibly valuable, so please don’t hesitate to leave a comment with your thoughts, questions, or your own experiences deploying models on Vertex AI!

Step-by-Step: Serving PyTorch Models with a Custom Handler on Vertex AI was originally published in Google Cloud - Community on Medium, where people are continuing the conversation by highlighting and responding to this story.