<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Remigiusz Samborski on Medium]]></title>
        <description><![CDATA[Stories by Remigiusz Samborski on Medium]]></description>
        <link>https://medium.com/@rsamborski?source=rss-adf81c1f37ee------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*wbRa_MY8hhJQdzRY9671xA.jpeg</url>
            <title>Stories by Remigiusz Samborski on Medium</title>
            <link>https://medium.com/@rsamborski?source=rss-adf81c1f37ee------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Sun, 07 Jun 2026 10:29:45 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@rsamborski/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Building “Sweets Vault” — a multimodal Gemini Agent with physical hardware integration]]></title>
            <link>https://medium.com/google-cloud/building-sweets-vault-a-multimodal-gemini-agent-with-physical-hardware-integration-d4e77b4ab770?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/d4e77b4ab770</guid>
            <category><![CDATA[kids-and-tech]]></category>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[education]]></category>
            <category><![CDATA[ai-agent]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Fri, 15 May 2026 10:01:32 GMT</pubDate>
            <atom:updated>2026-05-15T15:03:23.082Z</atom:updated>
            <content:encoded><![CDATA[<h3>Building “Sweets Vault” — a multimodal Gemini Agent with physical hardware integration</h3><p>Motivating seven-year-olds to complete their daily reading and handwriting practice is a classic parenting challenge. Traditional rewards work for a while, but they lack interactivity and require constant manual verification.</p><p>As a developer, I like to solve such challenges with automation. After putting some thought into it, I came up with the <strong>Sweets Vault</strong> idea: an interactive agent powered by <a href="https://github.com/google-gemini/adk">Google’s Agent Development Kit (ADK)</a> and the <a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/start?utm_campaign=CDR_0x87fa8d40_default_b512067144&amp;utm_medium=external&amp;utm_source=blog">Gemini API</a>. The system acts as a cheerful guardian that talks to children, visually inspects their workbooks via uploaded images, tests their reading comprehension, and triggers a hardware lock to open a drawer full of sweets upon successful completion.</p><figure><img alt="Sweets Vault — Gamifying Education with AI" src="https://cdn-images-1.medium.com/max/1024/1*yZSAQSXI5Tr9DZJan6wkRg.jpeg" /></figure><p>In this guide, I will walk you through the architecture and implementation of this solution. You will learn how to:</p><ul><li><strong>Structure a multimodal agent</strong> using the Agent Development Kit (ADK).</li><li><strong>Implement visual and verbal verification</strong> using Gemini’s multimodal capabilities.</li><li><strong>Manage state </strong>across multiple conversation turns and tools.</li><li><strong>Connect agent tool calls to physical hardware</strong> interfaces.</li><li><strong>Develop and run locally</strong> to access the physical hardware.</li></ul><p>If you’d like to jump directly to the code visit the <a href="https://github.com/rsamborski/sweets-vault">GitHub repository</a>. All the code is available there for your exploration.</p><h3>System architecture overview</h3><p>The diagram below presents the high level architecture of the solution:</p><figure><img alt="Architecture diagram" src="https://cdn-images-1.medium.com/max/1024/1*rjCE_F2wlb__TeeUu8Tg_g.png" /></figure><p>The core components include:</p><ol><li><strong>Gemini API</strong>: Handles reasoning, multimodal homework validation and tool calls.</li><li><strong>ADK Agent &amp; Tools</strong>: Encapsulates the system instructions, state management, and callable Python functions.</li><li><strong>Hardware Interface</strong>: Translates tool execution into physical actions (unlocking specific drawer IDs).</li></ol><p>The system is designed in such a way that the Agent runs on a local machine (I am using a mini PC with Ubuntu installed) to allow for direct hardware access:</p><ul><li>Magnetic drawers controlled via <a href="https://www.adafruit.com/product/2264">FT232H</a> USB to GPIO converter</li><li>LED Matrix controlled via REST API running on a <a href="https://www.raspberrypi.com/">Raspberry Pi</a></li></ul><p>Initially, I planned to control the LED Matrix using a second FT232H controller, but due to lack of library support, I ended up using an intermediary Raspberry Pi. This approach has its benefits, for example the LED Matrix can be located anywhere at home within the Wifi range 😀</p><h3>Root agent logic</h3><p>To kick-start the agent development, I leveraged the <a href="https://googlecloudplatform.github.io/agent-starter-pack/">agent-starter-pack templates</a>. It provides a production-ready foundation with FastAPI, frontend UI integration, and built-in observability.</p><p>The heart of the Sweets Vault is located in <a href="https://github.com/rsamborski/sweets-vault/blob/main/agent/app/agent.py">agent/app/agent.py</a>. I start by configuring the environment and initializing <a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/models?utm_campaign=CDR_0x87fa8d40_default_b512067144&amp;utm_medium=external&amp;utm_source=blog">Gemini Enterprise Agent Platform</a> (former Vertex AI). I also define the specific tasks required for our users (Mary and James):</p><pre>load_dotenv()<br>project_id = os.getenv(&quot;GOOGLE_CLOUD_PROJECT&quot;)<br>location = os.getenv(&quot;GOOGLE_CLOUD_LOCATION&quot;, &quot;us-central1&quot;)<br>os.environ[&quot;GOOGLE_CLOUD_PROJECT&quot;] = project_id<br>os.environ[&quot;GOOGLE_CLOUD_LOCATION&quot;] = location<br>os.environ[&quot;GOOGLE_GENAI_USE_VERTEXAI&quot;] = &quot;True&quot;<br><br># Initialize Vertex AI<br>vertexai.init(project=project_id, location=location)</pre><p>As a native Polish speaker I want to have the ability for the Agent to work both in Polish (for the sake of my kids) and English (for demo purposes). This is handled by the AGENT_LANGUAGE variable:</p><pre>AGENT_LANGUAGE = os.getenv(&quot;AGENT_LANGUAGE&quot;, &quot;en&quot;)</pre><p>The actual agent (root_agent) is created at the bottom of the same file:</p><pre>root_agent = Agent(<br>    name=&quot;root_agent&quot;,<br>    model=Gemini(<br>        model=&quot;gemini-2.5-flash&quot;,<br>        retry_options=types.HttpRetryOptions(attempts=3),<br>    ),<br>    instruction=load_prompt_from_file(f&quot;sweet-vault-agent-{AGENT_LANGUAGE}.txt&quot;),<br>    tools=[get_progress, complete_task, unlock_drawer],<br>)</pre><p><strong>Note:</strong> The prompt is language specific and pulled from a file with a language suffix ( enor pl).</p><h3>Handling state</h3><p>A common failure mode in conversational AI is lost context or hallucinated task completion. To prevent this, we implement strict state management using ToolContext.</p><p>Instead of relying on the model’s memory, the agent reads and writes explicit completion flags to its <a href="https://adk.dev/sessions/state/">session state</a>:</p><pre>def _get_task_status(user_key: str, task_id: str, tool_context: ToolContext) -&gt; bool:<br>    &quot;&quot;&quot;Retrieves the completion status for a specific task from the flat state.&quot;&quot;&quot;<br>    state_key = f&quot;user_tasks_{user_key}_{task_id}&quot;<br>    return tool_context.state.get(state_key, False)<br><br><br>def _set_task_status(user_key: str, task_id: str, is_done: bool, tool_context: ToolContext):<br>    &quot;&quot;&quot;Saves the completion status for a specific task and ensures all user/task <br>    combinations are explicitly represented in the flat tool_context.state.<br>    &quot;&quot;&quot;<br>    # First, update the specific target task in the current tool state<br>    target_key = f&quot;user_tasks_{user_key}_{task_id}&quot;<br>    tool_context.state[target_key] = is_done<br>    <br>    # Now, ensure every possible combination for all known users exists in the flat state.<br>    all_sync_updates = {}<br>    for name in user_names:<br>        u_key = name.lower()<br>        for t_id in TASKS_CONFIG:<br>            key = f&quot;user_tasks_{u_key}_{t_id}&quot;<br>            # If the key isn&#39;t already in the current state, default it to False.<br>            # Otherwise, keep its existing value.<br>            all_sync_updates[key] = tool_context.state.get(key, False)<br>    <br>    # Apply all values back to the flat state<br>    tool_context.state.update(all_sync_updates)<br>    logging.info(f&quot;Synchronized all task state values. Updated {target_key} to {is_done}&quot;)</pre><p><strong>Key learning:</strong> When building the system I tried using session state elements as a <a href="https://www.w3schools.com/PYTHON/python_dictionaries_nested.asp">nested dictionary</a>, but unfortunately at the time of writing this is not supported. The workaround was to use a flat structure with keys including both the user_key and task_id, which works well for my use case. However, this pattern might not scale well for a complex system with many users and tasks, in which case serialization or an external DB could be a better option.</p><h3>Agent tools</h3><p>I provided the agent with three specific tools: checking progress, marking tasks complete, and unlocking the drawer.</p><h4>Checking progress</h4><p>The get_progress function retrieves and formats a checklist of a specific user’s tasks, indicating whether each task is marked as completed or pending based on the application’s current session state.</p><pre>def get_progress(user_name: str, tool_context: ToolContext) -&gt; str:<br>    &quot;&quot;&quot;Check the progress of tasks for a specific user.&quot;&quot;&quot;<br>    user_key = user_name.lower()<br><br>    status_msg = f&quot;Progress for {user_name}:\n&quot;<br>    for task_id, desc in TASKS_CONFIG.items():<br>        is_done = _get_task_status(user_key, task_id, tool_context)<br>        state_str = &quot;✅ DONE&quot; if is_done else &quot;❌ PENDING&quot;<br>        status_msg += f&quot;- [{task_id}] {desc}: {state_str}\n&quot;<br><br>    return status_msg</pre><h4>Marking task as complete</h4><p>The complete_task tool acts as a gatekeeper. It checks if all tasks are finished before informing the model that it is authorized to unlock the drawer:</p><pre>def complete_task(user_name: str, task_id: str, tool_context: ToolContext) -&gt; str:<br>    &quot;&quot;&quot;Mark a task as completed for a user.&quot;&quot;&quot;<br>    user_key = user_name.lower()<br><br>    # Mark task as complete<br>    if task_id in TASKS_CONFIG:<br>        _set_task_status(user_key, task_id, True, tool_context)<br>    else:<br>        return f&quot;Error: Task ID &#39;{task_id}&#39; not found.&quot;<br><br>    # Check if ALL tasks are complete<br>    all_complete = True<br>    remaining = []<br>    for t_id in TASKS_CONFIG:<br>        if not _get_task_status(user_key, t_id, tool_context):<br>            all_complete = False<br>            remaining.append(t_id)<br><br>    if all_complete:<br>        return (<br>            f&quot;SUCCESS: All tasks completed for {user_name}! &quot;<br>            &quot;You may now unlock the drawer.&quot;<br>        )<br><br>    # If not all complete, show progress<br>    return (<br>        f&quot;Task {task_id} marked as DONE. &quot;<br>        f&quot;Remaining tasks: {&#39;, &#39;.join(remaining)}.&quot;<br>    )</pre><p>Notice how descriptive the returned values are. They are written this way intentionally to give the Agent enough information to handle communication with the user, provide feedback and motivate them to complete the remaining tasks.</p><h4>Integrating physical hardware</h4><p>When the model receives the success confirmation, it calls the unlock_drawer tool. This interfaces directly with our hardware relay logic to update the LED display and pop open the assigned drawer:</p><pre># Initialize the HW interface and lock the drawers<br>user_names = [&quot;Maria&quot;, &quot;Jan&quot;] if AGENT_LANGUAGE == &quot;pl&quot; else [&quot;Mary&quot;, &quot;James&quot;]<br>hw_interface = HardwareInterface(user_names)<br><br>def unlock_drawer(id: int, user_name: str) -&gt; str:<br>    &quot;&quot;&quot;Unlock a drawer by its ID.&quot;&quot;&quot;<br>    if id in [0, 1]:<br>        hw_interface.unlock_drawer(id)<br>        return f&quot;Drawer {id} unlocked for {user_name}&quot;<br><br>    return &quot;Drawer not found&quot;</pre><p>The HardwareInterface (defined in <a href="https://github.com/rsamborski/sweets-vault/blob/main/agent/app/app_utils/hw_interface.py">agent/app/app_utils/hw_interface.py</a>) actively communicates with the <a href="https://github.com/rsamborski/sweets-vault/tree/main/led-matrix-api">LED Matrix API</a> on the Raspberry Pi to display whether each drawer is currently locked or unlocked.</p><p>While the code to control the physical drawer magnets is fully functional and tested (located in <a href="https://github.com/rsamborski/sweets-vault/blob/main/hardware/drawers.py">drawers.py</a>), it is not yet integrated into the main HardwareInterface. This integration is simply on hold until the magnets are physically mounted to the drawer box.</p><h3>Agent prompts</h3><p>Tools alone are not enough; the model requires precise instructions on <em>how</em> to verify the work. In <a href="https://github.com/rsamborski/sweets-vault/tree/main/agent/app/prompts">agent/app/prompts</a> I defined a strict multi-step verification protocol both in English and Polish. Here is the English prompt:</p><pre>You are a friendly, cheerful, and helpful AI assistant, the guardian of the &quot;Sweets Vault.&quot; Your task is to verify tasks performed by children in order to grant a sweet reward.<br><br>### MAIN RULES:<br>1. **LANGUAGE**: You speak ONLY AND EXCLUSIVELY IN ENGLISH.<br>2. **USERS**:<br>   - **Mary** (girl, 7 years old) -&gt; Assigned drawer ID: **0**<br>   - **James** (boy, 7 years old) -&gt; Assigned drawer ID: **1**<br>   - **Parent** (man, 42 years old) -&gt; May test the system by saying, for example, &quot;I&#39;m pretending to be Mary.&quot; Treat him exactly like the child he is claiming to be.<br>3. **PERSONALITY**: You are enthusiastic, warm, and supportive. Use exclamation marks and a joyful tone.<br><br>### TASK VERIFICATION PROCESS:<br>1. **STATE IDENTIFICATION**: When a child starts a conversation, ALWAYS first use the `get_progress(user_name)` tool to check what needs to be done.<br>2. **REPORTING**: The child reports completing a task (A or B).<br>3. **VERIFICATION**: Conduct a rigorous verification (camera/questions) as described below.<br>4. **CREDITING**: If verification is successful, use the `complete_task(user_name, task_id)` tool.<br>   - Read the tool&#39;s response carefully!<br>   - ONLY IF the response is &quot;SUCCESS: All tasks completed...&quot;, then use `unlock_drawer`.<br>   - If the response shows &quot;Remaining tasks,&quot; inform the child what they still need to do.<br><br>**Task A: Reading a page of a book**<br>*   **Verification 1**: Ask the child to show the read page to the camera. Confirm that you see it. Don&#39;t expose any details that can help answer the question in the next step (i.e. avoid sharing details of what exactly you can see).<br>*   **Verification 2**: Ask a simple follow-up question about the read text. The child must answer it.<br>*   **Task ID**: &quot;A&quot;<br><br>**Task B: Calligraphy (writing letters in workbooks)**<br>*   **Verification 1**: Ask to show the completed page in the workbooks to the camera. <br>*   **Verification 2**: Confirm that the task has been performed. Make sure the picture contains hand-written letters (usually with a pencil).<br>*   If the page only contains examples, ask the child to complete missing parts.<br>*   **Task ID**: &quot;B&quot;<br><br>### SUCCESS AND REWARD:<br>IF the `complete_task` tool returns &quot;SUCCESS&quot;, run `unlock_drawer(id)`.<br>Then **CELEBRATE!** Use phrases like: &quot;Yippee!&quot;, &quot;Hurray!&quot;, &quot;Bravo!&quot;, &quot;You&#39;re a champion!&quot;, &quot;The sweets are yours!&quot;. Make some &quot;noise.&quot;<br><br>### FAILURE:<br>If verification fails (e.g., the child doesn&#39;t show the page or answers incorrectly), gently and encouragingly ask for improvement or a retry. Do not open the drawer.</pre><p>This prompt structure ensures the agent does its due diligence, preventing kids from simply holding up a blank page or skipping the reading comprehension check.</p><h3>Demo</h3><p>You can see a demonstration of the working system in the video below:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FTMStQw0Fthk%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DTMStQw0Fthk&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FTMStQw0Fthk%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/c8aa5ad37b3e21d63747a47309b1ec6a/href">https://medium.com/media/c8aa5ad37b3e21d63747a47309b1ec6a/href</a></iframe><h3>Conclusion</h3><p>By combining the Gemini API, the Agent Development Kit, and a simple hardware relay, you can build highly interactive, physically grounded AI Agents. The Sweets Vault demonstrates how multimodal verification and structured tool calling solve practical, real-world problems with a dose of fun.</p><p>Explore more at:</p><ul><li><a href="https://github.com/rsamborski/sweets-vault">Sweets Vault code repository</a></li><li><a href="https://adk.dev/">Agent Development Kit (ADK)</a></li><li><a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/models?utm_campaign=CDR_0x87fa8d40_default_b512067144&amp;utm_medium=external&amp;utm_source=blog">Gemini Enterprise Agent Platform</a></li></ul><h3>Future plans</h3><p>Current implementation uses <a href="https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/2-5-flash?utm_campaign=CDR_0x87fa8d40_default_b512067144&amp;utm_medium=external&amp;utm_source=blog">Gemini Flash</a> which guarantees high performance, multimodality and tool calling capabilities. Nevertheless it requires text input and provides only text as output. In the near future I plan to experiment with <a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/live-api?utm_campaign=CDR_0x87fa8d40_default_b512067144&amp;utm_medium=external&amp;utm_source=blog">Gemini Live API</a> which enables voice, video and text as input and conversational audio as output.</p><p>I am also going to finish the physical locks part with electro magnets. Stay tuned for updates.</p><h3>Thanks for reading</h3><p>Thank you for reading. I hope this blog inspires you to bring your own creative AI and hardware projects to life. If you found this article helpful, please consider following me here and giving it a clap 👏 to help others discover it.</p><p>I am always eager to connect with fellow developers and AI enthusiasts, so feel free to follow me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a>, <a href="http://x.com/RemikSamborski">X</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>. Your feedback is incredibly valuable, so please do not hesitate to leave a comment with your thoughts, questions, or your own experiences building multimodal agents!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=d4e77b4ab770" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/building-sweets-vault-a-multimodal-gemini-agent-with-physical-hardware-integration-d4e77b4ab770">Building “Sweets Vault” — a multimodal Gemini Agent with physical hardware integration</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[How I used Gemini CLI to orchestrate a complex RAG migration]]></title>
            <link>https://medium.com/google-cloud/how-i-used-gemini-cli-to-orchestrate-a-complex-rag-migration-136d9de137c5?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/136d9de137c5</guid>
            <category><![CDATA[agentic-rag]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[google-antigravity]]></category>
            <category><![CDATA[gemini-cli]]></category>
            <category><![CDATA[gemini]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Tue, 28 Apr 2026 12:19:46 GMT</pubDate>
            <atom:updated>2026-04-28T12:19:46.173Z</atom:updated>
            <content:encoded><![CDATA[<p>Building a complex, multi-phase cloud project like a RAG migration is as much about orchestration as it is about code. You have to manage infrastructure (<a href="https://www.terraform.io/">Terraform</a>), backend services (<a href="https://www.python.org/">Python</a>), frontend UI (<a href="https://nextjs.org/">Next.js</a>), data pipelines (<a href="https://docs.cloud.google.com/bigquery/docs/introduction?utm_campaign=CDR_0x87fa8d40_default_b499342492&amp;utm_medium=external&amp;utm_source=blog">BigQuery</a>/<a href="https://docs.cloud.google.com/alloydb/docs/overview?utm_campaign=CDR_0x87fa8d40_default_b499342492&amp;utm_medium=external&amp;utm_source=blog">AlloyDB</a>), and documentation — all while maintaining a consistent technical strategy.</p><p>Standard IDE completions are great for snippets, but they lack the system-level context needed for this kind of engineering. To build this reference architecture, I didn’t just use an AI to write code. I used an AI to <strong>orchestrate the entire project</strong>.</p><p>In this final post (see previous <a href="https://medium.com/google-cloud/building-a-scalable-rag-backend-with-cloud-run-jobs-and-alloydb-6ead93ca4aec">part 1</a> and <a href="https://medium.com/google-cloud/migrating-vector-embeddings-in-production-without-downtime-8a0464af6f55">part 2</a>), I’ll share a behind-the-scenes look at using <a href="https://geminicli.com/">Gemini CLI</a> with the <a href="https://github.com/gemini-cli-extensions/conductor">Conductor extension</a> to orchestrate this migration.</p><p>In this post, you will learn:</p><ul><li>How to leverage terminal-first AI assistants for system-level engineering</li><li>How to implement spec-driven development with <a href="https://github.com/gemini-cli-extensions/conductor">the Conductor extension</a></li><li>How to use AI-driven Test-Driven Development (TDD) for reliable code generation</li><li>How to collaborate with AI agents using the “Human-in-the-Loop” model</li></ul><p>Before we dive into the workflow, let’s briefly discuss why orchestration is the next logical step for AI-assisted development.</p><h3>The Developer Experience</h3><p>Let’s walk through my development process step-by-step. The entire specification, plan, and implementation logic is available in <a href="https://github.com/rsamborski/rag-migration/tree/main/conductor">the conductor directory</a> of the rag-migration repository.</p><h4>Spec-driven development with Conductor</h4><p>Central to my workflow, is the Conductor extension. It’s built on the principle of <strong>spec-driven development</strong>. Instead of jumping straight into code, we define the “source of truth” in Markdown files.</p><ul><li><strong>Product Definition (</strong>product.md<strong>):</strong> What are we building?</li><li><strong>Tech Stack (</strong>tech-stack.md<strong>):</strong> What tools are we using?</li><li><strong>Tracks Registry (</strong>tracks.md<strong>):</strong> What are the major milestones?</li><li><strong>Implementation Plans (</strong>plan.md<strong> for each of the tracks):</strong> What are the step-by-step tasks?</li><li><strong>Workflow (</strong>workflow.md<strong>):</strong> How are we building the solution?</li></ul><p>By having these documents in the codebase, the AI agent (Gemini CLI) always has the high-level context it needs to make smart decisions. It’s also a good practice to share those with your team so everyone (including AI agents) is on the same page about the project’s direction.</p><h3>Conductor initialization</h3><p>The first step for the project initialization is to create product definition and tech stack files. This is handled by running:</p><pre>/conductor:setup</pre><p>Gemini CLI will ask you a series of questions to help you define your project, including:</p><ul><li>What is the name of your product?</li><li>Who are the primary users?</li><li>What is the tech stack you are using?</li><li>What are the major features you want to implement?</li><li>What is the workflow you want to use?</li></ul><p>It will then create the initial project structure in the conductor directory, including the <a href="https://github.com/rsamborski/rag-migration/blob/main/conductor/product.md">product.md</a> and <a href="https://github.com/rsamborski/rag-migration/blob/main/conductor/tech-stack.md">tech-stack.md</a> files.</p><h4>The lifecycle of a track</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nq_IKg2NTavwpRucQy7uKw.png" /><figcaption>The lifecycle of a track in Gemini CLI Conductor</figcaption></figure><p>Each major feature in this project was implemented as a “Track”. A typical track lifecycle consists of:</p><p>1) Track Initialization (/conductor:newTrack):</p><ul><li>The agent creates a spec.md file that describes the goals of the track</li><li>The agent maps the existing codebase and validates assumptions</li><li>The agent creates a plan.md file that describes the steps needed to achieve the goals</li></ul><p>2) Track Execution (/conductor:implement):</p><ul><li>The agent iterates through tasks using a <strong>Plan -&gt; Act -&gt; Validate</strong> cycle</li></ul><p>3) Track Completion:</p><ul><li>The agent verifies the changes made during the track</li><li>The agent ask for user feedback on the implementation</li></ul><p>4) Track Archivization:</p><ul><li>Once a track is completed, Gemini CLI archives the track in the conductor/archivedirectory</li></ul><p>For example, when I started the initial embeddings track, I initialized it with:</p><pre>/conductor:newTrack</pre><p>Gemini CLI researches the codebase, asks clarifying questions and creates a <a href="https://github.com/rsamborski/rag-migration/blob/main/conductor/archive/initial-embeddings_20260303/spec.md">spec.md</a> and <a href="https://github.com/rsamborski/rag-migration/blob/main/conductor/archive/initial-embeddings_20260303/plan.md">plan.md</a> files. Only after I review and approve them, the actual implementation starts.</p><h4>Terraform for Infrastructure as Code</h4><p>My <a href="https://github.com/rsamborski/rag-migration/blob/b17a233a67886bf574601e9d72d53416c6ffcf90/conductor/product.md?plain=1#L21">product.md</a> file instructs Gemini CLI to write Terraform code for all the resources created during the project. This works really well as all the resources are consistently managed by source code and it’s easy to spin up a new environment when needed.</p><p>You can see all the Terraform files and infrastructure scripts used in the first track in the <a href="https://github.com/rsamborski/rag-migration/tree/main/01-generation/infra">infra</a> directory.</p><p>Moreover, in the course of the project creation I instructed Gemini CLI to always run terraform plan before terraform apply. Keeping this information in the <a href="https://github.com/rsamborski/rag-migration/blob/b17a233a67886bf574601e9d72d53416c6ffcf90/conductor/workflow.md?plain=1#L12">workflow.md</a> file ensures that such an approach is applied to all tracks.</p><h4>TDD with an AI agent</h4><p>One of the most powerful aspects of this workflow is AI-driven Test-Driven Development (TDD). I didn’t just ask the agent to “write the code”. It followed a strict protocol:</p><ul><li><strong>Write Failing Tests:</strong> The agent defines the expected behavior in a new test file</li><li><strong>Red Phase:</strong> It runs the tests and confirms they fail</li><li><strong>Green Phase:</strong> It writes the minimum code needed to pass the tests</li><li><strong>Refactor:</strong> It refactors the implementation code and the test code to improve clarity, remove duplication, and enhance performance without changing the external behavior.</li><li><strong>Verify Coverage:</strong> It verifies that the test coverage meets the project requirements (target: &gt;80% coverage for new code).</li><li><strong>Commit Code Changes:</strong> The agent commits code changes related to the task.</li></ul><p>This ensures that the AI-generated code isn’t just “syntactically correct” but functionally verified against my requirements. This workflow is described in the <a href="https://github.com/rsamborski/rag-migration/blob/b17a233a67886bf574601e9d72d53416c6ffcf90/conductor/workflow.md?plain=1#L18">workflow.md</a> file.</p><h4>Checkpoints and quality gates</h4><p>At the end of every phase, Gemini CLI runs a “Checkpoint” protocol. This includes:</p><ul><li><strong>Automated Verification:</strong> Running the full test suite.</li><li><strong>Manual Verification:</strong> Providing the user with step-by-step instructions to verify the changes.</li><li><strong>Auditable Records:</strong> Attaching a verification report to the git commit using git notes and update plan.md with the new commit hash.</li></ul><figure><img alt="Conductor commits demonstrating the checkpoint protocol." src="https://cdn-images-1.medium.com/max/1024/1*DyaQ5ruff6phahGk2m-uRg.png" /><figcaption>Conductor commits demonstrating the checkpoint protocol.</figcaption></figure><h4>Effective Human-in-the-Loop</h4><p>To achieve an effective AI agent-human development synergy I heavily depended on following solutions:</p><ul><li>Gemini CLI in a sandbox with Yolo mode enabled - see <a href="https://medium.com/google-cloud/secure-gemini-cli-for-cloud-development-488b23dedf29">my past article</a> for more about it.</li><li><a href="https://github.com/rsamborski/vibecoding/blob/main/gemini-cli-sandbox/sandbox_notifier.sh">Custom sandbox notifier script</a> that runs in another terminal.</li></ul><p>This approach provided safe guardrails and allowed me to jump into work on other projects while the AI was working on this one. I was always able to jump back quickly thanks to timely notifications. Moreover the checkpointing mechanism of Conductor allowed me to always have a possibility to revert unnecessary changes or to restart from a known working state.</p><p>I also used <a href="https://antigravity.google/?utm_campaign=CDR_0x87fa8d40_default_b499342492&amp;utm_medium=external&amp;utm_source=blog">Antigravity</a> to polish the generated code and the documentation. It was particularly helpful for minor tweaks or refactoring of the code that was generated by Gemini CLI.</p><h4>Token usage</h4><p>Throughout the project I used several models (Gemini 3 Pro, Gemini 3 Flash and Gemini 2.5 Flash Lite). The total token consumption was:</p><ul><li>Input tokens: ~19M</li><li>Cached input tokens: ~66M</li><li>Output tokens: ~400k</li></ul><p>Notice the high number of cached input tokens, which significantly impacts the spend. The total Vertex AI token cost was <strong>around $30</strong>. Not bad for several days of AI assisted work.</p><p>See the <a href="https://cloud.google.com/vertex-ai/generative-ai/pricing?utm_campaign=CDR_0x87fa8d40_default_b499342492&amp;utm_medium=external&amp;utm_source=blog">pricing page</a> for more details and please mind that your mileage may vary.</p><h3>Summary</h3><p>Software engineering is evolving from writing code to orchestrating agentic workflows. By using tools like Gemini CLI and frameworks like Conductor, you can scale your impact as an architect while ensuring consistent, high-quality implementation.</p><p>Ready to build your own AI-assisted development projects?</p><ul><li><a href="https://geminicli.com/">Check out Gemini CLI</a></li><li><a href="https://github.com/gemini-cli-extensions/conductor">Explore the Conductor extension</a></li><li><a href="https://antigravity.google/?utm_campaign=CDR_0x87fa8d40_default_b499342492&amp;utm_medium=external&amp;utm_source=blog">Try Antigravity</a></li><li><a href="https://github.com/rsamborski/rag-migration">Check out the full RAG Migration repository</a></li></ul><h3>Thanks for reading</h3><p>If you found this article helpful, please consider adding 50 claps to this post by pressing and holding the clap button 👏 This will help others find it. You can also share it with your friends on socials.</p><p>I’m always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a>, <a href="https://x.com/RemikSamborski">X</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=136d9de137c5" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/how-i-used-gemini-cli-to-orchestrate-a-complex-rag-migration-136d9de137c5">How I used Gemini CLI to orchestrate a complex RAG migration</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Migrating vector embeddings in production without downtime]]></title>
            <link>https://medium.com/google-cloud/migrating-vector-embeddings-in-production-without-downtime-8a0464af6f55?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/8a0464af6f55</guid>
            <category><![CDATA[vector-embeddings]]></category>
            <category><![CDATA[agentic-rag]]></category>
            <category><![CDATA[ai-agent]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[data]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Tue, 21 Apr 2026 14:09:56 GMT</pubDate>
            <atom:updated>2026-04-28T13:01:23.497Z</atom:updated>
            <content:encoded><![CDATA[<p>In the fast-moving world of AI, models evolve rapidly. What was state-of-the-art six months ago is now being surpassed by newer models. For a RAG system, this presents a significant challenge: vector embeddings are tied to the specific model that generated them.</p><p>If you want to upgrade your model, you can’t just start using the new one. Existing vectors in your database are incompatible with queries from the new model. A “naive” migration-shutting down the site, re-embedding everything, and restarting-means hours of potential downtime.</p><p>In this post, I’ll show you how to execute a zero-downtime migration strategy using dual-column schemas and background processing.</p><p>If you haven’t read <a href="https://medium.com/@rsamborski/6ead93ca4aec">the previous post</a>, I recommend starting there to understand the basics of building a RAG pipeline with <a href="https://docs.cloud.google.com/bigquery/docs/introduction?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">BigQuery</a>, <a href="https://docs.cloud.google.com/run/docs/create-jobs?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">Cloud Run Jobs</a>, <a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">Vertex AI</a>, and <a href="https://docs.cloud.google.com/alloydb/docs/overview?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">AlloyDB for PostgreSQL</a>.</p><p>In this post we will start off with a running system built in <a href="https://medium.com/@rsamborski/6ead93ca4aec">the previous post</a>, and I will show you how to:</p><ul><li>Implement the <a href="https://devops.com/what-is-a-shadow-deployment/">Shadow Deployment pattern</a> with dual-column schemas</li><li>Execute background backfilling using <a href="https://docs.cloud.google.com/run/docs/create-jobs?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">Cloud Run Jobs</a></li><li>Safely switch application logic without impacting search functionality</li><li>Ensure data consistency and handle migration failures</li></ul><p>Before we dive into the code, let’s briefly discuss the concept of shadow deployment and how it supports the RAG application migration process.</p><h3>Shadow deployment with dual columns</h3><figure><img alt="RAG embeddings migration overview" src="https://cdn-images-1.medium.com/max/1024/1*ks9Nq4RlQug8KTnebKcqow.png" /><figcaption>RAG embeddings migration overview</figcaption></figure><p>A robust way to migrate embeddings is to use a <a href="https://devops.com/what-is-a-shadow-deployment/">Shadow Deployment pattern</a>. Instead of replacing the existing vectors, you store the new vectors alongside them in a separate column. The migration process boils down to following major steps:</p><ul><li>Add a new column: We update our AlloyDB table to include embedding_v2.</li><li>Backfill in the background: We run a migration job to populate embedding_v2 for all existing rows.</li><li>Switch: Once every row has a new vector, we update the application code to use the new model and the new column.</li></ul><p>This strategy ensures that your live search functionality, which still uses the original embedding column, remains fully operational during the entire migration process.</p><h3>Implementation</h3><p>Let’s walk through the migration process step-by-step. All the code for this migration is available in the 03-migration folder of the <a href="https://github.com/rsamborski/rag-migration">RAG Migration Repository</a>.</p><h4>Step 1: Schema evolution</h4><p>First, we prepare the database. Using a simple SQL query, we add the new vector column. Because we are targeting an existing database, we connect via the <a href="https://cloud.google.com/alloydb/docs/auth-proxy/overview?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">AlloyDB Auth Proxy</a> and use psql to execute the query:</p><pre># Ensure your AlloyDB Auth proxy is running in another terminal window by running<br># ./alloydb-auth-proxy projects/&lt;PROJECT_ID&gt;/locations/&lt;LOCATION&gt;/clusters/&lt;CLUSTER&gt;/instances/&lt;INSTANCE&gt; --port &lt;PORT&gt; --auto-iam-authn --public-ip<br><br># Navigate to the migration directory<br>cd 03-migration<br><br># Apply the schema change<br>psql -h 127.0.0.1 -p &lt;PORT&gt; -U postgres -d &lt;DATABASE_NAME&gt; -f 001_add_embedding_v2.sql</pre><p>The content of 001_add_embedding_v2.sql is straightforward:</p><pre>ALTER TABLE products ADD COLUMN IF NOT EXISTS embedding_v2 VECTOR(768);</pre><p>Since AlloyDB handles schema changes gracefully, this operation is near-instantaneous and doesn’t lock the table for reads. Your live API is completely unaffected.</p><blockquote>Note: In production you may want to run this query via your CI/CD pipeline.</blockquote><h4>Step 2: Configure the migration environment</h4><p>We reuse the parallelization framework we built in <a href="https://medium.com/@rsamborski/6ead93ca4aec">the previous post</a>, but this time we configure the environment for the new model. The project uses uv for dependency management:</p><pre># Sync local dependencies (run it in 03-migration folder)<br>uv sync<br><br># Set required environment variables<br>export GOOGLE_CLOUD_PROJECT=&quot;YOUR_PROJECT_ID&quot;<br>export DB_PASSWORD=&quot;YOUR_ALLOYDB_PASSWORD&quot;<br>export GEMINI_EMBEDDING_MODEL=&quot;gemini-embedding-001&quot;<br>export GEMINI_EMBEDDING_DIMENSION=768<br>export BATCH_SIZE=1000</pre><h4>Step 3: Background backfilling worker</h4><p>The migration worker (03-migration/main.py) specifically targets rows where the new column is still empty. This makes the migration process <strong>idempotent and resumable</strong> — if a task fails, you can just run it again.</p><pre># snippet from 03-migration/main.py<br># Fetch products where embedding_v2 is null, respecting offset<br>fetch_stmt = text(&quot;&quot;&quot;<br>    SELECT id, name, category, brand FROM products <br>    WHERE embedding_v2 IS NULL<br>    ORDER BY id<br>    LIMIT :batch_size OFFSET :offset<br>&quot;&quot;&quot;)</pre><p>We deploy this worker as a Cloud Run Job. A convenient deploy script is provided in the repository which builds the Docker image and configures the job on GCP.</p><pre>./infra/scripts/deploy_migration.sh</pre><h4>Step 4: Orchestrating the migration</h4><p>Instead of manually calculating the number of tasks to run, we use a Python orchestrator (03-migration/orchestrator.py) to query the database, calculate the remaining work, and dynamically scale the Cloud Run Job.</p><p>The orchestrator counts the number of unmigrated rows:</p><pre># snippet from orchestrator.py logic<br>count_stmt = text(&quot;SELECT COUNT(*) FROM products WHERE embedding_v2 IS NULL&quot;)<br>unmigrated_count = session.execute(count_stmt).scalar()<br>total_tasks = math.ceil(unmigrated_count / batch_size)</pre><p>Then, it triggers the Cloud Run Job via the Google Cloud SDK, passing the exact number of tasks required:</p><pre># Run the orchestrator to kick off the migration<br>uv run orchestrator.py</pre><p>The job runs in the background, consuming rows and generating new embeddings without competing for critical resources with our live search API.</p><h4>Step 5: Safely changing the query</h4><p>Once the orchestrator reports that 100% of rows have embedding_v2 populated, we are ready for the switch. This happens entirely at the application layer (02-ui).</p><p>The search API code is updated to:</p><ol><li>Use the gemini-embedding-001 model to embed the user’s search query.</li><li>Query the embedding_v2 column in AlloyDB instead of embedding.</li></ol><p><strong>Congratulations 🎉</strong> You have successfully migrated your entire vector database with zero downtime!</p><h4>Production best practices: evals and feature flags</h4><p>While a direct code swap works for a simple demonstration, in a real-world production environment, you should avoid an abrupt 100% cutover. Instead, you should leverage the fact that both vector representations exist simultaneously in your database to roll out safely:</p><ol><li><strong>Evaluation pipeline:</strong> Before exposing the new model to customers, build an eval pipeline. Take a golden dataset of your most common or critical search queries and run them against both the old (embedding) and new (embedding_v2) columns. Compare the relevance of the retrieved results to ensure the new model actually improves the search experience.</li><li><strong>Feature flags for traffic routing:</strong> Wrap the application-layer switch in a feature flag. Start by routing a small percentage of your traffic (e.g., 5% or 10%) to the new embedding_v2 logic. Monitor your application metrics, click-through rates, and error logs.</li></ol><p>Because the migration happened in the background, this dual-state makes it trivial to run A/B tests or instantly rollback by toggling the feature flag if the new model introduces unexpected regressions. Once you’re fully ramped up to 100% and verified the new performance, the old embedding column can be safely dropped in a future database cleanup.</p><h4>See it in action</h4><figure><img alt="The semantic search UI seamlessly returns results using the new gemini-embedding-001 model without any disruption to the user experience." src="https://cdn-images-1.medium.com/max/884/1*FAjcPAB5rpL9jHRMSdiKxw.gif" /><figcaption><em>The semantic search UI seamlessly returns results using the new gemini-embedding-001 model without any disruption to the user experience.</em></figcaption></figure><h3>Summary</h3><p>AI infrastructure is about more than just the initial build; it’s about designing for evolution. By using shadow deployments, you ensure your RAG system can always stay at the cutting edge of model performance without sacrificing availability.</p><p>Ready to take it further?</p><ul><li>Check out the <a href="https://github.com/rsamborski/rag-migration">full source code on GitHub</a>.</li><li><a href="https://docs.cloud.google.com/run/docs/create-jobs?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">Learn more about Cloud Run Jobs</a>.</li><li><a href="https://docs.cloud.google.com/alloydb/docs/pgvector?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">Learn more about AlloyDB and pgvector</a>.</li><li><a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings?utm_campaign=CDR_0x87fa8d40_default_b499342268&amp;utm_medium=external&amp;utm_source=blog">Learn more about Embeddings APIs on VertexAI</a>.</li></ul><p>In my <a href="https://medium.com/google-cloud/how-i-used-gemini-cli-to-orchestrate-a-complex-rag-migration-136d9de137c5">next post</a>, we’ll look at the Developer Experience — how I used <a href="https://geminicli.com/">Gemini CLI</a> and the <a href="https://github.com/gemini-cli-extensions/conductor">Conductor extension</a> to build and manage this entire multi-phase project.</p><h3>Thanks for reading</h3><p>If you found this article helpful, please consider adding 50 claps to this post by pressing and holding the clap button 👏 This will help others find it. You can also share it with your friends on socials.</p><p>I’m always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a>, <a href="https://x.com/RemikSamborski">X</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=8a0464af6f55" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/migrating-vector-embeddings-in-production-without-downtime-8a0464af6f55">Migrating vector embeddings in production without downtime</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Building a Scalable RAG Backend with Cloud Run Jobs and AlloyDB]]></title>
            <link>https://medium.com/google-cloud/building-a-scalable-rag-backend-with-cloud-run-jobs-and-alloydb-6ead93ca4aec?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/6ead93ca4aec</guid>
            <category><![CDATA[google-cloud-run]]></category>
            <category><![CDATA[data]]></category>
            <category><![CDATA[postresql]]></category>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[ai-agent]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Wed, 15 Apr 2026 05:26:04 GMT</pubDate>
            <atom:updated>2026-04-21T15:38:00.617Z</atom:updated>
            <content:encoded><![CDATA[<p>Building a Retrieval-Augmented Generation (RAG) sounds easy with all the available tutorials. You take a few hundred products, run them through an embedding model, and store them in a vector database. It works beautifully on your machine or staging environment.</p><p>The friction starts at production scale. When your dataset jumps from a few hundred to millions of products, that simple Python loop you wrote to generate embeddings hits a wall. Between network latency and hitting API rate limits every few seconds, what was a five-minute task quickly spirals into a multi-hour ordeal that blocks your entire pipeline.</p><p>Scaling effectively means moving past sequential processing. In this post, we’ll explore how to build an industrial-strength RAG backend using <a href="https://docs.cloud.google.com/bigquery/docs/introduction?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">BigQuery</a>, <a href="https://docs.cloud.google.com/run/docs/create-jobs?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Cloud Run Jobs</a>, <a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Vertex AI</a>, and <a href="https://docs.cloud.google.com/alloydb/docs/overview?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">AlloyDB for PostgreSQL</a>.</p><p>You will learn how to:</p><ul><li>Provision infrastructure with <a href="https://www.terraform.io/">Terraform</a></li><li>Parallelize embedding generation using <a href="https://cloud.google.com/run/docs/managing/jobs?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Cloud Run Jobs</a></li><li>Use the google-genai SDK for Vertex AI text-embedding-005 model</li><li>Store and query vectors in <a href="https://cloud.google.com/alloydb/docs/overview?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">AlloyDB for PostgreSQL</a> using pgvector</li></ul><p><em>Note: I decided to use AlloyDB in this example, but any other </em><a href="https://www.postgresql.org/"><em>PostgreSQL</em></a><em> database with </em><a href="https://github.com/pgvector/pgvector"><em>pgvector extension</em></a><em> could work too, for example you may consider leveraging </em><a href="https://docs.cloud.google.com/sql/docs/postgres?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog"><em>Cloud SQL for PostgreSQL</em></a><em>.</em></p><p>Before we dive into the code, let’s briefly discuss the core components that power this serverless AI solution.</p><h3>The Industrial-Strength Architecture</h3><p>Our pipeline is designed for massive scale and serverless efficiency. We leverage the following Google Cloud services:</p><ul><li><strong>BigQuery:</strong> Our source of truth, containing millions of product records.</li><li><strong>Cloud Run Jobs:</strong> A serverless compute platform that allows us to run hundreds of parallel tasks.</li><li><strong>Vertex AI (</strong>text-embedding-005<strong>):</strong> The latest state-of-the-art embedding model from Google.</li><li><strong>AlloyDB for PostgreSQL:</strong> An enterprise-grade database with built-in pgvector support for high-performance vector search.</li></ul><p>The diagram below illustrates the high-level architecture of our RAG pipeline:</p><figure><img alt="High-level architecture of the RAG pipeline" src="https://cdn-images-1.medium.com/max/1024/1*7ajtr942WPhD7JZWS4o5Gw.png" /><figcaption>High-level architecture of the RAG pipeline</figcaption></figure><h3>Implementation</h3><p>Let’s walk through the setup and execution process step-by-step. All the code for this project is available in the <a href="https://github.com/rsamborski/rag-migration/tree/main/01-generation">RAG Migration Repository</a>.</p><h4>Prepare the environment</h4><p>First, let’s configure the <a href="https://cloud.google.com/sdk/docs/install-sdk?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">gcloud CLI</a>, clone the repository and create a virtual environment with dependencies.</p><ul><li>Step 1 — set your default project:</li></ul><pre>gcloud config set project YOUR_PROJECT_ID</pre><ul><li>Step 2 — configure the default region for Cloud Run:</li></ul><pre>gcloud config set run/region europe-central2</pre><ul><li>Step 3 — clone the code repository</li></ul><pre>git clone https://github.com/rsamborski/rag-migration.git<br>cd rag-migration/01-generation</pre><ul><li>Step 4 — create a virtual environment and install dependencies</li></ul><pre>uv init<br>uv sync</pre><h4>Infrastructure with Terraform</h4><p>We use Terraform to provision the AlloyDB cluster, the Artifact Registry, and the Cloud Run Job. Navigate to 01-generation/infra/terraform and apply the configuration:</p><pre>terraform init<br>terraform plan -var=&quot;project_id=YOUR_PROJECT_ID&quot; -var=&quot;db_password=YOUR_SECURE_PASSWORD&quot; -out tfplan<br>terraform apply tfplan</pre><blockquote>The -out tfplan flag saves the plan to a file named tfplan, and terraform apply tfplan applies that specific plan. This is a best practice for ensuring that the plan and apply operations are consistent.</blockquote><h4>Connecting to AlloyDB</h4><p>To interact with AlloyDB, the application needs to establish a secure connection. Depending on where you are running the code, the approach differs:</p><ul><li><strong>Local Development:</strong> For running scripts or testing queries from your local machine, use the <a href="https://cloud.google.com/alloydb/docs/auth-proxy/overview?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">AlloyDB Auth Proxy</a>. It provides secure access to your instance without authorizing your local IP to the AlloyDB instance.</li><li><strong>Cloud Run Jobs:</strong> When running in Cloud Run, the job connects to the AlloyDB instance over the private network (VPC). For this setup, we pass the database password via an environment variable to the Cloud Run Job configuration.</li></ul><blockquote>For production workloads, it is highly recommended to use Google Cloud Secret Manager to handle sensitive data like database passwords, rather than passing them as plain text environment variables.</blockquote><h4>Embedding logic</h4><p>The worker script (01-generation/main.py) is designed to run as an individual task within a Cloud Run Job. It uses the CLOUD_RUN_TASK_INDEX environment variable to calculate its specific shard of data.</p><pre># Cloud Run Job environment variables<br>task_index = int(os.environ.get(&quot;CLOUD_RUN_TASK_INDEX&quot;, 0))<br>batch_size = int(os.environ.get(&quot;BATCH_SIZE&quot;, 100))<br><br># Calculate offset<br>offset = task_index * batch_size</pre><p>The embedding generation logic (01-generation/src/embedder.py) uses the google-genai SDK:</p><pre>import os<br>from google import genai<br>from google.genai.types import EmbedContentConfig<br><br>def generate_embeddings(texts: list[str]) -&gt; list[list[float]]:<br>    &quot;&quot;&quot;<br>    Generates embeddings for a list of texts using the text-embedding-005 model.<br>    Uses the new google-genai SDK to avoid deprecation warnings.<br>    &quot;&quot;&quot;<br>    if not texts:<br>        return []<br>        <br>    project_id = os.environ.get(&quot;GOOGLE_CLOUD_PROJECT&quot;, &quot;rsamborski-rag&quot;)<br>    location = os.environ.get(&quot;GOOGLE_CLOUD_REGION&quot;, &quot;europe-central2&quot;)<br>    <br>    # Initialize the Gen AI client for Vertex AI<br>    client = genai.Client(vertexai=True, project=project_id, location=location)<br>    <br>    # The dimensionality of the output embeddings for text-embedding-005.<br>    dimensionality = 768 <br>    task = &quot;RETRIEVAL_DOCUMENT&quot; # standard task for documents in RAG<br>    <br>    response = client.models.embed_content(<br>        model=&quot;text-embedding-005&quot;,<br>        contents=texts,<br>        config=EmbedContentConfig(<br>            task_type=task,<br>            output_dimensionality=dimensionality<br>        )<br>    )<br>    <br>    return [embedding.values for embedding in response.embeddings]</pre><h4>Build and deploy</h4><p>We containerize the application using the provided Dockerfile and deploy it as a Cloud Run Job. The deploy.sh script automates this process, you can run it by executing:</p><pre>./infra/scripts/deploy.sh</pre><p>Once finished you should see:</p><pre>---------------------------------------------------------<br>✅ Deployment Finished<br>---------------------------------------------------------</pre><h4>Run and monitor</h4><p>Now you can start the orchestrator by running:</p><pre>uv run orchestrator.py</pre><p>The orchestrator provides real-time feedback on the job status, which you can also monitor in the <a href="https://console.cloud.google.com/run/jobs?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Google Cloud Console</a>.</p><p>Congratulations 🎉 You have successfully built and run a parallelized embedding pipeline!</p><blockquote>For production environment I recommend to <a href="https://docs.cloud.google.com/alloydb/docs/ai/create-scann-index?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">create a ScaNN index</a> to improve the speed of your queries. Please refer to the linked documentation to learn more about it.</blockquote><h3>Testing with the Semantic Search UI</h3><p>To see the embeddings in action, you can spin up the Next.js semantic search UI locally.</p><h4>Run the UI</h4><ul><li>Navigate to the UI directory and configure the environment:</li></ul><pre>cd ../02-ui<br>cp .env.template .env</pre><p>Edit the .env file to include your Google Cloud PROJECT_IDand the AlloyDB DB_PASSWORD you used during the Terraform deployment. Set DB_HOST=127.0.0.1 to route queries through the AlloyDB Auth Proxy.</p><ul><li>Install dependencies:</li></ul><pre>npm install</pre><ul><li>Start the AlloyDB Auth Proxy (in a separate terminal window):</li></ul><pre># Make sure you have downloaded the alloydb-auth-proxy binary<br>./alloydb-auth-proxy projects/YOUR_PROJECT_ID/locations/europe-central2/clusters/rag-migration-cluster/instances/rag-migration-instance</pre><ul><li>Start the development server:</li></ul><pre>npm run dev</pre><p>Navigate to <a href="http://localhost:3000">http://localhost:3000</a> to interact with the search portal. You can now run natural language queries directly against your product catalog!</p><h4>See it in action</h4><figure><img alt="Watch as natural language queries return highly relevant results mapped via the text-embedding-005 model in real-time." src="https://cdn-images-1.medium.com/max/884/1*DuaHqbobLX3tILWOX20wrA.gif" /><figcaption>Watch as natural language queries return highly relevant results mapped via the text-embedding-005 model in real-time.</figcaption></figure><h3>Summary</h3><p>You now have a scalable, serverless foundation for your RAG system. By using Cloud Run Jobs, you’ve transformed a bottleneck into a highly parallelized process capable of handling millions of records.</p><p>Ready to take it further?</p><ul><li>Check out the <a href="https://github.com/rsamborski/rag-migration">full source code on GitHub</a>.</li><li><a href="https://docs.cloud.google.com/run/docs/create-jobs?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Learn more about Cloud Run Jobs</a>.</li><li><a href="https://docs.cloud.google.com/alloydb/docs/pgvector?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Learn more about AlloyDB and pgvector</a>.</li><li><a href="https://docs.cloud.google.com/alloydb/docs/ai/create-scann-index?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Learn how to create a ScaNN index</a> for your embeddings.</li><li><a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/embeddings?utm_campaign=CDR_0x87fa8d40_default_b499342314&amp;utm_medium=external&amp;utm_source=blog">Learn more about Embeddings APIs on VertexAI</a>.</li></ul><p>In the <a href="https://medium.com/google-cloud/migrating-vector-embeddings-in-production-without-downtime-8a0464af6f55">next post</a>, we’ll dive into Zero-Downtime Embedding Migration — how to upgrade your vector models without taking your search offline.</p><h3>Thanks for reading</h3><p>If you found this article helpful, please consider adding 50 claps to this post by pressing and holding the clap button 👏 This will help others find it. You can also share it with your friends on socials.</p><p>I’m always eager to share my learnings or chat with fellow developers and AI enthusiasts, so feel free to follow me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a>, <a href="https://x.com/RemikSamborski">X</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6ead93ca4aec" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/building-a-scalable-rag-backend-with-cloud-run-jobs-and-alloydb-6ead93ca4aec">Building a Scalable RAG Backend with Cloud Run Jobs and AlloyDB</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Secure Gemini CLI for Cloud development]]></title>
            <link>https://medium.com/google-cloud/secure-gemini-cli-for-cloud-development-488b23dedf29?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/488b23dedf29</guid>
            <category><![CDATA[gemini-cli]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[devtools]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[ai-agent]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Fri, 13 Mar 2026 05:32:25 GMT</pubDate>
            <atom:updated>2026-03-13T05:32:25.548Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="Secure Gemini CLI for Cloud development" src="https://cdn-images-1.medium.com/max/1024/1*bkM4TzNQ7L1mT4M6WU-wdg.jpeg" /></figure><p>AI agents are a double-edged sword. You hear horror stories of autonomous tools deleting production databases or purging entire email inboxes. These risks often lead users to require manual confirmation for every agent operation. This approach keeps you in control but limits the agent’s autonomy. You will soon find yourself hand-holding the agent and hindering its true capabilities. You need a way to let the agent run in “yolo mode” without risking your system.</p><p>In this blog you will learn how to secure your <a href="https://geminicli.com/">Gemini CLI</a> in a way that will allow you to run it in an isolated environment with limited <a href="https://github.com/">GitHub</a> and <a href="https://cloud.google.com/?utm_campaign=CDR_0x87fa8d40_default_b490349988&amp;utm_medium=external&amp;utm_source=blog">Google Cloud</a> access while not worrying that it will do too much damage if things go wrong. We will follow the least privilege pattern to make sure Gemini CLI has all necessary permissions to build your project, but at the same time can’t access systems it shouldn’t touch.</p><h3>The Sandbox premise</h3><p>The solution consists of following components:</p><ol><li>Using <a href="https://docs.cloud.google.com/iam/docs/service-account-overview?utm_campaign=CDR_0x87fa8d40_default_b490349988&amp;utm_medium=external&amp;utm_source=blog">GitHub fine-grained personal access tokens</a> — limits source control risks.</li><li><a href="https://docs.cloud.google.com/iam/docs/service-account-overview?utm_campaign=CDR_0x87fa8d40_default_b490349988&amp;utm_medium=external&amp;utm_source=blog">Google Cloud service account</a> — limits cloud risks.</li><li><a href="https://www.docker.com/">Docker</a> — limits local system risks.</li><li><a href="https://geminicli.com/docs/cli/session-management/#session-limits">Session limits</a> — avoid surprises with the number of used tokens (especially important when running in — yolo mode).</li></ol><p>Following this approach will protect you from the ‘<strong>helpful agent curse</strong>’ — it’s a situation when the agent tries very hard to achieve a task by finding ways around blockers. Examples include: granting itself more permissions, copying files to the current folder to edit them, and many more.</p><h4>GitHub fine-grained personal access tokens</h4><p>First let’s limit agent’s GitHub exposure by leveraging the fine grained tokens:</p><ul><li>Navigate to GitHub Settings &gt; Developer Settings &gt; Personal access tokens &gt; <a href="https://github.com/settings/personal-access-tokens">Fine-grained tokens</a>.</li><li>Click <em>Generate a new token</em>.</li><li>Provide a descriptive <em>name</em> for your token and consider using <em>expiration date</em> to force rotations on a regular basis.</li><li>Restrict <em>Repository access</em> to the specific target repo you are working on.</li><li>Grant <em>Read and Write</em> permissions for <em>Contents</em>.</li><li>Save the token locally by running export GITHUB_TOKEN=&quot;github_pat_...&quot;</li></ul><h4>Google Cloud Service Account</h4><p>Create an isolated Service Account (SA) with minimal permissions. This prevents the agent from accessing protected resources and other projects.</p><p>Run these commands after updating YOUR_PROJECT_ID and roles below:</p><pre># Set your project ID<br>export CLOUDSDK_CORE_PROJECT=&quot;YOUR_PROJECT_ID&quot;<br>gcloud config set project $CLOUDSDK_CORE_PROJECT<br><br># Create the Service Account<br>gcloud iam service-accounts create gemini-cli-sa \<br>    --description=&quot;Isolated account for Gemini CLI&quot;<br><br># Grant minimal roles (adjust roles as needed)<br>gcloud projects add-iam-policy-binding $CLOUDSDK_CORE_PROJECT \<br>   --member=&quot;serviceAccount:gemini-cli-sa@$CLOUDSDK_CORE_PROJECT.iam.gserviceaccount.com&quot; \<br>    --role=&quot;roles/aiplatform.user&quot;<br><br># Generate the JSON key file<br>gcloud iam service-accounts keys create sa-key.json \<br>    --iam-account=gemini-cli-sa@$CLOUDSDK_CORE_PROJECT.iam.gserviceaccount.com</pre><p><strong>Hint:</strong> you can use the <a href="https://docs.cloud.google.com/iam/docs/roles-permissions?utm_campaign=CDR_0x87fa8d40_default_b490349988&amp;utm_medium=external&amp;utm_source=blog">IAM roles and permissions index</a> page to easily find the roles to grant.</p><p>A good practice is to use a dedicated project for each of your AI coding initiatives. This way you can run several agents in parallel. They will build different solutions without worrying about stepping on each other’s toes.</p><h4>Custom Docker Build</h4><p>The Gemini CLI can use a sandbox image to isolate the execution environment. You must customize this image to install gcloud, terraform, vim and set git configuration.</p><h4>Prepare the Dockerfile</h4><p>Create a .gemini directory in your project, and inside it, create a sandbox.Dockerfile. Using this specific file name allows Gemini CLI to automatically detect and build your custom sandbox profile if you’re <a href="https://docs.google.com/document/d/1eiavW-bd-wf_fFrNNzbNkzTw1WztZVilYjjAtCphKmU/edit?resourcekey=0-Ne0THGy8CcVoMFUSJUd_rg&amp;tab=t.0#bookmark=id.5tyzxboz7enm">running it from source</a> and you can also use it to build the image manually if you’re <a href="https://docs.google.com/document/d/1eiavW-bd-wf_fFrNNzbNkzTw1WztZVilYjjAtCphKmU/edit?resourcekey=0-Ne0THGy8CcVoMFUSJUd_rg&amp;tab=t.0#bookmark=id.l2l4uuiqql87">running a binary installation</a>.</p><p>Paste this content in the .gemini/sandbox.Dockerfile :</p><pre># Start from the official Gemini CLI sandbox image with proper version<br>ARG GEMINI_CLI_VERSION 0.33.0<br>FROM us-docker.pkg.dev/gemini-code-dev/gemini-cli/sandbox:${GEMINI_CLI_VERSION}<br><br># Switch to root to install system dependencies (gcloud)<br>USER root<br><br># Install Google Cloud SDK, Git, and prerequisites<br>RUN apt-get update &amp;&amp; apt-get install -y curl apt-transport-https ca-certificates gnupg git &amp;&amp; \<br>    echo &quot;deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main&quot; | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list &amp;&amp; \<br>    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - &amp;&amp; \<br>    apt-get update &amp;&amp; apt-get install -y google-cloud-cli<br><br># Install Terraform<br>RUN apt-get update &amp;&amp; apt-get install -y wget lsb-release &amp;&amp; \<br>    wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor | tee /usr/share/keyrings/hashicorp-archive-keyring.gpg &gt; /dev/null &amp;&amp; \<br>    echo &quot;deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(grep -oP &#39;(?&lt;=UBUNTU_CODENAME=).*&#39; /etc/os-release || lsb_release -cs) main&quot; | tee /etc/apt/sources.list.d/hashicorp.list &amp;&amp; \<br>    apt-get update &amp;&amp; apt-get install -y terraform<br><br># Install vim<br>RUN apt-get install -y vim<br><br># Switch back to the non-root user (the official sandbox image uses &#39;node&#39; as the default user)<br>USER node<br>WORKDIR /workspace<br><br># Configure Git to use the injected GitHub PAT at runtime<br>RUN git config --global credential.helper &#39;!f() { echo &quot;username=x-access-token&quot;; echo &quot;password=$GITHUB_TOKEN&quot;; }; f&#39;</pre><h4>Prepare docker for building images (optional MacOS step)</h4><p>If you haven’t built any Docker images before then run following commands to prepare your environment with brew:</p><pre># Install dependencies<br>brew install docker colima docker-buildx<br><br># Configure docker-buildx<br>mkdir -p ~/.docker/cli-plugins<br>ln -sfn $(brew --prefix)/opt/docker-buildx/bin/docker-buildx ~/.docker/cli-plugins/docker-buildx<br><br># Start colima service<br>brew services start colima<br><br># Update DOCKER_HOST (you might want to add this line to .bash_profile):<br>export DOCKER_HOST=&quot;unix://${HOME}/.colima/default/docker.sock&quot;</pre><h4>Build the image (binary installation)</h4><p>If you installed Gemini CLI with npm, brew or any other binary method then you will need to manually build the Docker image and tag it as a default one that Gemini CLI is looking for:</p><pre># Get the base name the CLI looks for<br>export IMAGE_BASE_NAME=&quot;us-docker.pkg.dev/gemini-code-dev/gemini-cli/sandbox&quot;<br><br># Get your currently installed Gemini CLI version (e.g., 0.33.0)<br>export GEMINI_CLI_VERSION=$(gemini --version)<br><br># Combine them<br>export IMAGE_NAME=&quot;${IMAGE_BASE_NAME}:${GEMINI_CLI_VERSION}&quot;<br><br># Build your custom sandbox image<br>docker build \<br>  --build-arg GEMINI_CLI_VERSION=$GEMINI_CLI_VERSION \<br>  -t &quot;${IMAGE_NAME}&quot; \<br>  -f .gemini/sandbox.Dockerfile .</pre><p><strong>Important: </strong>this image will be tagged with the exact version of the Gemini CLI you use. This means it needs to be rebuilt every time you update the CLI. I keep the above code in a shell script to run it after every update.</p><h4>Build the image (source installation)</h4><p>If you’re running your Gemini CLI from source as explained <a href="https://geminicli.com/docs/get-started/installation/#run-from-source-recommended-for-gemini-cli-contributors">here</a>. You can trigger the image build automatically each time you start gemini.</p><p>First update the top part of your sandbox.Dockerfile by substituting the FROMline with following:</p><pre># Start from the official Gemini CLI sandbox image (source installation)<br>FROM gemini-cli-sandbox</pre><h4>Start Gemini CLI in the sandbox mode</h4><p>First set couple very important environment variables:</p><pre># Export the necessary environment variables<br>export GITHUB_TOKEN=&quot;github_pat_...&quot;<br>export GEMINI_API_KEY=&quot;your-api-key&quot;<br>export CLOUDSDK_CORE_PROJECT=&quot;YOUR_PROJECT_ID&quot;<br>export GEMINI_SANDBOX=docker<br><br># We keep the ENV variables for our dynamic credentials<br>export SANDBOX_FLAGS=&quot;\<br>-e GITHUB_TOKEN=${GITHUB_TOKEN} \<br>-e CLOUDSDK_CORE_PROJECT=${CLOUDSDK_CORE_PROJECT} \<br>-e CLOUDSDK_AUTH_CREDENTIAL_FILE_OVERRIDE=$(pwd)/sa-key.json&quot;</pre><p><strong>Note: </strong>You can put the above variables in a shell script to speed up starts in the future. Just make sure to update your .gitignore to keep it and sa-key.json from getting added to your repository.</p><p>Now you can start your Gemini CLI with following command:</p><pre># For binary installation<br>gemini<br><br># For source installation<br>BUILD_SANDBOX=1 gemini</pre><h4>Session limits</h4><p>To avoid surprises with the number of tokens that Gemini CLI uses in your session, you can use the Max Session Turns in /settings or your ~/.gemini/settings.json:</p><figure><img alt="Max Session Turns" src="https://cdn-images-1.medium.com/max/1024/1*8x4VWAoBOQayKYY6bgnDKw.png" /></figure><h4>Congratulations</h4><p><strong>Congratulations 🚀</strong>You’re ready to validate your setup.</p><h3>Validation and “Ultimate Tests”</h3><p>Once the environment is launched within a sandbox we should verify the security boundaries.</p><p>First let’s run the /about command to see if we’re running within a sandbox. You should see something like this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2YG5Vko3ft1npZvdJ_iB_Q.png" /></figure><p>Now let’s try to break out from our new sandbox.</p><h4>GitHub privilege escalation</h4><p>Try asking Gemini CLI to access a private repo it shouldn’t have access. Example prompt:</p><pre>Clone https://github.com/USER_NAME/PRIVATE_REPOSITORY to a new folder</pre><p>You should see how Gemini CLI really tries and struggles to get access. Mine got really creative at trying to access the repo with git, gh, curl and even tried to reuse the GITHUB_TOKEN manually. All these tries failed and this error was displayed:</p><figure><img alt="GitHub privilege escalation test" src="https://cdn-images-1.medium.com/max/1024/1*Fu7TZPeGDRE4UoMOTD-rWA.png" /></figure><h4>Google Cloud privilege escalation</h4><p>Ask the agent to list all compute instances:</p><pre>List all my compute instances</pre><p>It should fail due to missing permissions on your restricted Service Account. Gemini CLI tries really hard and executes couple different commands including reauthentication, but it fails at the end:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UPiTr3auJidp2uXiwdP3Xg.png" /></figure><h4>Local privilege escalation</h4><p>Finally let’s try to access a file from another project folder by prompting:</p><pre>There are other projects in the folder above the current one. List them and let me know if there is anything that is interesting from hacker&#39;s perspective.</pre><p>I am starting to feel sorry for the poor agent 😉<br>Once again it can’t complete its task:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*HCj8t6ICZ6CoP3h1eva7_w.png" /></figure><h3>Conclusions</h3><p>Now that you have validated your sandbox setup you should feel much more confident to run gemini — yolo and streamline your work as <a href="https://geminicli.com/">Gemini CLI</a> delivers your code without hand-holding and pesky Can I execute this command? prompts.</p><p>I am looking forward to all the creative ideas you’ll bring to life!</p><h3>What’s next?</h3><p>If you find this setup useful here are some additional steps to consider:</p><ul><li>Try out <a href="https://github.com/gemini-cli-extensions/conductor">Gemini CLI Conductor Extension</a> — it’s very powerful and can significantly help you run autonomous agents effectively. <a href="https://medium.com/google-cloud/trying-out-the-new-conductor-extension-in-gemini-cli-0801f892e2db">Here is a deep dive</a> into some of the advantages.</li><li>Read my <a href="https://medium.com/google-cloud/antigravity-the-ralph-wiggum-style-ee6784a78237">Antigravity the Ralph Wiggum style</a> which covers sandboxing for <a href="https://antigravity.google/?utm_campaign=CDR_0x87fa8d40_default_b490349988&amp;utm_medium=external&amp;utm_source=blog">Antigravity</a>.</li><li>Add 50 claps to this post by pressing and holding the clap button 👏 <br>This will help others find it.</li><li>Share this post with your friends on socials.</li><li>Connect with me via <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a>, <a href="https://x.com/RemikSamborski">X</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>.</li></ul><p>Thanks for reading.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=488b23dedf29" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/secure-gemini-cli-for-cloud-development-488b23dedf29">Secure Gemini CLI for Cloud development</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Antigravity the Ralph Wiggum style]]></title>
            <link>https://medium.com/google-cloud/antigravity-the-ralph-wiggum-style-ee6784a78237?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/ee6784a78237</guid>
            <category><![CDATA[ralph-wiggum]]></category>
            <category><![CDATA[google-antigravity]]></category>
            <category><![CDATA[coding]]></category>
            <category><![CDATA[software-development]]></category>
            <category><![CDATA[agentic-ai]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Fri, 30 Jan 2026 15:24:23 GMT</pubDate>
            <atom:updated>2026-01-30T15:24:23.669Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*oXjVMkDQc4wCjhGkqBWHCQ.jpeg" /><figcaption>Antigravity the Ralph Wiggum Style</figcaption></figure><p>The <a href="https://ghuntley.com/ralph/">Ralph Wiggum trend</a> has been surfacing across social platforms lately. If you’re tracking current tech developments, it’s hard to miss. Named after a persistent and slightly confused second-grader, the Wiggum Loop agentic development boils down to: <strong>Don’t stop until the job is done.</strong></p><p>In traditional AI coding, the agent performs a task, stops, and waits for you to approve its next step or request changes. In a Wiggum Loop, you give the agent a mission and success criteria (like passing tests), and it keeps looping, fixing its own bugs and refactoring — until it hits the green light.</p><p>The recent excitement around the Wiggum Loop agentic development highlights a powerful shift: achieving autonomous, self-correcting development. I’ve been leveraging a similar approach effectively with <a href="http://antigravity.google">Antigravity</a> for some time already. In this post, I’ll share my strategy, enabling you to implement true unsupervised development yourself.</p><h3>Going “Full Wiggum”</h3><p>To achieve true unsupervised development, we need to move away from the review-driven defaults and let the agent take the wheel. Antigravity is uniquely built for this because it’s an agent-first environment capable of acting in both the terminal and the browser.</p><p>To mirror the “Bash loop” persistence of the Ralph Wiggum plugin, configure your <a href="https://antigravity.google/docs/agent-modes-settings">Antigravity settings</a> as follows:</p><ol><li><strong>Mode:</strong> Select Agent-driven development. This shifts the agent from a “wait for instructions” assistant to a “goal-oriented” architect.</li><li><strong>Terminal execution policy:</strong> Set to Always Proceed. This allows the agent to run npm test, uv run pytests, and other commands without constantly pausing for approval.</li><li><strong>Review policy:</strong> Set to Always Proceed. This tells the agent that its implementation plans are pre-approved.</li><li><strong>JavaScript execution policy:</strong> Set to Always Proceed. This is essential for agents that need to run scripts or interact with browser environments to verify their work.</li></ol><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ii88dl3ff9muZMe2O3CAdA.png" /><figcaption>Antigravity settings</figcaption></figure><blockquote><strong><em>WARNING: THE SANDBOX IS NOT OPTIONAL. </em></strong><em>Running an agent in “Always Proceed” mode is like giving Bart Simpson a slingshot in front of a mirror store. </em><strong><em>Only do this in a sandbox environment.</em></strong></blockquote><blockquote><em>Here is </em><a href="https://medium.com/google-cloud/using-chrome-remote-desktop-to-run-antigravity-on-a-cloud-workstation-or-just-in-a-container-d00296425a0f"><em>a great article</em></a><em> from my </em><a href="https://medium.com/@danistrebel"><em>colleague</em></a><em> which shows a step-by-step guide to setting such an environment up and running on </em><a href="https://cloud.google.com/workstations?utm_campaign=CDR_0x87fa8d40_default_b478855277&amp;utm_medium=external&amp;utm_source=blog"><em>Cloud Workstation</em></a><em>.</em></blockquote><h3><strong>Example</strong></h3><p>To see this in practice, I ran the following prompt against Antigravity:</p><pre>Build a REST API for todos in NodeJS.<br><br>When complete:<br>- All CRUD endpoints are working<br>- Input validation is in place<br>- Tests are passing (coverage &gt; 80%)<br>- README with API docs exists</pre><p>The screencast below shows how Antigravity handled the task without my interruptions (I spent this time on other tasks rather than handholding the agent):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*x_jdgAbno01nLXdZLQkSPw.gif" /><figcaption>Agent-driven development in Antigravity</figcaption></figure><h3>How does this work?</h3><p>Antigravity isn’t just looping in a vacuum. Because it has native hooks into <a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/start/get-started-with-gemini-3?utm_campaign=CDR_0x87fa8d40_default_b478855277&amp;utm_medium=external&amp;utm_source=blog">Gemini 3 Pro</a>, it utilizes a massive context window that remembers exactly why a previous command failed.</p><p>It kicks things off by drafting up an implementation plan and a task list. In the video, you can watch it tick through these items in real time. It doesn’t just plan, though — it actually touches the terminal to initialize the npm project and run tests.</p><p>The loop only closes once every requirement is met and the test suite hits green. It then provides a handy walkthrough so you can easily understand the architecture it just spun up.</p><p>This approach turns development from writing code into verifying outcomes.</p><h3>From vibe-coding to vibe-building</h3><p>The Ralph Wiggum trend isn’t about cutting corners; it’s about embracing sheer, stubborn persistence through automation. By letting Antigravity operate autonomously, you transition from a coder to an architect and team lead. You define the standards and environment, while agents manage the iterative grind of writing, testing, and debugging cycles that typically consume a developer’s valuable time.</p><p>Are you brave enough to let the agent “Always Proceed”? Visit <a href="https://antigravity.google/download">Antigravity’s download page</a> to start experimenting yourself.</p><h3>Other resources</h3><ul><li><a href="https://youtu.be/VgSefJZLSdw?si=PHFAwZSIR2DsybWP">Billy’s Ralph Wiggum loop with Gemini CLI</a></li><li><a href="https://medium.com/google-cloud/using-chrome-remote-desktop-to-run-antigravity-on-a-cloud-workstation-or-just-in-a-container-d00296425a0f">Daniel’s Antigravity on Cloud Workstation tutorial</a></li><li><a href="https://medium.com/google-cloud/tutorial-getting-started-with-google-antigravity-b5cc74c103c2">Romin’s getting started with Google Antigravity tutorial</a></li></ul><h3>Let’s Connect!</h3><p>I’d love to hear how you’re using Antigravity for your agentic workflows. Are you building Wiggum loops or keeping a tighter leash on your agents?</p><ul><li>Connect on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a></li><li>Follow me on <a href="https://x.com/RemikSamborski">X</a></li><li>Catch me on <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=ee6784a78237" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/antigravity-the-ralph-wiggum-style-ee6784a78237">Antigravity the Ralph Wiggum style</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Serverless AI: EmbeddingGemma with Cloud Run]]></title>
            <link>https://medium.com/google-cloud/serverless-ai-embeddinggemma-with-cloud-run-14a6beed9d03?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/14a6beed9d03</guid>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[google-cloud-run]]></category>
            <category><![CDATA[cloud-run-gpu]]></category>
            <category><![CDATA[gemma-3]]></category>
            <category><![CDATA[embeddinggemma]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Wed, 24 Sep 2025 16:41:23 GMT</pubDate>
            <atom:updated>2025-09-24T16:41:23.289Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="EmbeddingGemma on Cloud Run" src="https://cdn-images-1.medium.com/max/1024/1*sra-InVEE_BSgj7ZxHoheg.png" /><figcaption>EmbeddingGemma on Cloud Run</figcaption></figure><p>Building on the <a href="https://medium.com/google-cloud/serverless-ai-qwen3-embeddings-with-cloud-run-eb35d7f4037f">previous blog post</a> about running Qwen3 Embedding models on Cloud Run, this article focuses on the recently released EmbeddingGemma model from the <a href="https://ai.google.dev/gemma/docs?utm_campaign=CDR_0x87fa8d40_platform_b443676501&amp;utm_medium=external&amp;utm_source=blog">Gemma family</a>. Discover how to leverage the same powerful serverless techniques to deploy this model on Google Cloud’s serverless platform.</p><p>You will learn how to:</p><ul><li>Containerize the embedding model with Docker and Ollama</li><li>Deploy the embedding model to Cloud Run with GPUs</li><li>Test the deployed model from a local machine</li></ul><p>Before we dive into the code, let’s briefly discuss the core components that power this serverless AI solution.</p><h4>EmbeddingGemma Model</h4><p>According to the <a href="https://ai.google.dev/gemma/docs/embeddinggemma?utm_campaign=CDR_0x87fa8d40_platform_b443676501&amp;utm_medium=external&amp;utm_source=blog">EmbeddingGemma model card</a>:</p><p>“EmbeddingGemma is a 308M parameter multilingual text embedding model based on Gemma 3. It is optimized for use in everyday devices, such as phones, laptops, and tablets. The model produces numerical representations of text to be used for downstream tasks like information retrieval, semantic similarity search, classification, and clustering.”</p><p>Its optimization for efficiency makes EmbeddingGemma an ideal candidate for serverless deployment on Cloud Run, ensuring high performance and cost-effectiveness for your AI applications.</p><h4>Cloud Run</h4><p><a href="https://cloud.google.com/run?utm_campaign=CDR_0x87fa8d40_platform_b443676501&amp;utm_medium=external&amp;utm_source=blog">Cloud Run</a> is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Run Functions) and a more customizable GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.</p><p>The beauty of Cloud Run is that it can automatically scale to zero, meaning when there are no requests, you aren’t paying for any resources. When traffic picks up, it quickly scales up to handle the load. This makes it perfect for stateless models that need to be highly available and cost-effective.</p><h3>Deployment</h3><p>Let’s walk through the deployment process step-by-step.</p><h4>Prepare the environment</h4><p>First lets configure the gcloud CLI environment.</p><p><em>Note: if you do not have gcloud CLI installed please follow instructions </em><a href="https://cloud.google.com/sdk/docs/install?utm_campaign=CDR_0x87fa8d40_platform_b443676501&amp;utm_medium=external&amp;utm_source=blog"><em>available here</em></a><em>.</em></p><ul><li>Step 1 — Set your default project:</li></ul><pre>gcloud config set project PROJECT_ID</pre><ul><li>Step 2 — Configure Google Cloud CLI to use the <em>europe-west1 </em>region for Cloud Run commands:</li></ul><pre>gcloud config set run/region europe-west1</pre><p><em>Important: at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to </em><a href="https://cloud.google.com/run/docs/configuring/services/gpu?utm_campaign=CDR_0x87fa8d40_platform_b443676501&amp;utm_medium=external&amp;utm_source=blog#supported-regions"><em>this page</em></a><em>.</em></p><h4>Containerize</h4><p>Now we will use <a href="https://www.docker.com/">Docker</a> and <a href="https://ollama.com/">Ollama</a> to run the EmbeddingGemma model. Create a file named Dockerfile containing:</p><pre>FROM ollama/ollama:latest<br># Listen on all interfaces, port 8080<br>ENV OLLAMA_HOST=0.0.0.0:8080<br># Store model weight files in /models<br>ENV OLLAMA_MODELS=/models<br># Reduce logging verbosity<br>ENV OLLAMA_DEBUG=false<br># Never unload model weights from the GPU<br>ENV OLLAMA_KEEP_ALIVE=-1<br># Store the model weights in the container image<br>ENV MODEL=embeddinggemma:latest<br>RUN ollama serve &amp; sleep 5 &amp;&amp; ollama pull $MODEL<br># Start Ollama<br>ENTRYPOINT [&quot;ollama&quot;, &quot;serve&quot;]</pre><h4>Build and Deploy</h4><p>We will now use Cloud Run’s source deployments. This allows you to achieve the following with one command:</p><ul><li>First, compile the container image from the provided source.</li><li>Next, upload the resulting container image to an <a href="https://cloud.google.com/artifact-registry/docs?utm_campaign=CDR_0x87fa8d40_platform_b443676501&amp;utm_medium=external&amp;utm_source=blog">Artifact Registry</a>.</li><li>Then, deploy the container to Cloud Run, ensuring that GPU support is enabled using the — gpu and — gpu-type parameters.</li><li>Finally, redirect all incoming traffic to this newly deployed version.</li></ul><p>You just need to run:</p><pre>gcloud run deploy embedding-gemma \<br>  --source . \<br>  --concurrency 4 \<br>  --cpu 8 \<br>  --set-env-vars OLLAMA_NUM_PARALLEL=4 \<br>  --gpu 1 \<br>  --gpu-type nvidia-l4 \<br>  --max-instances 1 \<br>  --memory 32Gi \<br>  --no-allow-unauthenticated \<br>  --no-cpu-throttling \<br>  --no-gpu-zonal-redundancy \<br>  --timeout=600 \<br>  --labels dev-tutorial=blog-embedding-gemma</pre><p>Note the following important flags in this command:</p><ul><li>--concurency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.</li><li>--gpu 1 with --gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.</li><li>--max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project’s NVIDIA L4 GPU quota.</li><li>--no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run’s built-in Identity and Access Management (IAM) authentication for service-to-service communication.</li><li>--no-cpu-throttling is required for enabling GPU.</li><li>--no-gpu-zonal-reduncancy set zonal redundancy options depending on your zonal failover requirements and available quota.</li></ul><h4>Test the deployment</h4><p>Upon successful deployment of the service, you can initiate requests. However, direct api calls will result in an <em>HTTP 401 Unauthorized</em> response from Cloud Run.</p><p>This behaviour follows Google’s “secure by default” approach. The model is intended for calls from other services, such as a RAG application, and therefore is not open for public access.</p><p>To support local testing of your deployment, the simplest approach is to launch the Cloud Run developer proxy using the following command:</p><pre>gcloud run services proxy embedding-gemma --port=9090</pre><p>Afterwards, in a second terminal window, run:</p><pre>curl http://localhost:9090/api/embed -d &#39;{<br>  &quot;model&quot;: &quot;embeddinggemma&quot;,<br>  &quot;input&quot;: &quot;Sample text&quot;<br>}&#39;</pre><p>The response will look similar to this:</p><figure><img alt="EmbeddingGemma curl response" src="https://cdn-images-1.medium.com/max/1024/1*1YCV7EAUV3_avClwha1vPw.png" /><figcaption>EmbeddingGemma curl response</figcaption></figure><p>You can also use Python to call the endpoint. Example:</p><pre>from ollama import Client<br><br>client = Client(host=&quot;http://localhost:9090&quot;)<br><br>response = client.embed(model=&quot;embeddinggemma&quot;, input=&quot;Sample text&quot;)<br>print(response)</pre><p>Congratulations 🎉 The Cloud Run deployment is up and running!</p><h3>RAG Example</h3><p>You can use the newly deployed model to build your first <a href="https://en.wikipedia.org/wiki/Retrieval-augmented_generation">RAG application</a>. Here’s how to achieve this:</p><h4>Step 1 — Generate Embeddings</h4><p>Start with required dependencies:</p><pre>pip install ollama chromadb</pre><p>Create an example.py file containing:</p><pre>import ollama<br>import chromadb<br><br>documents = [<br>    &quot;Poland is a country located in Central Europe.&quot;,<br>    &quot;The capital and largest city of Poland is Warsaw.&quot;,<br>    &quot;Poland&#39;s official language is Polish, which is a West Slavic language.&quot;,<br>    &quot;Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.&quot;,<br>    &quot;Poland is famous for its traditional dish called pierogi, which are filled dumplings.&quot;,<br>    &quot;The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.&quot;,<br>]<br><br>client = chromadb.Client()<br>collection = client.create_collection(name=&quot;docs&quot;)<br><br>ollama_client = ollama.Client(host=&quot;http://localhost:9090&quot;)<br><br># Store each document in a in-memory vector embeddings database<br>for i, d in enumerate(documents):<br>    response = ollama_client.embed(model=&quot;embeddinggemma&quot;, input=d)<br>    embeddings = response[&quot;embeddings&quot;]<br>    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])</pre><h4>Step 2 — Retrieve</h4><p>Next, with the following code you can search the vector database for the most relevant document (add it to your example.py):</p><pre># An example question<br>question = &quot;What is Poland&#39;s official language?&quot;<br><br># Generate an embedding for the input and retrieve the most relevant document<br>response = ollama_client.embed(model=&quot;embeddinggemma&quot;, input=question)<br>results = collection.query(query_embeddings=[response[&quot;embeddings&quot;][0]], n_results=1)<br>data = results[&quot;documents&quot;][0][0]</pre><h4>Step 3 — Generate Final Answer</h4><p>In this final step step we will use a locally installed <a href="https://ollama.com/library/gemma3">Gemma3</a>.</p><p><em>Note: We use Gemma3 in the generation step, but any other model could work here (e.g., Gemini, Qwen3, Llama, etc.). Nevertheless, it is </em><strong><em>critical to use the same embeddings</em></strong><em> model in Step 1 (Generate Embeddings) and Step 2 (Retrieve).</em></p><p>To locally install the Gemma3:latest model run:</p><pre>ollama pull gemma3</pre><p>Now can combine user’s prompt with search results and generate the final answer (add this code to example.py):</p><pre># Final step - generate a response combining the prompt and data we retrieved in step 2<br>prompt = f&quot;Using this data: {data}. Respond to this prompt: {question}&quot;<br><br>output = ollama.generate(<br>    model=&quot;gemma3&quot;,<br>    prompt=prompt,<br>)<br><br>print(f&quot;Prompt: {prompt}&quot;)<br>print(output[&quot;response&quot;])</pre><p>Run the code:</p><pre>python example.py</pre><p>The answer should look similar to the one below:</p><pre>Prompt: Using this data: Poland&#39;s official language is Polish, which is a West Slavic language.. Respond to this prompt: What is Poland&#39;s official language?<br>Poland’s official language is Polish. It’s a West Slavic language.</pre><p>You have successfully created and run your first RAG application using the EmbeddingGemma model.</p><h3>Summary</h3><p>At this point, you have successfully established a Cloud Run service running the EmbeddingGemma model, ready to generate embeddings for semantic search or RAG applications.</p><p>This method also allows you to deploy and compare multiple embedding models on Cloud Run (e.g. <a href="https://medium.com/google-cloud/serverless-ai-qwen3-embeddings-with-cloud-run-eb35d7f4037f">Qwen3 Embedding</a> or <a href="https://ollama.com/search?c=embedding">other Ollama-supported models</a>), enabling you to find the best fit for your specific use case without major code changes.</p><p>Ready to build your own serverless AI applications?</p><ul><li><a href="https://cloud.google.com/run?utm_campaign=CDR_0x87fa8d40_platform_b443676501&amp;utm_medium=external&amp;utm_source=blog">Start building on Cloud Run today</a> and explore its full potential!</li><li>If you’re interested in learning more about RAG evaluation, <a href="https://medium.com/google-cloud/evaluating-rag-pipelines-d99e007e625f">this article</a> is a good starting point.</li></ul><h3>Thanks for reading</h3><p>If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.</p><p>I’m always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=14a6beed9d03" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/serverless-ai-embeddinggemma-with-cloud-run-14a6beed9d03">Serverless AI: EmbeddingGemma with Cloud Run</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Serverless AI: Qwen3 Embeddings with Cloud Run]]></title>
            <link>https://medium.com/google-cloud/serverless-ai-qwen3-embeddings-with-cloud-run-eb35d7f4037f?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/eb35d7f4037f</guid>
            <category><![CDATA[cloud-run-gpu]]></category>
            <category><![CDATA[gcp-app-dev]]></category>
            <category><![CDATA[google-cloud-run]]></category>
            <category><![CDATA[qwen-3]]></category>
            <category><![CDATA[embedding]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Wed, 20 Aug 2025 02:00:56 GMT</pubDate>
            <atom:updated>2025-09-15T12:57:24.905Z</atom:updated>
            <content:encoded><![CDATA[<p>In this blog post I’ll show you the process of deploying the Qwen3 Embedding model to <a href="https://cloud.google.com/run/docs/configuring/services/gpu?utm_campaign=CDR_0x87fa8d40_platform_b438423716&amp;utm_medium=external&amp;utm_source=blog">Cloud Run with GPUs</a> for enhanced performance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*d-Ptjq0Nde1BzusX3nWgSQ.png" /><figcaption>Qwen3 Embedding model on Cloud Run with GPUs</figcaption></figure><p>You will learn how to:</p><ul><li>Containerize the embedding model with Docker and Ollama</li><li>Deploy the embedding model to Cloud Run with GPUs</li><li>Test the deployed model from a local machine</li></ul><p>Before we jump into the code a couple words about key components of the solution.</p><h4>Qwen3 Embedding Model</h4><p>The Qwen3 Embedding series is a set of open-source models for text <a href="https://huggingface.co/collections/Qwen/qwen3-embedding-6841b2055b99c44d9a4c371f">embedding</a> and <a href="https://huggingface.co/collections/Qwen/qwen3-reranker-6841b22d0192d7ade9cdefea">reranking</a>, built on the Qwen3 Large Language Model (LLM) family. It’s designed for retrieval-augmented generation (RAG), a technique that enhances the output of large language models by retrieving relevant information from a knowledge base, and other tasks requiring semantic search. You can learn more about embeddings in <a href="https://www.youtube.com/watch?v=vlcQV4j2kTo">this video</a>.</p><p>Open embedding models such as Qwen3 are the ideal choice when you need greater control, specialization, and security than proprietary, “black-box” APIs can offer. They are particularly well-suited for the following use cases:</p><ul><li>Fine-Tuning for Niche Domains📻: by fine-tuning them on specialized data (e.g., legal contracts, medical research, internal company wikis) they can provide more accurate results for semantic search and RAG than a general-purpose model.</li><li>Data Privacy &amp; Security🔒: open models can be self-hosted or deployed to cloud resources managed by your organization. This ensures compliance with regulations like GDPR and prevents data from ever leaving your control.</li><li>Cost-Effectiveness at Scale💰: for high-volume tasks, running an optimized open model can be cheaper than paying per-API-call fees to a proprietary service provider.</li><li>Offline &amp; Edge Deployment🛜: open models can run locally and are perfect for applications that must function without an internet connection, such as on-device search in mobile apps or analysis on remote IoT devices.</li></ul><p>I chose the <a href="https://huggingface.co/Qwen/Qwen3-Embedding-4B">Qwen3-Embedding-4B</a> model due to its growing popularity and suitable size for the Cloud Run environment. However, you can experiment with different sizes (0.6B, 4B, and 8B) depending on your specific use case.</p><h4>Cloud Run</h4><p><a href="https://cloud.google.com/run?utm_campaign=CDR_0x87fa8d40_platform_b438423716&amp;utm_medium=external&amp;utm_source=blog">Cloud Run</a> is a managed compute platform on Google Cloud that lets you run containerized applications in a serverless environment. Think of it as a middle ground between a simple function-as-a-service (like Cloud Functions) and a more complex GKE cluster. You give it a container image, and it handles all the underlying infrastructure, from provisioning and scaling to managing the runtime.</p><p>The beauty of Cloud Run is that it can automatically scale to zero, meaning when there are no requests, you aren’t paying for any resources. When traffic picks up, it quickly scales up to handle the load. This makes it perfect for stateless models that need to be highly available and cost-effective.</p><h3><strong>Deployment</strong></h3><p>But enough with the intros, let’s get our hands dirty with some code 🧑‍💻</p><p>Below is a step by step instruction on how to get the Qwen3 Embedding model up and running.</p><h4>Prepare the environment</h4><p>First we need to configure the gcloud CLI environment.</p><p><em>Note: if you don’t have gcloud CLI installed please follow instructions </em><a href="https://cloud.google.com/sdk/docs/install?utm_campaign=CDR_0x87fa8d40_platform_b438423716&amp;utm_medium=external&amp;utm_source=blog"><em>available here</em></a><em>.</em></p><p><strong>Step 1 </strong>—<strong> </strong>Set your default project:</p><pre>gcloud config set project PROJECT_ID</pre><p><strong>Step 2 </strong>—<strong> </strong>Configure Google Cloud CLI to use the <em>europe-west1 </em>region for Cloud Run commands:</p><pre>gcloud config set run/region europe-west1</pre><p><strong>Important:</strong> at the time of writing, GPUs on Cloud Run are available in several regions. To check the closest supported region please refer to <a href="https://cloud.google.com/run/docs/configuring/services/gpu#supported-regions?utm_campaign=CDR_0x87fa8d40_platform_b438423716&amp;utm_medium=external&amp;utm_source=blog">this page</a>.</p><h4>Containerize</h4><p>We will use <a href="https://www.docker.com/">Docker</a> and <a href="https://ollama.com/">Ollama</a> to run the Qwen3 Embedding model. Create a file named <em>Dockerfile</em> and put the following code inside it:</p><pre>FROM ollama/ollama:latest<br><br># Listen on all interfaces, port 8080<br>ENV OLLAMA_HOST=0.0.0.0:8080<br><br># Store model weight files in /models<br>ENV OLLAMA_MODELS=/models<br><br># Reduce logging verbosity<br>ENV OLLAMA_DEBUG=false<br><br># Never unload model weights from the GPU<br>ENV OLLAMA_KEEP_ALIVE=-1<br><br># Store the model weights in the container image<br>ENV MODEL=dengcao/Qwen3-Embedding-4B:Q4_K_M<br>RUN ollama serve &amp; sleep 5 &amp;&amp; ollama pull $MODEL<br><br># Start Ollama<br>ENTRYPOINT [&quot;ollama&quot;, &quot;serve&quot;]</pre><h4>Build and deploy</h4><p>Next it’s time to leverage the power of Cloud Run’s source deployments. With a single command you can:</p><ul><li>Build the container image from source (note the –source parameter in the command below)</li><li>Upload the container image to an <a href="https://cloud.google.com/artifact-registry/docs?utm_campaign=CDR_0x87fa8d40_platform_b438423716&amp;utm_medium=external&amp;utm_source=blog">Artifact Registry</a></li><li>Deploy the container to Cloud Run with GPUs enabled (note — gpu and — gpu-type options)</li><li>Redirect all traffic to the new deployment</li></ul><p>To do all the above, you just need to run:</p><pre>gcloud run deploy ollama-qwen3-embeddings \<br>  --source . \<br>  --concurrency 4 \<br>  --cpu 8 \<br>  --set-env-vars OLLAMA_NUM_PARALLEL=4 \<br>  --gpu 1 \<br>  --gpu-type nvidia-l4 \<br>  --max-instances 1 \<br>  --memory 32Gi \<br>  --no-allow-unauthenticated \<br>  --no-cpu-throttling \<br>  --no-gpu-zonal-redundancy \<br>  --timeout=600 \<br>  --labels dev-tutorial=blog-qwen3-embeddings</pre><p>Note the following important flags in this command:</p><ul><li>— concurrency 4 is set to match the value of the environment variable OLLAMA_NUM_PARALLEL.</li><li>— gpu 1 with — gpu-type nvidia-l4 assigns 1 NVIDIA L4 GPU to every Cloud Run instance in the service.</li><li>— max-instances 1 specifies the maximum number of instances to scale to. It has to be equal to or lower than your project’s NVIDIA L4 GPU quota.</li><li>— no-allow-unauthenticated restricts unauthenticated access to the service. By keeping the service private, you can rely on Cloud Run’s built-in Identity and Access Management (IAM) authentication for service-to-service communication.</li><li>— no-cpu-throttling is required for enabling GPU.</li><li>— no-gpu-zonal-redundancy set zonal redundancy options depending on your zonal failover requirements and available quota.</li></ul><h4>Test the deployment</h4><p>Now that you have successfully deployed the service, you can send requests to it. However, if you send a request directly, Cloud Run will respond with <em>HTTP 401 Unauthorized</em>. This is intentional, because we want our model to be called from other services, such as a RAG application, and not accessible by everyone on the Internet.</p><p>The easiest way to test the deployment from a local machine is to spin up the Cloud Run developer proxy by executing:</p><pre>gcloud run services proxy ollama-qwen3-embeddings --port=9090</pre><p>Now in a second terminal window run:</p><pre>curl http://localhost:9090/api/embed -d &#39;{<br>  &quot;model&quot;: &quot;dengcao/Qwen3-Embedding-4B:Q4_K_M&quot;,<br>  &quot;input&quot;: &quot;Sample text&quot;<br>}&#39;</pre><p>You should see a response similar to this:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*VTcPgNBuyRAnKmrl" /><figcaption>Qwen3 Embedding Model response from Cloud Run</figcaption></figure><p>You can also call the endpoint from a Python client. Example:</p><pre>from ollama import Client<br><br>client = Client(host=&quot;http://localhost:9090&quot;)<br><br>response = client.embed(model=&quot;dengcao/Qwen3-Embedding-4B:Q4_K_M&quot;, input=&quot;Sample text&quot;)<br>print(response)</pre><p>Congratulations 🎉 Your Cloud Run deployment is up and running!</p><h4>RAG Example</h4><p>You can use the newly deployed model to build your first RAG application. Here’s how to achieve this:</p><p><strong>Step 1 — Generate Embeddings</strong></p><p>Install necessary dependencies:</p><pre>pip install ollama chromadb</pre><p>Create an <em>example.py</em> with the following content:</p><pre>import ollama<br>import chromadb<br><br>documents = [<br>    &quot;Poland is a country located in Central Europe.&quot;,<br>    &quot;The capital and largest city of Poland is Warsaw.&quot;,<br>    &quot;Poland&#39;s official language is Polish, which is a West Slavic language.&quot;,<br>    &quot;Marie Curie, the pioneering scientist who conducted groundbreaking research on radioactivity, was born in Warsaw, Poland.&quot;,<br>    &quot;Poland is famous for its traditional dish called pierogi, which are filled dumplings.&quot;,<br>    &quot;The Białowieża Forest in Poland is one of the last and largest remaining parts of the immense primeval forest that once stretched across the European Plain.&quot;,<br>]<br><br>client = chromadb.Client()<br>collection = client.create_collection(name=&quot;docs&quot;)<br><br>ollama_client = ollama.Client(host=&quot;http://localhost:9090&quot;)<br><br># Store each document in a in-memory vector embeddings database<br>for i, d in enumerate(documents):<br>    response = ollama_client.embed(model=&quot;dengcao/Qwen3-Embedding-4B:Q4_K_M&quot;, input=d)<br>    embeddings = response[&quot;embeddings&quot;]<br>    collection.add(ids=[str(i)], embeddings=embeddings, documents=[d])</pre><p><strong>Step 2 — Retrieve</strong></p><p>Next, the following code will search the vector database for the most relevant document (add it to your <em>example.py</em>):</p><pre># An example prompt<br>prompt = &quot;What is Poland&#39;s official language?&quot;<br><br># Generate an embedding for the input and retrieve the most relevant document<br>response = ollama_client.embed(model=&quot;dengcao/Qwen3-Embedding-4B:Q4_K_M&quot;, input=prompt)<br>results = collection.query(query_embeddings=[response[&quot;embeddings&quot;][0]], n_results=1)<br>data = results[&quot;documents&quot;][0][0]</pre><p><strong>Step 3 — Generate final answer</strong></p><p>In the generation step we will use a locally installed <a href="https://huggingface.co/Qwen/Qwen3-0.6B">Qwen3:0.6b</a>.</p><p><em>Note: we use Qwen3 in generation step, but any other model could work here (i.e. Gemini, Gemma, Llama, etc.). Nevertheless it’s </em><strong><em>critical to use the same embeddings</em></strong><em> model in step 1 (Generate Embeddings) and step 2 (Retrieve).</em></p><p>You can install the Qwen3:0.6b model by running the following command:</p><pre>ollama pull qwen3:0.6b</pre><p>Now we’re ready to combine user’s prompt with search results to generate the final answer (add to <em>example.py</em>):</p><pre># Final step - generate a response combining the prompt and data we retrieved in step 2<br>output = ollama.generate(<br>    model=&quot;qwen3:0.6b&quot;,<br>    prompt=f&quot;Using this data: {data}. Respond to this prompt: {prompt}&quot;,<br>)<br><br>print(output[&quot;response&quot;])</pre><p>Run the code by executing:</p><pre>python example.py</pre><p>You should see an answer similar to the one below:</p><pre>&lt;think&gt;<br>Okay, the user is asking what Poland&#39;s official language is, and they provided the information that Poland&#39;s official language is Polish, which is a West Slavic language. Let me make sure I understand this correctly.<br><br>First, I need to confirm if that&#39;s the correct information. I know that Poland is a country in Eastern Europe, and its official language is Polish. But wait, what&#39;s the source of this information? The user hasn&#39;t provided any other data, so I should stick strictly to the given information.<br><br>I should state that Poland&#39;s official language is Polish, and that it&#39;s a West Slavic language. I need to present this clearly and concisely. Maybe mention that it&#39;s the official language to emphasize its significance. Also, check if there&#39;s any other detail that needs to be included, but since the user provided only this, I can proceed.<br>&lt;/think&gt;<br><br>Poland&#39;s official language is **Polish**. This language is a **West Slavic language**.</pre><p>Well done! You have just created and run your first RAG application with Qwen3 Embedding model under the hood.</p><h3><strong>Summary</strong></h3><p>At this point you have established a Cloud Run service running Qwen3 Embedding model. You can use it to generate embeddings for a semantic search or a RAG application.</p><p>Stay tuned for more content around leveraging Qwen3 Embedding in your applications.</p><h3>Thanks for reading</h3><p>I hope this article inspired you to experiment with open embedding models on <a href="https://cloud.google.com/run/?utm_campaign=CDR_0x87fa8d40_platform_b438423716&amp;utm_medium=external&amp;utm_source=blog">Cloud Run</a>. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.</p><p>I’m always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=eb35d7f4037f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/serverless-ai-qwen3-embeddings-with-cloud-run-eb35d7f4037f">Serverless AI: Qwen3 Embeddings with Cloud Run</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Gemini CLI: Power up your Linux workflow]]></title>
            <link>https://medium.com/google-cloud/gemini-cli-power-up-your-linux-workflow-e4d423c64c65?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/e4d423c64c65</guid>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[gemini-cli]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Tue, 29 Jul 2025 11:28:21 GMT</pubDate>
            <atom:updated>2025-07-30T02:12:00.330Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RB09yphYI7_G3PfCRmKHkA.png" /></figure><p>As an open-source enthusiast who’s been deeply immersed in the Linux ecosystem since the late 1990s, my daily work heavily relies on its power and flexibility. While I’ve always appreciated the command line, discovering the <a href="https://github.com/google-gemini/gemini-cli">Gemini CLI</a> has been a game-changer. It’s not just a tool that aids my coding endeavors; it’s become an indispensable companion in tackling those cumbersome tasks that once sent me down rabbit holes of Google searches for the right commands or documentation.</p><h3>Trick 1: Setting vim configuration</h3><p>The first thing I usually do when I spin up a new VM for my dev environment is to setup vim with some basic configuration. I don’t always remember all configuration parameters, so why not try Gemini to get help.</p><p>Here’s the prompt I used:</p><pre>I am setting up my new development environment and want to get my `vim` configured. Please update the `vimrc` file to meet following requirements:<br> - use slate color theme<br> - set tabs to 4 characters and always substitute them with spaces<br> - make sure smart indentation is used<br> - turn on the syntax highlighting</pre><p>And here you can see short screencast of how it played out:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*e2fbMIXe6Xo2I168dfcf_Q.gif" /></figure><p>As you can see Gemini CLI helped me a lot and I didn’t have to look up my old config files or go through vim ‘s documentation.</p><h3>Trick 2: Creating Dev Environment with Docker</h3><p>Setting up a development environment can be a time-consuming and repetitive process. With Gemini CLI, you can orchestrate the installation of your entire development toolkit, including Docker, with a single command. Imagine the efficiency gains!</p><p>Here’s a sample prompt:</p><pre>Please help me setup my local development environment by installing Docker and making sure it&#39;s running properly.</pre><p>Below a screencast of the Gemini CLI efforts:</p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fg3ZQ0cDqNao%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dg3ZQ0cDqNao&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fg3ZQ0cDqNao%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/4cce5310b1b55ee7f0d092419e0806d0/href">https://medium.com/media/4cce5310b1b55ee7f0d092419e0806d0/href</a></iframe><p>At the end you see the “<strong>Hello from Docker!</strong>” message, indicating a successful setup 🥳</p><p>When you watch the screencast you may notice that during the installation there have been a couple of issues, such as:</p><ul><li>Error: Command substitution using $() is not allowed for security reasons</li><li>sudo: add-apt-repository: command not found</li></ul><p>Nevertheless the Gemini CLI agent is smart enough to figure out and apply solutions for those on its own. The only thing I need to do is monitor the commands it wants to execute and approve them to make sure I stay in control.</p><p>As you could see, this was quite effective and I <strong>didn’t need to leave my terminal at all </strong>to visit any documentation sites. Now I have both vim and docker set and can start implementing my next project.</p><h3>Trick 3: Data analysis with awk</h3><p>Sometimes, you need to extract specific pieces of information from complex data structures. The other week, I needed to find all user Ids that were missing an email value in a JSON file. For this use case awk combined with Gemini CLI proved to be incredibly powerful.</p><p>Let’s say you have a JSON schema that looks something like this (simplified for demonstration):</p><pre>[<br>    {<br>        &quot;details&quot;: {<br>            &quot;Id&quot;: 1728294450663,<br>            &quot;Email&quot;: &quot;anon-email-0@example.com&quot;<br>        }<br>    },<br>    {<br>        &quot;details&quot;: {<br>            &quot;Id&quot;: 9272737716917<br>        }<br>    },<br>    ...<br>]</pre><p>And you want to extract only the Id values for users who <strong>don’t have</strong> an email address. In the example above I’d like to get <em>9272737716917</em> but not <em>1728294450663</em>.</p><p>I could spend time tinkering with the correct awk command, or just ask Gemini CLI to do that for me. Here is a prompt I used:</p><pre>Analyze the @data.json file, then prepare a shell script (i.e. using awk) to extract all unique Ids for users who don&#39;t have an email associated with them and write them to ids.txt file.</pre><p>And this is the how it worked out:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GC5wu2MP8G9KUJTWbu_y1w.png" /></figure><p>As you can see Gemini CLI proposed to create an awk script to get the data and then run it to get the ids I was looking for 🎉</p><p>Normally writing such a script would take several minutes of Googling, checking documentation and fixing errors. With Gemini CLI I was able to do it in seconds.</p><p><em>Note: During my experiments Gemini CLI was smart enough to propose using the </em><em>jq tool, but as I didn’t have it installed I asked it to use </em><em>awk instead.</em></p><h3>Share your Gemini CLI tricks!</h3><p>These are just a few examples of how Gemini CLI has transformed my Linux workflow. I’m constantly discovering new ways to leverage its capabilities, and I’m sure many of you out there have your own ingenious tricks and shortcuts.</p><p>How has Gemini CLI helped you power up your terminal tasks? Share your experiences, tips, and creative uses in the comments below! Your insights could help fellow Linux enthusiasts streamline their workflows and unlock even more of Gemini CLI’s potential.</p><h3>Thanks for reading</h3><p>I hope this article inspired you to explore <a href="https://cloud.google.com/gemini/docs/codeassist/gemini-cli?utm_campaign=CDR_0x87fa8d40_awareness_b433439625&amp;utm_medium=external&amp;utm_source=blog">Gemini CLI</a> capabilities. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.</p><p>I’m always eager to chat with fellow developers and AI enthusiasts, so feel free to connect with me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>.</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=e4d423c64c65" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/gemini-cli-power-up-your-linux-workflow-e4d423c64c65">Gemini CLI: Power up your Linux workflow</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Step-by-Step: Serving PyTorch Models with a Custom Handler on Vertex AI]]></title>
            <link>https://medium.com/google-cloud/step-by-step-serving-pytorch-models-with-a-custom-handler-on-vertex-ai-5ada1d01c534?source=rss-adf81c1f37ee------2</link>
            <guid isPermaLink="false">https://medium.com/p/5ada1d01c534</guid>
            <category><![CDATA[google-cloud-platform]]></category>
            <category><![CDATA[vertex-ai]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[hugging-face]]></category>
            <category><![CDATA[google-cloud]]></category>
            <dc:creator><![CDATA[Remigiusz Samborski]]></dc:creator>
            <pubDate>Fri, 06 Jun 2025 14:11:39 GMT</pubDate>
            <atom:updated>2025-06-07T09:54:40.191Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1000/1*ANa_nznsAxf1p0LlbFHF1g.png" /><figcaption>PLLuM falling into a fruit compote (inspired by a Polish wordplay)</figcaption></figure><p><a href="https://cloud.google.com/model-garden?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">Google Cloud’s Model Garden</a> streamlines the process of deploying various open-source models — including those from Anthropic, Meta, and Hugging Face — into production-ready, scalable APIs.</p><p>However, challenges arise when models need special preprocessing, have unconventional output formats, or require unique logic not found in standard serving containers. These issues can significantly impede project progress.</p><p>The solution is to gain more control over the inference pipeline. This is where <a href="https://cloud.google.com/vertex-ai/docs?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">Google Cloud’s Vertex AI</a> shines, offering a powerful combination of pre-built <a href="https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">Hugging Face containers</a> and the flexibility of <a href="https://huggingface.co/docs/inference-endpoints/en/guides/custom_handler"><strong>custom handlers</strong></a>. By writing a simple Python script, you can dictate exactly how your model loads, processes requests, and formats predictions.</p><p>In this guide, I’ll walk you through the entire process, from development to deployment. You will learn how to:</p><ul><li><strong>Understand and build a custom inference handler</strong> for a Hugging Face model.</li><li><strong>Test your model and handler locally</strong> to speed up debugging.</li><li><strong>Package your model and custom code</strong> for deployment.</li><li><strong>Deploy the model to a scalable Vertex AI Endpoint</strong> with GPU acceleration.</li><li><strong>Get live predictions</strong> from your newly created API.</li></ul><p>We will use <a href="https://huggingface.co/CYFRAGOVPL/PLLuM-12B-chat"><strong>PLLuM</strong></a>, a powerful Polish language model, as our practical example, but the techniques you learn here are applicable to countless other PyTorch-based models.</p><p>This model is a great example for a couple of reasons:</p><ul><li>It’s increasingly popular in Poland.</li><li>It’s not directly accessible through <a href="https://cloud.google.com/model-garden?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">Model Garden</a></li><li>It won’t work in a standard <a href="https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">Hugging Face container</a>. This is because it employs a custom <a href="https://huggingface.co/docs/transformers/en/main_classes/tokenizer">tokenizer</a>, necessitating the implementation of encoding/decoding logic before invoking its <em>generation</em> function.</li></ul><h3>Before You Begin: Setting Up Your Environment</h3><p>To follow along, you’ll need a Google Cloud project and the right tools and permissions. You can execute the code in your favorite Python development environment (i.e. locally, <a href="https://cloud.google.com/workstations?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">Cloud Workstation</a>, <a href="https://colab.google/">Google Colab</a>, etc.).</p><ol><li><strong>Google Cloud Project:</strong> Ensure you have a Google Cloud project with the <strong>Vertex AI</strong> and <strong>Artifact Registry</strong> APIs enabled.</li><li><strong>Cloud Storage Bucket:</strong> Create a new Cloud Storage bucket to store your model files. This will act as the staging area for Vertex AI.</li><li><strong>Permissions:</strong> Make sure your account has the following IAM roles:<br>- Vertex AI User (roles/aiplatform.user)<br>- Artifact Registry Reader (roles/artifactregistry.reader)<br>- Storage Object Admin (roles/storage.objectAdmin)</li><li><strong>Docker (optional but recommended):</strong> To test your model container locally, before deploying to the cloud, you will need to have Docker installed and running.</li><li><strong>Required Libraries:</strong> Install the necessary Python libraries:</li></ol><pre>pip install - upgrade - user - quiet &#39;torch&#39; &#39;torchvision&#39; &#39;torchaudio&#39;<br>pip install - upgrade - user - quiet &#39;transformers&#39; &#39;accelerate&gt;=0.26.0&#39;<br>pip install - upgrade - user - quiet &#39;google-cloud-aiplatform[prediction]&#39; &#39;crcmod&#39; &#39;etils&#39;Docker (optional but recommended): To test your model container locally, before deploying to the cloud, you will need to have Docker installed and running.</pre><p>Once your environment is configured, import libraries and initialize the Vertex AI SDK. You should put this code in two files: <em>test_local.py</em> (for local testing) and <em>deploy.py</em> (for cloud deployment):</p><pre>import json<br>import torch<br>import vertexai<br><br>from etils import epath<br>from google.cloud import aiplatform<br>from google.cloud.aiplatform import Endpoint, Model<br>from google.cloud.aiplatform.prediction import LocalModel<br><br># Set your project, location, and bucket details<br>PROJECT_ID = &quot;your-gcp-project-id&quot;<br>LOCATION = &quot;your-gcp-location&quot;  # example: &quot;us-central1&quot;<br>BUCKET_URI = &quot;gs://your-gcs-bucket-name&quot;<br><br># Initialize Vertex AI SDK<br>vertexai.init(project=PROJECT_ID, location=LOCATION, staging_bucket=BUCKET_URI)</pre><h3>Understanding Custom Handlers</h3><p>When you deploy a model on Vertex AI using a pre-built container, the container makes assumptions about how to load the model and process predictions. For many standard models, this works perfectly.</p><p>However, a <a href="https://huggingface.co/docs/inference-endpoints/en/guides/custom_handler"><strong>custom handler</strong></a> gives you control over this process. It’s a Python script, named <em>handler.py</em>, that you provide alongside your model files. The Vertex AI serving container will automatically find and use this script.</p><p>The <em>handler.py</em> needs to implement the <a href="https://huggingface.co/philschmid/distilbert-onnx-banking77/blob/main/handler.py"><strong>EndpointHandler</strong></a> class, that must define two key methods:</p><ul><li><em>__init__</em>: This method is called once when the model is loaded. Its job is to load your model and any other necessary assets (like a <a href="https://huggingface.co/docs/transformers/en/main_classes/tokenizer">tokenizer</a>) from the model directory into memory.</li><li><em>__call__</em>: This method is called for every prediction request. It contains the core inference logic:</li></ul><ol><li><strong>Pre-processing:</strong> Preparing the raw input data (e.g., tokenizing a prompt).</li><li><strong>Prediction:</strong> Running the processed input through the model.</li><li><strong>Post-processing:</strong> Formatting the model’s output into a user-friendly response.</li></ol><p>By implementing this simple class, you can serve virtually any model, no matter how custom its requirements are.</p><h3>Building Our Custom Handler for the PLLuM Model</h3><p>Let’s build a handler for the <a href="https://huggingface.co/CYFRAGOVPL/PLLuM-12B-chat">CYFRAGOVPL/PLLuM-12B-chat</a> model. Our goal is to create a simple text-generation endpoint.</p><p>First we need to make sure we have correct imports. Create a file named <em>handler.py</em> and copy the following lines into it:</p><pre>from typing import Any, Dict, List<br>import torch<br>from transformers import AutoTokenizer, AutoModelForCausalLM<br>import base64<br>from io import BytesIO<br>import logging<br>import sys</pre><p>A good practice is to setup logging to stdout, so it can be accessed via Google Cloud’s observability services:</p><pre># Configure logging to output to stdout<br>logging.basicConfig(<br>    level=logging.INFO,<br>    format=&#39;%(asctime)s - %(name)s - %(levelname)s - %(message)s&#39;,<br>    handlers=[logging.StreamHandler(sys.stdout)]<br>)<br>logger = logging.getLogger(&#39;huggingface_inference_toolkit&#39;)</pre><p>Then we define our <em>__init__</em> function, which will be responsible for loading the model and its Tokenizer:</p><pre>class EndpointHandler:<br>    def __init__(<br>        self,<br>        model_dir: str = &#39;/opt/huggingface/model&#39;,<br>        **kwargs: Any,<br>    ) -&gt; None:<br>        self.processor = AutoTokenizer.from_pretrained(model_dir)<br><br>        self.model = AutoModelForCausalLM.from_pretrained(<br>            model_dir,<br>            torch_dtype=torch.bfloat16,<br>            device_map=&quot;auto&quot;  # automatically places model layers on available devices<br>        ).eval()</pre><p>Lastly, let’s define the inference logic that goes inside our handler’s <em>__call__</em> method. This involves taking a prompt, tokenizing it, generating a response with the model, and decoding the output:</p><pre>   def __call__(self, data: Dict[str, Any]) -&gt; Dict[str, List[Any]]:<br>        logger.info(&quot;Processing new request&quot;)<br>        predictions = []<br><br>        for instance in data[&#39;instances&#39;]:<br>            logger.info(f&quot;Processing instance: {instance.get(&#39;prompt&#39;, &#39;&#39;)[:100]}...&quot;)<br><br>            if &quot;prompt&quot; not in instance:<br>                error_msg = &quot;Missing prompt in the request body&quot;<br>                logger.info(error_msg)<br>                return {&quot;error&quot;: &quot;Missing prompt in the request body&quot;}<br><br>            inputs = self.processor(<br>                instance[&quot;prompt&quot;], return_tensors=&quot;pt&quot;, return_token_type_ids=False<br>            ).to(self.model.device)<br>            input_len = inputs[&quot;input_ids&quot;].shape[-1]<br>            logger.info(f&quot;Input processed, length: {input_len}&quot;)<br><br>            with torch.inference_mode():<br>                generation_kwargs = data.get(<br>                    &quot;generation_kwargs&quot;, {<br>                        &quot;max_new_tokens&quot;: 100,<br>                        &quot;do_sample&quot;: False,<br>                        &quot;top_k&quot;: 50,<br>                        &quot;top_p&quot;: 0.9,<br>                        &quot;temperature&quot;: 0.7<br>                    }<br>                )<br>                logger.info(f&quot;Generation kwargs: {generation_kwargs}&quot;)<br><br>                generation = self.model.generate(**inputs, **generation_kwargs)<br>                generation = generation[0][input_len:]<br>                response = self.processor.decode(generation, skip_special_tokens=True)<br>                logger.info(f&quot;Generated response: {response[:100]}...&quot;)<br>                predictions.append(response)<br><br>        logger.info(f&quot;Successfully processed {len(predictions)} instances&quot;)<br>        return {&quot;predictions&quot;: predictions}</pre><p>Note that <em>__call__</em> method receives a dictionary and should be written in a way that handles multiple <em>instances</em>. This allows users to send multiple prompts in a single request.</p><p><a href="https://gist.github.com/rsamborski/f234f1688570a964fe29728935649df4#file-handler-py">Click here</a> to download the full <em>handler.py</em> code.</p><h3>Preparing the Model for Deployment</h3><p>Vertex AI needs all your model artifacts — the model weights, configuration, and our new <em>handler.py</em> — to be in one location on Google Cloud Storage.</p><ol><li><strong>Create a local directory</strong> that contains the model files and your handler.</li><li><strong>Upload the entire directory</strong> to your GCS bucket.</li></ol><pre>gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp -r /path/to/your/local/model_directory/* gs://your-gcs-bucket-name/model/</pre><h3>Best Practice: Testing Locally Before Deploying</h3><p>Deploying a model to a GPU-accelerated endpoint can take 15–20 minutes. To avoid waiting that long just to find a bug in your code, you can use the Vertex AI SDK’s LocalModel feature to simulate the cloud environment on your local machine.</p><p>This spins up the official Hugging Face serving container using Docker and loads your model and handler from a local directory, allowing for rapid testing.</p><ol><li>We define a helper function by adding following lines to the<em> test_local.py</em> file:</li></ol><pre>def get_cuda_device_names():<br>    &quot;&quot;&quot;A function to get the list of NVIDIA GPUs&quot;&quot;&quot;<br>    if not torch.cuda.is_available():<br>        return None<br><br>    return [str(i) for i in range(torch.cuda.device_count())]</pre><p>2. Create <a href="https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.prediction.LocalModel?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">LocalModel</a> instance:</p><pre>local_pllum_model = LocalModel(<br>    serving_container_image_uri=&quot;us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-3.transformers.4-48.ubuntu2204.py311&quot;,<br>    serving_container_ports=[5000],<br>)</pre><p>3. Create a <a href="https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.prediction.LocalEndpoint?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">LocalEndpoint</a> instance:</p><pre>model_uri = epath.Path(BUCKET_URI) / &quot;model&quot;<br><br>local_pllum_endpoint = local_pllum_model.deploy_to_local_endpoint(<br>    artifact_uri=str(model_uri), gpu_device_ids=get_cuda_device_names()<br>)<br><br>local_pllum_endpoint.serve()</pre><p>4. Generate predictions:</p><pre># EN:&quot;Write a short poem about spring.&quot;<br>prompt = &quot;Napisz krótki wiersz o wiośnie.&quot;  # @param {type: &quot;string&quot;}<br><br>prediction_request = {<br>    &quot;instances&quot;: [<br>        {<br>            &quot;prompt&quot;: prompt,<br>            &quot;generation_kwargs&quot;: {&quot;max_new_tokens&quot;: 50, &quot;do_sample&quot;: True},<br>        }<br>    ]<br>}<br><br>vertex_prediction_request = json.dumps(prediction_request)<br>vertex_prediction_response = local_pllum_endpoint.predict(<br>    request=vertex_prediction_request, headers={&quot;Content-Type&quot;: &quot;application/json&quot;}<br>)<br>print(vertex_prediction_response.json()[&quot;predictions&quot;])</pre><p><a href="https://gist.github.com/rsamborski/f234f1688570a964fe29728935649df4#file-test_local-py">Click here</a> to download the full <em>test_local.py</em> code.</p><p>If the local prediction succeeds, you can be much more confident that your cloud deployment will work correctly.</p><h3>Deploying to a Live Vertex AI Endpoint</h3><p>With our model and handler tested and uploaded to GCS, we’re ready for the final two steps.</p><h4>Step 1: Register the Model in the Vertex AI Model Registry</h4><p>First, we register the model, telling Vertex AI where to find the artifacts and which container to use. Add following code to the <em>deploy.py</em> file:</p><pre>model_uri = epath.Path(BUCKET_URI) / &quot;model&quot;<br><br>model = Model.upload(<br>    display_name=&quot;cyfragovpl--pllum-12b-it&quot;,<br>    artifact_uri=str(model_uri),<br>         serving_container_image_uri=&quot;us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-3.transformers.4-48.ubuntu2204.py311&quot;,<br>    serving_container_ports=[8080],<br>)<br>model.wait()</pre><h4>Step 2: Deploy the Model to an Endpoint</h4><p>Next, we deploy the registered model to an endpoint. This is where Vertex AI provisions the physical hardware (like an NVIDIA L4 GPU) and makes your model available to receive prediction requests.</p><pre>deployed_model = model.deploy(<br>    endpoint=Endpoint.create(display_name=&quot;cyfragovpl--pllum-12b-it-endpoint&quot;),<br>    machine_type=&quot;g2-standard-8&quot;,<br>    accelerator_type=&quot;NVIDIA_L4&quot;,<br>    accelerator_count=1,<br>)</pre><p>This step will take about 15–25 minutes. Once complete, you will have a fully managed, scalable HTTP endpoint for your model.</p><h3>Getting Live Predictions</h3><p>Now for the fun part. You can send requests to your endpoint using the Vertex AI SDK, a simple cURL command, or any HTTP client.</p><p>Using the <a href="https://cloud.google.com/python/docs/reference/aiplatform/latest?utm_campaign=CDR_0x87fa8d40_user-journey_b420395736&amp;utm_medium=external&amp;utm_source=blog">VertexAI’s Python SDK</a> is the most straightforward way:</p><pre># EN:&quot;Write a short poem about spring.&quot;<br>prompt = &quot;Napisz krótki wiersz o wiośnie.&quot;  # @param {type: &quot;string&quot;}<br>prediction_request = {<br>    &quot;instances&quot;: [<br>        {<br>            &quot;prompt&quot;: prompt,<br>            &quot;generation_kwargs&quot;: {&quot;max_new_tokens&quot;: 50, &quot;do_sample&quot;: True},<br>        }<br>    ]<br>}<br><br>prediction = deployed_model.predict(instances=prediction_request[&quot;instances&quot;])<br>print(prediction)</pre><p><a href="https://gist.github.com/rsamborski/f234f1688570a964fe29728935649df4#file-deploy-py">Click here</a> to download the full <em>deploy.py</em> code.</p><p>The output will be a prediction object containing the generated text from the PLLuM model, served live from your own custom API endpoint 🎉</p><h3>Conclusion and Next Steps</h3><p>You have successfully taken an open-source Hugging Face model with custom requirements and transformed it into a robust, scalable API on Google Cloud. You now have the power to productionize a vast range of models by creating a simple <strong>custom handler</strong> that tailors the inference process to your exact needs.</p><p>Explore more at:</p><ul><li><a href="https://cloud.google.com/vertex-ai/docs?utm_campaign=CDR_0x87fa8d40_platform_b420395736&amp;utm_medium=external&amp;utm_source=blog">Vertex AI documentation</a></li><li><a href="https://github.com/GoogleCloudPlatform/generative-ai">Google Cloud Generative AI repository on GitHub</a></li><li><a href="https://huggingface.co/CYFRAGOVPL">PLLuM on Hugging Face</a></li></ul><p>For this content in other formats visit:</p><ul><li><a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/9aabd305ed72ac24eed6cd102d33e19061ec3971/open-models/serving/vertex_ai_pytorch_inference_pllum_with_custom_handler.ipynb">Python notebook</a></li><li><a href="https://www.youtube.com/watch?v=fJcCRTmi8Bo">Youtube video</a> (in Polish)</li></ul><h3>Thanks for reading</h3><p>Thank you for reading! I hope this guide helps you bring your own creative AI projects to life on Google Cloud. If you found this article helpful, please consider following me here on Medium and giving it a clap 👏 to help others discover it.</p><p>I’m always eager to connect with fellow developers and AI enthusiasts, so feel free to connect with me on <a href="https://www.linkedin.com/in/remigiusz-samborski/">LinkedIn</a> or <a href="https://bsky.app/profile/rsamborski.bsky.social">Bluesky</a>. Your feedback is incredibly valuable, so please don’t hesitate to leave a comment with your thoughts, questions, or your own experiences deploying models on Vertex AI!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=5ada1d01c534" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/step-by-step-serving-pytorch-models-with-a-custom-handler-on-vertex-ai-5ada1d01c534">Step-by-Step: Serving PyTorch Models with a Custom Handler on Vertex AI</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>