<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Laurent Picard on Medium]]></title>
        <description><![CDATA[Stories by Laurent Picard on Medium]]></description>
        <link>https://medium.com/@picardparis?source=rss-6be63961431c------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/1*OMNmXavx5hd7qUUZxNYLhA.jpeg</url>
            <title>Stories by Laurent Picard on Medium</title>
            <link>https://medium.com/@picardparis?source=rss-6be63961431c------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Mon, 08 Jun 2026 23:18:24 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@picardparis/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Detecting and Editing Visual Objects with Gemini]]></title>
            <link>https://medium.com/google-cloud/detecting-and-editing-visual-objects-with-gemini-214bde9d5792?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/214bde9d5792</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[image-generation]]></category>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[object-detection]]></category>
            <category><![CDATA[llm]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Thu, 26 Feb 2026 13:30:17 GMT</pubDate>
            <atom:updated>2026-03-02T16:10:22.814Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*bqFCQq-Zj12lRrIE93u6vw.png" /><figcaption>A practical guide to identifying, restoring, and transforming elements within your images</figcaption></figure><p><em>A few notes before we start:</em></p><ul><li><em>The complete source code for this article, including future updates, is available in </em><a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/object_detection_and_editing.ipynb"><em>this notebook</em></a><em> under the Apache 2.0 license.</em></li><li><em>All new images in this article were generated with Gemini Nano Banana using the explored proof-of-concept. All source images are either in the public domain or free to use (reference links are provided in the code output).</em></li><li><em>You can experiment with Gemini models for free in </em><a href="https://aistudio.google.com"><em>Google AI Studio</em></a><em>. For programmatic API access, please note that while a free tier is available for some models (i.e., you can perform object detection), image generation is a pay-as-you-go service.</em></li></ul><h3>✨ Overview</h3><p>Traditional computer vision models are typically trained to detect a fixed set of object classes, like “person”, “cat”, or “car”. If you want to detect something specific that wasn’t in the training set, such as an “illustration” in a book photograph, you usually have to gather a dataset, label it manually, and train a custom model, which can take hours or even days.</p><p>In this exploration, we’ll test a different approach using Gemini. We will leverage its spatial understanding capabilities to perform open-vocabulary object detection. This allows us to find objects based solely on a natural language description, without any training.</p><p>Once the visual objects are detected, we’ll extract them and then use Gemini’s image editing capabilities (specifically the Nano Banana models) to restore and creatively transform them.</p><h3>🔥 Challenge</h3><p>We are dealing with unstructured data: photos of books, magazines, and objects in the wild. These images present several difficulties for traditional computer vision:</p><ul><li>Variety: The objects we want to find (illustrations, engravings, and any visuals in general) vary wildly in style and content.</li><li>Distortion: Pages are curved, photos are taken at angles, and lighting is uneven.</li><li>Noise: Old books have stains, paper grain, and text bleeding through from the other side.</li></ul><p>Our challenge is to build a robust pipeline that can detect these objects despite the distortions, extract them cleanly, and edit them to look like high-quality digital assets… all using simple text prompts.</p><h3>🏁 Setup</h3><h4>🐍 Python packages</h4><p>We’ll use the following packages:</p><ul><li>google-genai: the <a href="https://pypi.org/project/google-genai">Google Gen AI Python SDK</a> lets us call Gemini with a few lines of code</li><li>pillow for image management</li><li>matplotlib for result visualization</li></ul><p>We’ll also use these packages (dependencies of google-genai):</p><ul><li>pydantic for data management</li><li>tenacity for request management</li></ul><pre>pip install --quiet &quot;google-genai&gt;=1.64.0&quot; &quot;pillow&gt;=11.3.0&quot; &quot;matplotlib&gt;=3.10.0&quot;</pre><h4>🤖 Gen AI SDK</h4><p>To send Gemini requests, create a google.genai client:</p><pre>from google import genai<br><br>check_environment()<br><br>client = genai.Client()<br><br>check_configuration(client)</pre><pre>Using the Vertex AI API with project &quot;...&quot; in location &quot;europe-west9&quot;</pre><h4>🖼️ Image test suite</h4><p>Let’s define a list of images for our tests:</p><pre>from dataclasses import dataclass<br>from enum import StrEnum<br><br>Url = str<br><br><br>class Source(StrEnum):<br>    incunable = &quot;https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg&quot;<br>    engravings = &quot;https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg&quot;<br>    museum_guidebook = &quot;https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg&quot;<br>    denver_illustrated = &quot;https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg&quot;<br>    physics_textbook = &quot;https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:03:64:87:31:8:00036487318:0103/full/pct:50/0/default.jpg&quot;<br>    portrait_miniatures = &quot;https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2024:2024rosen013592v02:0249/full/pct:50/0/default.jpg&quot;<br>    wizard_of_oz_drawings = &quot;https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg&quot;<br>    paintings = &quot;https://images.unsplash.com/photo-1714146681164-f26fed839692?h=1440&quot;<br>    alice_drawing = &quot;https://images.unsplash.com/photo-1630595011903-689853b04ee2?h=800&quot;<br>    book = &quot;https://images.unsplash.com/photo-1643451533573-ee364ba6e330?h=800&quot;<br>    manual = &quot;https://images.unsplash.com/photo-1623666936367-a100f62ba9b7?h=800&quot;<br>    electronics = &quot;https://images.unsplash.com/photo-1757397584789-8b2c5bfcdbc3?h=1440&quot;<br><br><br>@dataclass<br>class SourceMetadata:<br>    title: str<br>    webpage_url: Url<br>    credit_line: str<br><br><br>LOC = &quot;Library of Congress&quot;<br>LOC_RARE_BOOKS = &quot;Library of Congress, Rare Book and Special Collections Division&quot;<br>LOC_MEETING_FRONTIERS = &quot;Library of Congress, Meeting of Frontiers&quot;<br><br>metadata_by_source: dict[Source, SourceMetadata] = {<br>    Source.incunable: SourceMetadata(<br>        &quot;Vergaderinge der historien van Troy (1485)&quot;,<br>        &quot;https://www.loc.gov/resource/rbc0001.2014rosen0487/?sp=165&quot;,<br>        LOC_RARE_BOOKS,<br>    ),<br>    Source.engravings: SourceMetadata(<br>        &quot;Harper&#39;s illustrated catalogue (1847)&quot;,<br>        &quot;https://www.loc.gov/resource/gdcscd.00340766921/?sp=121&quot;,<br>        LOC,<br>    ),<br>    Source.museum_guidebook: SourceMetadata(<br>        &quot;Barnum&#39;s American Museum illustrated (1850)&quot;,<br>        &quot;https://www.loc.gov/resource/rbc0001.2014gen34181/?sp=33&quot;,<br>        LOC_RARE_BOOKS,<br>    ),<br>    Source.denver_illustrated: SourceMetadata(<br>        &quot;Denver illustrated (1893)&quot;,<br>        &quot;https://www.loc.gov/resource/gdclccn.rc01000494/?sp=51&quot;,<br>        LOC_MEETING_FRONTIERS,<br>    ),<br>    Source.physics_textbook: SourceMetadata(<br>        &quot;Lessons in physics (1916)&quot;,<br>        &quot;https://www.loc.gov/resource/gdcscd.00036487318/?sp=103&quot;,<br>        LOC,<br>    ),<br>    Source.portrait_miniatures: SourceMetadata(<br>        &quot;The history of portrait miniatures (1904)&quot;,<br>        &quot;https://www.loc.gov/resource/rbc0001.2024rosen013592v02/?sp=249&quot;,<br>        LOC_RARE_BOOKS,<br>    ),<br>    Source.wizard_of_oz_drawings: SourceMetadata(<br>        &quot;The wonderful Wizard of Oz (1899)&quot;,<br>        &quot;https://www.loc.gov/resource/rbc0001.2006gen32405/?sp=48&quot;,<br>        LOC_RARE_BOOKS,<br>    ),<br>    Source.paintings: SourceMetadata(<br>        &quot;Open book showing paintings by Vincent van Gogh&quot;,<br>        &quot;https://unsplash.com/photos/9hD7qrxICag&quot;,<br>        &quot;Photo by Trung Manh cong on Unsplash&quot;,<br>    ),<br>    Source.alice_drawing: SourceMetadata(<br>        &quot;Open book showing an illustration and text from Alice&#39;s Adventures in Wonderland&quot;,<br>        &quot;https://unsplash.com/photos/bewzr_Q9u2o&quot;,<br>        &quot;Photo by Brett Jordan on Unsplash&quot;,<br>    ),<br>    Source.book: SourceMetadata(<br>        &quot;Open book showing two botanical illustrations&quot;,<br>        &quot;https://unsplash.com/photos/4IDqcNj827I&quot;,<br>        &quot;Photo by Ranurte on Unsplash&quot;,<br>    ),<br>    Source.manual: SourceMetadata(<br>        &quot;Open user manual for vintage camera&quot;,<br>        &quot;https://unsplash.com/photos/aaFU96eYASk&quot;,<br>        &quot;Photo by Annie Spratt on Unsplash&quot;,<br>    ),<br>    Source.electronics: SourceMetadata(<br>        &quot;Circuit board with electronic components&quot;,<br>        &quot;https://unsplash.com/photos/Aqa1pHQ57pw&quot;,<br>        &quot;Photo by Albert Stoynov on Unsplash&quot;,<br>    ),<br>}<br><br>print(&quot;✅ Test images defined&quot;)</pre><h4>🧠 Gemini models</h4><p>Gemini comes in different <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models#gemini-models">versions</a>. We can currently use the following models:</p><ul><li>For object detection: Gemini 2.5 or Gemini 3, each available in Flash or Pro versions.</li><li>For object editing: Gemini 2.5 Flash Image, Gemini 3 Pro Image, or Gemini 3.1 Flash Image (also known as Nano Banana 🍌, Nano Banana Pro 🍌, and Nano Banana 2 🍌).</li></ul><h4>🛠️ Helpers</h4><p>Now, let’s add core helper classes and functions:</p><pre>from enum import auto<br>from pathlib import Path<br>from typing import Any, cast<br><br>import IPython.display<br>import matplotlib.pyplot as plt<br>import pydantic<br>import tenacity<br>from google.genai.errors import ClientError<br>from google.genai.types import (<br>    FinishReason,<br>    GenerateContentConfig,<br>    GenerateContentResponse,<br>    PIL_Image,<br>    ThinkingConfig,<br>    ThinkingLevel,<br>)<br><br><br># Multimodal models with spatial understanding and structured outputs<br>class MultimodalModel(StrEnum):<br>    # Generally Available (GA)<br>    GEMINI_2_5_FLASH = &quot;gemini-2.5-flash&quot;<br>    GEMINI_2_5_PRO = &quot;gemini-2.5-pro&quot;<br>    # Preview<br>    GEMINI_3_FLASH_PREVIEW = &quot;gemini-3-flash-preview&quot;<br>    GEMINI_3_1_PRO_PREVIEW = &quot;gemini-3.1-pro-preview&quot;<br>    # Default model used for object detection<br>    DEFAULT = GEMINI_3_FLASH_PREVIEW<br><br><br># Image generation and editing models (Nano Banana 🍌 models)<br>class ImageModel(StrEnum):<br>    # Generally Available (GA)<br>    GEMINI_2_5_FLASH_IMAGE = &quot;gemini-2.5-flash-image&quot;  # Nano Banana<br>    # Preview<br>    GEMINI_3_PRO_IMAGE_PREVIEW = &quot;gemini-3-pro-image-preview&quot;  # Nano Banana Pro<br>    GEMINI_3_1_FLASH_IMAGE_PREVIEW = &quot;gemini-3.1-flash-image-preview&quot;  # Nano Banana 2<br>    # Default model used for image editing<br>    DEFAULT = GEMINI_2_5_FLASH_IMAGE<br><br><br>Model = MultimodalModel | ImageModel<br><br><br>def generate_content(<br>    contents: list[Any],<br>    model: Model,<br>    config: GenerateContentConfig | None,<br>    should_display_response_info: bool = False,<br>) -&gt; GenerateContentResponse | None:<br>    response = None<br>    client = check_client_for_model(model)<br><br>    for attempt in get_retrier():<br>        with attempt:<br>            response = client.models.generate_content(<br>                model=model.value,<br>                contents=contents,<br>                config=config,<br>            )<br>    if should_display_response_info:<br>        display_response_info(response, config)<br><br>    return response<br><br><br>def check_client_for_model(model: Model) -&gt; genai.Client:<br>    if (<br>        model.value.endswith(&quot;-preview&quot;)<br>        and client.vertexai<br>        and client._api_client.location != &quot;global&quot;<br>    ):<br>        # Preview models are only available on the &quot;global&quot; location<br>        return genai.Client(location=&quot;global&quot;)<br><br>    return client<br><br><br>def display_response_info(<br>    response: GenerateContentResponse | None,<br>    config: GenerateContentConfig | None,<br>) -&gt; None:<br>    if response is None:<br>        print(&quot;❌ No response&quot;)<br>        return<br><br>    if usage_metadata := response.usage_metadata:<br>        if usage_metadata.prompt_token_count:<br>            print(f&quot;Input tokens   : {usage_metadata.prompt_token_count:9,d}&quot;)<br>        if usage_metadata.candidates_token_count:<br>            print(f&quot;Output tokens  : {usage_metadata.candidates_token_count:9,d}&quot;)<br>        if usage_metadata.thoughts_token_count:<br>            print(f&quot;Thoughts tokens: {usage_metadata.thoughts_token_count:9,d}&quot;)<br><br>    if (<br>        config is not None<br>        and config.response_mime_type == &quot;application/json&quot;<br>        and response.parsed is None<br>    ):<br>        print(&quot;❌ Could not parse the JSON response&quot;)<br>        return<br>    if not response.candidates:<br>        print(&quot;❌ No `response.candidates`&quot;)<br>        return<br>    if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:<br>        print(f&quot;❌ {finish_reason = }&quot;)<br>    if not response.text:<br>        print(&quot;❌ No `response.text`&quot;)<br>        return<br><br><br>def generate_image(<br>    sources: list[PIL_Image],<br>    prompt: str,<br>    model: ImageModel,<br>    config: GenerateContentConfig | None = None,<br>) -&gt; PIL_Image | None:<br>    contents = [*sources, prompt.strip()]<br><br>    response = generate_content(contents, model, config)<br><br>    return check_get_output_image_from_response(response)<br><br><br>def check_get_output_image_from_response(<br>    response: GenerateContentResponse | None,<br>) -&gt; PIL_Image | None:<br>    if response is None:<br>        print(&quot;❌ No `response`&quot;)<br>        return None<br>    if not response.candidates:<br>        print(&quot;❌ No `response.candidates`&quot;)<br>        if response.prompt_feedback:<br>            if block_reason := response.prompt_feedback.block_reason:<br>                print(f&quot;{block_reason = :s}&quot;)<br>            if block_reason_message := response.prompt_feedback.block_reason_message:<br>                print(f&quot;{block_reason_message = }&quot;)<br>        return None<br>    if not (content := response.candidates[0].content):<br>        print(&quot;❌ No `response.candidates[0].content`&quot;)<br>        return None<br>    if not (parts := content.parts):<br>        print(&quot;❌ No `response.candidates[0].content.parts`&quot;)<br>        return None<br><br>    output_image: PIL_Image | None = None<br>    for part in parts:<br>        if part.text:<br>            display_markdown(part.text)<br>            continue<br>        sdk_image = part.as_image()<br>        assert sdk_image is not None<br>        output_image = sdk_image._pil_image<br>        assert output_image is not None<br>        break  # There should be a single image<br><br>    return output_image<br><br><br>def get_thinking_config(model: Model) -&gt; ThinkingConfig | None:<br>    match model:<br>        case MultimodalModel.GEMINI_2_5_FLASH:<br>            return ThinkingConfig(thinking_budget=0)<br>        case MultimodalModel.GEMINI_2_5_PRO:<br>            return ThinkingConfig(thinking_budget=128, include_thoughts=False)<br>        case MultimodalModel.GEMINI_3_FLASH_PREVIEW:<br>            return ThinkingConfig(thinking_level=ThinkingLevel.MINIMAL)<br>        case MultimodalModel.GEMINI_3_1_PRO_PREVIEW:<br>            return ThinkingConfig(thinking_level=ThinkingLevel.LOW)<br>        case _:<br>            return None  # Default<br><br><br>def display_markdown(markdown: str) -&gt; None:<br>    IPython.display.display(IPython.display.Markdown(markdown))<br><br><br>def display_image(image: PIL_Image) -&gt; None:<br>    IPython.display.display(image)<br><br><br>def get_retrier() -&gt; tenacity.Retrying:<br>    return tenacity.Retrying(<br>        stop=tenacity.stop_after_attempt(7),<br>        wait=tenacity.wait_incrementing(start=10, increment=1),<br>        retry=should_retry_request,<br>        reraise=True,<br>    )<br><br><br>def should_retry_request(retry_state: tenacity.RetryCallState) -&gt; bool:<br>    if not retry_state.outcome:<br>        return False<br>    err = retry_state.outcome.exception()<br>    if not isinstance(err, ClientError):<br>        return False<br>    print(f&quot;❌ ClientError {err.code}: {err.message}&quot;)<br><br>    retry = False<br>    match err.code:<br>        case 400 if err.message is not None and &quot; try again &quot; in err.message:<br>            # Workshop: first time access to Cloud Storage (service agent provisioning)<br>            retry = True<br>        case 429:<br>            # Workshop: temporary project with 1 QPM quota<br>            retry = True<br>    print(f&quot;🔄 Retry: {retry}&quot;)<br><br>    return retry<br><br><br>print(&quot;✅ Helpers defined&quot;)</pre><h3>🔍 Detecting visual objects</h3><p>To perform visual object detection, craft the prompt to indicate what you’d like to detect and how results should be returned. In the same request, it’s possible to also extract additional information about each detected object. This can be virtually anything, from labels such as “furniture”, “table”, or “chair”, to more precise classifications like “mammals” or “reptiles”, or to contextual data such as captions, colors, shapes, etc.</p><p>For the next tests, we’ll experiment with detecting illustrations within book photos. Here’s a possible prompt:</p><pre>OBJECT_DETECTION_PROMPT = &quot;&quot;&quot;<br>Detect every illustration within the book photo and extract the following data for each:<br>- `box_2d`: Bounding box coordinates of the illustration only (ignoring any caption).<br>- `caption`: Verbatim caption or legend such as &quot;Figure 1&quot;. Use &quot;&quot; if not found.<br>- `label`: Single-word label describing the illustration. Use &quot;&quot; if not found.<br>&quot;&quot;&quot;</pre><p>Notes:</p><ul><li>Bounding boxes are very useful for locating or extracting the detected objects.</li><li>Typically, for Gemini models, a box_2d bounding box represents coordinates normalized to a (0, 0, 1000, 1000) space for a (0, 0, width, height) input image.</li><li>We’re also requesting to extract captions (metadata often present in reference books) and labels (dynamic metadata).</li></ul><p>To automate response processing, it’s convenient to define a Pydantic class that matches the prompt, such as:</p><pre>class DetectedObject(pydantic.BaseModel):<br>    box_2d: list[int]<br>    caption: str<br>    label: str<br><br>DetectedObjects = list[DetectedObject]</pre><p>Then, request a structured output with config fields response_mime_type and response_schema:</p><pre>config = GenerateContentConfig(<br>    # …,<br>    response_mime_type=&quot;application/json&quot;,<br>    response_schema=DetectedObjects,<br>    # …,<br>)</pre><p>This will generate a JSON response which the SDK can parse automatically, letting us directly use object instances:</p><pre>detected_objects = cast(DetectedObjects, response.parsed)</pre><p>Let’s add a few object-detection-specific classes and functions:</p><pre>import io<br>import urllib.request<br>from collections.abc import Iterator<br>from dataclasses import field<br>from datetime import datetime<br><br>import PIL.Image<br>from google.genai.types import Part, PartMediaResolutionLevel<br>from PIL.PngImagePlugin import PngInfo<br><br>OBJECT_DETECTION_PROMPT = &quot;&quot;&quot;<br>Detect every illustration within the book photo and extract the following data for each:<br>- `box_2d`: Bounding box coordinates of the illustration only (ignoring any caption).<br>- `caption`: Verbatim caption or legend such as &quot;Figure 1&quot;. Use &quot;&quot; if not found.<br>- `label`: Single-word label describing the illustration. Use &quot;&quot; if not found.<br>&quot;&quot;&quot;<br><br># Margin added to detected/cropped objects, giving more context for a better understanding of spatial distortions<br>CROP_MARGIN_PX = 10<br><br># Set to True to save each generated image<br>SAVE_GENERATED_IMAGES = False<br>OUTPUT_IMAGES_PATH = Path(&quot;./object_detection_and_editing&quot;)<br><br><br># Matching class for structured output generation<br>class DetectedObject(pydantic.BaseModel):<br>    box_2d: list[int]<br>    caption: str<br>    label: str<br><br><br># Misc data classes<br>InputImage = Path | Url<br>DetectedObjects = list[DetectedObject]<br>WorkflowStepImages = list[PIL_Image]<br><br><br>class WorkflowStep(StrEnum):<br>    SOURCE = auto()<br>    CROPPED = auto()<br>    RESTORED = auto()<br>    COLORIZED = auto()<br>    CINEMATIZED = auto()<br><br><br>@dataclass<br>class VisualObjectWorkflow:<br>    source_image: PIL_Image<br>    detected_objects: DetectedObjects<br>    images_by_step: dict[WorkflowStep, WorkflowStepImages] = field(default_factory=dict)<br><br>    def __post_init__(self) -&gt; None:<br>        denormalize_bounding_boxes(self)<br><br><br>workflow_by_image: dict[InputImage, VisualObjectWorkflow] = {}<br><br><br>def denormalize_bounding_boxes(self: VisualObjectWorkflow) -&gt; None:<br>    &quot;&quot;&quot;Convert the box_2d coordinates.<br>    - Before: [y1, x1, y2, x2] normalized to 0-1000, as returned by Gemini<br>    - After:  [x1, y1, x2, y2] in source_image coordinates, as used in Pillow<br>    &quot;&quot;&quot;<br><br>    def to_image_coord(coord: int, dim: int) -&gt; int:<br>        return int(coord * dim / 1000 + 0.5)<br><br>    w, h = self.source_image.size<br>    for obj in self.detected_objects:<br>        y1, x1, y2, x2 = obj.box_2d<br>        x1, x2 = to_image_coord(x1, w), to_image_coord(x2, w)<br>        y1, y2 = to_image_coord(y1, h), to_image_coord(y2, h)<br>        obj.box_2d = [x1, y1, x2, y2]<br><br><br>def detect_objects(<br>    image: InputImage,<br>    prompt: str = OBJECT_DETECTION_PROMPT,<br>    model: MultimodalModel = MultimodalModel.DEFAULT,<br>    config: GenerateContentConfig | None = None,<br>    media_resolution: PartMediaResolutionLevel | None = None,<br>    display_results: bool = True,<br>) -&gt; None:<br>    display_image_source_info(image)<br>    pil_image, content_part = get_pil_image_and_part(image, model, media_resolution)<br>    prompt = prompt.strip()<br>    contents = [content_part, prompt]<br>    config = config or get_object_detection_config(model)<br><br>    response = generate_content(contents, model, config)<br><br>    if response is not None and response.parsed is not None:<br>        detected_objects = cast(DetectedObjects, response.parsed)<br>    else:<br>        detected_objects = DetectedObjects()<br><br>    workflow = VisualObjectWorkflow(pil_image, detected_objects)<br>    workflow_by_image[image] = workflow<br>    add_cropped_objects(workflow, image, prompt)<br><br>    if display_results:<br>        display_detected_objects(workflow)<br><br><br>def get_pil_image_and_part(<br>    image: InputImage,<br>    model: MultimodalModel,<br>    media_resolution: PartMediaResolutionLevel | None,<br>) -&gt; tuple[PIL_Image, Part]:<br>    if isinstance(image, Path):<br>        image_bytes = image.read_bytes()<br>    else:<br>        headers = {&quot;User-Agent&quot;: &quot;Mozilla/5.0&quot;}<br>        req = urllib.request.Request(image, headers=headers)<br>        with urllib.request.urlopen(req, timeout=10) as response:<br>            image_bytes = response.read()<br><br>    pil_image = PIL.Image.open(io.BytesIO(image_bytes))<br>    mime_type = f&quot;image/{pil_image.format.lower()}&quot; if pil_image.format else &quot;image/*&quot;<br>    content_part = Part.from_bytes(<br>        data=image_bytes,<br>        mime_type=mime_type,<br>        media_resolution=media_resolution,<br>    )<br><br>    return pil_image, content_part<br><br><br>def get_object_detection_config(model: Model) -&gt; GenerateContentConfig:<br>    # Low randomness for more determinism<br>    return GenerateContentConfig(<br>        temperature=0.0,<br>        top_p=0.0,<br>        seed=42,<br>        response_mime_type=&quot;application/json&quot;,<br>        response_schema=DetectedObjects,<br>        thinking_config=get_thinking_config(model),<br>    )<br><br><br>def add_cropped_objects(<br>    workflow: VisualObjectWorkflow,<br>    input: InputImage,<br>    prompt: str,<br>    crop_margin: int = CROP_MARGIN_PX,<br>) -&gt; None:<br>    cropped_images: list[PIL_Image] = []<br>    obj_count = len(workflow.detected_objects)<br>    for obj_order, obj in enumerate(workflow.detected_objects, 1):<br>        cropped_image, _ = extract_object_image(workflow.source_image, obj, crop_margin)<br>        cropped_images.append(cropped_image)<br>        save_workflow_image(<br>            WorkflowStep.SOURCE,<br>            WorkflowStep.CROPPED,<br>            input,<br>            obj_order,<br>            obj_count,<br>            cropped_image,<br>            dict(prompt=prompt, crop_margin=str(crop_margin)),<br>        )<br>    workflow.images_by_step[WorkflowStep.CROPPED] = cropped_images<br><br><br>def extract_object_image(<br>    image: PIL_Image,<br>    obj: DetectedObject,<br>    margin: int = 0,<br>) -&gt; tuple[PIL_Image, tuple[int, int, int, int]]:<br>    def clamp(coord: int, dim: int) -&gt; int:<br>        return min(max(coord, 0), dim)<br><br>    x1, y1, x2, y2 = obj.box_2d<br>    w, h = image.size<br>    if margin != 0:<br>        x1, x2 = clamp(x1 - margin, w), clamp(x2 + margin, w)<br>        y1, y2 = clamp(y1 - margin, h), clamp(y2 + margin, h)<br><br>    box = (x1, y1, x2, y2)<br>    object_image = image.crop(box)<br><br>    return object_image, box<br><br><br>def save_workflow_image(<br>    source_step: WorkflowStep,<br>    target_step: WorkflowStep,<br>    input_image: InputImage,<br>    obj_order: int,<br>    obj_count: int,<br>    target_image: PIL_Image | None,<br>    image_info: dict[str, str] | None = None,<br>) -&gt; None:<br>    if not SAVE_GENERATED_IMAGES or target_image is None:<br>        return<br>    OUTPUT_IMAGES_PATH.mkdir(parents=True, exist_ok=True)<br>    time_str = datetime.now().strftime(&quot;%Y-%m-%d_%H-%M-%S&quot;)<br>    try:<br>        filename = f&quot;{Source(input_image).name}_&quot;<br>    except ValueError:<br>        filename = &quot;&quot;<br>    filename += f&quot;{obj_order}o{obj_count}_{source_step}_{target_step}_{time_str}.png&quot;<br>    image_path = OUTPUT_IMAGES_PATH.joinpath(filename)<br>    params = {}<br>    if image_info:<br>        png_info = PngInfo()<br>        for k, v in image_info.items():<br>            png_info.add_text(k, v)<br>        params.update(pnginfo=png_info)<br>    target_image.save(image_path, **params)<br><br><br># Matplotlib<br>FIGURE_FG_COLOR = &quot;#F1F3F4&quot;<br>FIGURE_BG_COLOR = &quot;#202124&quot;<br>EDGE_COLOR = &quot;#80868B&quot;<br>rcParams = {<br>    &quot;figure.dpi&quot;: 300,<br>    &quot;text.color&quot;: FIGURE_FG_COLOR,<br>    &quot;figure.facecolor&quot;: FIGURE_BG_COLOR,<br>    &quot;figure.edgecolor&quot;: FIGURE_FG_COLOR,<br>    &quot;axes.titlecolor&quot;: FIGURE_FG_COLOR,<br>    &quot;axes.edgecolor&quot;: EDGE_COLOR,<br>    &quot;xtick.color&quot;: FIGURE_FG_COLOR,<br>    &quot;ytick.color&quot;: FIGURE_FG_COLOR,<br>    &quot;xtick.bottom&quot;: False,<br>    &quot;xtick.top&quot;: False,<br>    &quot;ytick.left&quot;: False,<br>    &quot;ytick.right&quot;: False,<br>    &quot;xtick.labelbottom&quot;: False,<br>    &quot;ytick.labelleft&quot;: False,<br>}<br>plt.rcParams.update(rcParams)<br><br><br>def display_image_source_info(image: InputImage) -&gt; None:<br>    def get_image_info_md() -&gt; str:<br>        if image not in Source:<br>            return f&quot;[[Source Image]({image})]&quot;<br>        source = Source(image)<br>        metadata = metadata_by_source.get(source)<br>        if not metadata:<br>            return f&quot;[[Source Image]({source.value})]&quot;<br>        parts = [<br>            f&quot;[Source Image]({source.value})&quot;,<br>            f&quot;[Source Page]({metadata.webpage_url})&quot;,<br>            metadata.title,<br>            metadata.credit_line,<br>        ]<br>        separator = &quot;•&quot;<br>        inner_info = f&quot; {separator} &quot;.join(parts)<br>        return f&quot;{separator} {inner_info} {separator}&quot;<br><br>    def yield_md_rows() -&gt; Iterator[str]:<br>        horizontal_line = &quot;---&quot;<br>        image_info = get_image_info_md()<br>        yield horizontal_line<br>        yield f&quot;_{image_info}_&quot;<br>        yield horizontal_line<br><br>    display_markdown(f&quot;{chr(10)}{chr(10)}&quot;.join(yield_md_rows()))<br><br><br>def display_detected_objects(workflow: VisualObjectWorkflow) -&gt; None:<br>    source_image = workflow.source_image<br>    detected_objects = PIL.Image.new(&quot;RGB&quot;, source_image.size, &quot;white&quot;)<br>    for obj in workflow.detected_objects:<br>        obj_image, box = extract_object_image(source_image, obj)<br>        detected_objects.paste(obj_image, (box[0], box[1]))<br><br>    _, (ax1, ax2) = plt.subplots(1, 2, layout=&quot;compressed&quot;)<br>    ax1.imshow(source_image)<br>    ax2.imshow(detected_objects)<br><br>    disable_colab_cell_scrollbar()<br>    plt.show()<br><br><br>print(&quot;✅ Object detection helpers defined&quot;)</pre><p>🧪 Let’s start simple: can we detect the single illustration in this incunable from 1485?</p><pre>detect_objects(Source.incunable)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UbIz3l0BxzpyVr5En6ubKw.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2014rosen0487/?sp=165"><em>Source Page</em></a><em> • Vergaderinge der historien van Troy (1485) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><blockquote><em>💡 This works nicely. The bounding box is very precise, enclosing the hand-colored woodcut illustration very tightly.</em></blockquote><p>🧪 Now, let’s check the detection of the multiple visuals in this museum guidebook:</p><pre>detect_objects(Source.museum_guidebook)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*xwEEPJHm7jEhnwAMD2AVXw.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2014gen34181/?sp=33"><em>Source Page</em></a><em> • Barnum’s American Museum illustrated (1850) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><p>💡 Remarks:</p><ul><li>The bounding boxes are again very precise.</li><li>The results are perfect: there are no false positives and no false negatives.</li><li>The captions below the visuals are not enclosed within the bounding boxes, which was specifically requested. The bounding box granularity can be controlled by changing the prompt.</li></ul><p>🧪 What about slightly warped visuals?</p><pre>detect_objects(Source.paintings)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Z9m0NCshi_2x0v4D98lyMg.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1714146681164-f26fed839692?h=1440"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/9hD7qrxICag"><em>Source Page</em></a><em> • Open book showing paintings by Vincent van Gogh • Photo by Trung Manh cong on Unsplash •</em></figcaption></figure><blockquote><em>💡 This doesn’t make a difference. Notice how the bottom-right painting is partially covered by the orange bookmark. We’ll try to fix that in the restoration step.</em></blockquote><p>🧪 What about the tilted visuals in this book about the architecture in Denver?</p><pre>detect_objects(Source.denver_illustrated)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*6UKNrbH6ArojaCKm.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/gdclccn.rc01000494/?sp=51"><em>Source Page</em></a><em> • Denver illustrated (1893) • Library of Congress, Meeting of Frontiers •</em></figcaption></figure><blockquote><em>💡 Each visual is perfectly detected: spatial understanding covers tilted objects.</em></blockquote><p>🧪 Finally, let’s check the detection on this significantly warped book page from Alice’s Adventures in Wonderland:</p><pre>detect_objects(Source.alice_drawing)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*abXzXUShi_eXSrfP.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1630595011903-689853b04ee2?h=800"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/bewzr_Q9u2o"><em>Source Page</em></a><em> • Open book showing an illustration and text from Alice’s Adventures in Wonderland • Photo by Brett Jordan on Unsplash •</em></figcaption></figure><blockquote><em>💡 Page curvature and other distortions don’t prevent non-rectangular objects from being detected. In fact, spatial understanding works at the pixel level, which explains this precision for warped objects. If you’d like to work at a lower level, you can also ask for a “segmentation mask” in the prompt and you’ll get a base64-encoded PNG (each pixel giving the 0–255 probability it belongs to the object within the bounding box). See the </em><a href="https://ai.google.dev/gemini-api/docs/image-understanding#segmentation"><em>segmentation doc</em></a><em> for more details.</em></blockquote><h3>🏷️ Text extraction and dynamic labeling</h3><p>On top of localizing each object with its bounding box, our prompt requested to extract a verbatim caption and to assign a single-word label, when possible.</p><p>Let’s add a simple function to display the detection data in a table:</p><pre>from collections import defaultdict<br><br><br>def display_detection_data(source: Source, show_consolidated: bool = False) -&gt; None:<br>    def string_with_visible_linebreaks(s: str) -&gt; str:<br>        return f&#39;&#39;&#39;&quot;{s.replace(chr(10), &quot;↩️&quot;)}&quot;&#39;&#39;&#39;<br><br>    def yield_md_rows_consolidated(workflow: VisualObjectWorkflow) -&gt; Iterator[str]:<br>        yield &quot;| label | count | captions |&quot;<br>        yield &quot;| :--- | ---: | :--- |&quot;<br>        stats = defaultdict(list)<br>        for obj in workflow.detected_objects:<br>            stats[obj.label].append(string_with_visible_linebreaks(obj.caption))<br>        for label, captions in stats.items():<br>            count = len(captions)<br>            label_captions = &quot; • &quot;.join(sorted(captions))<br>            yield f&quot;| {label} | {count} | {label_captions} |&quot;<br><br>    def yield_md_rows_with_bbox(workflow: VisualObjectWorkflow) -&gt; Iterator[str]:<br>        yield &quot;| box_2d | label | caption |&quot;<br>        yield &quot;| :--- | :--- | :--- |&quot;<br>        for obj in workflow.detected_objects:<br>            yield f&quot;| {obj.box_2d} | {obj.label} | {string_with_visible_linebreaks(obj.caption)} |&quot;<br><br>    workflow = workflow_by_image.get(source)<br>    if workflow is None:<br>        print(f&#39;❌ No detection for source &quot;{source.name}&quot;&#39;)<br>        return<br>    md_rows = list(<br>        yield_md_rows_consolidated(workflow)<br>        if show_consolidated<br>        else yield_md_rows_with_bbox(workflow)<br>    )<br>    display_image_source_info(source)<br>    display_markdown(chr(10).join(md_rows))</pre><p>In the museum guidebook, the dynamic labeling is precise according to the context, and the captions below each illustration are perfectly extracted:</p><pre>display_detection_data(Source.museum_guidebook)</pre><pre>| box_2d                   | label     | caption                   |<br>| :----------------------- | :-------- | :------------------------ |<br>| [954, 629, 1338, 1166]   | beetle    | &quot;The Horned Beetle.&quot;      |<br>| [265, 984, 464, 1504]    | armor     | &quot;Armor of a Man.&quot;         |<br>| [737, 984, 915, 1328]    | armor     | &quot;Horse Armor.&quot;            |<br>| [1225, 1244, 1589, 1685] | beetle    | &quot;The Goliath Beetle.&quot;     |<br>| [264, 1766, 431, 2006]   | mask      | &quot;The Mask.&quot;               |<br>| [937, 1769, 1260, 2087]  | butterfly | &quot;Painted Lady Butterfly.&quot; |<br>| [1325, 2170, 1581, 2468] | butterfly | &quot;The Lady Butterfly.&quot;     |</pre><p>In the book photo showing four paintings, this is perfect too:</p><pre>display_detection_data(Source.paintings)</pre><pre>| box_2d                | label    | caption                                                                                                                                            |<br>| :-------------------- | :------- | :------------------------------------------------------------------------------------------------------------------------------------------------- |<br>| [378, 203, 837, 575]  | painting | &quot;Hái Ô-liu (Olive Picking), tháng 12 năm 1889, sơn dầu trên toan, 28 3/4 x 35 in. [73 x 89 cm]&quot;                                                    |<br>| [913, 207, 1380, 563] | painting | &quot;Hẻm núi Les Peiroulets (Les Peiroulets Ravine), tháng 10 năm 1889, sơn dầu trên toan, 28 3/4 x 36 1/4 in. [73 x 92 cm]&quot;                           |<br>| [387, 596, 845, 978]  | painting | &quot;Trưa: Nghỉ ngơi (phỏng theo Millet) (Noon: Rest from Work [after Millet]), tháng 1 năm 1890, sơn dầu trên toan, 28 3/4 x 35 7/8 in. [73 x 91 cm]&quot; |<br>| [921, 611, 1397, 982] | painting | &quot;Hoa hạnh đào (Almond Blossom), tháng 2 năm 1890, sơn dầu trên toan, 28 3/8 x 36 1/4 in. [73 x 92 cm]&quot;                                             |</pre><p>In the Denver architecture book, the four captions are assigned to the correct illustrations, which was not an obvious task:</p><pre>display_detection_data(Source.denver_illustrated)</pre><pre>| box_2d                 | label    | caption                        |<br>| :--------------------- | :------- | :----------------------------- |<br>| [203, 224, 741, 839]   | building | &quot;ERNEST AND CRANMER BUILDING.&quot; |<br>| [743, 73, 1192, 758]   | building | &quot;PEOPLE&#39;S BANK BUILDING.&quot;      |<br>| [1185, 211, 1787, 865] | building | &quot;BOSTON BUILDING.&quot;             |<br>| [699, 754, 1238, 1203] | building | &quot;COOPER BUILDING.&quot;             |</pre><blockquote><em>💡 If you have a closer look at the input image, it’s hard to tell which caption belongs to which illustration at a glance. Most of us would need to think about it (and might be wrong). Asking Gemini reveals that the results are intentional and not pure luck: </em>Deciphering vintage layouts can feel a bit like a puzzle, but there is usually a “reading-order” logic at play. In this specific case, the captions are arranged to correspond with the images in a clockwise or Z-pattern starting from the top left.</blockquote><p>In the “Alice’s Adventures in Wonderland” book page, there was a single illustration accompanying the story text. As expected, the caption is empty (i.e., no false positive):</p><pre>display_detection_data(Source.alice_drawing)</pre><pre>| box_2d                | label        | caption |<br>| :-------------------- | :----------- | :------ |<br>| [111, 146, 1008, 593] | illustration | &quot;&quot;      |</pre><h3>🔭 Generalizing object detection</h3><p>We can use the same principles for other object types. We’ll generally keep requesting bounding boxes to identify object positions within images. Without changing our current output structure (i.e., no code change), we can use captions and labels to extract different object metadata depending on the input type.</p><p>🧪 See how we can detect electronic components by adapting the prompt while keeping the exact same code and output structure:</p><pre>ELECTRONIC_COMPONENT_DETECTION_PROMPT = &quot;&quot;&quot;<br>Exhaustively detect all the individual electronic components in the image and provide the following data for each:<br>- `box_2d`: bounding box coordinates.<br>- `caption`: Verbatim alphanumeric text visible on the component (including original line breaks), or &quot;&quot; if no text is present.<br>- `label`: Specific type of component.<br>&quot;&quot;&quot;<br><br>detect_objects(<br>    Source.electronics,<br>    ELECTRONIC_COMPONENT_DETECTION_PROMPT,<br>    media_resolution=PartMediaResolutionLevel.MEDIA_RESOLUTION_ULTRA_HIGH,<br>)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GJdmctVO_RuPrPzx.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1757397584789-8b2c5bfcdbc3?h=1440"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/Aqa1pHQ57pw"><em>Source Page</em></a><em> • Circuit board with electronic components • Photo by Albert Stoynov on Unsplash •</em></figcaption></figure><p>💡 Remarks:</p><ul><li>Large and tiny components are detected, thanks to the specific instruction “exhaustively detect…”.</li><li>By using the ultra-high media resolution, we ensure more details are tokenized and the “P” component (a visual outlier) gets detected.</li></ul><p>Here’s a consolidated view of the detected components:</p><pre>display_detection_data(Source.electronics, show_consolidated=True)</pre><pre>| label              | count | captions                                        |<br>| :----------------- | ----: | :---------------------------------------------- |<br>| integrated circuit |     3 | &quot;49240↩️020S6K&quot; • &quot;8105↩️0:35&quot; • &quot;P4010↩️9NA0&quot; |<br>| resistor           |     4 | &quot;&quot; • &quot;&quot; • &quot;105&quot; • &quot;R020&quot;                        |<br>| inductor           |     1 | &quot;n1W&quot;                                           |<br>| diode              |     3 | &quot;K&quot; • &quot;L&quot; • &quot;P&quot;                                 |<br>| capacitor          |     6 | &quot;&quot; • &quot;&quot; • &quot;&quot; • &quot;&quot; • &quot;&quot; • &quot;&quot;                     |<br>| transistor         |     1 | &quot;41&quot;                                            |<br>| connector          |     1 | &quot;&quot;                                              |</pre><p>💡 Remarks:</p><ul><li>Components are detected along with their text markings, despite the three different text orientations (upright, sideways, and upside down), the blur, and the photo noise.</li><li>We removed the degree of freedom for multi-line text by specifying the inclusion of “original line breaks” in the prompt: responses now consistently include the line breaks for the three integrated circuits (displayed with the ↩️ emoji for better visibility).</li><li>The last degree of freedom lies in the labeling. While most components have been properly labeled, it is unclear whether the “P” component is a diode, a resistor, or a fuse. Making the instructions more specific (e.g., listing the possible labels, using an enum for the label field in the Pydantic class, or providing guidelines and more details about the expected circuit boards) will make the prompt more &quot;closed&quot; and the results more deterministic and accurate.<br>It&#39;s also possible to enable/update the thinking_config configuration, which will trigger a chain of thought before generating the final answer. In all the detections performed, our code used ThinkingLevel.MINIMAL, which didn&#39;t consume any thought tokens (with Gemini 3 Flash). Updating the parameter to ThinkingLevel.LOW, ThinkingLevel.MEDIUM, or ThinkingLevel.HIGH will use thought tokens and can lead to better outputs in complex cases.</li></ul><p>This demonstrates the versatility of the approach. Without retraining a model, we switched from detecting 15th-century woodcuts and illustrations with vintage layouts to identifying modern electronics just by changing the prompt. Such detections, including caption and label metadata, could be used to auto-crop components for a parts catalog, verify assembly lines, or create interactive schematics… all without a single labeled training image.</p><h3>🪄 Editing visual objects</h3><p>Now that we can detect visual objects, we can envision an automation workflow to extract and reuse them. For this, we’ll use Gemini 2.5 Flash Image (also known as Nano Banana 🍌) by default, a state-of-the-art image generation and editing model.</p><p>Our object editing functions will follow the same template, taking one step as input and generating an edited image for the output step. Let’s define core helpers for this:</p><pre>from typing import Protocol<br><br><br>class ObjectEditingFunction(Protocol):<br>    def __call__(<br>        self,<br>        image: InputImage,<br>        prompt: str | None = None,<br>        model: ImageModel | None = None,<br>        config: GenerateContentConfig | None = None,<br>        display_results: bool = True,<br>    ) -&gt; None: ...<br><br><br>SourceTargetSteps = tuple[WorkflowStep, WorkflowStep]<br>registered_functions: dict[SourceTargetSteps, ObjectEditingFunction] = {}<br><br>DEFAULT_EDITING_CONFIG = GenerateContentConfig(response_modalities=[&quot;IMAGE&quot;])<br>EMPTY_IMAGE = PIL.Image.new(&quot;1&quot;, (1, 1), &quot;white&quot;)<br><br><br>def object_editing_function(<br>    default_prompt: str,<br>    source_step: WorkflowStep,<br>    target_step: WorkflowStep,<br>    default_model: ImageModel = ImageModel.DEFAULT,<br>    default_config: GenerateContentConfig = DEFAULT_EDITING_CONFIG,<br>) -&gt; ObjectEditingFunction:<br>    def editing_function(<br>        image: InputImage,<br>        prompt: str | None = default_prompt,<br>        model: ImageModel | None = default_model,<br>        config: GenerateContentConfig | None = default_config,<br>        display_results: bool = True,<br>    ) -&gt; None:<br>        workflow, source_images = get_workflow_and_step_images(image, source_step)<br>        if prompt is None:<br>            prompt = default_prompt<br>        prompt = prompt.strip()<br>        if model is None:<br>            model = default_model<br>        # Note: &quot;config is None&quot; is valid and will use the model endpoint default config<br><br>        target_images: list[PIL_Image] = []<br>        display_image_source_info(image)<br>        obj_count = len(source_images)<br>        for obj_order, source_image in enumerate(source_images, 1):<br>            target_image = generate_image([source_image], prompt, model, config)<br>            save_workflow_image(<br>                source_step,<br>                target_step,<br>                image,<br>                obj_order,<br>                obj_count,<br>                target_image,<br>                dict(prompt=prompt),<br>            )<br>            target_images.append(target_image if target_image else EMPTY_IMAGE)<br><br>        workflow.images_by_step[target_step] = target_images<br>        if display_results:<br>            display_sources_and_targets(workflow, source_step, target_step)<br><br>    registered_functions[(source_step, target_step)] = editing_function<br><br>    return editing_function<br><br><br>def get_workflow_and_step_images(<br>    image: InputImage,<br>    step: WorkflowStep,<br>) -&gt; tuple[VisualObjectWorkflow, list[PIL_Image]]:<br>    # Objects detected?<br>    if image not in workflow_by_image:<br>        detect_objects(image, display_results=False)<br>    workflow = workflow_by_image.get(image)<br>    assert workflow is not None<br><br>    # Workflow step objects? (single level, could be extended to a dynamical graph)<br>    operation = (WorkflowStep.CROPPED, step)<br>    if step not in workflow.images_by_step and operation in registered_functions:<br>        source_function = registered_functions[operation]<br>        source_function(image, display_results=False)<br><br>    # Source images<br>    source_images = workflow.images_by_step.get(step)<br>    assert source_images is not None<br><br>    return workflow, source_images<br><br><br>def display_sources_and_targets(<br>    workflow: VisualObjectWorkflow,<br>    source_step: WorkflowStep,<br>    target_step: WorkflowStep,<br>) -&gt; None:<br>    source_images = workflow.images_by_step[source_step]<br>    target_images = workflow.images_by_step[target_step]<br>    if not source_images:<br>        print(&quot;❌ No images to display&quot;)<br>        return<br><br>    fig = plt.figure(layout=&quot;compressed&quot;)<br>    if horizontal := (len(source_images) &gt;= 2):<br>        rows, cols = 2, len(source_images)<br>    else:<br>        rows, cols = len(source_images), 2<br>    gs = fig.add_gridspec(rows, cols)<br><br>    for i, (source_image, target_image) in enumerate(<br>        zip(source_images, target_images, strict=True)<br>    ):<br>        for dim, image in enumerate([source_image, target_image]):<br>            grid_spec = gs[dim, i] if horizontal else gs[i, dim]<br>            ax = fig.add_subplot(grid_spec)<br>            ax.set_axis_off()<br>            ax.imshow(image)<br><br>    disable_colab_cell_scrollbar()<br>    plt.show()<br><br><br>print(&quot;✅ Object editing helpers defined&quot;)</pre><p>Now, let’s define a first editing step to restore the detected objects that can contain many real-life artifacts…</p><h3>✨ Restoring visual objects</h3><p>For this restoration step, we need to craft a prompt that is generic enough (to cover most use cases) but also specific enough (to take into account restoration needs).</p><p>An image editing prompt is based on natural language, typically using imperative or declarative instructions. With an imperative prompt, you describe the actions to perform on the input, while with a declarative prompt, you describe the expected output. Both are possible and will provide equivalent results. Your choice is really a matter of preference, as long as the prompt makes sense.</p><p>Our test suite is mostly composed of book photos, which can contain various photographic and paper artifacts. The Nano Banana models understand these subtleties and can edit images accordingly, which simplifies the prompt.</p><p>Here is a possible restoration function using an imperative prompt:</p><pre>RESTORATION_PROMPT = &quot;&quot;&quot;<br>- Isolate and straighten the visual on a pure white background, excluding any surrounding text.<br>- Clean up all physical artifacts and noise while preserving every original detail.<br>- Center the result and scale it to fit the canvas with minimal, symmetrical margins, ensuring no distortion or cropping.<br>&quot;&quot;&quot;<br><br># Default config with low randomness for more deterministic restoration outputs<br>RESTORATION_CONFIG = GenerateContentConfig(<br>    temperature=0.0,<br>    top_p=0.0,<br>    seed=42,<br>    response_modalities=[&quot;IMAGE&quot;],<br>)<br><br>restore_objects = object_editing_function(<br>    RESTORATION_PROMPT,<br>    WorkflowStep.CROPPED,<br>    WorkflowStep.RESTORED,<br>    default_config=RESTORATION_CONFIG,<br>)<br><br>print(&quot;✅ Restoration function defined&quot;)</pre><p>🧪 Let’s try to restore the illustration from the 1485 incunable:</p><pre>restore_objects(Source.incunable)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*_9ExeIkXZgr_DfJ5.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2014rosen0487/?sp=165"><em>Source Page</em></a><em> • Vergaderinge der historien van Troy (1485) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><blockquote><em>💡 We now have a nice restoration of the hand-colored woodcut illustration. Note that our prompt is generic ( </em>“clean up all physical artifacts”<em>) and could be made more specific to remove more or fewer artifacts. In this example, there are remaining artifacts, such as the paper discoloration in the sword or the bleeding ink in the armor. We’ll see if we can fix these in the colorization step.</em></blockquote><p>🧪 What about the illustrations from the museum guidebook?</p><pre>restore_objects(Source.museum_guidebook)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*k5GNDcointglk0mC.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2014gen34181/?sp=33"><em>Source Page</em></a><em> • Barnum’s American Museum illustrated (1850) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><blockquote><em>💡 All good!</em></blockquote><p>🧪 What about the slightly warped visuals?</p><pre>restore_objects(Source.paintings)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BGYF2NWWTPmUZ0wb.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1714146681164-f26fed839692?h=1440"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/9hD7qrxICag"><em>Source Page</em></a><em> • Open book showing paintings by Vincent van Gogh • Photo by Trung Manh cong on Unsplash •</em></figcaption></figure><p>💡 Remarks:</p><ul><li>Notice how, on the last painting, the orange bookmark is properly removed and the hidden part inpainted to complete the painting.</li><li>We requested to “fit the canvas with minimal, symmetrical margins, without distortion or cropping”. Depending on the aspect ratio and type of the visual, this degree of freedom can result in different white margins.</li><li>This example shows famous paintings by Vincent Van Gogh. Nano Banana does not fetch any reference images and only uses the provided input. If these were photos of private paintings, they would be restored in the same way.</li></ul><p>In the Denver architecture book, the illustrations can be tilted, which our generic prompt does not fully take into account. When several geometric transformations are involved, it can be challenging to craft an imperative prompt that details all the operations to perform. Instead, a descriptive prompt can be more straightforward by directly describing the expected output.</p><p>🧪 Here’s an example of a descriptive prompt focusing on the restoration of tilted visuals:</p><pre>tilted_visual_prompt = &quot;&quot;&quot;<br>An upright, high-fidelity rendition of the visual isolated against a pure white background, filling the canvas with minimal uniform margins. The output is clean, sharp, and free of physical artifacts.<br>&quot;&quot;&quot;<br><br>restore_objects(Source.denver_illustrated, tilted_visual_prompt)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*bVFJzGevDrubPKfn.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/gdclccn.rc01000494/?sp=51"><em>Source Page</em></a><em> • Denver illustrated (1893) • Library of Congress, Meeting of Frontiers •</em></figcaption></figure><p>💡 Remarks:</p><ul><li>To get these results, the prompt focuses on requesting an “upright” visual “filling the canvas”, which proves more straightforward to write than trying to account for all possible geometric corrections.</li><li>The native visual understanding automatically identifies the content type (photo, illustration, etc.) and the different artifacts (photographic, paper, printing, scanning…), allowing for precise restorations out of the box.</li><li>Notice how the consistency is preserved: the last visual is restored as an illustration, while the first visuals maintain their photographic style.</li><li>The results, with this rather generic prompt, are impressive. It is, of course, possible to be more specific and request particular lighting, styles, colors…</li></ul><p>In this last test, the input visual has distortions not only from the page curvature but also from the photo perspective.</p><p>🧪 Here’s an example of a descriptive prompt focusing on restoring warped illustrations:</p><pre>warped_visual_prompt = &quot;&quot;&quot;<br>An edge-to-edge digital extraction of the illustration from the provided book photo, excluding any peripheral text. All page curvature and perspective distortions are corrected, resulting in an image framed in a perfect rectangle, on a pure white canvas with minimal margins.<br>&quot;&quot;&quot;<br><br>restore_objects(Source.alice_drawing, warped_visual_prompt)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*SYYWKJ9KvvCPvcqH.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1630595011903-689853b04ee2?h=800"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/bewzr_Q9u2o"><em>Source Page</em></a><em> • Open book showing an illustration and text from Alice’s Adventures in Wonderland • Photo by Brett Jordan on Unsplash •</em></figcaption></figure><blockquote><em>💡 It is really impressive that such a restoration can be performed in a single step. Note that this prompt is not stable and can generate less optimal results (it would benefit from being more precise). If you have complex transformations, test descriptive prompts iteratively, using precise and concise instructions, and you might be pleasantly surprised. In the worst case, it’s also possible to process the transformations in successive, easier steps.</em></blockquote><p>Now, let’s add a colorization step…</p><h3>🎨 Colorization</h3><p>Our restoration step respected the original styles of the input images. Recent image editing models excel at transforming image styles, starting with colors. This can generally be performed directly with a simple, precise instruction.</p><p>Here is a possible colorization function using an imperative prompt:</p><pre>COLORIZATION_PROMPT = &quot;&quot;&quot;<br>Colorize this image in a modern book illustration style, maintaining all original details without any additions.<br>&quot;&quot;&quot;<br><br>colorize = object_editing_function(<br>    COLORIZATION_PROMPT,<br>    WorkflowStep.RESTORED,<br>    WorkflowStep.COLORIZED,<br>)<br><br>print(&quot;✅ Colorization function defined&quot;)</pre><p>🧪 Let’s modernize our 1485 illustration:</p><pre>colorize(Source.incunable)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ZwoEmD05EJmpzqnn.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2014rosen0487/?sp=165"><em>Source Page</em></a><em> • Vergaderinge der historien van Troy (1485) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><blockquote><em>💡 All details are preserved, as requested in the prompt. Notice how the colorization can naturally fix some remaining artifacts (e.g., the paper discoloration in the sword or the bleeding ink in the armor).</em></blockquote><p>🧪 Let’s colorize our museum guidebook illustrations:</p><pre>colorize(Source.museum_guidebook)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*X6xeg4tmNuHZQcRw.png" /><figcaption>• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Rare Book and Special Collections Division •</figcaption></figure><blockquote><em>💡 Our prompt is very open as it only specifies “modern book illustration style”. This can generate very creative colorizations, but they all seem to make perfect sense.</em></blockquote><p>🧪 What about our Denver buildings?</p><pre>colorize(Source.denver_illustrated)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*EOL3lUsdrOkmdS-8.png" /><figcaption>• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Meeting of Frontiers •</figcaption></figure><blockquote><em>💡 As requested, they all look like modern illustrations, including the first visuals (originating from noisy photos).</em></blockquote><p>It’s possible to go further by not only “colorizing” but also “transforming” the image into a significantly different one.</p><p>🧪 Let’s make our “Alice’s Adventures in Wonderland” drawing into a watercolor painting:</p><pre>watercolor_prompt = &quot;&quot;&quot;<br>Transform this visual into a warm, watercolor painting.<br>&quot;&quot;&quot;<br><br>colorize(Source.alice_drawing, watercolor_prompt)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*DlDTWEKBjzlP81L0.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1630595011903-689853b04ee2?h=800"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/bewzr_Q9u2o"><em>Source Page</em></a><em> • Open book showing an illustration and text from Alice’s Adventures in Wonderland • Photo by Brett Jordan on Unsplash •</em></figcaption></figure><p>🧪 What about making it a traditional painting?</p><pre>painting_prompt = &quot;&quot;&quot;<br>Transform this visual into a traditional painting.<br>&quot;&quot;&quot;<br><br>colorize(Source.alice_drawing, painting_prompt)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ivTTl09qMrxj-4AY.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1630595011903-689853b04ee2?h=800"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/bewzr_Q9u2o"><em>Source Page</em></a><em> • Open book showing an illustration and text from Alice’s Adventures in Wonderland • Photo by Brett Jordan on Unsplash •</em></figcaption></figure><p>We can also change image compositions. Depending on the context, some compositions are more or less implied by default. For example, illustrations often have margins, while photos generally have edge-to-edge (full-bleed in the printing world) compositions. When possible, it’s interesting to refer to a type of visual (which intrinsically brings a lot of semantics to the context) and adjust the instructions accordingly.</p><p>🧪 Let’s see how we can detect engravings in this 1847 book, restore them, and transform them into modern digital graphics:</p><pre>detect_objects(Source.engravings)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ny4apKIZqjX9InOS.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/gdcscd.00340766921/?sp=121"><em>Source Page</em></a><em> • Harper’s illustrated catalogue (1847) • Library of Congress •</em></figcaption></figure><pre>restore_objects(Source.engravings)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*07bIPH3J_Irc3rEC.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/gdcscd.00340766921/?sp=121"><em>Source Page</em></a><em> • Harper’s illustrated catalogue (1847) • Library of Congress •</em></figcaption></figure><pre>visual_to_digital_graphic_prompt = &quot;&quot;&quot;<br>Transform this visual into a full-color, flat digital graphic, extending the content for a full-bleed effect.<br>&quot;&quot;&quot;<br><br>colorize(Source.engravings, visual_to_digital_graphic_prompt)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Af0GbnsMv_-MDXiv.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/gdcscd.00340766921/?sp=121"><em>Source Page</em></a><em> • Harper’s illustrated catalogue (1847) • Library of Congress •</em></figcaption></figure><p>🧪 We can also transform the same engravings into photos with a very simple prompt:</p><pre>visual_to_photo_prompt = &quot;&quot;&quot;<br>Transform this visual into a high-end, modern camera photograph.<br>&quot;&quot;&quot;<br><br>colorize(Source.engravings, visual_to_photo_prompt)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Ss__KSvDxQ3MDKw2.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/gdcscd.00340766921/?sp=121"><em>Source Page</em></a><em> • Harper’s illustrated catalogue (1847) • Library of Congress •</em></figcaption></figure><blockquote><em>💡 As photos are generally full-bleed, the prompt does not need to specify a composition.</em></blockquote><p>It’s really up to our imagination, as Nano Banana seems to grasp every aspect of the visual semantics.</p><p>Let’s add a final step to see how far we can go, reimagining images as cinematic movie stills…</p><h3>🎞️ Cinematization</h3><p>We’ve used rather “closed” prompts so far, crafting specific instructions and constraints to control the outputs. It’s possible to go even further with “open” prompts and generate images in full creative mode. Notably, it can be interesting to refer to photographic or cinematographic terminology as it encompasses many visual techniques.</p><p>Here is a possible generic cinematization function to reimagine images as movie stills:</p><pre>CINEMATIZATION_PROMPT = &quot;&quot;&quot;<br>Reimagine this image as a joyful, modern live-action cinematic movie still featuring professional lighting and composition.<br>&quot;&quot;&quot;<br><br>cinematize = object_editing_function(<br>    CINEMATIZATION_PROMPT,<br>    WorkflowStep.RESTORED,<br>    WorkflowStep.CINEMATIZED,<br>)</pre><p>🧪 Let’s cinematize the “Alice’s Adventures in Wonderland” drawing:</p><pre>cinematize(Source.alice_drawing)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*eFqfiTlmOw8f19h5.png" /><figcaption><em>• </em><a href="https://images.unsplash.com/photo-1630595011903-689853b04ee2?h=800"><em>Source Image</em></a><em> • </em><a href="https://unsplash.com/photos/bewzr_Q9u2o"><em>Source Page</em></a><em> • Open book showing an illustration and text from Alice’s Adventures in Wonderland • Photo by Brett Jordan on Unsplash •</em></figcaption></figure><blockquote><em>💡 This looks like a high-budget movie still. There are lots of degrees of freedom in the prompt, but you’re likely to get foreground figures in sharp focus, a gradual background blur, “golden hour” lighting (a magical ingredient for many cinematographers), and detailed textures. Such compositions really evoke different atmospheres compared to the photos generated in the previous test.</em></blockquote><p>🧪 Let’s test the workflow on a page from the Wonderful Wizard of Oz containing three drawings:</p><pre>detect_objects(Source.wizard_of_oz_drawings)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*mtzj1fkaZlfB1vZO.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2006gen32405/?sp=48"><em>Source Page</em></a><em> • The wonderful Wizard of Oz (1899) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><pre>restore_objects(Source.wizard_of_oz_drawings)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*vtE_-NZAoVTf_1Om.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2006gen32405/?sp=48"><em>Source Page</em></a><em> • The wonderful Wizard of Oz (1899) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><pre>cinematize(Source.wizard_of_oz_drawings)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*eJI-8iXM2S-jG8pI.png" /><figcaption><em>• </em><a href="https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"><em>Source Image</em></a><em> • </em><a href="https://www.loc.gov/resource/rbc0001.2006gen32405/?sp=48"><em>Source Page</em></a><em> • The wonderful Wizard of Oz (1899) • Library of Congress, Rare Book and Special Collections Division •</em></figcaption></figure><blockquote><em>💡 The cast for a new movie is ready 😉</em></blockquote><p>Cinematic images have various use cases:</p><ul><li>These cinematized stills can be perfect “reference images” for video generation models like Veo. See <a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/video/use-reference-images-to-guide-video-generation#googlegenaisdk_videogen_with_image-python_genai_sdk">Generate Veo videos from reference images</a>.</li><li>As they are photorealistic representations, they can also be a source for generating 2D or 3D visuals, in any style, with realistic figures, perfect proportions, advanced lighting, enhanced compositions…</li><li>You can use them in many professional contexts or for high-end products: presentations, magazines, posters, storyboards, brainstorming sessions…</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*t6g72E-DxbelSixzGWg1GA.png" /></figure><h3>🏁 Conclusion</h3><ul><li>Gemini’s native spatial understanding enables the detection of specific visual objects based on a single prompt in natural language.</li><li>We tested the detection of illustrations in book photos, which traditional machine learning (ML) models usually miss, as they are typically trained to detect people, animals, vehicles, food, and a finite set of physical object classes.</li><li>We tested the detection of straight, tilted, and even significantly warped illustrations, and they were always precisely identified.</li><li>The core implementation was straightforward, requiring minimal code using the Python SDK and customized prompts. By comparison, fine-tuning a traditional object detection model is time-consuming: it involves assembling an image dataset, labeling objects, and managing training jobs.</li><li>This solution is very flexible: we could switch from detecting illustrations to electronic components, by adapting the prompt, while keeping the code unchanged.</li><li>Using structured outputs (with a JSON schema or Pydantic classes, and the Python SDK) makes the code both easy to implement and ready to deploy to production.</li><li>Then, Nano Banana allows editing these visual objects in virtually any way imaginable.</li><li>We tested a workflow with restoration, colorization, and even cinematization steps, using imperative and descriptive prompts.</li><li>The possibilities seem really endless, and the principles in this exploration can be reused in different contexts.</li></ul><h3>➕ More!</h3><ul><li><strong>Try it yourself:</strong> Use the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/spatial-understanding/object_detection_and_editing.ipynb">companion notebook</a> to reproduce the results in this article.</li><li><strong>Read more:</strong> Read <a href="https://medium.com/google-cloud/generating-consistent-imagery-with-gemini-nano-banana-6e807b4d1f77?source=friends_link&amp;sk=717e6077e70aad45d24df4d2e0d09780">Generating Consistent Imagery</a> to explore another image generation use case.</li><li><strong>Get inspired:</strong> Check out the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/nano-banana/nano_banana_recipes.ipynb">Nano Banana recipes notebook</a> for more practical examples.</li><li><strong>Follow me:</strong> Connect with me (@PicardParis) on <a href="https://www.linkedin.com/in/picardparis">LinkedIn</a> or <a href="https://x.com/PicardParis">Twitter-X</a> for more cloud, applied AI, and Python explorations…</li></ul><p>Thanks for reading. Let me know if you create something cool!</p><p><em>Originally published at </em><a href="https://towardsdatascience.com/detecting-and-editing-visual-objects-with-gemini/"><em>https://towardsdatascience.com</em></a><em> on February 26, 2026.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=214bde9d5792" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/detecting-and-editing-visual-objects-with-gemini-214bde9d5792">Detecting and Editing Visual Objects with Gemini</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[ Testing Gemini 3 Pro Image]]></title>
            <link>https://medium.com/google-cloud/testing-gemini-3-pro-image-f585236ae411?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/f585236ae411</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[image-generation]]></category>
            <category><![CDATA[gemini]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Thu, 20 Nov 2025 15:57:50 GMT</pubDate>
            <atom:updated>2025-11-22T13:59:41.190Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*61lWNXm0UiBD7l61HTp-QA.jpeg" /><figcaption>4K image generated with Nano Banana Pro 🍌, using 2 reference images and 1 prompt</figcaption></figure><p>“Gemini 3 Pro Image” (aka Nano Banana Pro 🍌) just launched (in preview) and is the new state-of-the-art image generation/editing model.</p><p>What’s new:</p><ul><li><strong>Advanced reasoning</strong></li><li><strong>Advanced text rendering</strong></li><li>Real-world <strong>grounding with Google Search</strong></li><li>High-resolution outputs: 1K, <strong>2K, and 4K</strong></li><li>Up to <strong>14 reference input images</strong></li></ul><p>Let’s switch our code directly from 1K to 4K, and check how it behaves compared to the <a href="https://medium.com/google-cloud/generating-consistent-imagery-with-gemini-nano-banana-6e807b4d1f77">previous test</a>…</p><h3>🪄 Character sheet</h3><pre>source_ids = [AssetId.ARCHIVE]<br>prompt = &quot;&quot;&quot;<br>- Scene: Robot character sheet.<br>- Left half: Direct flat front view of the robot.<br>- Right half: Direct flat back view of the robot (seamless back).<br>- In both views, the robot wears a same small, brown-felt backpack, with a tiny polished-brass buckle and simple straps.<br>- Background: Pure white.<br>- Top text: Caption the image &quot;ROBOT CHARACTER SHEET&quot;.<br>- Bottom text: Caption the views &quot;FRONT VIEW&quot; and &quot;BACK VIEW&quot;.<br>&quot;&quot;&quot;<br>new_id = AssetId.ROBOT<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pvRfskVtyBYwkHRR7xlV5Q.png" /></figure><p>🔼️💡 Perfect, every time, not missing the backpack straps, in both views, and in 4K!</p><h3>✨ First scene</h3><pre>source_ids = [AssetId.ROBOT]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Scene: Macro photography of a beautifully crafted miniature diorama.<br>- Foreground: A medium-gray felt cliff is visible in the bottom-left corner. The robot stands on the edge of the cliff, viewed from a 3/4 back angle, looking out over a sea of clouds (made of white cotton).<br>- Background: A range of infinite, interspersed, large dome-like felt mountains, in various shades of medium blue/green, with curvy white snowcaps.<br>- Angle: Eye level.<br>- Lighting: Studio, clean and soft.<br>&quot;&quot;&quot;<br>new_id = AssetId.MOUNTAINS<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kqOwchZf1PLyM-_GZvYbkw.png" /></figure><p>🔼️💡 More mountains can be generated. The “infinite” concept is understood.</p><h3>✨ Successive scenes</h3><pre>source_ids = [AssetId.ROBOT, AssetId.MOUNTAINS]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot has descended from the cliff to a gray felt valley. It stands in the center, seen directly from the back. It is holding/reading a felt map with outstretched arms.<br>- Large smooth, round, felt rocks in various beige/gray shades are visible on the sides.<br>- Background: The distant mountain range. A thin layer of clouds obscures its base and the end of the valley.<br>- Lighting: Golden hour light, soft and diffused.<br>&quot;&quot;&quot;<br>new_id = AssetId.VALLEY<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DWE1y-5Laq3HVZwpuv2ryA.png" /></figure><p>🔼️💡 Beautiful mix and lighting!</p><pre>source_ids = [AssetId.ROBOT, AssetId.VALLEY]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot walks to the end of the valley and reaches an infinite forest wall filling the entire background.<br>- The forest is a dense cluster of tall, slender, simple-upright-elongated-cone-with-rounded-tip-and-no-trunk trees.<br>- The trees are made from various shades of light/medium/dark green felt.<br>- The robot is on the right, viewed from a 3/4 rear angle, no longer holding the map, with both hands clasped to its ears in despair.<br>- On the left &amp; right bottom sides, rocks (similar to image 2) are partially visible.<br>&quot;&quot;&quot;<br>new_id = AssetId.FOREST<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*FBLMcIC84NpIwkmDO1qPRA.png" /></figure><p>🔼️💡 It was not possible to generate so many trees before. Nano Banana Pro can “manage many more objects” (if that means anything).</p><pre>source_ids = [AssetId.ROBOT, AssetId.FOREST]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot goes through the dense forest and emerges into a clearing.<br>- The robot is in the center, front-facing the camera.<br>- The ground is made of green felt, with flat patches of white felt snow. Rocks are no longer visible.<br>&quot;&quot;&quot;<br>new_id = AssetId.CLEARING<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*_owXyL-ufxGnDr37E0JVQQ.png" /></figure><p>🔼️💡 Magnificent clearing! Many consistent trees. Footsteps were not specified but are a nice addition to the scene indeed!</p><pre>source_ids = [AssetId.ROBOT, AssetId.MOUNTAINS]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot is now climbing the peak of a medium-green mountain, reaching its summit.<br>- The mountain is in the center of the image, with the robot on its left, viewed from a 3/4 rear angle.<br>- The robot has both feet on the mountain and is using two felt ice axes (brown handles, gray heads), reaching the snowcap.<br>- Horizon: Filled by the distant mountain range.<br>&quot;&quot;&quot;<br>new_id = AssetId.ASCENSION<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ucDWW3holQBO9f135pLcNw.png" /></figure><p>🔼️💡 Excellent! Nice focus on the foreground and soft-focus on the background.</p><pre>source_ids = [AssetId.ROBOT, AssetId.ASCENSION]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot reaches the top and stands on the summit (facing the camera).<br>- It is no longer holding the ice axes, which are planted upright in the snow on each side.<br>- It has both arms raised in sign of victory.<br>&quot;&quot;&quot;<br>new_id = AssetId.SUMMIT<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AHx_J0Mb61Dy3OnIZcyI4g.png" /></figure><p>🔼️💡 Perfect! Even nicer when displayed in 4K.</p><pre>source_ids = [AssetId.ROBOT, AssetId.SUMMIT]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- Remove the ice axes.<br>- Move the center mountain to the left side of the image and add a slightly taller medium-blue mountain to the right side.<br>- Suspend a stylized felt bridge between the two mountains: Its deck is made of thick felt planks in various uniform wood shades.<br>- Place the robot on the center of the bridge with one arm pointing toward the blue mountain.<br>- Keep the horizon filled by the distant mountain range.<br>&quot;&quot;&quot;<br>new_id = AssetId.BRIDGE<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*uVPalX8ulQJ1vmK80WKzlQ.png" /></figure><p>🔼️💡 Excellent prompt adherence! You can even specify that “the robot walks toward the blue mountain” or more complex compositions.</p><pre>source_ids = [AssetId.ROBOT, AssetId.BRIDGE]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot is sleeping peacefully (both eyes changed into a &quot;closed&quot; state) in a large comfortable brown-and-tan tartan hammock that has replaced the bridge.<br>- The backpack lays at the robot&#39;s feet.<br>&quot;&quot;&quot;<br>new_id = AssetId.HAMMOCK<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*vJNKjm7AcM0DKa0fJoWWVA.png" /></figure><p>🔼️💡 This is one of the most complex prompts: Nano Banana Pro replaces the bridge with a hammock, and never misses to have “both eyes changed into a ‘closed’ state”. You can even move the “backpack at the robot’s feet” (the character sheet doesn’t have the robot without the backpack)!</p><h3>🚀 Conclusion</h3><p>In this limited quick test, we could see the following:</p><ul><li><strong>Advanced reasoning</strong>: Consistent results on complex prompts and complex prompts can even be enriched with more constraints.</li><li><strong>Advanced composition</strong>: Many more objects can be generated, which can lead to even more breathtaking images.</li><li><strong>Advanced text rendering</strong>: The cover image for this article was generated with the following additional instruction: <em>The robot is holding a light-beige crocheted sign in its hands that says, over 3 lines, “I love” “Nano Banana” “Pro 🍌” with navy blue letters, and the emoji in yellow.</em></li><li>It takes <strong>less iterations when trying new scenes, thanks to the advanced prompt understanding and adherence</strong>!</li><li>With everything in <strong>4K resolution</strong>!</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*IvX4hL6czop6rOwX9IcdmA.png" /><figcaption>The robot is holding a light-beige crocheted sign in its hands that says, over 3 lines, “I love” “Nano Banana” “Pro 🍌” with navy blue letters, and the emoji in yellow.</figcaption></figure><p>We now have two state-of-the-art image generation/editing models in our toolbox:</p><ul><li><strong>Gemini 2.5 Flash Image</strong> for speed and cost-effectiveness</li><li><strong>Gemini 3 Pro Image</strong> for advanced reasoning capabilities + 2K + 4K, and a lot more (we just scratched the surface)</li></ul><h3>➕ More!</h3><ul><li><strong>Dive in:</strong> Check out the <a href="https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-pro-image">documentation</a>.</li><li><strong>Experiment</strong>: Run the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/getting-started/intro_gemini_3_image_gen.ipynb">getting-started notebook</a>.</li><li><strong>Get inspired:</strong> Check out the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/nano-banana/nano_banana_recipes.ipynb">Nano Banana recipes notebook</a> for more practical examples.</li><li><strong>Explore</strong>: See additional use cases in the <a href="https://console.cloud.google.com/vertex-ai/studio/prompt-gallery?utm_source=blog&amp;utm_medium=external&amp;utm_campaign=CDR_0x8c87a0bc_default_b447103558">Vertex AI Prompt Gallery</a>.</li><li><strong>Stay updated</strong>: Follow the <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/release-notes?utm_source=blog&amp;utm_medium=external&amp;utm_campaign=CDR_0x8c87a0bc_default_b447103558">Vertex AI Release Notes</a>.</li><li><strong>Follow me:</strong> Connect with me (@PicardParis) on <a href="https://www.linkedin.com/in/picardparis">LinkedIn</a> or <a href="https://x.com/PicardParis">Twitter-X</a> for more cloud, applied AI, and Python explorations…</li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f585236ae411" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/testing-gemini-3-pro-image-f585236ae411">🍌 Testing Gemini 3 Pro Image</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Generating Consistent Imagery with Gemini Nano Banana ]]></title>
            <link>https://medium.com/google-cloud/generating-consistent-imagery-with-gemini-nano-banana-6e807b4d1f77?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/6e807b4d1f77</guid>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[image-generation]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Tue, 23 Sep 2025 21:49:57 GMT</pubDate>
            <atom:updated>2025-10-03T11:35:04.326Z</atom:updated>
            <content:encoded><![CDATA[<figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pBSET5blJVMwCuxk6cIi_A.gif" /><figcaption>A practical guide to building a prompt-based generation pipeline for your image library</figcaption></figure><p><em>A few notes before we dive in:</em></p><ul><li><em>The complete source code for this article, including future updates, is available in </em><a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/media-generation/consistent_imagery_generation.ipynb"><em>this notebook</em></a><em> under the Apache 2.0 license.</em></li><li><em>All new images in this article were generated with Gemini Nano Banana using the proof-of-concept generation pipeline explored here.</em></li><li><em>This article was originally published at </em><a href="https://towardsdatascience.com/generating-consistent-imagery-with-gemini/"><em>towardsdatascience.com</em></a><em>.</em></li><li><em>You can experiment with Gemini for free in </em><a href="https://aistudio.google.com?utm_source=blog&amp;utm_medium=external&amp;utm_campaign=CDR_0x8c87a0bc_default_b447103558"><em>Google AI Studio</em></a><em>. Please note that programmatic API access to Nano Banana is a pay-as-you-go service.</em></li><li><em>2025–10–02: Nano Banana got generally available for production and added support for ten aspect ratios. Notebook &amp; article were updated accordingly.</em></li></ul><h3>🔥 Challenge</h3><p>We all have existing images worth reusing in different contexts. This would generally imply modifying the images, a complex (if not impossible) task requiring very specific skills and tools. This explains why our archives are full of forgotten or unused treasures. State-of-the-art vision models have evolved so much that we can reconsider this problem.</p><p>So, can we breathe new life into our visual archives?</p><p>Let’s try to complete this challenge with the following steps:</p><ul><li>1️⃣ Start from an archive image we’d like to reuse</li><li>2️⃣ Extract a character to create a brand-new reference image</li><li>3️⃣ Generate a series of images to illustrate the character’s journey, using only prompts and the new assets</li></ul><p>For this, we’ll explore the capabilities of “Gemini 2.5 Flash Image”, also known as “Nano Banana” 🍌.</p><h3>🏁 Setup</h3><h4>🐍 Python packages</h4><p>We’ll use the following packages:</p><ul><li>google-genai: The <a href="https://pypi.org/project/google-genai">Google Gen AI Python SDK</a> lets us call Gemini with a few lines of code</li><li>networkx for graph management</li></ul><p>We’ll also use the following dependencies:</p><ul><li>pillow and matplotlib for data visualization</li><li>tenacity for request management</li></ul><pre>%pip install --quiet &quot;google-genai&gt;=1.40.0&quot; &quot;networkx[default]&quot;</pre><h4>🤖 Gen AI SDK</h4><p>Create a google.genai client:</p><pre>from google import genai<br><br>check_environment()<br><br>client = genai.Client()</pre><p>Check your configuration:</p><pre>check_configuration(client)</pre><pre>Using the Vertex AI API with project &quot;...&quot; in location &quot;europe-west1&quot;</pre><h3>🧠 Gemini model</h3><p>For this challenge, we’ll select the latest Gemini 2.5 Flash Image model:</p><p>GEMINI_2_5_FLASH_IMAGE = &quot;gemini-2.5-flash-image&quot;</p><h3>🛠️ Helpers</h3><p>Define some helper functions to generate and display images:</p><pre>import IPython.display<br>import tenacity<br>from google.genai.errors import ClientError<br>from google.genai.types import GenerateContentConfig, ImageConfig, PIL_Image<br><br>GEMINI_2_5_FLASH_IMAGE = &quot;gemini-2.5-flash-image&quot;<br><br># You can add the &quot;TEXT&quot; modality for potential textual feedback (or in iterative chat mode)<br>RESPONSE_MODALITIES = [&quot;IMAGE&quot;]<br><br># Supported aspect ratios: &quot;1:1&quot;, &quot;2:3&quot;, &quot;3:2&quot;, &quot;3:4&quot;, &quot;4:3&quot;, &quot;4:5&quot;, &quot;5:4&quot;, &quot;9:16&quot;, &quot;16:9&quot;, and &quot;21:9&quot;<br>ASPECT_RATIO = &quot;16:9&quot;<br><br>GENERATION_CONFIG = GenerateContentConfig(<br>    response_modalities=RESPONSE_MODALITIES,<br>    image_config=ImageConfig(aspect_ratio=ASPECT_RATIO),<br>)<br><br><br>def generate_content(sources: list[PIL_Image], prompt: str) -&gt; PIL_Image | None:<br>    prompt = prompt.strip()<br>    contents = [*sources, prompt] if sources else prompt<br><br>    response = None<br>    for attempt in get_retrier():<br>        with attempt:<br>            response = client.models.generate_content(<br>                model=GEMINI_2_5_FLASH_IMAGE,<br>                contents=contents,<br>                config=GENERATION_CONFIG,<br>            )<br><br>    if not response or not response.candidates:<br>        return None<br>    if not (content := response.candidates[0].content):<br>        return None<br>    if not (parts := content.parts):<br>        return None<br><br>    image: PIL_Image | None = None<br>    for part in parts:<br>        if part.text:<br>            display_markdown(part.text)<br>            continue<br>        assert (sdk_image := part.as_image())<br>        assert (image := sdk_image._pil_image)<br>        display_image(image)<br><br>    return image<br><br><br>def get_retrier() -&gt; tenacity.Retrying:<br>    return tenacity.Retrying(<br>        stop=tenacity.stop_after_attempt(7),<br>        wait=tenacity.wait_incrementing(start=10, increment=1),<br>        retry=should_retry_request,<br>        reraise=True,<br>    )<br><br><br>def should_retry_request(retry_state: tenacity.RetryCallState) -&gt; bool:<br>    if not retry_state.outcome:<br>        return False<br>    err = retry_state.outcome.exception()<br>    if not isinstance(err, ClientError):<br>        return False<br>    print(f&quot;❌ ClientError {err.code}: {err.message}&quot;)<br><br>    retry = False<br>    match err.code:<br>        case 400 if err.message is not None and &quot; try again &quot; in err.message:<br>            # Workshop: Cloud Storage accessed for the first time (service agent provisioning)<br>            retry = True<br>        case 429:<br>            # Workshop: temporary project with 1 QPM quota<br>            retry = True<br>    print(f&quot;🔄 Retry: {retry}&quot;)<br><br>    return retry<br><br><br>def display_markdown(markdown: str) -&gt; None:<br>    IPython.display.display(IPython.display.Markdown(markdown))<br><br><br>def display_image(image: PIL_Image) -&gt; None:<br>    IPython.display.display(image)</pre><h3>🖼️ Assets</h3><p>Let’s define the assets for our character’s journey and the functions to manage them:</p><pre>import enum<br>from collections.abc import Sequence<br>from dataclasses import dataclass<br><br><br>class AssetId(enum.StrEnum):<br>    ARCHIVE = &quot;0_archive&quot;<br>    ROBOT = &quot;1_robot&quot;<br>    MOUNTAINS = &quot;2_mountains&quot;<br>    VALLEY = &quot;3_valley&quot;<br>    FOREST = &quot;4_forest&quot;<br>    CLEARING = &quot;5_clearing&quot;<br>    ASCENSION = &quot;6_ascension&quot;<br>    SUMMIT = &quot;7_summit&quot;<br>    BRIDGE = &quot;8_bridge&quot;<br>    HAMMOCK = &quot;9_hammock&quot;<br><br><br>@dataclass<br>class Asset:<br>    id: str<br>    source_ids: Sequence[str]<br>    prompt: str<br>    pil_image: PIL_Image<br><br><br>class Assets(dict[str, Asset]):<br>    def set_asset(self, asset: Asset) -&gt; None:<br>        # Note: This replaces any existing asset (if needed, add guardrails to auto-save or keep all versions)<br>        self[asset.id] = asset<br><br><br>def generate_image(source_ids: Sequence[str], prompt: str, new_id: str = &quot;&quot;) -&gt; None:<br>    sources = [assets[source_id].pil_image for source_id in source_ids]<br>    prompt = prompt.strip()<br>    image = generate_content(sources, prompt)<br>    if image and new_id:<br>        assets.set_asset(Asset(new_id, source_ids, prompt, image))<br><br><br>assets = Assets()</pre><h3>📦 Reference archive</h3><p>We can now fetch our reference archive and make it our first asset:</p><pre>import urllib.request<br><br>import PIL.Image<br><br>ARCHIVE_URL = &quot;https://storage.googleapis.com/github-repo/generative-ai/gemini/use-cases/media-generation/consistent_imagery_generation/0_archive.png&quot;<br><br><br>def load_archive() -&gt; None:<br>    image = get_image_from_url(ARCHIVE_URL)<br>    assets.set_asset(Asset(AssetId.ARCHIVE, [], &quot;&quot;, image))<br>    display_image(image)<br><br><br>def get_image_from_url(image_url: str) -&gt; PIL_Image:<br>    with urllib.request.urlopen(image_url) as response:<br>        return PIL.Image.open(response)</pre><pre>load_archive()</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ZcUmZYD1zenNXyebe64CQg.png" /></figure><p>This archive image was generated in July 2024 with a beta version of Imagen 3, prompted with <em>“On white background, a small hand-felted toy of blue robot. The felt is soft and cuddly…”</em>. The result looked really good but, at the time, there was absolutely no determinism and no consistency. As a result, <strong>this was a nice one-shot image generation but the cute little robot seemed gone forever…</strong></p><h3>⛏️ Asset extraction</h3><p>Let’s try to extract our little robot:</p><pre>source_ids = [AssetId.ARCHIVE]<br>prompt = &quot;Extract the robot in a clean cutout on a solid white fill.&quot;<br><br>generate_image(source_ids, prompt)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*5WMyfsii5jNbwymKHl7Ekg.png" /></figure><p>⚠️ The robot is perfectly extracted, but this is essentially a good background removal, which many models can perform. This prompt uses terms from graphics software, whereas we can now reason in terms of image composition. It’s also not necessarily a good idea to try to use traditional binary masks, as object edges and shadows convey significant details about shapes, textures, positions, and lighting.</p><p>Let’s go back to our archive to perform an advanced extraction instead, and directly generate a character sheet…</p><h3>🪄 Character sheet</h3><p>Gemini has spatial understanding, so it’s able to provide different views while preserving visual features. Let’s generate a front/back character sheet and, as our little robot will go on a journey, also add a backpack at the same time:</p><pre>source_ids = [AssetId.ARCHIVE]<br>prompt = &quot;&quot;&quot;<br>- Scene: Robot character sheet.<br>- Left: Front view of the extracted robot.<br>- Right: Back view of the extracted robot (seamless back).<br>- The robot wears a same small, brown-felt backpack, with a tiny polished-brass buckle and simple straps in both views. The backpack straps are visible in both views.<br>- Background: Pure white.<br>- Text: On the top, caption the image &quot;ROBOT CHARACTER SHEET&quot; and, on the bottom, caption the views &quot;FRONT VIEW&quot; and &quot;BACK VIEW&quot;.<br>&quot;&quot;&quot;<br>new_id = AssetId.ROBOT<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ofqNjW55eu-FlU3pC-GYMA.png" /></figure><p>💡 A few remarks:</p><ul><li>Our prompt focuses on the composition of the scene, a common practice in media studios.</li><li>Successive generated images will be consistent, preserving all robot features visible in the provided image. However, since we only specified some features of the backpack (e.g., a single buckle) and left others unspecified, we’ll get slightly different backpacks.</li><li>For simplicity, we directly included the backpack in the character sheet. In a real production pipeline, we would likely make it part of a separate accessory sheet.</li><li>To control the backpack’s exact shape and design, we could also use a reference photo of a real backpack and instruct Gemini to “transform the backpack into a stylized felt version”.</li><li>Gemini can generate 1024 × 1024images (1:1 aspect ratio) or equivalent resolutions (token-wise) for the other supported aspect ratios (2:3, 3:2, 3:4, 4:3, 4:5, 5:4, 9:16, 16:9, and 21:9).</li><li>In the request configuration, we specified aspect_ratio=&quot;16:9&quot;, which generates images at 1344 × 768 pixels. If this parameter is omitted, Gemini uses the aspect ratio of the input image (the last one if multiple are provided) to select the closest supported aspect ratio.</li></ul><p>This new asset can now serve as a design reference in our future image generations.</p><h3>✨ First scene</h3><p>Let’s get started with a mountain scenery:</p><pre>source_ids = [AssetId.ROBOT]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Scene: Macro photography of a beautifully crafted miniature diorama.<br>- Background: Soft-focus of a panoramic range of interspersed, dome-like felt mountains, in various shades of medium blue/green, with curvy white snowcaps, extending over the entire horizon.<br>- Foreground: In the bottom-left, the robot stands on the edge of a medium-gray felt cliff, viewed from a 3/4 back angle, looking out over a sea of clouds (made of white cotton).<br>- Lighting: Studio, clean and soft.<br>&quot;&quot;&quot;<br>new_id = AssetId.MOUNTAINS<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*6yWh3vVojiBvOZKLRbjV0g.png" /></figure><blockquote><em>💡 The mountain shape is specified as “dome-like” so our character can stand on one of the summits later on.</em></blockquote><p>It’s important to spend some time on this first scene as, in a cascading effect, it will define the overall look of our story. Take some time to refine the prompt or try a couple of times to get the best variation.</p><p>From now on, our generation inputs will be both the character sheet and a reference scene…</p><h3>✨ Successive scenes</h3><p>Let’s get the robot down a valley:</p><pre>source_ids = [AssetId.ROBOT, AssetId.MOUNTAINS]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot has descended from the cliff to a gray felt valley. It stands in the center, seen directly from the back. It is holding/reading a felt map with outstretched arms.<br>- Large smooth, round, felt rocks in various beige/gray shades are visible on the sides.<br>- Background: The distant mountain range. A thin layer of clouds obscures its base and the end of the valley.<br>- Lighting: Golden hour light, soft and diffused.<br>&quot;&quot;&quot;<br>new_id = AssetId.VALLEY<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*JhqfAj9MtFj9E79kynKM4A.png" /></figure><p>💡 A few notes:</p><ul><li>The provided specifications about our input images (&quot;Image 1:...&quot;, &quot;Image 2:...&quot;) are important. Without them, &quot;the robot&quot; could refer to any of the 3 robots in the input images (2 in the character sheet, 1 in the previous scene). With them, we indicate that it&#39;s the same robot. In case of confusion, we can be more specific with &quot;the [entity] from image [number]&quot;.</li><li>On the other hand, since we didn’t provide a precise description of the valley, successive requests will give different, interesting, and creative results (we can pick our favorite or make the prompt more precise for more determinism).</li><li>Here, we also tested a different lighting, which significantly changes the whole scene.</li></ul><p>Then, we can move forward into this scene:</p><pre>source_ids = [AssetId.ROBOT, AssetId.VALLEY]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot goes on and faces a dense, infinite forest of simple, giant, thin trees, that fills the entire background.<br>- The trees are made from various shades of light/medium/dark green felt.<br>- The robot is on the right, viewed from a 3/4 rear angle, no longer holding the map, with both hands clasped to its ears in despair.<br>- On the left &amp; right bottom sides, rocks (similar to image 2) are partially visible.<br>&quot;&quot;&quot;<br>new_id = AssetId.FOREST<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*NWM4unROweJkObR8oItbEw.png" /></figure><p>💡 Of interest:</p><ul><li>We could position the character, change its point of view, and even “animate” its arms for more expressivity.</li><li>The “no longer holding the map” precision prevents the model from trying to keep it from the previous scene in a meaningful way (e.g., the robot dropped the map on the floor).</li><li>We didn’t provide lighting details: The lighting source, quality, and direction have been kept from the previous scene.</li></ul><p>Let’s go through the forest:</p><pre>source_ids = [AssetId.ROBOT, AssetId.FOREST]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot goes through the dense forest and emerges into a clearing, pushing aside two tree trunks.<br>- The robot is in the center, now seen from the front view.<br>- The ground is made of green felt, with flat patches of white felt snow. Rocks are no longer visible.<br>&quot;&quot;&quot;<br>new_id = AssetId.CLEARING<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RNtRZrVs-e3j1o20WfFdmg.png" /></figure><blockquote><em>💡 We changed the ground but didn’t provide additional details for the view and the forest: The model will generally preserve most of the trees.</em></blockquote><p>Now that the valley-forest sequence is over, we can journey up to the mountains, using the original mountain scene as our reference to return to that environment:</p><pre>source_ids = [AssetId.ROBOT, AssetId.MOUNTAINS]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- Close-up of the robot now climbing the peak of a medium-green mountain and reaching its summit.<br>- The mountain is right in the center, with the robot on its left slope, viewed from a 3/4 rear angle.<br>- The robot has both feet on the mountain and is using two felt ice axes (brown handles, gray heads), reaching the snowcap.<br>- Horizon: The distant mountain range.<br>&quot;&quot;&quot;<br>new_id = AssetId.ASCENSION<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*4SjJNJRlIb8RqP7LZRwwlA.png" /></figure><blockquote><em>💡 The mountain close-up, inferred from the blurred background, is pretty impressive.</em></blockquote><p>Let’s climb to the summit:</p><pre>source_ids = [AssetId.ROBOT, AssetId.ASCENSION]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot reaches the top and stands on the summit, seen in the front view, in close-up.<br>- It is no longer holding the ice axes, which are planted upright in the snow on each side.<br>- It has both arms raised in sign of victory.<br>&quot;&quot;&quot;<br>new_id = AssetId.SUMMIT<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*qpKSy1Ru1o3H35hINRpZyg.png" /></figure><blockquote><em>💡 This is a logical follow-up but also a nice, different view.</em></blockquote><p>Now, let’s try something different to significantly recompose the scene:</p><pre>source_ids = [AssetId.ROBOT, AssetId.SUMMIT]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- Remove the ice axes.<br>- Move the center mountain to the left edge of the image and add a slightly taller medium-blue mountain to the right edge.<br>- Suspend a stylized felt bridge between the two mountains: Its deck is made of thick felt planks in various wood shades.<br>- Place the robot on the center of the bridge with one arm pointing toward the blue mountain.<br>- View: Close-up.<br>&quot;&quot;&quot;<br>new_id = AssetId.BRIDGE<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*RibnnPGMqUf4xaNxIxbFuw.png" /></figure><p>💡 Of interest:</p><ul><li>This imperative prompt composes the scene in terms of actions. It’s sometimes easier than descriptions.</li><li>A new mountain is added as instructed, and it is both different and consistent.</li><li>The bridge attaches to the summits in very plausible ways and seems to obey the laws of physics.</li><li>The “Remove the ice axes” instruction is here for a reason. Without it, it’s as if we were prompting “do whatever you can with the ice axes from the previous scene: leave them where they are, don’t let the robot leave without them, or anything else”, leading to random results.</li><li>It’s also possible to get the robot to walk on the bridge, seen from the side (which we never generated before), but it’s hard to have it consistently walk from left to right. Adding left and right views in the character sheet should fix this.</li></ul><p>Let’s generate a final scene and let the robot get some well-deserved rest:</p><pre>source_ids = [AssetId.ROBOT, AssetId.BRIDGE]<br>prompt = &quot;&quot;&quot;<br>- Image 1: Robot character sheet.<br>- Image 2: Previous scene.<br>- The robot is sleeping peacefully (both eyes changed into a &quot;closed&quot; state), in a comfortable brown-and-tan tartan hammock that has replaced the bridge.<br>&quot;&quot;&quot;<br>new_id = AssetId.HAMMOCK<br><br>generate_image(source_ids, prompt, new_id)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Zc1HBjnV_ELw8hVqh6ouUg.png" /></figure><p>💡 Of interest:</p><ul><li>This time, the prompt is descriptive, and it works as well as the previous imperative prompt.</li><li>The bridge-hammock transformation is really nice and preserves the attachments on the mountain summits.</li><li>The robot transformation is also impressive, as it hasn’t been seen in this position before.</li><li>The closed eyes are the most difficult detail to get consistently (may require a couple of attempts), probably because we’re accumulating many different transformations at once (and diluting the model’s attention). For full control and more deterministic results, we can focus on significant changes over iterative steps, or create various character sheets upfront.</li></ul><p>We have illustrated our story with 9 new consistent images! Let’s take a step back to understand what we’ve built…</p><h3>🗺️ Graph visualization</h3><p>We now have a collection of image assets, from archives to brand-new generated assets.</p><p>Let’s add some data visualization to get a better sense of the steps completed…</p><h3>🔗 Directed graph</h3><p>Our new assets are all related, connected by one or more “generated from” links. From a data structure point of view, this is a directed graph.</p><p>We can build the corresponding directed graph using the networkx library:</p><pre>import networkx as nx<br><br><br>def build_graph(assets: Assets) -&gt; nx.DiGraph:<br>    graph = nx.DiGraph(assets=assets)<br>    # Nodes<br>    for asset in assets.values():<br>        graph.add_node(asset.id, asset=asset)<br>    # Edges<br>    for asset in assets.values():<br>        for source_id in asset.source_ids:<br>            graph.add_edge(source_id, asset.id)<br>    return graph<br><br><br>asset_graph = build_graph(assets)<br>print(asset_graph)</pre><pre>DiGraph with 10 nodes and 16 edges</pre><p>Let’s place the most used asset in the center and display the other assets around:</p><pre>import matplotlib.pyplot as plt<br><br><br>def display_basic_graph(graph: nx.Graph) -&gt; None:<br>    pos = compute_node_positions(graph)<br>    color = &quot;#4285F4&quot;<br>    options = dict(<br>        node_color=color,<br>        edge_color=color,<br>        arrowstyle=&quot;wedge&quot;,<br>        with_labels=True,<br>        font_size=&quot;small&quot;,<br>        bbox=dict(ec=&quot;black&quot;, fc=&quot;white&quot;, alpha=0.7),<br>    )<br>    nx.draw(graph, pos, **options)<br>    plt.show()<br><br><br>def compute_node_positions(graph: nx.Graph) -&gt; dict[str, tuple[float, float]]:<br>    # Put the most connected node in the center<br>    center_node = most_connected_node(graph)<br>    edge_nodes = set(graph) - {center_node}<br>    pos = nx.circular_layout(graph.subgraph(edge_nodes))<br>    pos[center_node] = (0.0, 0.0)<br>    return pos<br><br><br>def most_connected_node(graph: nx.Graph) -&gt; str:<br>    if not graph.nodes():<br>        return &quot;&quot;<br>    centrality_by_id = nx.degree_centrality(graph)<br>    return max(centrality_by_id, key=lambda s: centrality_by_id.get(s, 0.0))</pre><pre>display_basic_graph(asset_graph)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/660/1*rADnAJBG8Fd3lHZS-6hOyg.png" /></figure><p>That’s a correct summary of our different steps. It’d be nice if we could also visualize our assets…</p><h3>🌟 Asset graph</h3><p>Let’s add custom matplotlib functions to render the graph nodes with the assets in a more visually appealing way:</p><pre>import typing<br>from collections.abc import Iterator<br>from io import BytesIO<br>from pathlib import Path<br><br>import PIL.Image<br>import PIL.ImageDraw<br>from google.genai.types import PIL_Image<br>from matplotlib.axes import Axes<br>from matplotlib.backends.backend_agg import FigureCanvasAgg<br>from matplotlib.figure import Figure<br>from matplotlib.image import AxesImage<br>from matplotlib.patches import Patch<br>from matplotlib.text import Annotation<br>from matplotlib.transforms import Bbox, TransformedBbox<br><br><br>@enum.unique<br>class ImageFormat(enum.StrEnum):<br>    # Matches PIL.Image.Image.format<br>    WEBP = enum.auto()<br>    PNG = enum.auto()<br>    GIF = enum.auto()<br><br><br>def yield_generation_graph_frames(<br>    graph: nx.DiGraph,<br>    animated: bool,<br>) -&gt; Iterator[PIL_Image]:<br>    def get_fig_ax() -&gt; tuple[Figure, Axes]:<br>        factor = 1.0<br>        figsize = (16 * factor, 9 * factor)<br>        fig, ax = plt.subplots(figsize=figsize)<br>        fig.tight_layout(pad=3)<br>        handles = [<br>            Patch(color=COL_OLD, label=&quot;Archive&quot;),<br>            Patch(color=COL_NEW, label=&quot;Generated&quot;),<br>        ]<br>        ax.legend(handles=handles, loc=&quot;lower right&quot;)<br>        ax.set_axis_off()<br>        return fig, ax<br><br>    def prepare_graph() -&gt; None:<br>        arrows = nx.draw_networkx_edges(graph, pos, ax=ax)<br>        for arrow in arrows:<br>            arrow.set_visible(False)<br><br>    def get_box_size() -&gt; tuple[float, float]:<br>        xlim_l, xlim_r = ax.get_xlim()<br>        ylim_t, ylim_b = ax.get_ylim()<br>        factor = 0.08<br>        box_w = (xlim_r - xlim_l) * factor<br>        box_h = (ylim_b - ylim_t) * factor<br>        return box_w, box_h<br><br>    def add_axes() -&gt; Axes:<br>        xf, yf = tr_figure(pos[node])<br>        xa, ya = tr_axes([xf, yf])<br>        x_y_w_h = (xa - box_w / 2.0, ya - box_h / 2.0, box_w, box_h)<br>        a = plt.axes(x_y_w_h)<br>        a.set_title(<br>            asset.id,<br>            loc=&quot;center&quot;,<br>            backgroundcolor=&quot;#FFF8&quot;,<br>            fontfamily=&quot;monospace&quot;,<br>            fontsize=&quot;small&quot;,<br>        )<br>        a.set_axis_off()<br>        return a<br><br>    def draw_box(color: str, image: bool) -&gt; AxesImage:<br>        if image:<br>            result = pil_image.copy()<br>        else:<br>            result = PIL.Image.new(&quot;RGB&quot;, image_size, color=&quot;white&quot;)<br>        xy = ((0, 0), image_size)<br>        # Draw box outline<br>        draw = PIL.ImageDraw.Draw(result)<br>        draw.rounded_rectangle(xy, box_r, outline=color, width=outline_w)<br>        # Make everything outside the box outline transparent<br>        mask = PIL.Image.new(&quot;L&quot;, image_size, 0)<br>        draw = PIL.ImageDraw.Draw(mask)<br>        draw.rounded_rectangle(xy, box_r, fill=0xFF)<br>        result.putalpha(mask)<br>        return a.imshow(result)<br><br>    def draw_prompt() -&gt; Annotation:<br>        text = f&quot;Prompt:\n{asset.prompt}&quot;<br>        margin = 2 * outline_w<br>        image_w, image_h = image_size<br>        bbox = Bbox([[0, margin], [image_w - margin, image_h - margin]])<br>        clip_box = TransformedBbox(bbox, a.transData)<br>        return a.annotate(<br>            text,<br>            xy=(0, 0),<br>            xytext=(0.06, 0.5),<br>            xycoords=&quot;axes fraction&quot;,<br>            textcoords=&quot;axes fraction&quot;,<br>            verticalalignment=&quot;center&quot;,<br>            fontfamily=&quot;monospace&quot;,<br>            fontsize=&quot;small&quot;,<br>            linespacing=1.3,<br>            annotation_clip=True,<br>            clip_box=clip_box,<br>        )<br><br>    def draw_edges() -&gt; None:<br>        STYLE_STRAIGHT = &quot;arc3&quot;<br>        STYLE_CURVED = &quot;arc3,rad=0.15&quot;<br>        for parent in graph.predecessors(node):<br>            edge = (parent, node)<br>            color = COL_NEW if assets[parent].prompt else COL_OLD<br>            style = STYLE_STRAIGHT if center_node in edge else STYLE_CURVED<br>            nx.draw_networkx_edges(<br>                graph,<br>                pos,<br>                [edge],<br>                width=2,<br>                edge_color=color,<br>                style=&quot;dotted&quot;,<br>                ax=ax,<br>                connectionstyle=style,<br>            )<br><br>    def get_frame() -&gt; PIL_Image:<br>        canvas = typing.cast(FigureCanvasAgg, fig.canvas)<br>        canvas.draw()<br>        image_size = canvas.get_width_height()<br>        image_bytes = canvas.buffer_rgba()<br>        return PIL.Image.frombytes(&quot;RGBA&quot;, image_size, image_bytes).convert(&quot;RGB&quot;)<br><br>    COL_OLD = &quot;#34A853&quot;<br>    COL_NEW = &quot;#4285F4&quot;<br>    assets = graph.graph[&quot;assets&quot;]<br>    center_node = most_connected_node(graph)<br>    pos = compute_node_positions(graph)<br>    fig, ax = get_fig_ax()<br>    prepare_graph()<br>    box_w, box_h = get_box_size()<br>    tr_figure = ax.transData.transform  # Data → display coords<br>    tr_axes = fig.transFigure.inverted().transform  # Display → figure coords<br><br>    for node, data in graph.nodes(data=True):<br>        if animated:<br>            yield get_frame()<br>        # Edges and sub-plot<br>        asset = data[&quot;asset&quot;]<br>        pil_image = asset.pil_image<br>        image_size = pil_image.size<br>        box_r = min(image_size) * 25 / 100  # Radius for rounded rect<br>        outline_w = min(image_size) * 5 // 100<br>        draw_edges()<br>        a = add_axes()  # a is used in sub-functions<br>        # Prompt<br>        if animated and asset.prompt:<br>            box = draw_box(COL_NEW, image=False)<br>            prompt = draw_prompt()<br>            yield get_frame()<br>            box.set_visible(False)<br>            prompt.set_visible(False)<br>        # Generated image<br>        color = COL_NEW if asset.prompt else COL_OLD<br>        draw_box(color, image=True)<br><br>    plt.close()<br>    yield get_frame()<br><br><br>def draw_generation_graph(<br>    graph: nx.DiGraph,<br>    format: ImageFormat,<br>) -&gt; BytesIO:<br>    frames = list(yield_generation_graph_frames(graph, animated=False))<br>    assert len(frames) == 1<br>    frame = frames[0]<br><br>    params: dict[str, typing.Any] = dict()<br>    match format:<br>        case ImageFormat.WEBP:<br>            params.update(lossless=True)<br><br>    image_io = BytesIO()<br>    frame.save(image_io, format, **params)<br><br>    return image_io<br><br><br>def draw_generation_graph_animation(<br>    graph: nx.DiGraph,<br>    format: ImageFormat,<br>) -&gt; BytesIO:<br>    frames = list(yield_generation_graph_frames(graph, animated=True))<br>    assert 1 &lt;= len(frames)<br><br>    if format == ImageFormat.GIF:<br>        # Dither all frames with the same palette to optimize the animation<br>        # The animation is cumulative, so most colors are in the last frame<br>        method = PIL.Image.Quantize.MEDIANCUT<br>        palettized = frames[-1].quantize(method=method)<br>        frames = [frame.quantize(method=method, palette=palettized) for frame in frames]<br><br>    # The animation will be played in a loop: start cycling with the most complete frame<br>    first_frame = frames[-1]<br>    next_frames = frames[:-1]<br>    INTRO_DURATION = 3000<br>    FRAME_DURATION = 1000<br>    durations = [INTRO_DURATION] + [FRAME_DURATION] * len(next_frames)<br>    params: dict[str, typing.Any] = dict(<br>        save_all=True,<br>        append_images=next_frames,<br>        duration=durations,<br>        loop=0,<br>    )<br>    match format:<br>        case ImageFormat.GIF:<br>            params.update(optimize=False)<br>        case ImageFormat.WEBP:<br>            params.update(lossless=True)<br><br>    image_io = BytesIO()<br>    first_frame.save(image_io, format, **params)<br><br>    return image_io<br><br><br>def display_generation_graph(<br>    graph: nx.DiGraph,<br>    format: ImageFormat | None = None,<br>    animated: bool = False,<br>    save_image: bool = False,<br>) -&gt; None:<br>    if format is None:<br>        format = ImageFormat.WEBP if running_in_colab_env else ImageFormat.PNG<br>    if animated:<br>        image_io = draw_generation_graph_animation(graph, format)<br>    else:<br>        image_io = draw_generation_graph(graph, format)<br><br>    image_bytes = image_io.getvalue()<br>    IPython.display.display(IPython.display.Image(image_bytes))<br><br>    if save_image:<br>        stem = &quot;graph_animated&quot; if animated else &quot;graph&quot;<br>        Path(f&quot;./{stem}.{format.value}&quot;).write_bytes(image_bytes)</pre><p>We can now display our generation graph:</p><pre>display_generation_graph(asset_graph)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*W5EyOHCCEuTjOAOkfx5M2g.png" /></figure><h3>🚀 Challenge completed</h3><p>We managed to generate a full set of new consistent images with Nano Banana and learned a few things along the way:</p><ul><li><strong>Images prove again that they are worth a thousand words</strong>: It’s now a lot easier to generate new images from existing ones and simple instructions.</li><li>We can <strong>create or edit images just in terms of composition</strong> (letting us all become artistic directors).</li><li>We can use <strong>descriptive or imperative instructions</strong>.</li><li>The model’s <strong>spatial understanding allows 3D manipulations</strong>.</li><li>We can <strong>add text in our outputs</strong> (character sheet) and <strong>also refer to text in our inputs</strong> (front/back views).</li><li>Consistency can be <strong>preserved at very different levels</strong>: character, scene, texture, lighting, camera angle/type…</li><li>The generation process can still be iterative but it <strong>feels like 10x-100x faster</strong> for reaching <strong>better-than-hoped-for results</strong>.</li><li>It’s now possible to <strong>breathe new life into our archives</strong>!</li></ul><p>Possible next steps:</p><ul><li>The process we followed is essentially a generation pipeline. It can be industrialized for automation (e.g., changing a node regenerates its descendants) or for the generation of different variations in parallel (e.g., the same set of images could be generated for different aesthetics, audiences, or simulations).</li><li>For the sake of simplicity and exploration, the prompts are intentionally simple. In a production environment, they could have a fixed structure with a systematic set of parameters.</li><li>We described scenes as if in a photo studio. Virtually any other imaginable artistic style is possible (photorealistic, abstract, 2D…).</li><li>Our assets could be made self-sufficient by saving prompts and ancestors in the image metadata (e.g., in PNG chunks), allowing for full local storage and retrieval (no database needed and no more lost prompts!). For details, see the “asset metadata” section in the notebook (link below).</li></ul><p>As a bonus, let’s end with an <strong>animated version of our journey</strong>, with the generation graph also showing a glimpse of our instructions:</p><pre>display_generation_graph(asset_graph, animated=True)</pre><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pBSET5blJVMwCuxk6cIi_A.gif" /></figure><h3>➕ More!</h3><p><strong>Want to go deeper?</strong></p><ul><li><strong>Try it yourself:</strong> Use the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/media-generation/consistent_imagery_generation.ipynb">article’s accompanying notebook</a> to reproduce the results or generate your own images. If you “run all” the notebook cells, this will generate 10 images for a total cost of $0.39.</li><li><strong>Get inspired:</strong> Check out the <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/nano-banana/nano_banana_recipes.ipynb">Nano Banana recipes notebook</a> for more practical examples.</li><li><strong>Explore</strong>: See additional use cases in the <a href="https://console.cloud.google.com/vertex-ai/studio/prompt-gallery?utm_source=blog&amp;utm_medium=external&amp;utm_campaign=CDR_0x8c87a0bc_default_b447103558">Vertex AI Prompt Gallery</a>.</li><li><strong>Stay updated</strong>: Follow the <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/release-notes?utm_source=blog&amp;utm_medium=external&amp;utm_campaign=CDR_0x8c87a0bc_default_b447103558">Vertex AI Release Notes</a>.</li><li><strong>Follow me:</strong> Connect with me (@PicardParis) on <a href="https://www.linkedin.com/in/picardparis">LinkedIn</a> or <a href="https://x.com/PicardParis">Twitter-X</a> for more cloud, applied AI, and Python explorations…</li></ul><p>Thanks for reading. I look forward to seeing what you create!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=6e807b4d1f77" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/generating-consistent-imagery-with-gemini-nano-banana-6e807b4d1f77">Generating Consistent Imagery with Gemini Nano Banana 🍌</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking Multimodal Video Transcription with Gemini — Part 8:  Conclusion]]></title>
            <link>https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/eee478ba7eb0</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[video-transcription]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[gemini]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Sat, 06 Sep 2025 17:28:35 GMT</pubDate>
            <atom:updated>2025-09-30T12:29:39.170Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unlocking Multimodal Video Transcription with Gemini — Part 8: 🏁 Conclusion</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MeP1-uBqyxwzGtfd444RAw.png" /><figcaption>Generated with Gemini 2.5 Flash Image Preview (aka Nano Banana)</figcaption></figure><h4>Unlocking Multimodal Video Transcription with Gemini</h4><ol><li>🔥 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part1-02dc32118f41">Challenge</a></li><li>🛠️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1">Setup</a></li><li>🧪 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Prototyping</a></li><li>🏗️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Prompt Crafting</a></li><li>🚀 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Finalization</a></li><li>✅ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Challenge Complete</a></li><li>⚖️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Analysis, Tips &amp; Optimizations</a></li><li><strong>🏁 Conclusion ◀️</strong></li></ol><h3>🏁 Conclusion</h3><p>Multimodal video transcription, which requires the complex synthesis of audio and visual data, is a true challenge for ML practitioners, without mainstream solutions. A traditional approach, involving an elaborate pipeline of specialized models, would be engineering-intensive without any guarantee of success. In contrast, Gemini proved to be a versatile toolbox for reaching a powerful and straightforward solution based on a single prompt:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pZe3i52g8et3Bu98.gif" /></figure><p>We managed to address this complex problem with the following techniques:</p><ul><li>Prototyping with open prompts to develop intuition about Gemini’s natural strengths</li><li>Taking into account how LLMs work under the hood</li><li>Crafting increasingly specific prompts using a tabular extraction strategy</li><li>Generating structured outputs to move towards production-ready code</li><li>Adding data visualization for easier interpretation of responses and smoother iterations</li><li>Adapting default parameters to optimize the results</li><li>Conducting more tests, iterating, and even enriching the extracted data</li></ul><p>These principles should apply to many other data extraction domains and allow you to solve your own complex problems. Have fun and happy solving!</p><h3>➕ More!</h3><ul><li>Run <a href="https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/use-cases/video-analysis/multimodal_video_transcription.ipynb">this notebook</a> to reproduce the results from this article and transcribe your own videos</li><li>Experiment for free in <a href="https://aistudio.google.com?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">Google AI Studio</a> and get an API key to call Gemini programmatically</li><li>Explore additional use cases in the <a href="https://console.cloud.google.com/vertex-ai/studio/prompt-gallery?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">Vertex AI Prompt Gallery</a></li><li>Stay updated by following the <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/release-notes?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">Vertex AI Release Notes</a></li><li>Follow me on <a href="https://www.linkedin.com/in/picardparis">LinkedIn</a> or <a href="https://x.com/PicardParis">Twitter / X</a> for more cloud, applied AI, and Python explorations…</li><li><strong><em>UPDATE</em></strong><em>: It was so much fun to generate the intro images for this eight-part series that I had to share it with you. See my next article: </em><a href="https://medium.com/google-cloud/generating-consistent-imagery-with-gemini-nano-banana-6e807b4d1f77"><em>Generating Consistent Imagery with Gemini Nano Banana 🍌</em></a></li></ul><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=eee478ba7eb0" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Unlocking Multimodal Video Transcription with Gemini — Part 8: 🏁 Conclusion</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking Multimodal Video Transcription with Gemini — Part 7: ⚖️ Analysis, Tips & Optimizations]]></title>
            <link>https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/74ee997d2096</guid>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[video-transcription]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Fri, 05 Sep 2025 12:01:31 GMT</pubDate>
            <atom:updated>2025-11-18T11:02:50.246Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unlocking Multimodal Video Transcription with Gemini — Part 7: ⚖️ Analysis, Tips &amp; Optimizations</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*z8u0Z_nkzoIc16BGlFtnBg.png" /><figcaption>Generated with Gemini 2.5 Flash Image Preview (aka Nano Banana)</figcaption></figure><h4>Unlocking Multimodal Video Transcription with Gemini</h4><ol><li>🔥 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part1-02dc32118f41">Challenge</a></li><li>🛠️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1">Setup</a></li><li>🧪 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Prototyping</a></li><li>🏗️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Prompt Crafting</a></li><li>🚀 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Finalization</a></li><li>✅ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Challenge Complete</a></li><li><strong>⚖️ Analysis, Tips &amp; Optimizations ◀️</strong></li><li>🏁 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Conclusion</a></li></ol><h3>⚖️ Strengths &amp; weaknesses</h3><h4>👍 Strengths</h4><p>Overall, Gemini is capable of generating excellent transcriptions that surpass human-generated ones in these aspects:</p><ul><li>Consistency of the transcription</li><li>Impressive semantic understanding</li><li>Highly accurate grammar and punctuation</li><li>No typos or transcription system mistakes</li><li>Exhaustivity (every audible word is transcribed)</li></ul><blockquote><em>💡 As you know, a single incorrect/missing word (or even letter) can completely change the meaning. These strengths help ensure high-quality transcriptions and reduce the risk of misunderstandings.</em></blockquote><p>If we compare YouTube’s user-provided transcriptions (sometimes by professional caption vendors) to our auto-generated ones, we can observe some significant differences. Here are some examples from the last test:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/762/1*pAgJYGFVW1pENpQ1V5xE1A.png" /></figure><h4>👎 Weaknesses</h4><p>The current prompt isn’t perfect, though. It focuses first on the audio for transcription and then on all cues for speaker data extraction. Though Gemini natively ensures a very high consolidation from the context, the prompt can lead to these side effects:</p><ul><li>Sensitivity to speakers’ pronunciation or accent</li><li>Misspellings for proper nouns</li><li>Inconsistencies between the transcription and a perfectly identified speaker’s name</li></ul><p>Here are examples from the same test:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/435/1*qjLmIjICQj7IIw1MyunrKA.png" /></figure><p>We’ll stop our exploration here and leave it as an exercise, but here are possible ways to fix these errors, in order of simplicity/cost:</p><ul><li>Update the prompt to use visual cues for proper nouns, such as <em>“Ensure all proper nouns (people, companies, products, etc.) are spelled correctly and consistently. Prioritize on-screen text for reference.”</em></li><li>Enrich the prompt with an additional preliminary table to extract the proper nouns and use them explicitly in the context</li><li>Add available video context metadata in the prompt</li><li>Split the prompt into two successive requests</li></ul><h3>📈 Tips &amp; optimizations</h3><h4>🔧 Model selection</h4><p>Each model can differ in terms of performance, speed, and cost.</p><p>Here’s a practical summary based on the model specifications, our video test suite, and the current prompt:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fidDEJzMTH5hQrxUMUY4tg.png" /></figure><h4>🔧 Video segment</h4><p>You don’t always need to analyze videos from start to finish. You can indicate a video segment with start and/or end offsets in the <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/reference/rpc/google.cloud.aiplatform.v1?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog#videometadata">VideoMetadata</a> structure.</p><p>In this example, Gemini will only analyze the 30:00–50:00 segment of the video:</p><pre>video_metadata = VideoMetadata(<br>    start_offset=&quot;1800.0s&quot;,<br>    end_offset=&quot;3000.0s&quot;,<br>    …<br>)</pre><h4>🔧 Media resolution</h4><p>In our test suite, the videos are fairly standard. We got excellent results by using a “low” media resolution (“medium” being the default), specified with the GenerateContentConfig.media_resolution parameter.</p><blockquote><em>💡 This provides faster and cheaper inferences, while also enabling the analysis of videos that are three times as long.</em></blockquote><p>We used a simple heuristic based on video duration, but you might want to make it dynamic on a per-video basis:</p><pre>def get_media_resolution_for_video(video: Video) -&gt; MediaResolution | None:<br>    if not (video_duration := get_video_duration(video)):<br>        return None  # Default<br><br>    # For testing purposes, this is based on video duration, as our short videos tend to be more detailed<br>    less_than_five_minutes = video_duration &lt; timedelta(minutes=5)<br>    if less_than_five_minutes:<br>        media_resolution = MediaResolution.MEDIA_RESOLUTION_MEDIUM<br>    else:<br>        media_resolution = MediaResolution.MEDIA_RESOLUTION_LOW<br><br>    return media_resolution</pre><blockquote><em>⚠️ If you select a “low” media resolution and experience an apparent loss of understanding, you might be losing important details in the sampled video frames. This is easy to fix: switch back to the default media resolution.</em></blockquote><h4>🔧 Sampling frame rate</h4><p>The default sampling frame rate of 1 FPS worked fine in our tests. You might want to customize it for each video:</p><pre>SamplingFrameRate = float<br><br>def get_sampling_frame_rate_for_video(video: Video) -&gt; SamplingFrameRate | None:<br>    sampling_frame_rate = None  # Default (1 FPS for current models)<br><br>    # [Optional] Define a custom FPS: 0.0 &lt; sampling_frame_rate &lt;= 24.0<br><br>    return sampling_frame_rate</pre><blockquote><em>💡 You can mix the parameters. In this extreme example, assuming the input video has a 24fps frame rate, all frames will be sampled for a 10s segment:</em></blockquote><pre>video_metadata = VideoMetadata(<br>    start_offset=&quot;42.0s&quot;,<br>    end_offset=&quot;52.0s&quot;,<br>    fps=24.0,<br>)</pre><blockquote><em>⚠️ If you use a higher sampling rate, this multiplies the number of frames (and tokens) accordingly, increasing latency and cost. As </em><em>10s × 24fps = 240 frames = 4×60s × 1fps, this 10-second analysis at 24 FPS is equivalent to a 4-minute default analysis at 1 FPS.</em></blockquote><h4>🎯 Precision vs recall</h4><p>The prompt can influence the precision and recall of our data extractions, especially when using explicit versus implicit wording. If you want more qualitative results, favor precision using explicit wording; if you want more quantitative results, favor recall using implicit wording:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/933/1*XYSj2wO6frv-J_rRQ2Bj8w.png" /></figure><p>Here are examples that can lead to subtly different results:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/581/1*Qx2_RFs1tBn6vCrWEuVW4Q.png" /></figure><blockquote><em>💡 Different models can also behave differently for the same prompt. In particular, more performant models might seem more “confident” and make more implicit inferences or consolidations.</em></blockquote><blockquote><em>💡 As an example, in this </em><a href="https://youtu.be/gg7WjuFs8F4?t=297"><em>AlphaFold video</em></a><em>, at the 04:57 timecode, “Spring 2020” is first displayed as context. Then, a short declaration from “The Prime Minister” is heard in the background (“You must stay at home”) without any other hints. When asked to “identify” (rather than “extract”) the speaker, Gemini is likely to infer more and attribute the voice to “Boris Johnson”. There’s absolutely no explicit mention of Boris Johnson; his identity is correctly inferred from the context (“UK”, “Spring 2020”, and “The Prime Minister”).</em></blockquote><h4>🏷️ Metadata</h4><p>In our current tests, Gemini only uses audio and frame tokens, tokenized from sources on Google Cloud Storage or YouTube. If you have additional video metadata, this can be a goldmine; try to add it to your prompt and enrich the video context for better results upfront.</p><p>Potentially helpful metadata:</p><ul><li>Video description: This can provide a better understanding of where and when the video was shot.</li><li>Speaker info: This can help auto-correct names that are only heard and not obvious to spell.</li><li>Entity info: Overall, this can help get better transcriptions for custom or private data.</li></ul><blockquote><em>💡 For YouTube videos, no additional metadata or transcript is fetched. Gemini only receives the raw audio and video streams. You can check this yourself by comparing your results with YouTube’s automatic captioning (no punctuation, audio only) or user-provided transcripts (cleaned up), when available.</em></blockquote><blockquote><em>💡 If you know your video concerns a team or a company, adding internal data in the context can help correct or complete the requested speaker names (provided there are no homonyms in the same context), companies, and job titles.</em></blockquote><blockquote><em>💡 In this </em><a href="https://youtu.be/U_yYkb-ureI?t=376"><em>French reportage</em></a><em>, in the 06:16–06:31 segment, there are two dogs: Arnold and Rio. “Arnold” is clearly audible, repeated three times, and correctly transcribed. “Rio” is called only once, audible for a fraction of a second in a noisy environment, and the audio transcription can vary. Providing the names of the whole team (owners &amp; dogs, even if they are not all in the video) can help in transcribing this short name consistently.</em></blockquote><blockquote><em>💡 It should also be possible to ground the results with Google Search, Google Maps, or your own RAG system. See </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog"><em>Grounding overview</em></a><em>.</em></blockquote><h4>🔬 Debugging &amp; evidence</h4><p>Iterating through successive prompts and debugging LLM outputs can be challenging, especially when trying to understand the reasons for the results.</p><p>It’s possible to ask Gemini to provide evidence in the response. In our video transcription solution, we could request a timecoded “evidence” for each speaker’s identified name, company, or role. This enables linking results to their sources, discovering and understanding unexpected insights, checking potential false positives…</p><blockquote><em>💡 In the tested videos, when trying to understand where the insights came from, requesting evidence yielded very insightful explanations, for example:</em></blockquote><blockquote><em>- Person names could be extracted from various sources (video conference captions, badges, unseen participants introducing themselves when asking questions during a conference panel…)</em></blockquote><blockquote><em>- Company names could be found from text on uniforms, backpacks, vehicles…</em></blockquote><blockquote><em>💡 In a document data extraction solution, we could request to provide an “excerpt” as evidence, including page number, chapter number, or any other relevant location information.</em></blockquote><h4>🐘 Verbose JSON</h4><p>The JSON format is currently the most common way to generate structured outputs with LLMs. However, JSON is a rather verbose data format, as field names are repeated for each object. For example, an output can look like the following, with many repeated underlying tokens:</p><pre>{<br>  &quot;task1_transcripts&quot;: [<br>    { &quot;start&quot;: &quot;00:02&quot;, &quot;text&quot;: &quot;We&#39;ve…&quot;, &quot;voice&quot;: 1 },<br>    { &quot;start&quot;: &quot;00:07&quot;, &quot;text&quot;: &quot;But we…&quot;, &quot;voice&quot;: 1 }<br>    // …<br>  ],<br>  &quot;task2_speakers&quot;: [<br>    {<br>      &quot;voice&quot;: 1,<br>      &quot;name&quot;: &quot;John Moult&quot;,<br>      &quot;company&quot;: &quot;University of Maryland&quot;,<br>      &quot;position&quot;: &quot;Co-Founder, CASP&quot;,<br>      &quot;role_in_video&quot;: &quot;Expert&quot;<br>    },<br>    // …<br>    {<br>      &quot;voice&quot;: 3,<br>      &quot;name&quot;: &quot;Demis Hassabis&quot;,<br>      &quot;company&quot;: &quot;DeepMind&quot;,<br>      &quot;position&quot;: &quot;Founder and CEO&quot;,<br>      &quot;role_in_video&quot;: &quot;Team Leader&quot;<br>    }<br>    // …<br>  ]<br>}</pre><p>To optimize output size, an interesting possibility is to ask Gemini to generate an XML block containing a CSV for each of your tabular extractions. The field names are specified once in the header, and by using tab separators, for example, we can achieve more compact outputs like the following:</p><pre>&lt;TASK1_TRANSCRIPT_CSV&gt;<br>start  text     voice<br>00:02  We&#39;ve…   1<br>00:07  But we…  1<br>…<br>&lt;/TASK1_TRANSCRIPT_CSV&gt;<br>&lt;TASK2_SPEAKER_CSV&gt;<br>voice  name            company                 position          role_in_video<br>1      John Moult      University of Maryland  Co-Founder, CASP  Expert<br>…<br>3      Demis Hassabis  DeepMind                Founder and CEO   Team Leader<br>…<br>&lt;/TASK2_SPEAKER_CSV&gt;</pre><blockquote><em>💡 Gemini excels at patterns and formats. Depending on your needs, feel free to experiment with JSON, XML, CSV, YAML, and any custom structured formats. It’s likely that the industry will evolve to allow even more elaborate structured outputs.</em></blockquote><h4>🐿️ Context caching</h4><p>Context caching optimizes the cost and the latency of repeated requests using the same base inputs.</p><p>There are two ways requests can benefit from context caching:</p><ul><li><strong>Implicit caching</strong>: By default, upon the first request, input tokens are cached, to accelerate responses for subsequent requests with the same base inputs. This is fully automated and no code change is required.</li><li><strong>Explicit caching</strong>: You place specific inputs into the cache and reuse this cached content as a base for your requests. This provides full control but requires managing the cache manually.</li></ul><p>Example of implicit caching:</p><pre>model_id = &quot;gemini-2.0-flash&quot;<br>video_file_data = FileData(<br>    file_uri=&quot;gs://bucket/path/to/my-video.mp4&quot;,<br>    mime_type=&quot;video/mp4&quot;,<br>)<br>video = Part(file_data=video_file_data)<br>prompt_1 = &quot;List the people visible in the video.&quot;<br>prompt_2 = &quot;Summarize what happens to John Smith.&quot;<br><br># ✅ Request A1: static data (video) placed first<br>response = client.models.generate_content(<br>    model=model_id,<br>    contents=[video, prompt_1],<br>)<br><br># ✅ Request A2: likely cache hit for the video tokens<br>response = client.models.generate_content(<br>    model=model_id,<br>    contents=[video, prompt_2],<br>)</pre><blockquote><em>💡 Implicit caching can be disabled at the project level (see </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/data-governance?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog#customer_data_retention_and_achieving_zero_data_retention"><em>data governance</em></a><em>).</em></blockquote><p>Implicit caching is prefix-based, meaning it only works if you put static data first and variable data last.</p><p>Example of requests preventing implicit caching:</p><pre># ❌ Request B1: variable input placed first<br>response = client.models.generate_content(<br>    model=model_id,<br>    contents=[prompt_1, video],<br>)<br><br># ❌ Request B2: no cache hit<br>response = client.models.generate_content(<br>    model=model_id,<br>    contents=[prompt_2, video],<br>)</pre><blockquote><em>💡 This explains why the data-plus-instructions input order is preferred, for performance (not LLM-related) reasons.</em></blockquote><p>Cost-wise, the input tokens retrieved with a cache hit benefit from a 90% discount in the following cases:</p><ul><li><strong>Implicit caching</strong>: With all Gemini models, cache hits are automatically discounted (without any control on the cache or cache-hit guarantee).</li><li><strong>Explicit caching</strong>: With all Gemini models and supported models in Model Garden, you control your cached inputs and their lifespans to ensure cache hits.</li></ul><p>Example of explicit caching:</p><pre>from google.genai.types import (<br>    Content,<br>    CreateCachedContentConfig,<br>    FileData,<br>    GenerateContentConfig,<br>    Part,<br>)<br><br>model_id = &quot;gemini-2.0-flash-001&quot;<br><br># Input video<br>video_file_data = FileData(<br>    file_uri=&quot;gs://cloud-samples-data/video/JaneGoodall.mp4&quot;,<br>    mime_type=&quot;video/mp4&quot;,<br>)<br>video_part = Part(file_data=video_file_data)<br>video_contents = [Content(role=&quot;user&quot;, parts=[video_part])]<br><br># Video explicitly put in cache, with time-to-live (TTL) before automatic deletion<br>cached_content = client.caches.create(<br>    model=model_id,<br>    config=CreateCachedContentConfig(<br>        ttl=&quot;1800s&quot;,<br>        display_name=&quot;video-cache&quot;,<br>        contents=video_contents,<br>    ),<br>)<br>if cached_content.usage_metadata:<br>    print(f&quot;Cached tokens: {cached_content.usage_metadata.total_token_count or 0:,}&quot;)<br>    # Cached tokens: 46,171<br>    # ✅ Video tokens are cached (standard tokenization rate + storage cost for TTL duration)<br><br>cache_config = GenerateContentConfig(cached_content=cached_content.name)<br><br># Request #1<br>response = client.models.generate_content(<br>    model=model_id,<br>    contents=&quot;List the people mentioned in the video.&quot;,<br>    config=cache_config,<br>)<br>if response.usage_metadata:<br>    print(f&quot;Input tokens : {response.usage_metadata.prompt_token_count or 0:,}&quot;)<br>    print(f&quot;Cached tokens: {response.usage_metadata.cached_content_token_count or 0:,}&quot;)<br>    # Input tokens : 46,178<br>    # Cached tokens: 46,171<br>    # ✅ Cache hit (90% discount)<br><br># Request #i (within the TTL period)<br># …<br><br># Request #n (within the TTL period)<br>response = client.models.generate_content(<br>    model=model_id,<br>    contents=&quot;List all the timecodes when Jane Goodall is mentioned.&quot;,<br>    config=cache_config,<br>)<br>if response.usage_metadata:<br>    print(f&quot;Input tokens : {response.usage_metadata.prompt_token_count or 0:,}&quot;)<br>    print(f&quot;Cached tokens: {response.usage_metadata.cached_content_token_count or 0:,}&quot;)<br>    # Input tokens : 46,182<br>    # Cached tokens: 46,171<br>    # ✅ Cache hit (90% discount)</pre><blockquote><em>💡 Explicit caching needs a specific model version (like </em><em>...-001 in this example) to ensure the cache remains valid and is not affected by a model update.</em></blockquote><blockquote><em>ℹ️ Learn more about </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/context-cache/context-cache-overview?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog"><em>Context caching</em></a><em>.</em></blockquote><h4>⏳ Batch prediction</h4><p>If you need to process a large volume of videos and don’t need synchronous responses, you can use a single batch request and reduce your cost.</p><blockquote><em>💡 Batch requests for Gemini models get a 50% discount compared to standard requests.</em></blockquote><blockquote><em>ℹ️ Learn more about </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/batch-prediction-gemini?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog#generative-ai-batch-text-python_genai_sdk"><em>Batch prediction</em></a><em>.</em></blockquote><h4>♾️ To production… and beyond</h4><p>A few additional notes:</p><ul><li>The current prompt is not perfect and can be improved. It has been preserved in its current state to illustrate its development starting with Gemini 2.0 Flash and a simple video test suite.</li><li>The Gemini 2.5 models are more capable and intrinsically provide a better video understanding. However, the current prompt has not been optimized for them. Writing optimal prompts for different models is another challenge.</li><li>If you test transcribing your own videos, especially different types of videos, you may run into new or specific issues. They can probably be addressed by enriching the prompt.</li><li>Future models will likely support more output features. This should allow for richer structured outputs and simpler prompts.</li><li>As models keep learning, it’s also possible that multimodal video transcription will become a one-liner prompt.</li><li>Gemini’s image and audio tokenizers are truly impressive and enable many other use cases. To fully grasp the extent of the possibilities, you can run unit tests on images or audio files.</li><li>We constrained our challenge to using a single request, which optimizes the solution both for speed and cost.</li><li>For applications demanding the absolute highest transcription accuracy, we could isolate the audio-only transcription in a first request before performing speaker identification on the video frames in a second request. It might produce many more voice identifiers than actual speakers, but it should minimize false positives. In the second step, we’d reinject the transcription to focus on extracting and consolidating speaker data from the video frames. This two-step approach would also be a viable strategy to process very long videos, even those several hours in duration.</li></ul><p>▶️ Next up: 🏁 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Conclusion</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=74ee997d2096" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Unlocking Multimodal Video Transcription with Gemini — Part 7: ⚖️ Analysis, Tips &amp; Optimizations</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking Multimodal Video Transcription with Gemini — Part 6: ✅ Challenge Complete]]></title>
            <link>https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/b1fc52729e4f</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[video-transcription]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[gemini]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Fri, 05 Sep 2025 11:50:33 GMT</pubDate>
            <atom:updated>2025-11-18T10:57:11.869Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unlocking Multimodal Video Transcription with Gemini — Part 6: ✅ Challenge Complete</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*MFSNNle0cipz1udDW1FBrQ.png" /><figcaption>Generated with Gemini 2.5 Flash Image Preview (aka Nano Banana)</figcaption></figure><h4>Unlocking Multimodal Video Transcription with Gemini</h4><ol><li>🔥 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part1-02dc32118f41">Challenge</a></li><li>🛠️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1">Setup</a></li><li>🧪 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Prototyping</a></li><li>🏗️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Prompt Crafting</a></li><li>🚀 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Finalization</a></li><li><strong>✅ Challenge Complete ◀️</strong></li><li>⚖️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Analysis, Tips &amp; Optimizations</a></li><li>🏁 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Conclusion</a></li></ol><h3>✅ Challenge completed</h3><h4>🎬 Short video</h4><p>This video is a trailer for the Google DeepMind podcast. It features a fast-paced montage of 6 interviews. The multimodal transcription is excellent:</p><pre>transcribe_video(TestVideo.GDM_PODCAST_TRAILER_PT59S)</pre><p><strong>Video (</strong><a href="https://www.youtube.com/watch?v=0pJn3g8dfwk"><strong>source</strong></a><strong>)</strong></p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F0pJn3g8dfwk%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D0pJn3g8dfwk&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F0pJn3g8dfwk%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/211c006355aff3c68f32fc891b34b5fa/href">https://medium.com/media/211c006355aff3c68f32fc891b34b5fa/href</a></iframe><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,917<br>Output tokens  :       989</pre><p><strong>Speakers (6)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*zgIJF_d0RiVDnik0.png" /></figure><p><strong>Transcripts (13)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*WDaDg1cxSYHlUFSt.png" /></figure><h4>🎬 Narrator-only video</h4><p>This video is a documentary that takes viewers on a virtual tour of the Gombe National Park in Tanzania. There’s no visible speaker. Jane Goodall is correctly detected as the narrator, her name is extracted from the credits:</p><pre>transcribe_video(TestVideo.JANE_GOODALL_PT2M42S)</pre><p><strong>Video (</strong><a href="https://storage.googleapis.com/cloud-samples-data/video/JaneGoodall.mp4"><strong>source</strong></a><strong>)</strong></p><pre>------------------- JANE_GOODALL_PT2M42S / gemini-2.0-flash --------------------<br>Input tokens   :    46,324<br>Output tokens  :       717</pre><p><strong>Speakers (1)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*53uK7KzG5ZL3tHfv.png" /></figure><p><strong>Transcripts (14)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*FHjIMHsl7VtnVsan.png" /></figure><blockquote><em>💡 Over the past few years, I have regularly used this video to test specialized ML models and these tests consistently resulted in various types of errors. Gemini’s transcription, including punctuation, is perfect.</em></blockquote><h4>🎬 French video</h4><p>This French reportage combines on-the-ground footage of a specialized team using trained dogs to detect leaks in underground drinking water pipes. The recording takes place entirely outdoors in a rural setting. The interviewed workers are introduced with on-screen text overlays. The audio, captured live on location, includes ambient noise. There are also some off-screen or unidentified speakers. This video is rather complex. The multimodal transcription provides excellent results with no false positives:</p><pre>transcribe_video(TestVideo.BRUT_FR_DOGS_WATER_LEAK_PT8M28S)</pre><p><strong>Video (</strong><a href="https://www.youtube.com/watch?v=U_yYkb-ureI"><strong>source</strong></a><strong>)</strong></p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FU_yYkb-ureI%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DU_yYkb-ureI&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FU_yYkb-ureI%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/8e58b07f2f3d407c80a7a5b6ba9c9cf0/href">https://medium.com/media/8e58b07f2f3d407c80a7a5b6ba9c9cf0/href</a></iframe><pre>-------------- BRUT_FR_DOGS_WATER_LEAK_PT8M28S / gemini-2.0-flash --------------<br>Input tokens   :    46,514<br>Output tokens  :     4,924</pre><p><strong>Speakers (14)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*v_1dIqUD9jQp-gSz.png" /></figure><p><strong>Transcripts (61)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fiJZ5wCc7PDcQ9Rz.gif" /></figure><blockquote><em>💡 Our prompt was crafted and tested with English videos, but it works without modification with this French video. It should also work for videos in these </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog#languages-gemini"><em>100+ different languages</em></a><em>.</em></blockquote><blockquote><em>💡 In a multilingual solution, we might ask to translate our transcriptions into any of those 100+ languages and even perform text cleanup. This can be done in a second request, as the multimodal transcription is complex enough by itself.</em></blockquote><blockquote><em>💡 Gemini’s audio tokenizer detects more than speech. If you try to list non-speech sounds on audio tracks only (to ensure the response doesn’t benefit from any visual cues), you’ll see it can detect sounds such as “dog bark”, “music”, “sound effect”, “footsteps”, “laughter”, “applause”…</em></blockquote><blockquote><em>💡 In our data visualization tables, colored rows are inference positives (speakers identified by the model), while gray rows correspond to negatives (unidentified speakers). This makes it easier to understand the results. As the prompt we crafted favors accuracy over recall, colored rows are generally correct, and gray rows correspond either to unnamed/unidentifiable speakers (true negatives) or to speakers that should have been identified (false negatives).</em></blockquote><h4>🎬 Complex video</h4><p>This Google DeepMind video is quite complex:</p><ul><li>It is highly edited and very dynamic</li><li>Speakers are often off-screen and other people can be visible instead</li><li>The researchers are often in groups and it’s not always obvious who’s speaking</li><li>Some video shots were taken 2 years apart: the same speakers can sound and look different!</li></ul><p>Gemini 2.0 Flash generates an excellent transcription. However, the complexity of the video can lead to some missed consolidations. Gemini 2.5 Pro shows a deeper inference and manages to consolidate the differently-looking-and-sounding speakers:</p><pre>transcribe_video(<br>    TestVideo.GDM_ALPHAFOLD_PT7M54S,<br>    model=Model.GEMINI_2_5_PRO,<br>)</pre><p><strong>Video (</strong><a href="https://www.youtube.com/watch?v=gg7WjuFs8F4"><strong>source</strong></a><strong>)</strong></p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fgg7WjuFs8F4%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dgg7WjuFs8F4&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fgg7WjuFs8F4%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/7849bf15726fd652ed9ca331bdb90629/href">https://medium.com/media/7849bf15726fd652ed9ca331bdb90629/href</a></iframe><pre>-------------------- GDM_ALPHAFOLD_PT7M54S / gemini-2.5-pro --------------------<br>Input tokens   :    43,354<br>Output tokens  :     4,861<br>Thoughts tokens:        80</pre><p><strong>Speakers (11)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7PSGxFirex2UG45A.png" /></figure><p><strong>Transcripts (81)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*bbKLNA8KRv5QHpBk.gif" /></figure><h4>🎬 Long transcription</h4><p>The total length of the transcribed text can quickly reach the maximum number of output tokens. With our current JSON response schema, we can reach 8,192 output tokens (supported by Gemini 2.0) with transcriptions of ~25min videos. Gemini 2.5 models support up to 65,536 output tokens (8x more) and let us transcribe longer videos.</p><p>For this 54-minute panel discussion, Gemini 2.5 Pro uses only ~30–35% of the input/output token limits:</p><pre>transcribe_video(<br>    TestVideo.GDM_AI_FOR_SCIENCE_FRONTIER_PT54M23S,<br>    model=Model.GEMINI_2_5_PRO,<br>)</pre><p><strong>Video (</strong><a href="https://www.youtube.com/watch?v=nQKmVhLIGcs"><strong>source</strong></a><strong>)</strong></p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FnQKmVhLIGcs%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DnQKmVhLIGcs&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FnQKmVhLIGcs%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/800da993a63d344d052d1cfe8ef90068/href">https://medium.com/media/800da993a63d344d052d1cfe8ef90068/href</a></iframe><pre>------------ GDM_AI_FOR_SCIENCE_FRONTIER_PT54M23S / gemini-2.5-pro -------------<br>Input tokens   :   297,153<br>Output tokens  :    22,896<br>Thoughts tokens:        65</pre><p><strong>Speakers (14)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YJDJ1283fB1Icdem.png" /></figure><p><strong>Transcripts (593)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Nk6xBW8H_AHzeMws.gif" /></figure><blockquote><em>💡 In this long video, the five panelists are correctly transcribed, diarized, and identified. In the second half of the video, unseen attendees ask questions to the panel. They are correctly identified as audience members and, though their names and companies are never written on the screen, Gemini correctly extracts and even consolidates the information from the audio cues.</em></blockquote><h4>🎬 1h+ video</h4><p>In the latest Google I/O keynote video (1h 10min):</p><ul><li>~30–35%% of the token limit is used (383k/1M in, 20/64k out)</li><li>The dozen speakers are nicely identified, including the demo “AI Voices” (esp. “Casey”)</li><li>Speaker names are extracted from slanted text on the background screen for the live keynote speakers (e.g., Josh Woodward at 0:07) and from lower-third on-screen text in the DolphinGemma reportage (e.g., Dr. Denise Herzing at 1:05:28)</li></ul><pre>transcribe_video(<br>    TestVideo.GOOGLE_IO_DEV_KEYNOTE_PT1H10M03S,<br>    model=Model.GEMINI_2_5_PRO,<br>)</pre><p><strong>Video (</strong><a href="https://www.youtube.com/watch?v=GjvgtwSOCao"><strong>source</strong></a><strong>)</strong></p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FGjvgtwSOCao%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DGjvgtwSOCao&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FGjvgtwSOCao%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/88de8c03f2122aa714192cc75673058b/href">https://medium.com/media/88de8c03f2122aa714192cc75673058b/href</a></iframe><pre>-------------- GOOGLE_IO_DEV_KEYNOTE_PT1H10M03S / gemini-2.5-pro ---------------<br>Input tokens   :   382,699<br>Output tokens  :    19,772<br>Thoughts tokens:        75</pre><p><strong>Speakers (14)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*-0nacbKOsiELzaC4.png" /></figure><p><strong>Transcripts (201)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*EES7C7pNO4kBcmAm.gif" /></figure><h4>🎬 40 speaker video</h4><p>In this 1h 40min Google Cloud Next keynote video:</p><ul><li>~50–70% of the token limit is used (547k/1M in, 45/64k out)</li><li>40 distinct voices are diarized</li><li>29 speakers are identified, connected to their 21 respective companies or divisions</li><li>The transcription takes up to 8 minutes (approximately 4 minutes with video tokens cached), which is 13 to 23 times faster than watching the entire video without pauses.</li></ul><pre>transcribe_video(<br>    TestVideo.GOOGLE_CLOUD_NEXT_PT1H40M03S,<br>    model=Model.GEMINI_2_5_PRO,<br>)</pre><p><strong>Video (</strong><a href="https://www.youtube.com/watch?v=Md4Fs-Zc3tg"><strong>source</strong></a><strong>)</strong></p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FMd4Fs-Zc3tg%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DMd4Fs-Zc3tg&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FMd4Fs-Zc3tg%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/04d109a7b2102aa5f61c9d974bd32d9c/href">https://medium.com/media/04d109a7b2102aa5f61c9d974bd32d9c/href</a></iframe><pre>---------------- GOOGLE_CLOUD_NEXT_PT1H40M03S / gemini-2.5-pro -----------------<br>Input tokens   :   546,590<br>Output tokens  :    45,398<br>Thoughts tokens:        74</pre><p><strong>Speakers (40)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*V3drWotARGqQBDcp.gif" /></figure><p><strong>Transcripts (853)</strong></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jDf08VK24Fn37RHT.gif" /></figure><p>▶️ Next up: ⚖️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Analysis, Tips &amp; Optimizations</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=b1fc52729e4f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Unlocking Multimodal Video Transcription with Gemini — Part 6: ✅ Challenge Complete</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking Multimodal Video Transcription with Gemini — Part 5:  Finalization]]></title>
            <link>https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/488b357b53b1</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[video-transcription]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Fri, 05 Sep 2025 11:45:34 GMT</pubDate>
            <atom:updated>2025-11-18T10:51:33.467Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unlocking Multimodal Video Transcription with Gemini — Part 5: 🚀 Finalization</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*XHvDuHRuYptPJacPGw3wMg.png" /><figcaption>Generated with Gemini 2.5 Flash Image Preview (aka Nano Banana)</figcaption></figure><h4>Unlocking Multimodal Video Transcription with Gemini</h4><ol><li>🔥 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part1-02dc32118f41">Challenge</a></li><li>🛠️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1">Setup</a></li><li>🧪 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Prototyping</a></li><li>🏗️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Prompt Crafting</a></li><li><strong>🚀 Finalization ◀️</strong></li><li>✅ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Challenge Complete</a></li><li>⚖️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Analysis, Tips &amp; Optimizations</a></li><li>🏁 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Conclusion</a></li></ol><h3>🚀 Finalization</h3><h4>🧩 Structured output</h4><p>We’ve iterated towards a precise and concise prompt. Now, we can focus on Gemini’s response:</p><ul><li>The response is plain text containing fenced code blocks</li><li>Instead, we’d like a structured output, to receive consistently formatted responses</li><li>Ideally, we’d also like to avoid having to parse the response, which can be a maintenance burden</li></ul><p>Getting structured outputs is an LLM feature also called “controlled generation”. Since we’ve already crafted our prompt in terms of data tables and JSON fields, this is now a formality.</p><p>In our request, we can add the following parameters:</p><ul><li>response_mime_type=&quot;application/json&quot;</li><li>response_schema=&quot;YOUR_JSON_SCHEMA&quot; (<a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog#fields">docs</a>)</li></ul><p>In Python, this gets even easier:</p><ul><li>Use the pydantic library</li><li>Reflect your output structure with classes derived from pydantic.BaseModel</li></ul><p>We can simplify the prompt by removing the output specification parts:</p><pre>Generate a JSON object with keys `task1_transcripts` and `task2_speakers` for the following tasks.<br>…<br>- The `task1_transcripts` value is a JSON array where each object has the following fields:<br>  - `start`<br>  - `text`<br>  - `voice`<br>…<br>- The `task2_speakers` value is a JSON array where each object has the following fields:<br>  - `voice`<br>  - `name`</pre><p>… to move them to matching Python classes instead:</p><pre>import pydantic<br><br>class Transcript(pydantic.BaseModel):<br>    start: str<br>    text: str<br>    voice: int<br><br>class Speaker(pydantic.BaseModel):<br>    voice: int<br>    name: str<br><br>class VideoTranscription(pydantic.BaseModel):<br>    task1_transcripts: list[Transcript] = pydantic.Field(default_factory=list)<br>    task2_speakers: list[Speaker] = pydantic.Field(default_factory=list)</pre><p>… and request a structured response:</p><pre>response = client.models.generate_content(<br>    # …<br>    config=GenerateContentConfig(<br>        # …<br>        response_mime_type=&quot;application/json&quot;,<br>        response_schema=VideoTranscription,<br>        # …<br>    ),<br>)</pre><p>Finally, retrieving the objects from the response is also direct:</p><pre>if isinstance(response.parsed, VideoTranscription):<br>    video_transcription = response.parsed<br>else:<br>    video_transcription = VideoTranscription()  # Empty transcription</pre><p>The interesting aspects of this approach are the following:</p><ul><li>The prompt focuses on the logic and the classes focus on the output</li><li>It’s easier to update and maintain typed classes</li><li>The JSON schema is automatically generated by the Gen AI SDK from the class provided in response_schema and dispatched to Gemini</li><li>The response is automatically parsed by the Gen AI SDK and deserialized into the corresponding Python objects</li></ul><blockquote><em>⚠️ If you keep output specifications in your prompt, ensure there are no contradictions between the prompt and the schema (e.g., same field names and order), as this can negatively impact the quality of the responses.</em></blockquote><blockquote><em>💡 It’s possible to have more structural information directly in the schema (e.g., detailed field definitions). See </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog"><em>Controlled generation</em></a><em>.</em></blockquote><h4>✨ Implementation</h4><p>Let’s finalize our code. In addition, now that we have a stable prompt, we can even enrich our solution to extract each speaker’s company, position, and role_in_video:</p><pre>import re<br><br>import pydantic<br>from google.genai.types import MediaResolution, ThinkingConfig<br><br>SamplingFrameRate = float<br>NOT_FOUND = &quot;?&quot;<br>VIDEO_TRANSCRIPTION_PROMPT = f&quot;&quot;&quot;<br>**Task 1 - Transcripts**<br><br>- Watch the video and listen carefully to the audio.<br>- Identify the distinct voices using a `voice` ID (1, 2, 3, etc.).<br>- Transcribe the video&#39;s audio verbatim with voice diarization.<br>- Include the `start` timecode ({{timecode_spec}}) for each speech segment.<br><br>**Task 2 - Speakers**<br><br>- For each `voice` ID from Task 1, extract information about the corresponding speaker.<br>- Use visual and audio cues.<br>- If a piece of information cannot be found, use `{NOT_FOUND}` as the value.<br>&quot;&quot;&quot;<br><br><br>class Transcript(pydantic.BaseModel):<br>    start: str<br>    text: str<br>    voice: int<br><br><br>class Speaker(pydantic.BaseModel):<br>    voice: int<br>    name: str<br>    company: str<br>    position: str<br>    role_in_video: str<br><br><br>class VideoTranscription(pydantic.BaseModel):<br>    task1_transcripts: list[Transcript] = pydantic.Field(default_factory=list)<br>    task2_speakers: list[Speaker] = pydantic.Field(default_factory=list)<br><br><br>def get_generate_content_config(model: Model, video: Video) -&gt; GenerateContentConfig:<br>    media_resolution = get_media_resolution_for_video(video)<br>    thinking_config = get_thinking_config(model)<br><br>    return GenerateContentConfig(<br>        temperature=DEFAULT_CONFIG.temperature,<br>        top_p=DEFAULT_CONFIG.top_p,<br>        seed=DEFAULT_CONFIG.seed,<br>        response_mime_type=&quot;application/json&quot;,<br>        response_schema=VideoTranscription,<br>        media_resolution=media_resolution,<br>        thinking_config=thinking_config,<br>    )<br><br><br>def get_video_duration(video: Video) -&gt; timedelta | None:<br>    # For testing purposes, video duration is statically specified in the enum name<br>    # Suffix (ISO 8601 based): _PT[&lt;h&gt;H][&lt;m&gt;M][&lt;s&gt;S]<br>    # For production,<br>    # - fetch durations dynamically or store them separately<br>    # - take into account video VideoMetadata.start_offset &amp; VideoMetadata.end_offset<br>    regex = r&quot;_PT(?:(\d+)H)?(?:(\d+)M)?(?:(\d+)S)?$&quot;<br>    if not (match := re.search(regex, video.name)):<br>        print(f&quot;⚠️ No duration info in {video.name}. Will use defaults.&quot;)<br>        return None<br><br>    h_str, m_str, s_str = match.groups()<br>    return timedelta(<br>        hours=int(h_str or 0), minutes=int(m_str or 0), seconds=int(s_str or 0)<br>    )<br><br>def get_media_resolution_for_video(video: Video) -&gt; MediaResolution | None:<br>    if not (video_duration := get_video_duration(video)):<br>        return None  # Default<br><br>    # For testing purposes, this is based on video duration, as our short videos tend to be more detailed<br>    less_than_five_minutes = video_duration &lt; timedelta(minutes=5)<br>    if less_than_five_minutes:<br>        media_resolution = MediaResolution.MEDIA_RESOLUTION_MEDIUM<br>    else:<br>        media_resolution = MediaResolution.MEDIA_RESOLUTION_LOW<br><br>    return media_resolution<br><br><br>def get_sampling_frame_rate_for_video(video: Video) -&gt; SamplingFrameRate | None:<br>    sampling_frame_rate = None  # Default (1 FPS for current models)<br><br>    # [Optional] Define a custom FPS: 0.0 &lt; sampling_frame_rate &lt;= 24.0<br><br>    return sampling_frame_rate<br><br><br>def get_timecode_spec_for_model_and_video(model: Model, video: Video) -&gt; str:<br>    timecode_spec = &quot;MM:SS&quot;  # Default<br><br>    match model:<br>        case Model.GEMINI_2_0_FLASH:  # Supports MM:SS<br>            pass<br>        case Model.GEMINI_2_5_FLASH | Model.GEMINI_2_5_PRO:  # Support MM:SS and H:MM:SS<br>            duration = get_video_duration(video)<br>            one_hour_or_more = duration is not None and timedelta(hours=1) &lt;= duration<br>            if one_hour_or_more:<br>                timecode_spec = &quot;MM:SS or H:MM:SS&quot;<br>        case _:<br>            raise NotImplementedError(f&quot;Undefined timecode spec for {model.name}.&quot;)<br><br>    return timecode_spec<br><br><br>def get_thinking_config(model: Model) -&gt; ThinkingConfig | None:<br>    # Examples of thinking configurations (Gemini 2.5 models)<br>    match model:<br>        case Model.GEMINI_2_5_FLASH:  # Thinking disabled<br>            return ThinkingConfig(thinking_budget=0, include_thoughts=False)<br>        case Model.GEMINI_2_5_PRO:  # Minimum thinking budget and no summarized thoughts<br>            return ThinkingConfig(thinking_budget=128, include_thoughts=False)<br>        case _:<br>            return None  # Default<br><br><br>def get_video_transcription_from_response(<br>    response: GenerateContentResponse,<br>) -&gt; VideoTranscription:<br>    if isinstance(response.parsed, VideoTranscription):<br>        return response.parsed<br>    <br>    print(&quot;❌ Could not parse the JSON response&quot;)<br>    return VideoTranscription()  # Empty transcription<br><br><br>def get_video_transcription(<br>    video: Video,<br>    video_segment: VideoSegment | None = None,<br>    fps: float | None = None,<br>    prompt: str | None = None,<br>    model: Model | None = None,<br>) -&gt; VideoTranscription:<br>    model = model or Model.DEFAULT<br>    model_id = model.value<br><br>    fps = fps or get_sampling_frame_rate_for_video(video)<br>    video_part = get_video_part(video, video_segment, fps)<br>    if not video_part:  # Unsupported source, return an empty transcription<br>        return VideoTranscription()<br>    if prompt is None:<br>        timecode_spec = get_timecode_spec_for_model_and_video(model, video)<br>        prompt = VIDEO_TRANSCRIPTION_PROMPT.format(timecode_spec=timecode_spec)<br>    contents = [video_part, prompt.strip()]<br><br>    config = get_generate_content_config(model, video)<br><br>    print(f&quot; {video.name} / {model_id} &quot;.center(80, &quot;-&quot;))<br>    response = None<br>    for attempt in get_retrier():<br>        with attempt:<br>            response = client.models.generate_content(<br>                model=model_id,<br>                contents=contents,<br>                config=config,<br>            )<br>            display_response_info(response)<br><br>    assert isinstance(response, GenerateContentResponse)<br>    return get_video_transcription_from_response(response)</pre><p>Test it:</p><pre>def test_structured_video_transcription(video: Video) -&gt; None:<br>    transcription = get_video_transcription(video)<br><br>    print(&quot;-&quot; * 80)<br>    print(f&quot;Transcripts : {len(transcription.task1_transcripts):3d}&quot;)<br>    print(f&quot;Speakers    : {len(transcription.task2_speakers):3d}&quot;)<br>    for speaker in transcription.task2_speakers:<br>        print(f&quot;- {speaker}&quot;)<br><br><br>test_structured_video_transcription(TestVideo.GDM_PODCAST_TRAILER_PT59S)</pre><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,917<br>Output tokens  :       989<br>--------------------------------------------------------------------------------<br>Transcripts :  13<br>Speakers    :   6<br>- voice=1 name=&#39;Professor Hannah Fry&#39; company=&#39;Google DeepMind&#39; position=&#39;Host&#39; role_in_video=&#39;Host&#39;<br>- voice=2 name=&#39;Demis Hassabis&#39; company=&#39;Google DeepMind&#39; position=&#39;Co-Founder &amp; CEO&#39; role_in_video=&#39;Interviewee&#39;<br>- voice=3 name=&#39;Anca Dragan&#39; company=&#39;?&#39; position=&#39;Director, AI Safety &amp; Alignment&#39; role_in_video=&#39;Interviewee&#39;<br>- voice=4 name=&#39;Pushmeet Kohli&#39; company=&#39;?&#39; position=&#39;VP Science &amp; Strategic Initiatives&#39; role_in_video=&#39;Interviewee&#39;<br>- voice=5 name=&#39;Jeff Dean&#39; company=&#39;?&#39; position=&#39;Chief Scientist&#39; role_in_video=&#39;Interviewee&#39;<br>- voice=6 name=&#39;Douglas Eck&#39; company=&#39;?&#39; position=&#39;Senior Research Director&#39; role_in_video=&#39;Interviewee&#39;</pre><h4>📊 Data visualization</h4><p>We started prototyping in natural language, crafted a prompt, and generated a structured output. Since reading raw data can be cumbersome, we can now present video transcriptions in a more visually appealing way.</p><p>Here’s a possible orchestrator function:</p><pre>def transcribe_video(video: Video, …) -&gt; None:<br>    display_video(video)<br>    transcription = get_video_transcription(video, …)<br>    display_speakers(transcription)<br>    display_transcripts(transcription)</pre><p><strong>Let’s add some data visualization functions…</strong></p><pre>import itertools<br>from collections.abc import Callable, Iterator<br><br>from pandas import DataFrame, Series<br>from pandas.io.formats.style import Styler<br>from pandas.io.formats.style_render import CSSDict<br><br>BGCOLOR_COLUMN = &quot;bg_color&quot;  # Hidden column to store row background colors<br><br><br>def yield_known_speaker_color() -&gt; Iterator[str]:<br>    PAL_40 = (&quot;#669DF6&quot;, &quot;#EE675C&quot;, &quot;#FCC934&quot;, &quot;#5BB974&quot;)<br>    PAL_30 = (&quot;#8AB4F8&quot;, &quot;#F28B82&quot;, &quot;#FDD663&quot;, &quot;#81C995&quot;)<br>    PAL_20 = (&quot;#AECBFA&quot;, &quot;#F6AEA9&quot;, &quot;#FDE293&quot;, &quot;#A8DAB5&quot;)<br>    PAL_10 = (&quot;#D2E3FC&quot;, &quot;#FAD2CF&quot;, &quot;#FEEFC3&quot;, &quot;#CEEAD6&quot;)<br>    PAL_05 = (&quot;#E8F0FE&quot;, &quot;#FCE8E6&quot;, &quot;#FEF7E0&quot;, &quot;#E6F4EA&quot;)<br>    return itertools.cycle([*PAL_40, *PAL_30, *PAL_20, *PAL_10, *PAL_05])<br><br><br>def yield_unknown_speaker_color() -&gt; Iterator[str]:<br>    GRAYS = [&quot;#80868B&quot;, &quot;#9AA0A6&quot;, &quot;#BDC1C6&quot;, &quot;#DADCE0&quot;, &quot;#E8EAED&quot;, &quot;#F1F3F4&quot;]<br>    return itertools.cycle(GRAYS)<br><br><br>def get_color_for_voice_mapping(speakers: list[Speaker]) -&gt; dict[int, str]:<br>    known_speaker_color = yield_known_speaker_color()<br>    unknown_speaker_color = yield_unknown_speaker_color()<br><br>    mapping: dict[int, str] = {}<br>    for speaker in speakers:<br>        if speaker.name != NOT_FOUND:<br>            color = next(known_speaker_color)<br>        else:<br>            color = next(unknown_speaker_color)<br>        mapping[speaker.voice] = color<br><br>    return mapping<br><br><br>def get_table_styler(df: DataFrame) -&gt; Styler:<br>    def join_styles(styles: list[str]) -&gt; str:<br>        return &quot;;&quot;.join(styles)<br><br>    table_css = [<br>        &quot;color: #202124&quot;,<br>        &quot;background-color: #BDC1C6&quot;,<br>        &quot;border: 0&quot;,<br>        &quot;border-radius: 0.5rem&quot;,<br>        &quot;border-spacing: 0px&quot;,<br>        &quot;outline: 0.5rem solid #BDC1C6&quot;,<br>        &quot;margin: 1rem 0.5rem&quot;,<br>    ]<br>    th_css = [&quot;background-color: #E8EAED&quot;]<br>    th_td_css = [&quot;text-align:left&quot;, &quot;padding: 0.25rem 1rem&quot;]<br>    table_styles = [<br>        CSSDict(selector=&quot;&quot;, props=join_styles(table_css)),<br>        CSSDict(selector=&quot;th&quot;, props=join_styles(th_css)),<br>        CSSDict(selector=&quot;th,td&quot;, props=join_styles(th_td_css)),<br>    ]<br><br>    return df.style.set_table_styles(table_styles).hide()<br><br><br>def change_row_bgcolor(row: Series) -&gt; list[str]:<br>    style = f&quot;background-color:{row[BGCOLOR_COLUMN]}&quot;<br>    return [style] * len(row)<br><br><br>def display_table(yield_rows: Callable[[], Iterator[list[str]]]) -&gt; None:<br>    data = yield_rows()<br>    df = DataFrame(columns=next(data), data=data)<br>    styler = get_table_styler(df)<br>    styler.apply(change_row_bgcolor, axis=1)<br>    styler.hide([BGCOLOR_COLUMN], axis=&quot;columns&quot;)<br><br>    html = styler.to_html()<br>    IPython.display.display(IPython.display.HTML(html))<br><br><br>def display_speakers(transcription: VideoTranscription) -&gt; None:<br>    def sanitize_field(s: str, symbol_if_unknown: str) -&gt; str:<br>        return symbol_if_unknown if s == NOT_FOUND else s<br><br>    def yield_rows() -&gt; Iterator[list[str]]:<br>        yield [&quot;voice&quot;, &quot;name&quot;, &quot;company&quot;, &quot;position&quot;, &quot;role_in_video&quot;, BGCOLOR_COLUMN]<br><br>        color_for_voice = get_color_for_voice_mapping(transcription.task2_speakers)<br>        for speaker in transcription.task2_speakers:<br>            yield [<br>                str(speaker.voice),<br>                sanitize_field(speaker.name, NOT_FOUND),<br>                sanitize_field(speaker.company, NOT_FOUND),<br>                sanitize_field(speaker.position, NOT_FOUND),<br>                sanitize_field(speaker.role_in_video, NOT_FOUND),<br>                color_for_voice.get(speaker.voice, &quot;red&quot;),<br>            ]<br><br>    display_markdown(f&quot;### Speakers ({len(transcription.task2_speakers)})&quot;)<br>    display_table(yield_rows)<br><br><br>def display_transcripts(transcription: VideoTranscription) -&gt; None:<br>    def yield_rows() -&gt; Iterator[list[str]]:<br>        yield [&quot;start&quot;, &quot;speaker&quot;, &quot;transcript&quot;, BGCOLOR_COLUMN]<br><br>        color_for_voice = get_color_for_voice_mapping(transcription.task2_speakers)<br>        speaker_for_voice = {<br>            speaker.voice: speaker for speaker in transcription.task2_speakers<br>        }<br>        previous_voice = None<br>        for transcript in transcription.task1_transcripts:<br>            current_voice = transcript.voice<br>            speaker_label = &quot;&quot;<br>            if speaker := speaker_for_voice.get(current_voice, None):<br>                if speaker.name != NOT_FOUND:<br>                    speaker_label = speaker.name<br>                elif speaker.position != NOT_FOUND:<br>                    speaker_label = f&quot;[voice {current_voice}][{speaker.position}]&quot;<br>                elif speaker.role_in_video != NOT_FOUND:<br>                    speaker_label = f&quot;[voice {current_voice}][{speaker.role_in_video}]&quot;<br>            if not speaker_label:<br>                speaker_label = f&quot;[voice {current_voice}]&quot;<br>            yield [<br>                transcript.start,<br>                speaker_label if current_voice != previous_voice else &#39;&quot;&#39;,<br>                transcript.text,<br>                color_for_voice.get(current_voice, &quot;red&quot;),<br>            ]<br>            previous_voice = current_voice<br><br>    display_markdown(f&quot;### Transcripts ({len(transcription.task1_transcripts)})&quot;)<br>    display_table(yield_rows)<br><br><br>def transcribe_video(<br>    video: Video,<br>    video_segment: VideoSegment | None = None,<br>    fps: float | None = None,<br>    prompt: str | None = None,<br>    model: Model | None = None,<br>) -&gt; None:<br>    display_video(video)<br>    transcription = get_video_transcription(video, video_segment, fps, prompt, model)<br>    display_speakers(transcription)<br>    display_transcripts(transcription)</pre><p>▶️ Next up: ✅ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Challenge Complete</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=488b357b53b1" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Unlocking Multimodal Video Transcription with Gemini — Part 5: 🚀 Finalization</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking Multimodal Video Transcription with Gemini — Part 4: ️ Prompt Crafting]]></title>
            <link>https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/3381b61aaaec</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[video-transcription]]></category>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[generative-ai]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Thu, 04 Sep 2025 11:28:27 GMT</pubDate>
            <atom:updated>2025-11-18T10:46:07.964Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unlocking Multimodal Video Transcription with Gemini — Part 4: 🏗️ Prompt Crafting</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*t8mhR11Z-QU8gANtLQhVSg.png" /><figcaption>Generated with Gemini 2.5 Flash Image Preview (aka Nano Banana)</figcaption></figure><h4>Unlocking Multimodal Video Transcription with Gemini</h4><ol><li>🔥 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part1-02dc32118f41">Challenge</a></li><li>🛠️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1">Setup</a></li><li>🧪 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Prototyping</a></li><li><strong>🏗️ Prompt Crafting ◀️</strong></li><li>🚀 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Finalization</a></li><li>✅ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Challenge Complete</a></li><li>⚖️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Analysis, Tips &amp; Optimizations</a></li><li>🏁 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Conclusion</a></li></ol><h3>🏗️ Prompt crafting</h3><h4>🪜 Methodology</h4><p>Prompt crafting, also called prompt engineering, is a relatively new field. It involves designing and refining text instructions to guide LLMs towards generating desired outputs. Like writing, it is both an art and a science, a skill that everyone can develop with practice.</p><p>We can find countless reference materials about prompt crafting. Some prompts can be very long, complex, and even scary. Crafting prompts with a high-performing LLM like Gemini is much simpler. Here are three key adjectives to keep in mind:</p><ul><li>iterative</li><li>precise</li><li>concise</li></ul><p><strong>Iterative</strong></p><p>Prompt crafting is typically an iterative process. Here are some recommendations:</p><ul><li>Craft your prompt step by step</li><li>Keep track of your successive iterations</li><li>At every iteration, make sure to measure what’s working versus what’s not</li><li>If you reach a regression, backtrack to a successful iteration</li></ul><p><strong>Precise</strong></p><p>Precision is key:</p><ul><li>Use words as specific as possible</li><li>Words with multiple meanings can introduce variability, so use precise expressions</li><li>Precision will influence probabilities in your favor</li></ul><p><strong>Concise</strong></p><p>Concision has additional advantages:</p><ul><li>A short prompt is easier for us developers to understand (and maintain!)</li><li>The longer your prompt is, the more likely you are to introduce inconsistencies or even contradictions, which results in variable interpretations of your instructions</li><li>Test and trust the LLM’s knowledge: this knowledge acts as an implicit context and can make your prompt shorter and clearer</li></ul><p>Overall, though this may seem contradictory, if you take the time to be iterative, precise, and concise, you are likely to save a lot of time.</p><blockquote><em>💡 If you want to explore this topic, check out </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/learn/prompts/prompt-design-strategies?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog"><em>Prompting strategies</em></a><em> (Google Cloud reference) and </em><a href="https://www.kaggle.com/whitepaper-prompt-engineering"><em>Prompt engineering</em></a><em> (68-page PDF by Lee Boonstra).</em></blockquote><h4>📚 Terminology</h4><p>We’re not experts in video transcription (yet!) but we want Gemini to behave as one. Consequently, we’d like to write prompts as specific as possible for this use case. While LLMs process instructions based on their training knowledge, they can also share this knowledge with us.</p><p>We can learn a lot by directly asking Gemini:</p><pre>prompt = &quot;&quot;&quot;<br>What is the terminology used for video transcriptions?<br>Please show a typical output example.<br>&quot;&quot;&quot;<br>generate_content(prompt, show_as=ShowAs.MARKDOWN)</pre><pre>------------------------------- gemini-2.0-flash -------------------------------<br>Input tokens   :        18<br>Output tokens  :     1,064<br>------------------------------ start of response -------------------------------</pre><pre>## Terminology Used for Video Transcriptions<br><br>Here&#39;s a breakdown of common terminology used when discussing video transcriptions:<br><br>*   **Transcription:** The general term for converting audio or video content into written text.<br>*   **Verbatim Transcription:** A transcription that captures every spoken word, including filler words (&quot;um,&quot; &quot;ah&quot;), false starts, stutters, and background noises.  It aims for a highly accurate representation of the audio.<br>*   **Clean Verbatim Transcription (or Edited Transcription):** A transcription that removes filler words, false starts, and stutters to create a more readable and polished text. It focuses on conveying the core message clearly.<br>*   **Intelligent Verbatim Transcription:** Similar to clean verbatim, but may also correct minor grammatical errors or rephrase sentences slightly for clarity, while still preserving the speaker&#39;s intent.<br>*   **Timecoding (or Timestamping):** Adding timestamps to the transcription to indicate when specific words or phrases were spoken in the video.  This is often used for indexing, subtitling, or finding specific sections of the video.<br>*   **Speaker Identification:** Identifying and labeling each speaker in the transcription (e.g., Speaker 1, Speaker 2, or using names if known).<br>*   **Subtitles/Captions:** Textual versions of the video&#39;s dialogue displayed on the screen, often used for accessibility or translation.  While related to transcription, they are specifically formatted for on-screen display.<br>*   **Closed Captions:** Captions that can be turned on or off by the viewer.<br>*   **Open Captions:** Captions that are permanently embedded in the video and cannot be turned off.<br>*   **SRT (SubRip Subtitle) file:** A common file format for subtitles and captions, containing the text and timestamps.<br>*   **VTT (Video Text Tracks) file:** A more advanced subtitle format that supports styling and other features.<br>*   **Burned-in Captions:** Captions that are permanently embedded into the video file.<br>*   **Transcription Software:** Software used to assist in the transcription process, often including features like automatic speech recognition (ASR), playback controls, and text editing tools.<br>*   **Automatic Speech Recognition (ASR):** Technology that automatically converts speech to text.  While improving, ASR often requires human editing for accuracy.<br>*   **Human Transcription:** Transcription performed by a human transcriber, generally considered more accurate than ASR alone.<br><br>## Typical Output Examples<br><br>Here are examples of different transcription styles:<br><br>**1. Verbatim Transcription:**<br><br>```<br>Speaker 1: Um, so, like, I think the, uh, the main point is, ah, we need to, um, focus on, you know, the, the marketing strategy.<br>Speaker 2: Yeah, I, I agree.  It&#39;s, it&#39;s, uh, really important.  And, um, we should, like, consider, um, different, uh, channels.<br>```<br><br>**2. Clean Verbatim Transcription:**<br><br>```<br>Speaker 1: So, I think the main point is we need to focus on the marketing strategy.<br>Speaker 2: Yeah, I agree. It&#39;s really important. And we should consider different channels.<br>```<br><br>**3. Intelligent Verbatim Transcription:**<br><br>```<br>Speaker 1: I think the main point is that we need to focus on our marketing strategy.<br>Speaker 2: Yes, I agree. It&#39;s very important. We should consider different marketing channels.<br>```<br><br>**4. Transcription with Timecoding (Clean Verbatim):**<br><br>```<br>00:00:05 Speaker 1: So, I think the main point is we need to focus on the marketing strategy.<br>00:00:12 Speaker 2: Yeah, I agree. It&#39;s really important. And we should consider different channels.<br>```<br><br>**5. Transcription with Speaker Identification and Timecoding (Clean Verbatim):**<br><br>```<br>00:00:05 John: So, I think the main point is we need to focus on the marketing strategy.<br>00:00:12 Mary: Yeah, I agree. It&#39;s really important. And we should consider different channels.<br>```<br><br>**6. Example of an SRT file:**<br><br>```<br>1<br>00:00:05,000 --&gt; 00:00:12,000<br>So, I think the main point is<br>we need to focus on the marketing strategy.<br><br>2<br>00:00:12,000 --&gt; 00:00:18,000<br>Yeah, I agree. It&#39;s really important.<br>And we should consider different channels.<br>```<br><br>The best type of transcription depends on the specific needs of the project.  For example, legal proceedings often require verbatim transcriptions, while marketing videos might benefit from clean verbatim or intelligent verbatim transcriptions for clarity and readability.  Subtitles and captions require specific formatting and timing to synchronize with the video.</pre><pre>------------------------------- end of response --------------------------------</pre><h4>📝 Tabular extraction</h4><p>So far, we’ve seen the following:</p><ul><li>We didn’t manage to get the full transcription with identified speakers all at once</li><li>Order matters (because a generated token influences the probabilities for subsequent tokens)</li></ul><p>To tackle our challenge, we need Gemini to infer from the following multimodal information:</p><ul><li>text (our instructions + what may be written in the video)</li><li>audio cues (everything said or audible in the video’s audio)</li><li>visual cues (everything visible in the video)</li><li>time (when things happen)</li></ul><p>That is quite a mixture of information types!</p><p>As video transcription is a data extraction use case, if we think about the final result as a database, our final goal can be seen as the generation of two related tables (transcripts and speakers). If we write it down, our initial three sub-problems now look decoupled:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8PIctLGdNnivE4S0.png" /></figure><blockquote><em>💡 In computer science, data decoupling enhances data locality, often yielding improved performance across areas such as cache utilization, data access, semantic understanding, or system maintenance. Within the LLM Transformer architecture, core performance relies heavily on the attention mechanism. Nonetheless, the attention pool is finite and tokens compete for attention. Researchers sometimes refer to “attention dilution” for long-context, million-token-scale benchmarks. While we cannot directly debug LLMs as users, intuitively, data decoupling may improve the model’s focus, leading to a better attention span.</em></blockquote><p>Since Gemini is extremely good with patterns, it can automatically generate identifiers to link our tables. In addition, since we eventually want an automated workflow, we can start reasoning in terms of data and fields:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*yS9jMl7PY7xKl1gF.png" /></figure><p>Let’s call this approach “tabular extraction”, split our instructions into two tasks (tables), still in a single request, and arrange them in a meaningful order…</p><h4>💬 Transcripts</h4><p>First of all, let’s focus on getting the audio transcripts:</p><ul><li>Gemini has proven to be natively good at audio transcription</li><li>This requires less inference than image analysis</li><li>It is central and independent information</li></ul><blockquote><em>💡 Generating an output that starts with correct answers should help to achieve an overall correct output.</em></blockquote><p>We’ve also seen what a typical transcription entry can look like:</p><pre>00:02 speaker_1: Welcome!</pre><p>But, right away, there can be some ambiguities in our multimodal use case:</p><ul><li>What is a speaker?</li><li>Is it someone we see/hear?</li><li>What if the person visible in the video is not the one speaking?</li><li>What if the person speaking is never seen in the video?</li></ul><p>How do we unconsciously identify who is speaking in a video?</p><ul><li>First, probably by identifying the different voices on the fly?</li><li>Then, probably by consolidating additional audio and visual cues?</li></ul><p>Can Gemini understand voice characteristics?</p><pre>prompt = &quot;&quot;&quot;<br>Using only the video&#39;s audio, list the following audible characteristics:<br>- Voice tones<br>- Voice pitches<br>- Languages<br>- Accents<br>- Speaking styles<br>&quot;&quot;&quot;<br>video = TestVideo.GDM_PODCAST_TRAILER_PT59S<br><br>generate_content(prompt, video, show_as=ShowAs.MARKDOWN)</pre><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,730<br>Output tokens  :       168<br>------------------------------ start of response -------------------------------</pre><pre>Okay, here&#39;s a breakdown of the audible characteristics in the video&#39;s audio:<br><br>- **Voice Tones:** The tones range from conversational and friendly to more serious and thoughtful, depending on the speaker and the topic. There are also moments of excitement and humor.<br>- **Voice Pitches:** There&#39;s a mix of high and low pitches, reflecting the different speakers (both male and female).<br>- **Languages:** The primary language is English.<br>- **Accents:** There are a variety of accents, including British and American.<br>- **Speaking Styles:** The speaking styles vary from casual and conversational to more formal and academic, depending on the speaker and the context.</pre><pre>------------------------------- end of response --------------------------------</pre><p>What about a French video?</p><pre>video = TestVideo.BRUT_FR_DOGS_WATER_LEAK_PT8M28S<br><br>generate_content(prompt, video, show_as=ShowAs.MARKDOWN)</pre><pre>-------------- BRUT_FR_DOGS_WATER_LEAK_PT8M28S / gemini-2.0-flash --------------<br>Input tokens   :   144,055<br>Output tokens  :       147<br>------------------------------ start of response -------------------------------</pre><pre>Here is a list of the audible characteristics of the video&#39;s audio:<br><br>- **Voice tones:** Conversational, informative, enthusiastic, serious, humorous<br>- **Voice pitches:** Varying, from low to high<br>- **Languages:** French<br>- **Accents:** Standard French<br>- **Speaking styles:** Clear, articulate, professional, casual</pre><pre>------------------------------- end of response --------------------------------</pre><blockquote><em>⚠️ We have to be cautious here: responses can consolidate multimodal information or even general knowledge. For example, if a person is famous, their name is most likely part of the LLM’s knowledge. If they are known to be from the UK, a possible inference is that they have a British accent. This is why we made our prompt more specific by including “using only the video’s audio”.</em></blockquote><blockquote><em>💡 If you conduct more tests, for example on private audio files (i.e., not part of common knowledge and with no additional visual cues), you’ll see that Gemini’s audio tokenizer performs exceptionally well and extracts semantic speech information!</em></blockquote><p>After a few iterations, we can arrive at a transcription prompt focusing on the audio and voices:</p><pre>prompt = &quot;&quot;&quot;<br>Task:<br>- Watch the video and listen carefully to the audio.<br>- Identify the distinct voices using a `voice` ID (1, 2, 3, etc.).<br>- Transcribe the video&#39;s audio verbatim with voice diarization.<br>- Include the `start` timecode (MM:SS) for each speech segment.<br>- Output a JSON array where each object has the following fields:<br>  - `start`<br>  - `text`<br>  - `voice`<br>&quot;&quot;&quot;<br>video = TestVideo.GDM_PODCAST_TRAILER_PT59S<br><br>generate_content(prompt, video, show_as=ShowAs.MARKDOWN)</pre><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,800<br>Output tokens  :       635<br>------------------------------ start of response -------------------------------</pre><pre>[<br>  {<br>    &quot;start&quot;: &quot;00:00&quot;,<br>    &quot;text&quot;: &quot;Do I have to call you Sir Demis now?&quot;,<br>    &quot;voice&quot;: 1<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:01&quot;,<br>    &quot;text&quot;: &quot;Oh, you don&#39;t. Absolutely not.&quot;,<br>    &quot;voice&quot;: 2<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:03&quot;,<br>    &quot;text&quot;: &quot;Welcome to Google Deep Mind the podcast with me, your host Professor Hannah Fry.&quot;,<br>    &quot;voice&quot;: 1<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:06&quot;,<br>    &quot;text&quot;: &quot;We want to take you to the heart of where these ideas are coming from. We want to introduce you to the people who are leading the design of our collective future.&quot;,<br>    &quot;voice&quot;: 1<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:19&quot;,<br>    &quot;text&quot;: &quot;Getting the safety right is probably, I&#39;d say, one of the most important challenges of our time. I want safe and capable.&quot;,<br>    &quot;voice&quot;: 3<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:26&quot;,<br>    &quot;text&quot;: &quot;I want a bridge that will not collapse.&quot;,<br>    &quot;voice&quot;: 3<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:30&quot;,<br>    &quot;text&quot;: &quot;just give these scientists a superpower that they had not imagined earlier.&quot;,<br>    &quot;voice&quot;: 4<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:34&quot;,<br>    &quot;text&quot;: &quot;autonomous vehicles. It&#39;s hard to fathom that when you&#39;re working on a search engine.&quot;,<br>    &quot;voice&quot;: 5<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:38&quot;,<br>    &quot;text&quot;: &quot;We may see entirely new genre or entirely new forms of art come up. There may be a new word that is not music, painting, photography, movie making, and that AI will have helped us create it.&quot;,<br>    &quot;voice&quot;: 6<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:48&quot;,<br>    &quot;text&quot;: &quot;You really want AGI to be able to peer into the mysteries of the universe.&quot;,<br>    &quot;voice&quot;: 1<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:51&quot;,<br>    &quot;text&quot;: &quot;Yes, quantum mechanics, string theory, well, and the nature of reality.&quot;,<br>    &quot;voice&quot;: 2<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:55&quot;,<br>    &quot;text&quot;: &quot;Ow.&quot;,<br>    &quot;voice&quot;: 1<br>  },<br>  {<br>    &quot;start&quot;: &quot;00:56&quot;,<br>    &quot;text&quot;: &quot;the magic of AI.&quot;,<br>    &quot;voice&quot;: 6<br>  }<br>]</pre><pre>------------------------------- end of response --------------------------------</pre><p>This is looking good! And if you test these instructions on more complex videos, you’ll get similarly promising results.</p><p>Notice how the prompt reuses cherry-picked terms from the terminology previously provided by Gemini, while aiming for precision and concision:</p><ul><li>verbatim is unambiguous (unlike &quot;spoken words&quot;)</li><li>1, 2, 3, etc. is an ellipsis (Gemini can infer the pattern)</li><li>timecode is specific (timestamp has more meanings)</li><li>MM:SS clarifies the timecode format</li></ul><blockquote><em>💡 Gemini 2.0 was trained to understand the specific </em><em>MM:SS timecode format. Gemini 2.5 also supports the </em><em>H:MM:SS format for longer videos. For the latest updates, refer to the </em><a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/video-understanding?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog"><em>video understanding documentation</em></a><em>.</em></blockquote><p>We’re halfway there. Let’s complete our database generation with a second task…</p><h4>🧑 Speakers</h4><p>The second task is pretty straightforward: we want to extract speaker information into a second table. The two tables are logically linked by the voice ID.</p><p>After a few iterations, we can reach a two-task prompt like the following:</p><pre>prompt = &quot;&quot;&quot;<br>Generate a JSON object with keys `task1_transcripts` and `task2_speakers` for the following tasks.<br><br>**Task 1 - Transcripts**<br><br>- Watch the video and listen carefully to the audio.<br>- Identify the distinct voices using a `voice` ID (1, 2, 3, etc.).<br>- Transcribe the video&#39;s audio verbatim with voice diarization.<br>- Include the `start` timecode (MM:SS) for each speech segment.<br>- The `task1_transcripts` value is a JSON array where each object has the following fields:<br>  - `start`<br>  - `text`<br>  - `voice`<br><br>**Task 2 - Speakers**<br><br>- For each `voice` ID from Task 1, extract the name of the corresponding speaker.<br>- Use visual and audio cues.<br>- If a speaker&#39;s name cannot be found, use `?` as the value.<br>- The `task2_speakers` value is a JSON array where each object has the following fields:<br>  - `voice`<br>  - `name`<br><br>JSON:<br>&quot;&quot;&quot;<br>video = TestVideo.GDM_PODCAST_TRAILER_PT59S<br><br>generate_content(prompt, video, show_as=ShowAs.MARKDOWN)</pre><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,920<br>Output tokens  :       806<br>------------------------------ start of response -------------------------------</pre><pre>{<br>  &quot;task1_transcripts&quot;: [<br>    {<br>      &quot;start&quot;: &quot;00:00&quot;,<br>      &quot;text&quot;: &quot;Do I have to call you Sir Demis now?&quot;,<br>      &quot;voice&quot;: 1<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:01&quot;,<br>      &quot;text&quot;: &quot;Oh, you don&#39;t. Absolutely not.&quot;,<br>      &quot;voice&quot;: 2<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:04&quot;,<br>      &quot;text&quot;: &quot;Welcome to Google Deep Mind the podcast with me, your host Professor Hannah Fry.&quot;,<br>      &quot;voice&quot;: 1<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:06&quot;,<br>      &quot;text&quot;: &quot;We want to take you to the heart of where these ideas are coming from. We want to introduce you to the people who are leading the design of our collective future.&quot;,<br>      &quot;voice&quot;: 1<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:19&quot;,<br>      &quot;text&quot;: &quot;Getting the safety right is probably, I&#39;d say, one of the most important challenges of our time. I want safe and capable.&quot;,<br>      &quot;voice&quot;: 3<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:26&quot;,<br>      &quot;text&quot;: &quot;I want a bridge that will not collapse.&quot;,<br>      &quot;voice&quot;: 3<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:30&quot;,<br>      &quot;text&quot;: &quot;That just give these scientists a superpower that they had not imagined earlier.&quot;,<br>      &quot;voice&quot;: 4<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:34&quot;,<br>      &quot;text&quot;: &quot;autonomous vehicles. It&#39;s hard to fathom that when you&#39;re working on a search engine.&quot;,<br>      &quot;voice&quot;: 5<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:38&quot;,<br>      &quot;text&quot;: &quot;We may see entirely new genre or entirely new forms of art come up. There may be a new word that is not music, painting, photography, movie making, and that AI will have helped us create it.&quot;,<br>      &quot;voice&quot;: 6<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:48&quot;,<br>      &quot;text&quot;: &quot;You really want AGI to be able to peer into the mysteries of the universe.&quot;,<br>      &quot;voice&quot;: 1<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:51&quot;,<br>      &quot;text&quot;: &quot;Yes, quantum mechanics, string theory, well, and the nature of reality.&quot;,<br>      &quot;voice&quot;: 2<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:55&quot;,<br>      &quot;text&quot;: &quot;Ow.&quot;,<br>      &quot;voice&quot;: 1<br>    },<br>    {<br>      &quot;start&quot;: &quot;00:56&quot;,<br>      &quot;text&quot;: &quot;the magic of AI.&quot;,<br>      &quot;voice&quot;: 6<br>    }<br>  ],<br>  &quot;task2_speakers&quot;: [<br>    {<br>      &quot;voice&quot;: 1,<br>      &quot;name&quot;: &quot;Professor Hannah Fry&quot;<br>    },<br>    {<br>      &quot;voice&quot;: 2,<br>      &quot;name&quot;: &quot;Demis Hassabis&quot;<br>    },<br>    {<br>      &quot;voice&quot;: 3,<br>      &quot;name&quot;: &quot;Anca Dragan&quot;<br>    },<br>    {<br>      &quot;voice&quot;: 4,<br>      &quot;name&quot;: &quot;Pushmeet Kohli&quot;<br>    },<br>    {<br>      &quot;voice&quot;: 5,<br>      &quot;name&quot;: &quot;Jeff Dean&quot;<br>    },<br>    {<br>      &quot;voice&quot;: 6,<br>      &quot;name&quot;: &quot;Douglas Eck&quot;<br>    }<br>  ]<br>}</pre><pre>------------------------------- end of response --------------------------------</pre><p>Test this prompt on more complex videos: it’s still looking good!</p><p>▶️ Next up: 🚀 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Finalization</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3381b61aaaec" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Unlocking Multimodal Video Transcription with Gemini — Part 4: 🏗️ Prompt Crafting</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking Multimodal Video Transcription with Gemini — Part 3:  Prototyping]]></title>
            <link>https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/404c6c4b986c</guid>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[video-transcription]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[gemini]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Thu, 04 Sep 2025 11:28:16 GMT</pubDate>
            <atom:updated>2025-11-18T10:44:07.145Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unlocking Multimodal Video Transcription with Gemini — Part 3: 🧪 Prototyping</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KTMzyAd0b6Fjzr7Q63-tWA.png" /><figcaption>Generated with Gemini 2.5 Flash Image Preview (aka Nano Banana)</figcaption></figure><h4>Unlocking Multimodal Video Transcription with Gemini</h4><ol><li>🔥 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part1-02dc32118f41">Challenge</a></li><li>🛠️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1">Setup</a></li><li><strong>🧪 Prototyping ◀️</strong></li><li>🏗️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Prompt Crafting</a></li><li>🚀 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Finalization</a></li><li>✅ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Challenge Complete</a></li><li>⚖️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Analysis, Tips &amp; Optimizations</a></li><li>🏁 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Conclusion</a></li></ol><h3>🧪 Prototyping</h3><h4>🌱 Natural behavior</h4><p>Before diving any deeper, it’s interesting to see how Gemini responds to simple instructions, to develop some intuition about its natural behavior.</p><p>Let’s first see what we get with minimalistic prompts and a short English video:</p><pre>video = TestVideo.GDM_PODCAST_TRAILER_PT59S<br>display_video(video)<br><br>prompt = &quot;Transcribe the video&#39;s audio with time information.&quot;<br>generate_content(prompt, video)</pre><p><strong>Video (</strong><a href="https://www.youtube.com/watch?v=0pJn3g8dfwk"><strong>source</strong></a><strong>)</strong></p><iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F0pJn3g8dfwk%3Ffeature%3Doembed&amp;display_name=YouTube&amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D0pJn3g8dfwk&amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F0pJn3g8dfwk%2Fhqdefault.jpg&amp;type=text%2Fhtml&amp;schema=youtube" width="854" height="480" frameborder="0" scrolling="no"><a href="https://medium.com/media/211c006355aff3c68f32fc891b34b5fa/href">https://medium.com/media/211c006355aff3c68f32fc891b34b5fa/href</a></iframe><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,708<br>Output tokens  :       421<br>------------------------------ start of response -------------------------------<br>[00:00:00] Do I have to call you Sir Demis now?<br>[00:00:01] Oh, you don&#39;t.<br>[00:00:02] Absolutely not.<br>[00:00:04] Welcome to Google DeepMind the podcast with me, your host Professor Hannah Fry.<br>[00:00:06] We want to take you to the heart of where these ideas are coming from.<br>[00:00:12] We want to introduce you to the people who are leading the design of our collective future.<br>[00:00:19] Getting the safety right is probably, I&#39;d say, one of the most important challenges of our time.<br>[00:00:25] I want safe and capable.<br>[00:00:27] I want a bridge that will not collapse.<br>[00:00:30] just give these scientists a superpower that they had not imagined earlier.<br>[00:00:34] autonomous vehicles.<br>[00:00:35] It&#39;s hard to fathom that when you&#39;re working on a search engine.<br>[00:00:38] We may see entirely new genre or entirely new forms of art come up.<br>[00:00:42] There may be a new word that is not music, painting, photography, movie making, and that AI will have helped us create it.<br>[00:00:48] You really want AGI to be able to peer into the mysteries of the universe.<br>[00:00:51] Yes, quantum mechanics, string theory, well, and the nature of reality.<br>[00:00:55] Ow.<br>[00:00:57] the magic of AI.<br>------------------------------- end of response --------------------------------</pre><p>Results:</p><ul><li>Gemini naturally outputs a list of [time] transcript lines.</li><li>That’s Speech-to-Text in one line!</li><li>It looks like we can answer “1️⃣ What was said and when?”.</li></ul><p>Now, what about “2️⃣ Who are the speakers?”</p><pre>prompt = &quot;List the speakers identifiable in the video.&quot;<br>generate_content(prompt, video)</pre><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,705<br>Output tokens  :        46<br>------------------------------ start of response -------------------------------<br>Here are the speakers identifiable in the video:<br><br>*   Professor Hannah Fry<br>*   Demis Hassabis<br>*   Anca Dragan<br>*   Pushmeet Kohli<br>*   Jeff Dean<br>*   Douglas Eck<br>------------------------------- end of response --------------------------------</pre><p>Results:</p><ul><li>Gemini can consolidate the names visible on title cards during the video.</li><li>That’s OCR + entity extraction in one line!</li><li>“2️⃣ Who are the speakers?” looks solved too!</li></ul><h4>⏩ Not so fast!</h4><p>The natural next step is to jump to the final instructions, to solve our problem once and for all.</p><pre>prompt = &quot;&quot;&quot;<br>Transcribe the video&#39;s audio including speaker names (use `?` if not found).<br><br>Format example:<br>[00:02] John Doe - Hello Alice!<br>&quot;&quot;&quot;<br>generate_content(prompt, video)</pre><pre>----------------- GDM_PODCAST_TRAILER_PT59S / gemini-2.0-flash -----------------<br>Input tokens   :    16,732<br>Output tokens  :       378<br>------------------------------ start of response -------------------------------<br>Here is the audio transcription of the video:<br><br>[00:00] ? - Do I have to call you Sir Demis now?<br>[00:01] Demis Hassabis - Oh, you don&#39;t. Absolutely not.<br>[00:03] Professor Hannah Fry - Welcome to Google DeepMind the podcast with me, your host, Professor Hannah Fry.<br>[00:06] Professor Hannah Fry - We want to take you to the heart of where these ideas are coming from. We want to introduce you to the people who are leading the design of our collective future.<br>[00:19] Anca Dragan - Getting the safety right is probably, I&#39;d say, one of the most important challenges of our time. I want safe and capable. I want a bridge that will not collapse.<br>[00:29] Pushmeet Kohli - Just give these scientists a superpower that they had not imagined earlier.<br>[00:34] Jeff Dean - Autonomous vehicles. It&#39;s hard to fathom that when you&#39;re working on a search engine.<br>[00:38] Douglas Eck - We may see entirely new genre or entirely new forms of art come up. There may be a new word that is not music, painting, photography, movie making, and that AI will have helped us create it.<br>[00:48] Professor Hannah Fry - You really want AGI to be able to peer into the mysteries of the universe.<br>[00:51] Demis Hassabis - Yes, quantum mechanics, string theory, well, and the nature of reality.<br>[00:55] Professor Hannah Fry - Ow!<br>[00:56] Douglas Eck - The magic of AI.<br>------------------------------- end of response --------------------------------</pre><p>This is almost correct. The first segment is not attributed to the host (who is only introduced a bit later), but everything else looks correct.</p><p>Nonetheless, these are not real-world conditions:</p><ul><li>The video is very short (less than a minute)</li><li>The video is also rather simple (speakers are clearly introduced with on-screen title cards)</li></ul><p>Let’s try with this 8-minute (and more complex) video:</p><pre>generate_content(prompt, TestVideo.GDM_ALPHAFOLD_PT7M54S)</pre><pre>------------------- GDM_ALPHAFOLD_PT7M54S / gemini-2.0-flash -------------------<br>Input tokens   :   134,177<br>Output tokens  :     2,689<br>------------------------------ start of response -------------------------------<br>[00:02] ? - We&#39;ve discovered more about the world than any other civilization before us.<br>[00:08] ? - But we have been stuck on this one problem.<br>[00:11] ? - How do proteins fold up?<br>[00:13] ? - How do proteins go from a string of amino acids to a compact shape that acts as a machine and drives life?<br>[00:22] ? - When you find out about proteins, it is very exciting.<br>[00:25] ? - You could think of them as little biological nano machines.<br>[00:28] ? - They are essentially the fundamental building blocks that power everything living on this planet.<br>[00:34] ? - If we can reliably predict protein structures using AI, that could change the way we understand the natural world.<br>[00:46] ? - Protein folding is one of these holy grail type problems in biology.<br>[00:50] Demis Hassabis - We&#39;ve always hypothesized that AI should be helpful to make these kinds of big scientific breakthroughs more quickly.<br>[00:58] ? - And then I&#39;ll probably be looking at little tunings that might make a difference.<br>[01:02] ? - It should be creating a histogram on and a background skill.<br>[01:04] ? - We&#39;ve been working on our system AlphaFold really hard now for over two years.<br>[01:08] ? - Rather than having to do painstaking experiments, in the future biologists might be able to instead rely on AI methods to directly predict structures quickly and efficiently.<br>[01:17] Kathryn Tunyasuvunakool - Generally speaking, biologists tend to be quite skeptical of computational work, and I think that skepticism is healthy and I respect it, but I feel very excited about what AlphaFold can achieve.<br>[01:28] Andrew Senior - CASP is when we, we say, look, DeepMind is doing protein folding.<br>[01:31] Andrew Senior - This is how good we are, and maybe it&#39;s better than everybody else, maybe it isn&#39;t.<br>[01:37] ? - We decided to enter CASP competition because it represented the Olympics of protein folding.<br>[01:44] John Moult - CASP, we started to try and speed up the solution to the protein folding problem.<br>[01:50] John Moult - When we started CASP in 1994, I certainly was naive about how hard this was going to be.<br>[01:58] ? - It was very cumbersome to do that because it took a long time.<br>[02:01] ? - Let&#39;s see what, what, what are we doing still to improve?<br>[02:03] ? - Typically 100 different groups from around the world participate in CASP, and we take a set of 100 proteins and we ask the groups to send us what they think the structures look like.<br>[02:15] ? - We can reach 57.9 GDT on CASP 12 ground truth.<br>[02:19] John Jumper - CASP has a metric on which you will be scored, which is this GDT metric.<br>[02:25] John Jumper - On a scale of zero to 100, you would expect a GDT over 90 to be a solution to the problem.<br>[02:33] ? - If we do achieve this, this has incredible medical relevance.<br>[02:37] Pushmeet Kohli - The implications are immense, from how diseases progress, how you can discover new drugs.<br>[02:45] Pushmeet Kohli - It&#39;s endless.<br>[02:46] ? - I wanted to make a, a really simple system and the results have been surprisingly good.<br>[02:50] ? - The team got some results with a new technique, not only is it more accurate, but it&#39;s much faster than the old system.<br>[02:56] ? - I think we&#39;ll substantially exceed what we&#39;re doing right now.<br>[02:59] ? - This is a game, game changer, I think.<br>[03:01] John Moult - In CASP 13, something very significant had happened.<br>[03:06] John Moult - For the first time, we saw the effective application of artificial intelligence.<br>[03:11] ? - We&#39;ve advanced the state of the art in the field, so that&#39;s fantastic, but we still got a long way to go before we&#39;ve solved it.<br>[03:18] ? - The shapes were now approximately correct for many of the proteins, but the details, exactly where each atom sits, which is really what we would call a solution, we&#39;re not yet there.<br>[03:29] ? - It doesn&#39;t help if you have the tallest ladder when you&#39;re going to the moon.<br>[03:33] ? - We hit a little bit of a brick wall, um, since we won CASP, then it was back to the drawing board and like what are our new ideas?<br>[03:41] ? - Um, and then it&#39;s taken a little while, I would say, for them to get back to where they were, but with the new ideas.<br>[03:51] ? - They can go further, right?<br>[03:52] ? - So, um, so that&#39;s a really important moment.<br>[03:55] ? - I&#39;ve seen that moment so many times now, but I know what that means now, and I know this is the time now to press.<br>[04:02] ? - We need to double down and go as fast as possible from here.<br>[04:05] ? - I think we&#39;ve got no time to lose.<br>[04:07] ? - So the intention is to enter CASP again.<br>[04:09] ? - CASP is deeply stressful.<br>[04:12] ? - There&#39;s something weird going on with, um, the learning because it is learning something that&#39;s correlated with GDT, but it&#39;s not calibrated.<br>[04:18] ? - I feel slightly uncomfortable.<br>[04:20] ? - We should be learning this, you know, in the blink of an eye.<br>[04:23] ? - The technology advancing outside DeepMind is also doing incredible work.<br>[04:27] Richard Evans - And there&#39;s always the possibility another team has come somewhere out there field that we don&#39;t even know about.<br>[04:32] ? - Someone asked me, well, should we panic now?<br>[04:33] ? - Of course, we should have been panicking before.<br>[04:35] ? - It does seem to do better, but still doesn&#39;t do quite as well as the best model.<br>[04:39] ? - Um, so it looks like there&#39;s room for improvement.<br>[04:42] ? - There&#39;s always a risk that you&#39;ve missed something, and that&#39;s why blind assessments like CASP are so important to validate whether our results are real.<br>[04:49] ? - Obviously, I&#39;m excited to see how CASP 14 goes.<br>[04:51] ? - My expectation is we get our heads down, we focus on the full goal, which is to solve the whole problem.<br>[05:14] ? - We were prepared for CASP to start on April 15th because that&#39;s when it was originally scheduled to start, and it&#39;s been delayed by a month due to coronavirus.<br>[05:24] ? - I really miss everyone.<br>[05:25] ? - No, I struggled a little bit just kind of getting into a routine, especially, uh, my wife, she came down with the, the virus.<br>[05:32] ? - I mean, luckily it didn&#39;t turn out too serious.<br>[05:34] ? - CASP started on Monday.<br>[05:37] Demis Hassabis - Can I just check this diagram you&#39;ve got here, John, this one where we ask ground truth.<br>[05:40] Demis Hassabis - Is this one we&#39;ve done badly on?<br>[05:42] ? - We&#39;re actually quite good on this region.<br>[05:43] ? - If you imagine that we hadn&#39;t have said it came around this way, but had put it in.<br>[05:47] ? - Yeah, and that instead.<br>[05:48] ? - Yeah.<br>[05:49] ? - One of the hardest proteins we&#39;ve gotten in CASP thus far is a SARS-CoV-2 protein, uh, called Orf8.<br>[05:55] ? - Orf8 is a coronavirus protein.<br>[05:57] ? - We tried really hard to improve our prediction, like really, really hard, probably the most time that we have ever spent on a single target.<br>[06:05] ? - So we&#39;re about two-thirds of the way through CASP, and we&#39;ve gotten three answers back.<br>[06:11] ? - We now have a ground truth for Orf8, which is one of the coronavirus proteins.<br>[06:17] ? - And it turns out we did really well in predicting that.<br>[06:20] Demis Hassabis - Amazing job, everyone, the whole team.<br>[06:23] Demis Hassabis - It&#39;s been an incredible effort.<br>[06:24] John Moult - Here what we saw in CASP 14 was a group delivering atomic accuracy off the bat, essentially solving what in our world is two problems.<br>[06:34] John Moult - How do you look to find the right solution, and then how do you recognize you&#39;ve got the right solution when you&#39;re there?<br>[06:41] ? - All right, are we, are we mostly here?<br>[06:46] ? - I&#39;m going to read an email.<br>[06:48] ? - Uh, I got this from John Moult.<br>[06:50] ? - Now I&#39;ll just read it.<br>[06:51] ? - It says, John, as I expect you know, your group has performed amazingly well in CASP 14, both relative to other groups and in absolute model accuracy.<br>[07:02] ? - Congratulations on this work.<br>[07:05] ? - It is really outstanding.<br>[07:07] Demis Hassabis - AlphaFold represents a huge leap forward that I hope will really accelerate drug discovery and help us to better understand disease.<br>[07:13] John Moult - It&#39;s pretty mind-blowing.<br>[07:16] John Moult - You know, these results were, for me, having worked on this problem so long, after many, many stops and starts and will this ever get there, suddenly this is a solution.<br>[07:28] John Moult - We&#39;ve solved the problem.<br>[07:29] John Moult - This gives you such excitement about the way science works, about how you can never see exactly or even approximately what&#39;s going to happen next.<br>[07:37] John Moult - There are always these surprises, and that really, as a scientist, is what keeps you going.<br>[07:41] John Moult - What&#39;s going to be the next surprise?<br>------------------------------- end of response --------------------------------</pre><p>This falls apart: Most segments have no identified speaker!</p><p>As we are trying to solve a new complex problem, LLMs haven’t been trained on any known solution. This is likely why direct instructions do not yield the expected answer.</p><p>At this stage:</p><ul><li>We might conclude that we can’t solve the problem with real-world videos.</li><li>Persevering by trying more and more elaborate prompts for this unsolved problem might result in a waste of time.</li></ul><p>Let’s take a step back and think about what happens under the hood…</p><h3>⚛️ Under the hood</h3><p>Modern LLMs are mostly based on the Transformer architecture, a new neural network design detailed in a 2017 paper by Google researchers titled <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a>. The paper introduced the self-attention mechanism, a key innovation that fundamentally changed the way machines process language.</p><h4>🪙 Tokens</h4><p>Tokens are the LLM building blocks. We can consider a token to represent a piece of information.</p><p>Examples of Gemini multimodal tokens (with default parameters):</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/770/1*pfL7dxGNV-OuCUx2XbuTpw.png" /></figure><h4>🎞️ Sampling frame rate</h4><p>By default, video frames are sampled at 1 frame per second (1 FPS). These frames are included in the context with their corresponding timecodes.</p><p>You can use a custom sampling frame rate with the Part.video_metadata.fps parameter:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/539/1*jTHgdDo0RCaZOxJE1rXp-A.png" /></figure><blockquote><em>💡 For </em><em>1.0 &lt; fps, Gemini was trained to understand </em><em>MM:SS.sss and </em><em>H:MM:SS.sss timecodes.</em></blockquote><h4>🔍 Media resolution</h4><p>By default, each sampled frame is represented with 258 tokens.</p><p>You can specify a medium or low media resolution with the GenerateContentConfig.media_resolution parameter:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/908/1*f0xZB5z0u0ciEiGWCuxzjQ.png" /></figure><blockquote><em>💡 The “media resolution” can be seen as the “image token resolution”: the number of tokens used to represent an image.</em></blockquote><h4>🧮 Probabilities all the way down</h4><p>The ability of LLMs to communicate in flawless natural language is very impressive, but it’s easy to get carried away and make incorrect assumptions.</p><p>Keep in mind how LLMs work:</p><ul><li>An LLM is trained on a massive tokenized dataset, which represents its knowledge (its long-term memory)</li><li>During the training, its neural network learns token patterns</li><li>When you send a request to an LLM, your inputs are transformed into tokens (tokenization)</li><li>To answer your request, the LLM predicts, token by token, the next likely tokens</li><li>Overall, LLMs are exceptional statistical token prediction machines that seem to mimic how some parts of our brain work</li></ul><p>This has a few consequences:</p><ul><li>LLM outputs are just statistically likely follow-ups to your inputs</li><li>LLMs show some forms of reasoning: they can match complex patterns but have no actual deep understanding</li><li>LLMs have no consciousness: they are designed to generate tokens and will do so based on your instructions</li><li>Order matters: Tokens that are generated first will influence tokens that are generated next</li></ul><p>For the next step, some methodical prompt crafting might help…</p><p>▶️ Next up: 🏗️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Prompt Crafting</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=404c6c4b986c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Unlocking Multimodal Video Transcription with Gemini — Part 3: 🧪 Prototyping</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Unlocking Multimodal Video Transcription with Gemini — Part 2: ️ Setup]]></title>
            <link>https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1?source=rss-6be63961431c------2</link>
            <guid isPermaLink="false">https://medium.com/p/43c491a0c4f1</guid>
            <category><![CDATA[gemini]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[llm]]></category>
            <category><![CDATA[generative-ai]]></category>
            <category><![CDATA[video-transcription]]></category>
            <dc:creator><![CDATA[Laurent Picard]]></dc:creator>
            <pubDate>Thu, 04 Sep 2025 11:27:59 GMT</pubDate>
            <atom:updated>2025-11-18T10:41:47.224Z</atom:updated>
            <content:encoded><![CDATA[<h3>Unlocking Multimodal Video Transcription with Gemini — Part 2: 🛠️ Setup</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*L8IflMlq9VtZVMKcN0E0kA.png" /><figcaption>Generated with Gemini 2.5 Flash Image Preview (aka Nano Banana)</figcaption></figure><h4>Unlocking Multimodal Video Transcription with Gemini</h4><ol><li>🔥 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part1-02dc32118f41">Challenge</a></li><li><strong>🛠️ Setup ◀️</strong></li><li>🧪 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Prototyping</a></li><li>🏗️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part4-3381b61aaaec">Prompt Crafting</a></li><li>🚀 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part5-488b357b53b1">Finalization</a></li><li>✅ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part6-b1fc52729e4f">Challenge Complete</a></li><li>⚖️ <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part7-74ee997d2096">Analysis, Tips &amp; Optimizations</a></li><li>🏁 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part8-eee478ba7eb0">Conclusion</a></li></ol><h3>🏁 Setup</h3><h4>🐍 Python packages</h4><p>We’ll use the following packages:</p><ul><li>google-genai: the <a href="https://pypi.org/project/google-genai">Google Gen AI Python SDK</a> lets us call Gemini with a few lines of code</li><li>pandas for data visualization</li></ul><p>We’ll also use these packages (dependencies of google-genai):</p><ul><li>pydantic for data management</li><li>tenacity for request management</li></ul><pre>pip install --quiet &quot;google-genai&gt;=1.49.0&quot; &quot;pandas[output-formatting]&quot;</pre><h4>🔗 Gemini API</h4><p>We have two main options to send requests to Gemini:</p><ul><li><a href="https://cloud.google.com/vertex-ai/generative-ai/docs?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">Vertex AI</a>: Build enterprise-ready projects on Google Cloud</li><li><a href="https://aistudio.google.com?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">Google AI Studio</a>: Experiment, prototype, and deploy small projects</li></ul><p>The Google Gen AI SDK provides a unified interface to these APIs and we can use environment variables for the configuration.</p><p><strong>Option A — Gemini API via Vertex AI</strong></p><p>Requirement:</p><ul><li>A Google Cloud project</li><li>The <a href="https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com&amp;utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">Vertex AI API</a> must be enabled for this project</li></ul><p>Gen AI SDK environment variables:</p><ul><li>GOOGLE_GENAI_USE_VERTEXAI=&quot;True&quot;</li><li>GOOGLE_CLOUD_PROJECT=&quot;&lt;PROJECT_ID&gt;&quot;</li><li>GOOGLE_CLOUD_LOCATION=&quot;&lt;LOCATION&gt;&quot; (see <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/learn/locations?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog#google_model_endpoint_locations">Google model endpoint locations</a>)</li></ul><p>Learn more about <a href="https://cloud.google.com/vertex-ai/docs/start/cloud-environment?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">setting up a project and a development environment</a>.</p><p><strong>Option B — Gemini API via Google AI Studio</strong></p><p>Requirement:</p><ul><li>A Gemini API key</li></ul><p>Gen AI SDK environment variables:</p><ul><li>GOOGLE_GENAI_USE_VERTEXAI=&quot;False&quot;</li><li>GOOGLE_API_KEY=&quot;&lt;API_KEY&gt;&quot;</li></ul><p>Learn more about <a href="https://aistudio.google.com/app/apikey?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">getting a Gemini API key from Google AI Studio</a>.</p><p>💡 You can store your environment configuration outside of the source code:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/680/1*4jEnJYbVrNTwUkl0sI5yag.png" /></figure><p>Define the following environment detection functions. You can also define your configuration manually if needed:</p><pre># @title {display-mode: &quot;form&quot;}<br><br>import os<br>import sys<br>from collections.abc import Callable<br><br>from google import genai<br><br># Manual setup (leave unchanged if setup is environment-defined)<br><br># @markdown **Which API: Vertex AI or Google AI Studio?**<br>GOOGLE_GENAI_USE_VERTEXAI = True  # @param {type: &quot;boolean&quot;}<br><br># @markdown **Option A - Google Cloud project [+location]**<br>GOOGLE_CLOUD_PROJECT = &quot;&quot;  # @param {type: &quot;string&quot;}<br>GOOGLE_CLOUD_LOCATION = &quot;global&quot;  # @param {type: &quot;string&quot;}<br><br># @markdown **Option B - Google AI Studio API key**<br>GOOGLE_API_KEY = &quot;&quot;  # @param {type: &quot;string&quot;}<br><br><br>def check_environment() -&gt; bool:<br>    check_colab_user_authentication()<br>    return check_manual_setup() or check_vertex_ai() or check_colab() or check_local()<br><br><br>def check_manual_setup() -&gt; bool:<br>    return check_define_env_vars(<br>        GOOGLE_GENAI_USE_VERTEXAI,<br>        GOOGLE_CLOUD_PROJECT.strip(),  # Might have been pasted with line return<br>        GOOGLE_CLOUD_LOCATION,<br>        GOOGLE_API_KEY,<br>    )<br><br><br>def check_vertex_ai() -&gt; bool:<br>    # Workbench and Colab Enterprise<br>    match os.getenv(&quot;VERTEX_PRODUCT&quot;, &quot;&quot;):<br>        case &quot;WORKBENCH_INSTANCE&quot;:<br>            pass<br>        case &quot;COLAB_ENTERPRISE&quot;:<br>            if not running_in_colab_env():<br>                return False<br>        case _:<br>            return False<br><br>    return check_define_env_vars(<br>        True,<br>        os.getenv(&quot;GOOGLE_CLOUD_PROJECT&quot;, &quot;&quot;),<br>        os.getenv(&quot;GOOGLE_CLOUD_REGION&quot;, &quot;&quot;),<br>        &quot;&quot;,<br>    )<br><br><br>def check_colab() -&gt; bool:<br>    if not running_in_colab_env():<br>        return False<br><br>    # Colab Enterprise was checked before, so this is Colab only<br>    from google.colab import auth as colab_auth  # type: ignore<br><br>    colab_auth.authenticate_user()<br><br>    # Use Colab Secrets (🗝️ icon in left panel) to store the environment variables<br>    # Secrets are private, visible only to you and the notebooks that you select<br>    # - Vertex AI: Store your settings as secrets<br>    # - Google AI: Directly import your Gemini API key from the UI<br>    vertexai, project, location, api_key = get_vars(get_colab_secret)<br><br>    return check_define_env_vars(vertexai, project, location, api_key)<br><br><br>def check_local() -&gt; bool:<br>    vertexai, project, location, api_key = get_vars(os.getenv)<br><br>    return check_define_env_vars(vertexai, project, location, api_key)<br><br><br>def running_in_colab_env() -&gt; bool:<br>    # Colab or Colab Enterprise<br>    return &quot;google.colab&quot; in sys.modules<br><br><br>def check_colab_user_authentication() -&gt; None:<br>    if running_in_colab_env():<br>        from google.colab import auth as colab_auth  # type: ignore<br><br>        colab_auth.authenticate_user()<br><br><br>def get_colab_secret(secret_name: str, default: str) -&gt; str:<br>    from google.colab import errors, userdata  # type: ignore<br><br>    try:<br>        return userdata.get(secret_name)<br>    except errors.SecretNotFoundError:<br>        return default<br><br><br>def get_vars(getenv: Callable[[str, str], str]) -&gt; tuple[bool, str, str, str]:<br>    # Limit getenv calls to the minimum (may trigger UI confirmation for secret access)<br>    vertexai_str = getenv(&quot;GOOGLE_GENAI_USE_VERTEXAI&quot;, &quot;&quot;)<br>    if vertexai_str:<br>        vertexai = vertexai_str.lower() in [&quot;true&quot;, &quot;1&quot;]<br>    else:<br>        vertexai = bool(getenv(&quot;GOOGLE_CLOUD_PROJECT&quot;, &quot;&quot;))<br><br>    project = getenv(&quot;GOOGLE_CLOUD_PROJECT&quot;, &quot;&quot;) if vertexai else &quot;&quot;<br>    location = getenv(&quot;GOOGLE_CLOUD_LOCATION&quot;, &quot;&quot;) if project else &quot;&quot;<br>    api_key = getenv(&quot;GOOGLE_API_KEY&quot;, &quot;&quot;) if not project else &quot;&quot;<br><br>    return vertexai, project, location, api_key<br><br><br>def check_define_env_vars(<br>    vertexai: bool,<br>    project: str,<br>    location: str,<br>    api_key: str,<br>) -&gt; bool:<br>    match (vertexai, bool(project), bool(location), bool(api_key)):<br>        case (True, True, _, _):<br>            # Vertex AI - Google Cloud project [+location]<br>            location = location or &quot;global&quot;<br>            define_env_vars(vertexai, project, location, &quot;&quot;)<br>        case (True, False, _, True):<br>            # Vertex AI - API key<br>            define_env_vars(vertexai, &quot;&quot;, &quot;&quot;, api_key)<br>        case (False, _, _, True):<br>            # Google AI Studio - API key<br>            define_env_vars(vertexai, &quot;&quot;, &quot;&quot;, api_key)<br>        case _:<br>            return False<br><br>    return True<br><br><br>def define_env_vars(vertexai: bool, project: str, location: str, api_key: str) -&gt; None:<br>    os.environ[&quot;GOOGLE_GENAI_USE_VERTEXAI&quot;] = str(vertexai)<br>    os.environ[&quot;GOOGLE_CLOUD_PROJECT&quot;] = project<br>    os.environ[&quot;GOOGLE_CLOUD_LOCATION&quot;] = location<br>    os.environ[&quot;GOOGLE_API_KEY&quot;] = api_key<br><br><br>def check_configuration(client: genai.Client) -&gt; None:<br>    service = &quot;Vertex AI&quot; if client.vertexai else &quot;Google AI Studio&quot;<br>    print(f&quot;Using the {service} API&quot;, end=&quot;&quot;)<br><br>    if client._api_client.project:<br>        print(f&#39; with project &quot;{client._api_client.project[:7]}…&quot;&#39;, end=&quot;&quot;)<br>        print(f&#39; in location &quot;{client._api_client.location}&quot;&#39;)<br>    elif client._api_client.api_key:<br>        api_key = client._api_client.api_key<br>        print(f&#39; with API key &quot;{api_key[:5]}…{api_key[-5:]}&quot;&#39;, end=&quot;&quot;)<br>        print(f&quot; (in case of error, make sure it was created for {service})&quot;)</pre><h4>🤖 Gen AI SDK</h4><p>To send Gemini requests, create a google.genai client:</p><pre>from google import genai<br><br>check_environment()<br><br>client = genai.Client()</pre><p>Check your configuration:</p><pre>check_configuration(client)</pre><pre>Using the Vertex AI API with project &quot;lpdemo-...&quot; in location &quot;europe-west9&quot;</pre><h4>🧠 Gemini model</h4><p>Gemini comes in different <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog#gemini-models">versions</a>.</p><p>Let’s get started with Gemini 2.0 Flash, as it offers both high performance and low latency:</p><ul><li>GEMINI_2_0_FLASH = &quot;gemini-2.0-flash&quot;</li></ul><blockquote><em>💡 We select Gemini 2.0 Flash intentionally. The Gemini 2.5 model family is generally available and even more capable, but we want to experiment and understand Gemini’s core multimodal behavior. If we complete our challenge with 2.0, this should also work with newer models.</em></blockquote><h4>⚙️ Gemini configuration</h4><p>Gemini can be used in different ways, ranging from factual to creative mode. The problem we’re trying to solve is a <strong>data extraction</strong> use case. We want results as factual and deterministic as possible. For this, we can change the <a href="https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/content-generation-parameters?utm_campaign=CDR_0x8c87a0bc_default_b436460646&amp;utm_medium=external&amp;utm_source=blog">content generation parameters</a>.</p><p>We’ll set the temperature, top_p, and seed parameters to minimize randomness:</p><ul><li>temperature=0.0</li><li>top_p=0.0</li><li>seed=42 (arbitrary fixed value)</li></ul><h4>🎞️ Video sources</h4><p>Here are the main video sources that Gemini can analyze:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/908/1*Z1EgdhKLAUBPZuK6CaZqmQ.png" /></figure><p>⚠️ Important notes</p><ul><li>Our video test suite primarily uses public YouTube videos. This is for simplicity.</li><li>When analyzing YouTube sources, Gemini receives raw audio/video streams without any additional metadata, exactly as if processing the corresponding video files from Cloud Storage.</li><li>YouTube does offer caption/subtitle/transcript features (user-provided or auto-generated). However, these features focus on word-level speech-to-text and are limited to 40+ languages. Gemini does not receive any of this data and you’ll see that a multimodal transcription with Gemini provides additional benefits.</li><li>Furthermore, our challenge also involves identifying speakers and extracting speaker data, a unique new capability.</li></ul><h4>🛠️ Helpers</h4><p>Define our helper functions and data:</p><pre>import enum<br>from dataclasses import dataclass<br>from datetime import timedelta<br>from typing import Any<br><br>import IPython.display<br>import tenacity<br>from google.genai.errors import ClientError<br>from google.genai.types import (<br>    FileData,<br>    FinishReason,<br>    GenerateContentConfig,<br>    GenerateContentResponse,<br>    Part,<br>    VideoMetadata,<br>)<br><br><br>class Model(enum.Enum):<br>    # Generally Available (GA)<br>    GEMINI_2_0_FLASH = &quot;gemini-2.0-flash&quot;<br>    GEMINI_2_5_FLASH = &quot;gemini-2.5-flash&quot;<br>    GEMINI_2_5_PRO = &quot;gemini-2.5-pro&quot;<br>    # Default model<br>    DEFAULT = GEMINI_2_0_FLASH<br><br><br># Default configuration for more deterministic outputs<br>DEFAULT_CONFIG = GenerateContentConfig(<br>    temperature=0.0,<br>    top_p=0.0,<br>    seed=42,  # Arbitrary fixed value<br>)<br><br>YOUTUBE_URL_PREFIX = &quot;https://www.youtube.com/watch?v=&quot;<br>CLOUD_STORAGE_URI_PREFIX = &quot;gs://&quot;<br><br><br>def url_for_youtube_id(youtube_id: str) -&gt; str:<br>    return f&quot;{YOUTUBE_URL_PREFIX}{youtube_id}&quot;<br><br><br>class Video(enum.Enum):<br>    pass<br><br><br>class TestVideo(Video):<br>    # For testing purposes, video duration is statically specified in the enum name<br>    # Suffix (ISO 8601 based): _PT[&lt;h&gt;H][&lt;m&gt;M][&lt;s&gt;S]<br><br>    # Google DeepMind | The Podcast | Season 3 Trailer | 59s<br>    GDM_PODCAST_TRAILER_PT59S = url_for_youtube_id(&quot;0pJn3g8dfwk&quot;)<br>    # Google Maps | Walk in the footsteps of Jane Goodall | 2min 42s<br>    JANE_GOODALL_PT2M42S = &quot;gs://cloud-samples-data/video/JaneGoodall.mp4&quot;<br>    # Google DeepMind | AlphaFold | The making of a scientific breakthrough | 7min 54s<br>    GDM_ALPHAFOLD_PT7M54S = url_for_youtube_id(&quot;gg7WjuFs8F4&quot;)<br>    # Brut | French reportage | 8min 28s<br>    BRUT_FR_DOGS_WATER_LEAK_PT8M28S = url_for_youtube_id(&quot;U_yYkb-ureI&quot;)<br>    # Google DeepMind | The Podcast | AI for science | 54min 23s<br>    GDM_AI_FOR_SCIENCE_FRONTIER_PT54M23S = url_for_youtube_id(&quot;nQKmVhLIGcs&quot;)<br>    # Google I/O 2025 | Developer Keynote | 1h 10min 03s<br>    GOOGLE_IO_DEV_KEYNOTE_PT1H10M03S = url_for_youtube_id(&quot;GjvgtwSOCao&quot;)<br>    # Google Cloud | Next 2025 | Opening Keynote | 1h 40min 03s<br>    GOOGLE_CLOUD_NEXT_PT1H40M03S = url_for_youtube_id(&quot;Md4Fs-Zc3tg&quot;)<br>    # Google I/O 2025 | Keynote | 1h 56min 35s<br>    GOOGLE_IO_KEYNOTE_PT1H56M35S = url_for_youtube_id(&quot;o8NiE3XMPrM&quot;)<br><br><br>class ShowAs(enum.Enum):<br>    DONT_SHOW = enum.auto()<br>    TEXT = enum.auto()<br>    MARKDOWN = enum.auto()<br><br><br>@dataclass<br>class VideoSegment:<br>    start: timedelta<br>    end: timedelta<br><br><br>def generate_content(<br>    prompt: str,<br>    video: Video | None = None,<br>    video_segment: VideoSegment | None = None,<br>    model: Model | None = None,<br>    config: GenerateContentConfig | None = None,<br>    show_as: ShowAs = ShowAs.TEXT,<br>) -&gt; None:<br>    prompt = prompt.strip()<br>    model = model or Model.DEFAULT<br>    config = config or DEFAULT_CONFIG<br><br>    model_id = model.value<br>    if video:<br>        if not (video_part := get_video_part(video, video_segment)):<br>            return<br>        contents = [video_part, prompt]<br>        caption = f&quot;{video.name} / {model_id}&quot;<br>    else:<br>        contents = prompt<br>        caption = f&quot;{model_id}&quot;<br>    print(f&quot; {caption} &quot;.center(80, &quot;-&quot;))<br><br>    for attempt in get_retrier():<br>        with attempt:<br>            response = client.models.generate_content(<br>                model=model_id,<br>                contents=contents,<br>                config=config,<br>            )<br>            display_response_info(response)<br>            display_response(response, show_as)<br><br><br>def get_video_part(<br>    video: Video,<br>    video_segment: VideoSegment | None = None,<br>    fps: float | None = None,<br>) -&gt; Part | None:<br>    video_uri: str = video.value<br><br>    if not client.vertexai:<br>        video_uri = convert_to_https_url_if_cloud_storage_uri(video_uri)<br>        if not video_uri.startswith(YOUTUBE_URL_PREFIX):<br>            print(&quot;Google AI Studio API: Only YouTube URLs are currently supported&quot;)<br>            return None<br><br>    file_data = FileData(file_uri=video_uri, mime_type=&quot;video/*&quot;)<br>    video_metadata = get_video_part_metadata(video_segment, fps)<br><br>    return Part(file_data=file_data, video_metadata=video_metadata)<br><br><br>def get_video_part_metadata(<br>    video_segment: VideoSegment | None = None,<br>    fps: float | None = None,<br>) -&gt; VideoMetadata:<br>    def offset_as_str(offset: timedelta) -&gt; str:<br>        return f&quot;{offset.total_seconds()}s&quot;<br><br>    if video_segment:<br>        start_offset = offset_as_str(video_segment.start)<br>        end_offset = offset_as_str(video_segment.end)<br>    else:<br>        start_offset = None<br>        end_offset = None<br><br>    return VideoMetadata(start_offset=start_offset, end_offset=end_offset, fps=fps)<br><br><br>def convert_to_https_url_if_cloud_storage_uri(uri: str) -&gt; str:<br>    if uri.startswith(CLOUD_STORAGE_URI_PREFIX):<br>        return f&quot;https://storage.googleapis.com/{uri.removeprefix(CLOUD_STORAGE_URI_PREFIX)}&quot;<br><br>    return uri<br><br><br>def get_retrier() -&gt; tenacity.Retrying:<br>    return tenacity.Retrying(<br>        stop=tenacity.stop_after_attempt(7),<br>        wait=tenacity.wait_incrementing(start=10, increment=1),<br>        retry=should_retry_request,<br>        reraise=True,<br>    )<br><br><br>def should_retry_request(retry_state: tenacity.RetryCallState) -&gt; bool:<br>    if not retry_state.outcome:<br>        return False<br>    err = retry_state.outcome.exception()<br>    if not isinstance(err, ClientError):<br>        return False<br>    print(f&quot;❌ ClientError {err.code}: {err.message}&quot;)<br><br>    retry = False<br>    match err.code:<br>        case 400 if err.message is not None and &quot; try again &quot; in err.message:<br>            # Workshop: project accessing Cloud Storage for the first time (service agent provisioning)<br>            retry = True<br>        case 429:<br>            # Workshop: temporary project with 1 QPM quota<br>            retry = True<br>    print(f&quot;🔄 Retry: {retry}&quot;)<br><br>    return retry<br><br><br>def display_response_info(response: GenerateContentResponse) -&gt; None:<br>    if usage_metadata := response.usage_metadata:<br>        if usage_metadata.prompt_token_count:<br>            print(f&quot;Input tokens   : {usage_metadata.prompt_token_count:9,d}&quot;)<br>        if usage_metadata.candidates_token_count:<br>            print(f&quot;Output tokens  : {usage_metadata.candidates_token_count:9,d}&quot;)<br>        if usage_metadata.thoughts_token_count:<br>            print(f&quot;Thoughts tokens: {usage_metadata.thoughts_token_count:9,d}&quot;)<br>    if not response.candidates:<br>        print(&quot;❌ No `response.candidates`&quot;)<br>        return<br>    if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:<br>        print(f&quot;❌ {finish_reason = }&quot;)<br>    if not response.text:<br>        print(&quot;❌ No `response.text`&quot;)<br>        return<br><br><br>def display_response(<br>    response: GenerateContentResponse,<br>    show_as: ShowAs,<br>) -&gt; None:<br>    if show_as == ShowAs.DONT_SHOW:<br>        return<br>    if not (response_text := response.text):<br>        return<br>    response_text = response.text.strip()<br><br>    print(&quot; start of response &quot;.center(80, &quot;-&quot;))<br>    match show_as:<br>        case ShowAs.TEXT:<br>            print(response_text)<br>        case ShowAs.MARKDOWN:<br>            display_markdown(response_text)<br>    print(&quot; end of response &quot;.center(80, &quot;-&quot;))<br><br><br>def display_markdown(markdown: str) -&gt; None:<br>    IPython.display.display(IPython.display.Markdown(markdown))<br><br><br>def display_video(video: Video) -&gt; None:<br>    video_url = convert_to_https_url_if_cloud_storage_uri(video.value)<br>    assert video_url.startswith(&quot;https://&quot;)<br><br>    video_width = 600<br>    if video_url.startswith(YOUTUBE_URL_PREFIX):<br>        youtube_id = video_url.removeprefix(YOUTUBE_URL_PREFIX)<br>        # Add referrerpolicy to fix video player configuration error 153<br>        extras = [&#39;referrerpolicy=&quot;strict-origin-when-cross-origin&quot;&#39;]<br>        ipython_video = IPython.display.YouTubeVideo(<br>            youtube_id, width=video_width, extras=extras<br>        )<br>    else:<br>        ipython_video = IPython.display.Video(video_url, width=video_width)<br><br>    display_markdown(f&quot;### Video ([source]({video_url}))&quot;)<br>    IPython.display.display(ipython_video)</pre><p>▶️ Next up: 🧪 <a href="https://medium.com/@PicardParis/unlocking-multimodal-video-transcription-with-gemini-part3-404c6c4b986c">Prototyping</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=43c491a0c4f1" width="1" height="1" alt=""><hr><p><a href="https://medium.com/google-cloud/unlocking-multimodal-video-transcription-with-gemini-part2-43c491a0c4f1">Unlocking Multimodal Video Transcription with Gemini — Part 2: 🛠️ Setup</a> was originally published in <a href="https://medium.com/google-cloud">Google Cloud - Community</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>