opensource.google.com

Menu

ESCA: Grounding embodied agents with scene graphs — Accelerated by JAX

Monday, December 15, 2025

Image
Introduction

Multi-Modal Language Models (MLLMs) are increasingly forming the core of the brain for general-purpose embodied agents — AI that can navigate and act in the physical world as robots. While MLLMs are making rapid progress, they often stumble on a critical hurdle: precise visual perception. They struggle to reliably capture the fine-grained links between low-level visual features and high-level textual semantics.

Today, we are highlighting the work of Prof. Mayur Naik's research team at the University of Pennsylvania. To bridge the gap between high-level language and low-level visual features, they developed ESCA (Embodied and Scene-Graph Contextualized Agent). By porting their neurosymbolic pipeline to JAX, they achieved the real-time performance necessary for high-throughput decision-making. This work also demonstrates that JAX drives performance gains across a wide range of hardware, including standard CPUs and NVIDIA GPUs, and not just on Google TPUs.

In this blog, the UPenn team explains how they combined structured scene graphs with JAX's functional design to reduce perception errors by over 50% and achieve a 25% speedup in inference.


The "Grounding" Problem in Embodied AI

Existing MLLMs are powerful, but they can be surprisingly "blind" when tasked with interacting with the physical world. In our empirical analysis of 60 navigation tasks from EmbodiedBench, we found that 69% of agent failures stemmed from perception errors. See the figure below.

The three top-level error types are Perception, Reasoning, and Planning. The second-level errors are Hallucination, Wrong Recognition, Spatial Understanding, Spatial Reasoning, Reflection Error, Inaccurate Action, and Collision. For clarity, the figure uses these acronyms to label the different error types.

Image

The models struggle to capture fine-grained links between visual features and textual semantics. They might recognize a "kitchen," but fail to identify the specific spatial relationship between a knife and a cutting board required to complete a task.

Enter ESCA: The Anglerfish of AI

To solve this, we introduced ESCA, a framework designed to contextualize MLLMs through open-domain scene graph generation.

Image

Think of ESCA like the bioluminescent lure of a deep-sea anglerfish. Just as the fish illuminates its dark surroundings to reveal prey, ESCA "illuminates" the agent's environment by generating a structured Scene Graph—a map of objects, attributes, and relationships (e.g., Cup [Red] ON Table).

A key innovation here is Selective Grounding. Injecting a massive scene graph of everything in the room can overwhelm the model. Instead, ESCA identifies only the subset of objects and relations pertinent to the current instruction. It performs probabilistic reasoning to construct prompts enriched with exactly the contextual details the agent needs to act.

The Engine: LASER and Scallop

At the core of ESCA is LASER, a CLIP-based foundation model trained on 87k video-caption pairs. LASER uses Scallop—our neurosymbolic programming language that supports JAX backends—to align predicted scene graphs with logical specifications. This pipeline allows us to train low-level perception models to produce detailed graphs without needing tedious frame-level annotations.

JAX User Experience

1. The Power of Statelessness

JAX's design encouraged a fully functional, stateless architecture. Every component, from feature extraction to similarity computation, was made into a pure modular function. This structure enabled effective use of jit (Just-In-Time) compilation. The XLA compiler could fuse sequences—like normalization, matrix multiplication, and softmax—into fewer kernels, reducing intermediate buffers and lowering GPU overhead.

2. Handling Complex Control Flow

Our pipeline requires selecting the "top-k" most relevant objects from a probabilistic scene graph. This introduces complex control flow. JAX provided the primitives we needed to handle this efficiently:

  • We used jax.lax.cond to manage control flow inside the probabilistic graph.
  • We leveraged jax.nn and jax.numpy for all activation functions and batched math in a JIT-friendly way.

3. Debugging and Transparency

Migrating to JAX was also a learning experience. Tools like jax.debug.print/callback() allowed us to inspect values inside jit-compiled functions, while jax.disable_jit() let us easily switch to eager execution to step through the program seeing intermediate values.

Furthermore, the transparency of the open-source system was impressive. Being able to read the annotated source code and see how Python functions trace into jaxpr (JAX expression) gave us deep insight into how to design inference logic that scales.

4. Seamless Integration with Flax

NNX fits into our workflow perfectly. We used nnx.Module to structure the model and FrozenDict to keep parameters organized and immutable. The TrainState object made managing model parameters and optimizer states straightforward, without adding the complexity often found in other frameworks.

JAX Performance: A 25% Speedup

Embodied agents operate in a continuous loop: planning, acting, and updating their understanding of a dynamic world. High latency here is a dealbreaker. We ported LASER from PyTorch to JAX to improve real-time performance, and the benefits were significant. By rewriting our core similarity computations and feature pipelines as pure functions wrapped in jax.jit, we achieved significant gains.

On an NVIDIA H100 GPU, JAX reduced the average time per frame from 18.15 ms (PyTorch) to 14.55 ms (JAX)—a roughly 25% speedup.

Framework

Hardware

Avg Time Per Frame (ms) ↓

FPS ↑

PyTorch

H100 GPU

18.15 ± 0.73

55.15 ± 2.31

JAX

H100 GPU

14.55 ± 0.64

68.82 ± 3.13

Conclusion

ESCA demonstrates that better data—structured, grounded scene graphs—can solve the perception bottleneck in Embodied AI. But it also demonstrates that better infrastructure is required to run these systems in the real world. JAX provided the speed, transparency, and modularity needed to turn our research into a real-time agent capable of reliable reasoning.

Acknowledgements

This research was made possible through support from a Google Research Award to the University of Pennsylvania and from the ARPA-H program on Safe and Explainable AI under award D24AC00253-00.

Get Started

You can explore the LASER code, the ESCA framework and documentation for JAX and Flax at:

Empowering app developers: Fine-tuning Gemma 3 for mobile with Tunix in Google Colab

Thursday, December 11, 2025

Image

In the rapidly evolving world of AI models for mobile devices, a persistent challenge is how to bring SOTA LLMs to smartphones without compromising on privacy or requiring App developers to be Machine Learning engineers.

Today, we are excited to talk about how Cactus, a startup building a next-gen inference engine for mobile devices, fine-tunes the open-source Gemma 3 model. By leveraging Tunix, the LLM post-training library in the JAX ML ecosystem, they achieved this entirely on Google Colab's Free Tier.

The Challenge: Making Small Models "Expert"

For app developers, running Large Language Models (LLMs) in the cloud isn't always an option due to privacy concerns (like GDPR) and latency requirements. The solution lies in running models locally on the device. However, most smartphones globally lack specialized MPUs (Micro Processing Units), meaning developers need highly efficient, smaller models.

While compact models like Gemma (270M or 1B parameters) are incredibly efficient, they are often "generalists." To be useful for specific mobile applications—such as a medical imaging assistant or a legal document analyzer—they need to be fine-tuned to become domain experts.

The problem? Most app developers are not ML infrastructure experts. Setting up complex training pipelines, managing dependencies, and navigating steep learning curves creates too much friction.

The Solution: SFT via Tunix on Google Colab

To solve this, Cactus created a simplified "Low-Friction" workflow by implementing a Python script using Supervised Fine Tuning (SFT) APIs of Tunix in a Colab.

1. The Engine: Tunix

Cactus utilized Tunix, Google's lightweight and modular LLM post-training library, which supports both SFT and leading RL algorithms, and executes natively on TPUs. Tunix strips away the complexity of heavy frameworks, offering a simplified path to Supervised Fine-Tuning (SFT).

2. The Access: Google Colab Free Tier

Accessibility was a key requirement. Instead of requiring developers to set up complex cloud billing and project IDs immediately, the workflow operates entirely within a Google Colab Notebook. By utilizing the free tier of Colab, developers can:

  • Load the Gemma 3 model.
  • Upload their specific dataset (e.g., medical data or customer service logs).
  • Run an SFT (Supervised Fine-Tuning) job using Tunix.
  • Export the weights for conversion.

3. The Deployment: Cactus

Once tuned, the model is converted into the Cactus graph format. This allows the now-specialized Gemma 3 model to be deployed directly into a Flutter or native mobile app with just a few lines of code, running efficiently on a wide range of smartphone hardware.

Why This Matters

"Our users are app developers, not ML engineers," explains Henry Ndubuaku, co-founder of Cactus. "They want to pick a model, upload data, and click 'tune.' By using Tunix and Colab, we can give them a 'clone-and-run' experience that removes the intimidation factor from fine-tuning."

This workflow represents the "lowest hanging fruit" in democratizing AI:

  • No complex local environment setup.
  • No upfront infrastructure costs.
  • High-performance JAX native Tunix library to tune a leading OSS model (Gemma).

What's Next?

While the Colab notebook provides an immediate, accessible solution, Cactus is exploring a future plan to build a full GUI-based portal for fine-tuning and quantization of LLMs with the back end compute as Google Cloud TPUs, allowing for scalable training of larger models and even more seamless integration into the mobile development lifecycle.

Get Started

Ready to turn your mobile app into an AI powerhouse? Check out the Tunix SFT Notebook for Cactus and start fine-tuning Gemma 3 for your device today:

You can explore Tunix sample scripts, documentation and repo at:

Shape the future with Google Summer of Code 2026!

Wednesday, December 3, 2025

Image

Are you a passionate beginner ready to make your mark in open source? Now is your chance to make an impact in the 2026 Google Summer of Code (GSoC) program!

For over two decades, GSoC has been a critical launchpad, consistently introducing fresh, enthusiastic talent into the open source ecosystem. The numbers speak for themselves: GSoC has connected more than 22,000 contributors with over 20,000 mentors from 1,000+ open source organizations, all collaborating to keep this vital community thriving.


Google, like everyone else, is built on open source. We depend on a healthy, vibrant open source ecosystem and want to lower the barrier to entry for people who want to contribute to the open community. Join in, learn great skills, and make an impact on people around the world.
Richard Seroter, Chief Evangelist, Google Cloud

Over 3+ months, contributors spend their time collaborating on real-world projects right alongside experienced mentors and their communities. This isn't just about coding; this deep immersion in open source does more than build valuable technical skills. It cultivates a strong understanding of community dynamics, best practices, and the soft skills needed to become a truly impactful open source contributor.

Check out the trailer below to learn more about the massive impact GSoC has made over the last two-plus decades.

Google Summer of Code trailer

Join GSoC as a Mentoring Organization

Application Period: January 19 – February 3

If you're interested in your organization participating, now is the time to start! We welcome around 30 new organizations to GSoC annually—and yours could be one of them.

  • Visit our website where you'll find supportive materials to get started.
  • The Mentor Guide is a must-read, offering an introduction to GSoC, community engagement tips, project idea suggestions, and guidance on applying.
  • Pro Tip: For 2026, we'll have an expanded focus on the AI, Security, and Machine Learning domains.

Want to be a GSoC Contributor?

Application Period: March 16 – March 31

Ready to get started? The official website is your first stop!


Spread the Word!

Please help us amplify the message about GSoC 2026! Share this post with your peers, family members, colleagues, universities, and anyone interested in making a real difference in the open source community.

Join us and help shape the future of open source!

AI Conformant Clusters in GKE

Wednesday, November 26, 2025

Image

AI Conformant Clusters in GKE

We are excited to announce that Google Kubernetes Engine (GKE) is now a CNCF-certified Kubernetes AI conformant platform, designed to provide a stable and optimized environment for your AI/ML applications. This initiative, culminating in a major announcement of the Kubernetes AI Conformance program by CNCF's CTO Chris Aniszczyk, at KubeCon NA 2025, is set to simplify AI/ML on Kubernetes for everyone. You can check out the Opening Keynote here.

During the keynote, Janet Kuo, author of this blog and Staff Software Engineer at Google, performed a live Demo, demonstrating the practical power of an AI Conformant cluster. If you are interested in the technical specifics, you can learn more about the demo here.

Why AI Conformance Matters

The primary goal of the Kubernetes AI Conformance program is to simplify AI/ML on Kubernetes, guarantee interoperability and portability for AI workloads, and enable a growing ecosystem of AI tools on a standard foundation.

Setting up a Kubernetes cluster for AI/ML can be a complex undertaking. An AI-conformant platform like GKE handles these underlying complexities for you, ensuring that your environment is optimized for scalability, performance, portability, and interoperability.

For a detailed look at all the requirements and step-by-step instructions on how to create an AI-conformant GKE cluster, we encourage you to read the GKE AI Conformance user guide.

What Makes GKE an AI-Conformant Platform?

A Kubernetes AI-conformant platform like GKE handles the underlying complexities for you, providing a verified set of capabilities to run AI/ML workloads reliably and efficiently. Here are some of the key requirements that GKE manages for you:

  • Dynamic Resource Allocation (DRA): GKE enables more flexible and fine-grained resource requests for accelerators, going beyond simple counts. This is crucial for workloads that need specific hardware configurations.
  • Intelligent Autoscaling for Accelerators: GKE implements autoscaling at both the cluster and pod level to ensure your AI workloads are both cost-effective and performant.
    • Cluster Autoscaling works at the infrastructure level. It automatically resizes node pools with accelerators, adding nodes when it detects pending Pods that require them and removing nodes to save costs when they are underutilized.
    • Horizontal Pod Autoscaling (HPA) works at the workload level. HPA can automatically scale the number of your pods up or down based on real-time demand. For AI workloads, this is especially powerful, as you can configure it to make scaling decisions based on custom metrics like GPU/TPU utilization.
  • Rich Accelerator Performance Metrics: GKE exposes detailed, fine-grained performance metrics for accelerators. This allows for deep insights into workload performance and is essential for effective monitoring and autoscaling.
  • Robust AI Operator Support: GKE ensures that complex AI operators, such as Kubeflow or Ray, can be installed and function reliably, enabling you to build and manage sophisticated ML platforms with CRDs.
  • All-or-Nothing Scheduling for Distributed Workloads: GKE supports gang scheduling solutions like Kueue, which ensure that distributed AI jobs only start when all of their required resources are available, preventing deadlocks and resource wastage.

A Unified and Evolving Standard

The Kubernetes AI Conformance program is designed as a single, unified standard for a platform to support all AI/ML workloads. This reflects the reality that modern AI processes, from training to inference, increasingly rely on the same underlying high-performance infrastructure.

What's Next?

We invite you to explore the benefits of running your AI/ML workloads on an AI-conformant GKE cluster.

The launch of the AI Conformance program is a significant milestone, but it is only the first step. We are eager to continue this conversation and work alongside the community to evolve and improve this industry standard as we head into 2026.

Secure-by-design firmware development with Wasefire

Tuesday, November 18, 2025

Improving firmware development

Building firmware for embedded devices—like microcontrollers and IoT hardware—is hard. It's often complex, it requires deep expertise, and most importantly it is prone to security bugs. One of the key challenges is the limited resources available on these devices, such as constrained processing power, memory, and storage capacity. These constraints make implementing robust security measures at odds with performance and functionality. Unsafe IoT devices are then recruited by cyber criminals into botnets, to perform DDoS attacks, steal information, and act as proxies to evade detection (e.g. the Mirai botnet).

Today, we introduce a new framework that makes it easier to build and maintain safer embedded systems: Wasefire.

Wasefire simplifies the development process and incorporates security best practices by default. This enables developers to create secure firmware without requiring extensive security expertise and only focusing on the business logic they want to implement. To this end, Wasefire provides for each supported device, a platform on which device-agnostic sandboxed applets can run. Wasefire currently supports nRF52840 DK, nRF52840 Dongle, nRF52840 MDK Dongle, and OpenTitan Earlgrey. There is also a Host platform for testing without embedded devices.

A Wasefire platform abstracts the hardware so Wasefire applets are portable

The platform is written in Rust for its performance and built-in memory safety. Embedded devices are one of the four target domains of the Rust 2018 roadmap. So today, it is quite simple to write embedded code in Rust, or even integrate Rust in existing embedded code.

The platform expects the applets to be written in—or more realistically, compiled to—WebAssembly for its simplicity, portability, and security. WebAssembly is a binary instruction format for a stack-based virtual machine. It is designed for high-performance applications on the web (hence its name) but it also supports non-web environments. Fun fact: Wasefire uses WebAssembly in both environments: the main usage is non-web for the virtual machine to run applets, but the web interface of the Host platform also relies on WebAssembly.

Incidentally, WebAssembly is another one of the four target domains of the Rust 2018 roadmap. This means that writing applets in Rust and compiling them to WebAssembly is very simple. For this reason, Rust is the primary language to write applets for Wasefire. Starting a new project is as simple as the following steps:

Image

WebAssembly on microcontrollers

Running WebAssembly on microcontrollers may seem like overkill if it were only for sandboxing. But using a virtual machine also provides binary-level portability like Java Cards. In particular, the same WebAssembly applet can be distributed in binary form and run on multiple platforms.

On a microcontroller, every byte matters. To cater to a variety of needs, Wasefire provides multiple alternatives to balance between security, performance, footprint, and portability:

  • WebAssembly applets: Platforms may embed the Wasefire interpreter. This is a custom in-place interpreter for WebAssembly in the style of "A fast in-place interpreter for WebAssembly" with a very small footprint. The main drawback is that it doesn't support computation heavy applets.
  • Pulley applets: Platforms may embed Wasmtime and its Pulley interpreter. WebAssembly was not designed for interpretation, but for compilation. So WebAssembly interpreters will experience some form of challenge (either performance or footprint). On the contrary, Pulley was designed for fast interpretation and can be compiled from WebAssembly. The main drawback is the larger footprint of this solution and the need for applets to be signed (which is not yet implemented) since Pulley cannot be validated like WebAssembly.
  • Native applets: Platforms may link with an applet compiled as a static library for the target architecture. This solution is only provided as a last resort when no other existing alternative works. The main drawback is that almost all security benefits are nullified and binary-level portability is lost.
  • CHERI applets: This alternative is planned (but not yet started) and would provide the performance and footprint advantage of Native applets while retaining the sandboxing advantage of WebAssembly and Pulley applets. The main drawback is that the target device needs to support CHERI and binary-level portability is lost.

To illustrate this tradeoff, let's look at a few examples from the Wasefire repository:

  • The first example is a button-controlled blinking LED. This applet can run as a WebAssembly applet without problem.
  • The second example is a FIDO2 security key implemented using the OpenSK library. This applet reaches the limits of the WebAssembly in-place interpreter in terms of performance at the moment. By using a Pulley applet instead, performance can be improved by degrading applet size and memory footprint.
  • The third example is a BLE sniffer. Performance is a critical aspect of this applet. The in-place interpreter is too slow and many packets are dropped. Compiling this applet to Pulley doesn't drop any packet in a noisy BLE environment.

We can summarize the tradeoff in the table below. The platform size differs between examples because the second and third examples need optional drivers disabled by default. The platform is the nRF52840 DK. For the security key, applet performance is measured as the time between a FIDO2 GetInfo request and the last packet of its response. For the BLE sniffer, applet performance is measured as the number of processed packets per second. This metric saturates for Pulley and Native applets, so we only get a lower bound of performance in those cases.

Blinking LED WebAssembly Pulley Native
Platform size (KiB) 98 299 49
Applet size (KiB) 3.3 12 5.6
Platform memory (KiB) 10 80 5
Security key WebAssembly Pulley Native
Platform size (KiB) 133 334 80
Applet size (KiB) 125 247 73
Platform memory (KiB) 20 104 9
Applet performance (ms) 1191 60 23
BLE sniffer WebAssembly Pulley Native
Platform size (KiB) 102 303 53
Applet size (KiB) 7.2 18 7.6
Platform memory (KiB) 16 82 8.8
Applet performance (packet/s) = 55 (dropping packets) > 195 (not dropping) > 195 (not dropping)

Looking forward

Wasefire is still an experimental project. Many features are missing (including security features) and many improvements are planned. For example, the platform currently runs a single applet and provides all the resources this applet asks for. Ultimately, applets would come with a manifest describing which resources they are permitted to use, and those resources would be isolated to that single applet. It would also be possible to run multiple applets concurrently.

The project is open source, so bug reports, feature requests, and pull requests are welcome. The project is licensed under Apache-2.0, so commercial use is permitted.

Feel free to give it a try (no hardware needed) and share the word!

.