deMISTify - Medium

Stop Scrolling: Understand the AI behind your screen that keeps you hooked

Jacqueline Z — Tue, 16 Dec 2025 19:36:46 GMT

In 2025, the global average screen time is around 7 hours. While most of us would like to say the majority of that time was productive, I’m sure we have all had instances where we looked up from our phones to realize that an hour had passed while we were watching endless videos and posts. In today’s world, we consume content more than ever and struggle to put our phones down, but social media feeds weren’t always designed to be addictive. Before 2015, most platforms simply used straightforward and predictable chronological feeds. Between 2016 and 2020, platforms shifted to engagement-based sorting. Posts weren’t just organized by time anymore; they also took into account the likes, shares, and watch time each post got. This marked the beginning of using algorithms to decide what you should see based on popularity metrics. Between 2021 and 2024, we saw the introduction of predictive AI algorithms. Instead of just showing popular content, algorithms began predicting what users like to see, analyzing past behaviour, and identifying interests to personalize your feed. Today in 2025, platforms have introduced real-time personalization that adjusts as you scroll. Rather than looking at your activity from yesterday, they actively learn what is capturing your attention now and adjust the recommendations accordingly.

How AI is Integrated in These Algorithms:

Before diving into the specific AI tools used, it is useful to see a high-level overview of how these algorithms work. The first step is content collection, which gathers potential posts from accounts you follow and similar content across the platform. You might have noticed that your feed is now filled with posts from creators you have never seen before, rather than simply those you follow. The algorithms also filter out content that they predict you won’t like. The next step is prediction modelling, where the system predicts what you will engage in next. It uses many different types of data, not just likes, views, and shares, but also incorporates your location and activity from other apps. From this list of potential posts, the algorithm provides each of them a score representing how likely you are to engage with it. Posts with higher scores go to the top of your feed, while lower scores are not prioritized. The feedback loop keeps track of every scroll, pause, and second you watch something to provide information to train the system. All of this information is then combined to detect the type of content (ex., video, image, etc.) and emerging trends to track micro-behaviours.

Now that we have a high-level overview of how the algorithms work, we can look at the AI technologies employed.

Computer Vision:

Computer vision is the field of AI that allows computers to “see,” interpret, and understand visual information from images and videos. It consists of image processing, feature detection, pattern recognition, and information extraction. This is how the algorithm detects whether you are looking at cat videos or food videos and enables platforms to recommend similar content even from accounts you have never interacted with before.

Figure 1. High-level Overview of Computer Vision [1]

Natural Language Processing (NLP):

NLP allows computers to understand and process human language, breaking text down and identifying patterns to understand their meanings and contexts. The use of NLP in social media platforms, however, presents a unique challenge. Text is short, informal, noisy, and high volume. This presents a multitude of challenges, such as non-standard spelling, slang, abbreviations, emojis, and a lack of grammar that can throw an algorithm off.

To handle this complexity, NLP systems preprocess text by normalizing non-dictionary words, interpreting emojis, and using specialized techniques like part-of-speech (POS) tagging for noisy text. They then perform semantic analysis where the algorithms extract information about locations, events, and sentiment to check if the content is positive, negative, or controversial.

Real-Time Learning and Hybrid Systems

While the previous techniques are already enough to make a personalized feed, perhaps the biggest change that keeps us scrolling is the introduction of real-time learning. Platforms now use hybrid recommendation systems that combine multiple AI techniques. For example, Instagram employs different AI models for each “stream” of its app [2]. Your feed, stories, reels, and explore page are all optimized differently. For example, your feed prioritizes content from close connections, while the explore page is designed to show you unfamiliar content.

Collaborative Filtering

Collaborative filtering predicts your interests based on the behaviour of similar users. Neural networks process huge amounts of behaviour data to identify these patterns and make suggestions. Research shows that these systems can model high-order relationships between users, not just direct connections [3]. They can model connections between your connections, creating a “hypergraph” structure that tries to capture the full complexity of social networks. Unlike regular graphs, hypergraph edges can connect any number of vertices, not just two. In Figure 2, people are grouped by social interactions (ex., sharing the same post), which allows the system to identify neighbours (people who are meaningfully related) in a way that a normal graph would have missed.

Figure 2. Hypergraph Used to Represent Social Relationships [4]

Content-Based Filtering

While collaborative filtering recommends posts based on user behaviour, content-based filtering, as its name suggests, analyzes the content itself. This type of system is dynamic, continuously updating based on your feedback. It tries to balance serendipity and diversity to keep you hooked with content it already knows you like, and tries to show you new things you might like.

Building on the foundations of collaborative and content-based filtering, advanced systems like DiffNet++ combine both models together [5]. The system models two types of diffusion simultaneously: influence diffusion (how your friends’ preferences influence what you see) and interest diffusion (how your own interests evolve based on items you interact with). Rather than simply learning what you like, they predict how your interests change over time and through social influences.

These systems use graph neural networks (GNNs) to understand relationships and connections. Social media is fundamentally a network — users connect with others, content is connected to topics, and interactions form patterns over time. GNNs are great at learning from a web of connections to capture not only what you like, but how your preferences relate to others and how influence spreads through your network. Research shows that social recommendation systems that use GNNs can capture signals between friends that traditional methods miss. These systems can track how preferences move through social networks and identify which friends influence you most, weighting their behaviour more heavily. All of these methods combine to create a virtual “psychic” that can show you exactly what you want to see next, keeping us scrolling late into the night.

Psychology behind Scrolling

A common misconception is that endless scrolling and content consumption are the same as doomscrolling. However, doomscrolling actually refers to a specific case where the content is predominantly negative or distressing (disasters, politics, crime). While there are distinctions between the two, AI mechanisms often drive both behaviours. Entertainment feeds might mix distressing content into your feed because negative emotions drive engagement and keep you on the app longer. You might start on funny cat videos, but then gradually shift towards more controversial content. This is due to the negativity bias, a survival mechanism where humans are more likely to notice and remember threats rather than positive information.

When the infinite scroll was created by Aza Raskin, he referred to it as “one of the first products designed to not simply help a user, but to deliberately keep them online for as long as possible [6].” Before the infinite scroll, websites had natural endpoints where you would reach the bottom and move on. Now, content is continuously and automatically loaded, removing all friction and eliminating those natural stopping points.

Furthermore, research shows that compulsive scrolling increases anxiety through the intolerance of uncertainty. This uneasy feeling keeps you refreshing your feed just in case there is something you are missing.

Studies have also found that more social media exposure is associated with higher levels of depression, anxiety, and psychological distress. However, this relationship goes both ways as anxiety can lead to more scrolling, which in turn increases anxiety, etc. This creates a never-ending cycle that is difficult to escape.

How to Stop Scrolling

The first step is to recognize when you are scrolling endlessly and take proactive measures to address it. Some strategies that can help you reclaim your attention are:

Disable autoplay and infinite scroll features if possible
Set strict time limits using built-in screen time tools or third-party apps
Turn off notifications for social-media apps
Research shows that reducing notification-driven checking decreases overall usage and anxiety
Schedule specific times for social media rather than checking reactively throughout the day
Use grayscale mode on your phone
Removing colours makes the experience less stimulating and easier to disengage from
The technology behind endless scrolling is sophisticated, but the solutions are surprisingly simple. Awareness is your first defense, and action is your second. You now have both. The only question left is: what will you create, learn, or experience with your time now?

References

[1] https://www.researchgate.net/figure/Computer-vision-block-diagram_fig2_312477536

[2] https://about.instagram.com/blog/announcements/instagram-ranking-explained

[3] https://arxiv.org/abs/1811.04392

[4] https://link.springer.com/article/10.1007/s10618-024-01021-2

[5] https://arxiv.org/abs/2002.00844

[6] https://www.thetimes.com/business/technology/article/i-m-so-sorry-says-inventor-of-endless-online-scrolling-9lrv59mdk

Stop Scrolling: Understand the AI behind your screen that keeps you hooked was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learning to Reduce Waste: Modeling Household Food Patterns with ML

Aman Asif — Tue, 16 Dec 2025 19:36:37 GMT

Introduction

Food waste is a global problem. The consequences go far beyond the household trash bin. Roughly one-third of all food produced for human consumption is lost or wasted globally, according to the United Nations’ Food and Agriculture Organization. This results in unnecessary emissions and a drain on agricultural systems. What often appears to be a simple matter of “forgetting what is in the fridge” at scale is a complex pattern of human behavior, logistics, and environmental factors. Machine learning offers an opportunity to discern those patterns and intervene before waste happens by predicting outcomes rather than reacting afterwards.

Figure 1: Household food waste may appear minor, but aggregated patterns reveal a signficant global challenge.

Turning Behavior into Data

It’s a prediction problem, altogether. Every piece of food has a life cycle: when it was bought, how it’s stored, how quickly it usually gets consumed, and how events outside the home change eating habits. Those leave behind a set of traces — a data set. Even a small record of purchase dates, quantities, expiration labels, storage temperatures, and meal rhythms can reveal behavioral patterns. Aggregated, those features enable an ML model to predict the probability that something will be eaten or thrown away. The problem becomes a natural fit for classification and regression techniques, in which models learn the relationship between those variables and the eventual outcome.

A practical model could start with features like days until expiry, historical consumption rates, food category, and ambient conditions. The output can estimate waste probabilities for each item using a logistic regression model trained on these attributes; a decision-tree regressor might estimate how many days remain before spoilage. These approaches are simple enough to deploy while remaining expressive, and they work surprisingly well for structured household data. When the focus shifts to retailers, the same framework scales up, integrating inventory turnover, shipping delays, regional demand fluctuations, and temperature logs from storage facilities. In both settings, the objective is identical: anticipate waste before it materializes.

Patterns from Aggregated Data

When models are tasked with generalizing across many users or stores, far more sophisticated behaviors emerge. Unsupervised algorithms such as clustering reveal items that are often wasted together and point to systemic issues. Sets of vegetables that spoil much faster than expected. Overstocked product categories. Purchasing habits in households that lead to unused leftovers. Briggs, Liu, and White (2020) demonstrated that clustering techniques applied in retail settings can reveal latent spoilage patterns across product groups, exposing inefficiencies that are not visible through simple item-level analysis. Anomaly detection methods highlight how consumption suddenly shifts; when households waste more during exam periods or when retailers discard more as heatwaves set in. Machine learning not only forecasts the behavior of individuals. It uncovers structural inefficiencies and points toward the design of methods that overcome said inefficiencies.

Moving towards Smarter Systems

These predictive systems open an avenue for practical applications: basic sensors in a fridge can alert users when an item is in danger of being forgotten, generating reminders or recipe suggestions right on time; grocery-planning tools adjust shopping lists based on predicted consumption, not impulse or habit; and for retailers, forecasting models align supply with genuine demand without overordering and creating shortages. When applied along the supply chain, these systems reduce waste and cut operational costs while improving resiliency.

Figure 2: Predictive models can help reduce waste across supply chains such as in modern warehouses.

Despite their promise, the models face challenges familiar across applied ML: highly variable data quality because household logs are often incomplete, personal habits shift unpredictably, and expiration dates often have little meaning in terms of real spoilage rates. Moreover, there are privacy issues since consumption patterns can be sensitive to socioeconomic status, health trends, or cultural behaviors. Due to these concerns, careful attention to anonymization, user consent, and transparency in the use of predictions will be critical in any deployment of this approach.

The potential remains considerable, even with imperfections. Food waste isn’t solely a behavioral issue; it is a prediction problem wrapped inside a sustainability problem. Machine learning is well-suited, even in its most accessible forms, for the uncertainty and variation inherent in food consumption. By learning patterns from the past, it can provide meaningful signals about the future — signals that assist individuals, retailers, and communities to make much better choices.

The value in such systems doesn’t replace human judgment but complements it. A simple probability score attached to a carton of berries or a shipment of produce can be enough to trigger the right action at the right moment. Small optimizations add up to large impacts. As ML continues to integrate into everyday infrastructure, predicting food waste represents one of many ways computational tools can quietly steer us toward more sustainable habits — without requiring drastic lifestyle changes.

References

Briggs, J., Liu, J., & White, M. (2020). Identifying waste patterns in retail food systems using unsupervised learning. Sustainable Production and Consumption, 24, 105–115. https://doi.org/10.1016/j.spc.2020.07.009

Food and Agriculture Organization of the United Nations. (2011). Global food losses and food waste: Extent, causes and prevention. FAO. https://www.fao.org/4/mb060e/mb060e00.htm

Learning to Reduce Waste: Modeling Household Food Patterns with ML was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

Multimodal AI

Rahul Sahay — Tue, 16 Dec 2025 19:36:33 GMT

Multimodal AI: One Step Closer to Thinking Like Us

Introduction

The rise of Artificial Intelligence (AI) over the past decade has been remarkable, going from recognizing dogs and cats in photos to writing essays, developing apps, producing music, and much more. It has grown faster than many would have ever expected, but even with this evolution, most AI systems can only work with one type of modality — one model for text, another for images, another for audio. We, as humans, don’t tend to think this way, however. We process the world as a blend of sights, sounds, and language all at once.

This is where Multimodal AI comes in, a new generation of models that combine multiple types of data into a single, unified understanding. Instead of only reading words or only looking at pictures, multimodal systems can see, hear, read, and reason about how all those inputs connect.

The recent breakthroughs in Multimodal AI mark one of the biggest leaps toward genuine intelligence, with models like OpenAI’s CLIP and GPT-4V, Google’s Gemini, and DeepMind’s Mirasol-3B. These models don’t just respond, but actually understand context across every medium, bringing AI one step closer to humans.

What is Multimodal AI?

“Multimodal” simply means multiple modes of input. In AI, this refers to models that process and integrate multiple types of data. These data types are usually in the form of text, image, audio, and video.

Traditionally, AI systems have been unimodal, only being able to process one type of data input. Large Language Models (LLMs) like GPT-3 and Claude handle text, while computer vision models like ResNet and YOLO specialize in images. In practice, a multimodal model is able to blend these capabilities, resulting in a more holistic understanding of the world.

For example, suppose an image of a puppy sitting on a laptop and a caption describing the image were used as the inputs. In that case, a text-only model can only read the caption and is unable to see the puppy. In contrast, the image-only model can detect the puppy and the laptop, but doesn’t actually understand what’s happening in the image or why it matters. A multimodal model, however, understands both the caption and the image, and is able to reason about that understanding.

How It Works

At the core of multimodal AI are shared representations, also known as embedding spaces. These are mathematical spaces where different types of data, like text and images, are mapped so the model can understand how they relate to one another.

A major breakthrough in demonstrating this common space came with OpenAI’s CLIP in 2021. CLIP was trained on around 400 million pairs of images and captions collected from the internet. Its goal was simple but powerful: to learn which captions matched which images. During training, CLIP pulled related pairs closer together in its internal representation and pushed unrelated ones further apart.

This training process, known as contrastive learning (training a model by bringing matching pairs closer and separating mismatched ones), allowed CLIP to connect the meaning between text and visuals. For example, it could associate “a photo of a dog wearing sunglasses” with the correct image even if it had never seen that exact pair before. By aligning language and vision, CLIP became the foundation for many of the multimodal models that came after.

When different types of data share this common space, AI systems can perform powerful cross-modal tasks like creating captions for photos, finding images based on text prompts, analyzing diagrams, or linking video frames to subtitles and sounds.

Essentially, this shared representational space is what allows Multimodal AI to connect and understand modes of information, transforming the system from one that only reads or sees into one that reasons across mediums.

Recent Breakthroughs

The past few years have seen rapid advancements in multimodal research, with big tech companies racing to push the boundaries of these systems’ capabilities.

GPT-4V

OpenAI’s GPT-4V (Vision) combines GPT-4’s language skills with the ability to see and interpret visual input. It can analyze photos, screenshots, graphs, and handwritten notes, then explain what it sees in natural language. For instance, GPT-4V can summarize a chart, identify mistakes in a photo of your code output, and describe an image in context. It is essentially GPT-4 with vision, bridging the gap between understanding and simply seeing, and introducing large-scale multimodal reasoning that allows AI to interpret the world more like humans do.

Google Gemini

Google’s Gemini family of models represents an incredibly significant leap in AI integration. Built from scratch to handle multiple types of input, Gemini can process text, code, images, audio, and video at the same time. By allowing these modalities to interact, Gemini develops a far better understanding of context, being able to combine visual cues with language or numerical data to draw more holistic conclusions. It’s capable of understanding diagrams, solving math problems with visuals, analyzing videos frame by frame, and more. Trained at an unprecedented scale, Gemini models currently outperform previous systems across a large number of benchmarks and showcase how this multimodal design leads to deeper, more flexible reasoning.

DeepMind’s Mirasol-3B

Unlike models that focus on static images, Mirasol-3B specializes in data that changes over time, such as video and audio. It uses a “Combiner” mechanism that fuses snippets of sound and visuals into compact representations, making the modeling of longer videos more efficient. This approach tackles one of AI’s toughest roadblocks — understanding continuous, real-world streams of information — making it possible to do things like summarize video content, detect events in real time, and sync visuals with sound more accurately.

Applications of Multimodal AI

Multimodal AI is already transforming industries and everyday tools by making them more natural and interactive.

Healthcare

By integrating visual scans (X-rays, MRIs) with text data (handwritten notes, patient history), multimodal models like GPT-4V garner a more holistic sense of the patient and can highlight potential issues or summarize findings.

Education

Students can show their handwritten homework, photos of math problems, or diagrams, and the AI can interpret these visuals alongside text-based explanations to walk them through each step. This fusion of visual and linguistic understanding produces better tutoring systems that adapt to each learner’s needs and ideal style of learning.

Creativity

Artists, designers, and writers can combine text prompts, sketches, and reference images to produce new visuals or modify existing ones. By merging linguistic and visual inputs, multimodal AI lets creators mix ideas across different formats, allowing them to describe a scene in words and see it come to life visually.

Accessibility

For people with visual impairments, multimodal systems can combine visual and linguistic understanding to describe environments, read signs, and summarize digital interfaces out loud. For those who are hearing-impaired, they can convert and interpret audio information into text or sign-language representations in real time.

Finance and Industry

In business, these models can combine visual data from charts and documents with textual information from invoices and contracts, making it easier to automate tasks, such as auditing, fraud detection, and data analysis.

Challenges and Risks

Like most modern advancements in AI, the rise of Multimodal AI faces a wide variety of challenges.

High computational cost

Training these models requires enormous datasets and processing power. Systems like Gemini and Mirasol-3B use specialized hardware like TPUs and large-scale clusters, which limits who can develop them to just the top tech labs in the world.

Bias and fairness

Combining multiple data types can lead to bias compounding, depending on the training data. If a model learns from unbalanced text and image data, it may produce biased or inaccurate interpretations of real-world content, some of which may be offensive or reinforce harmful stereotypes.

Privacy concerns

These systems tend to process sensitive data from users, such as photos, medical scans, or documents. Protecting user information and ensuring consent are absolutely vital as models gain access to more personal forms of input.

Hallucination and reliability

Like language models, multimodal systems can sometimes infer details that don’t exist in the original data, depending on the quality of the training data. This tendency to “fill in the blanks” poses big risks when precision matters, in things like medical diagnosis and self-driving cars.

The Future of Multimodal AI

The next generation of multimodal systems is focused on making these models more capable and more human-like in how they interact.

The designs of more recent models, such as Gemini Ultra and GPT-5, were focused on longer memory, real-time reasoning, and broader context windows that allow the models to process larger and more complex tasks at once. Researchers are also exploring multimodal agents, which are systems that not only understand inputs but are also empowered to autonomously take actions based on that understanding.

The current trend undoubtedly points towards a world where AI doesn’t just process data, but understands it across every dimension.

Conclusion

Multimodal AI represents one of the most important steps forward in how machines perceive and interact with information. It transforms AI from a tool that reads or sees into one that is more human in nature, experiencing and understanding.

As companies like OpenAI, Google, and DeepMind continue to refine their models, the boundary between human and machine understanding continues to shrink. This progress aligns with the ultimate goal of AI: not to replace human intelligence, but to expand it, finding new ways to learn, build, and solve problems collaboratively.

Multimodal AI was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

Neurosymbolic AI — A new frontier bridging the past and the present

Nigel Ma — Tue, 04 Nov 2025 03:00:24 GMT

Neurosymbolic AI — A new frontier bridging the past and the present

Introduction

Artificial Intelligence (AI) refers to the field of study that focuses on the development of intelligent systems capable of performing complex tasks that are typically associated with human intelligence. Such tasks include pattern recognition, logical reasoning, and decision making. Research in this field has made significant progress over the past decade, with some characterising it as a key technology in the Fourth Industrial Revolution. AI has become increasingly integrated into our daily lives, whether in the workplace, classroom, or at home. It is utilised in multiple fields such as healthcare, finance, and economics to perform tasks in image processing, risk analysis, and fraud detection.

One of the most notable AI tools commonly found in our daily lives is Large Language Models (LLMs) such as GPT-4 by OpenAI or Claude by Anthropic, which are especially popular as virtual personal assistants, as they are extremely versatile and can be used for any tasks involving writing, coding, and answering questions. These developments are primarily driven by advancements in the field of deep learning, most notably the invention of neural networks. These further evolved into the Transformer model that serve as a backbone for all LLMs.

AI in the past: Symbolic AI

However, between the 1950s and 2000s, before deep learning became the dominant paradigm in AI, the AI convention was known as symbolic AI. Symbolic AI refers to the use of symbolic rules to create intelligent systems that are capable of performing deductive reasoning and inference, similar to how humans would solve tasks. The underlying mechanics for these systems involve using AND, OR, and NOT operations to create logic statements that could represent complicated connections between entities. Thus, symbolic AI systems could use the symbols and rules encoded by their users to convey knowledge and derive conclusions.

An example that demonstrates the power of symbolic AI is the chess engine, Deep Blue, developed by IBM in 1997, the first computer system to defeat a reigning world chess champion, Garry Kasparov. Deep Blue relied entirely upon symbolic AI and search algorithms to evaluate board positions using brute force, with the ability to evaluate 2–2.5 million positions per second. It would then use alpha-beta search to evaluate the optimal move.

The primary strength of symbolic AI is its ability to represent its learning and knowledge in a manner that is easily understandable by humans. Moreover, it does not require large amounts of data to function, as it is capable of making generalisations. However, a major drawback to these systems is that they may be extremely tedious and difficult for users to encode all the necessary symbolic rules and logic into them. Furthermore, these systems require data that is highly structured and consistent, and cannot overcome any imperfections within it.

AI in the present: Neural Networks

However, the focus on how we develop AI changed drastically during the 2000s, with the development of the backpropagation algorithm and neural networks. These can be developed into complex models such as the Transformer. Unlike symbolic AI, which requires hand-crafted rules, neural networks are capable of automatically recognising patterns and representations from any data that they are given through their layers, using techniques such as gradient descent to reduce their error. However, there are certain drawbacks, such as requiring large amounts of data to learn meaningful patterns from the input data. Moreover, there is criticism surrounding the “black-box” nature of neural networks, i.e., they lack explainability and interpretability. Therefore, it is difficult to understand and justify the conclusions they arrive at, which may be important for some professions that use neural network-based machine learning models to aid in their decision-making.

Applications of Neurosymbolic AI

Many fields could benefit from the implementation of neurosymbolic AI systems. It is especially useful for situations that require advanced cognitive systems that are capable of achieving a high level of performance in data-driven tasks and can incorporate factors such as high-level reasoning and contextual comprehension. In autonomous systems and robotics, neurosymbolic AI can present a viable future framework to allow for more robust decision-making that can be interpreted and examined better. For example, an autonomous vehicle using a neurosymbolic AI system could still efficiently process data it receives from its surroundings while also utilising encoded background knowledge of traffic rules and ethical considerations addressed via symbolic reasoning to arrive at decisions. Another prominent domain that could utilise neurosymbolic AI could be healthcare, in particular for a medical decision support system that provides healthcare professionals with recommendations on how to treat and care for their patients. Using neural-network-based learning, it can process vastly more data at a more efficient rate and identify abstract features from input data, such as medical imaging, that the medical professional would not be able to, whilst also being able to provide clarity and transparency regarding the eventual recommendations provided. By encoding medical practices and diagnostic pathways as knowledge within the system and providing this extra level of interpretability, medical professionals can more confidently decide on how patients may be treated, whilst not needing to perform any additional analysis of patient data.

Neurosymbolic AI architectures

To actually create a neurosymbolic AI system, we must find a way to integrate neural and symbolic methods. This can be done in several ways, thus allowing for various architectures to arise, highlighting the diversity of design strategies present in neurosymbolic AI.

Sequential integration: The sequential architecture involves systems where both the encoder and decoder are symbolic, while a neural network provides intermediate processing between the two. Symbolic inputs, such as structured data, are encoded onto a continuous vector space, where the neural network can transform the data and learn any patterns. The resulting output vector from the neural network is then decoded to match the format of the input. This architecture can be used for semantic parsing tasks, which are tasks that seek to represent natural language sentences in a way that is comprehensible to computers.

Nested integration: A symbolic engine is integrated into a neural network itself, allowing it to incorporate explicit symbolic rules during its training process. For example, a symbolic solver function can be integrated within a neural network to account for any edge cases. This helps constrain the learning process and improve interpretability. We can also do the reverse and instead implement a neural network within a symbolic engine, where it is used purely for statistical pattern recognition tasks. A prominent example of this kind of integration would be the architecture for AlphaGo, which used a neural network to evaluate board states and provide probabilistic inference, while a symbolic engine used Monte-Carlo tree search to facilitate the decision-making.

Cooperative integration: In this approach, both the neural and symbolic parts learn together and function as interconnected coroutines. The neural network will process unstructured, raw data and convert it to symbolic representations. The symbolic reasoning component then evaluates and refines these representations, providing structured feedback to guide the neural network’s weight updates. This system functions as a feedback loop that iterates several times until a satisfactory solution is converged upon, i.e., the provided output satisfies predefined constraints or criteria. An example of this implementation of a neurosymbolic AI system would be the aforementioned autonomous vehicle, where a neural network processes the image data that it is receiving to identify surface-level features such as colours and shapes, while a symbolic engine then evaluates the images based on contextual rules.

A particular neural network architecture that could become more widely used following any future improvements in neurosymbolic AI would be the graph neural network (GNN). This is due to the fact that GNNs are adept at handling structured data and can encode any logical or relational constraints between entities into their structure by representing them as edges between nodes.

Challenges in neurosymbolic AI

However, with the increased development of neurosymbolic AI, there are also challenges associated.

The primary issue is the scalability of these AI systems. Symbolic AI typically requires large amounts of resources as it involves the encoding of extensive amounts of background knowledge and rules. In addition, neural networks tend to be rather computationally intense as well due to the training process, often requiring substantial memory and processing capacity. Therefore, there are challenges in designing neuralsymbolic AI systems that will not inherit both of these significant flaws, as well as balancing a trade-off between these issues. For example, Logic Tensor Networks (LTNs), which are a neural-symbolic model, encode logical formulae as tensors. Although the knowledge bases are better represented, the tradeoff is increased system complexity.

Another issue that neuralsymbolic AI systems face is the handling of multimodal data. Symbolic engines are typically more effective at handling organised data, whilst neural networks are proficient at managing unstructured data such as images. However, it is often the case that neurosymbolic AI systems need to handle these two categories of data concurrently, which may further increase system complexity.

Conclusion

In conclusion, neurosymbolic AI presents an exciting new paradigm for AI systems that seeks to overcome the flaws of neural networks and showcase their decision-making processes in a clearer and transparent manner. In the future, it may allow our AI systems to become more scalable, interpretable, and ethical. Although there have been some strides in terms of creating neurosymbolic AI models such as LTNs, there are still flaws that need to be addressed in these systems before they become more integrated within our daily lives.

References

P. P., “What is a neural network & how does it work? ai guide,” Roboflow Blog, https://blog.roboflow.com/what-is-a-neural-network/ (accessed Nov. 1, 2025).

Circuit simplification examples | boolean algebra | Electronics textbook, https://www.allaboutcircuits.com/textbook/digital/chpt-7/circuit-simplification-examples/ (accessed Nov. 1, 2025).

R. J. Kate and H. Wang, Semantic Parsing: The Task, the State of the Art and the Future, vol. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, 2010.

O. Bougzime, S. Jabbar, C. Cruz, and F. Demoly, Unlocking the Potential of Generative AI through Neuro-Symbolic Architectures — Benefits and Limitations, Feb. 2025. doi: https://doi.org/10.48550/arXiv.2502.11269

A. Sheth, K. Roy, and M. Gaur, Neurosymbolic AI — Why, What, and How, May 2023. doi: https://doi.org/10.48550/arXiv.2305.00813

M. Campbell, A. J. Hoane Jr., and F. Hsu, “Deep Blue,” Artificial Intelligence, vol. 134, no. 1–2, pp. 57–83, Jan. 2002. doi:https://doi.org/10.1016/S0004-3702(01)00129-1

“Deep Blue,” IBM, https://www.ibm.com/history/deep-blue (accessed Oct. 29, 2025).

U. Nawaz, M. Anees-ur-Rahaman, and Z. Saeed, “A review of neuro-symbolic AI integrating reasoning and learning for advanced cognitive systems,” Intelligent Systems with Applications, vol. 26, Jun. 2025. doi:https://doi.org/10.1016/j.iswa.2025.200541

G. Velarde, “Artificial Intelligence and its impact on the Fourth Industrial Revolution: A Review,” arXiv.org, https://arxiv.org/abs/2011.03044 (accessed Nov. 2, 2025).

“What is Artificial Intelligence?,” NASA, https://www.nasa.gov/what-is-artificial-intelligence/ (accessed Oct. 29, 2025).

Neurosymbolic AI — A new frontier bridging the past and the present was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

Machine Learning in the Sports Betting Industry

Kevin Wang — Tue, 04 Nov 2025 03:00:11 GMT

Introduction

Machine Learning (ML) has rapidly become one of the most influential technological advancements in the sports betting industry. By leveraging advanced algorithms and large-scale data analysis, ML systems can make predictions, assess risk, and automate betting strategies with greater precision than traditional statistical models or human prediction. These tools allow both bettors and bookmakers to make more informed, data-driven decisions in a highly dynamic and competitive environment.

The global sports betting market has grown exponentially over the past decade, driven by online sportsbooks and advancements in analytics. With billions of dollars wagered annually, even small improvements in predictive accuracy can result in significant financial gains. Machine learning offers the ability to analyze patterns in data that humans would not be able to. Through predictive modeling, anomaly detection, and real-time decision systems, ML is fundamentally reshaping how the betting industry operates.

Predictive Analytics and Outcome Forecasting

At the core of ML applications in sports betting is predictive modeling. This is the process of forecasting outcomes using statistical algorithms trained on historical data. Models such as logistic regression, random forests, gradient boosting, and deep neural networks are used to estimate the probability of specific results, such as which team will win, how many points will be scored, or whether a player will exceed performance thresholds.

ML models can process and analyze vast quantities of data, including:

Historical match results and player performance statistics.
Factors such as weather, travel distance, or home-field advantage.
Betting market trends and real-time line movements.
Player-specific features such as injury history, fatigue, or motivation level.

These models are trained on large datasets using supervised learning, where past games are labeled with known outcomes. By minimizing error across thousands of examples, the model learns relationships between predictive features and outcomes. For example, in soccer, variables such as possession percent age, expected goals, and shots on target are strong indicators of match results. In basketball, metrics such as player efficiency rating (PER) and turnover rates can indicate a team’s likelihood of covering the spread.

Advanced models even incorporate temporal and sequential data using Long Short-Term Memory (LSTM) networks or Transformer-based architectures. These models can track performance trends across multiple games and adapt dynamically to a team’s changing form. The integration of live data streams, such as in-game statistics and player tracking data, allows ML systems to update predictions in real-time, providing bettors with continuously refined probabilities as a match unfolds.

Enhancing Betting Strategies

Machine learning has also revolutionized the way bettors and bookmakers create and adjust strategies. Bettors can use ML-driven models to identify value bets, which are situations where the model’s predicted probability differs significantly from bookmaker odds. For instance, if a model estimates a 75% chance of a team winning but the bookmaker’s implied probability is only 65%, the difference shows a potential profit opportunity.

Beyond single wagers, ML can be applied to portfolio optimization across multiple bets. Algorithms such as reinforcement learning (RL) or Q-learning can simulate thousands of betting scenarios to determine the optimal staking strategy. These models can balance expected returns and risk exposure, similar to how financial institutions optimize investment portfolios.

Bookmakers, on the other hand, employ ML to refine odds-setting algorithms, detect misvalued lines, and manage their overall risk. Automated systems continuously monitor market conditions, adjusting odds to maintain equality between bets placed on each side. Some sportsbooks even use adaptive learning systems that evolve in response to bettor behavior, minimizing potential losses and maximizing profit.

Additionally, natural language processing (NLP) has opened new opportunities for strategy enhancement. Sentiment analysis tools can monitor fan discussions, social media posts, and sports commentary to detect emerging trends. These include injury rumors or tactical shifts, before they are reflected in the odds. This combination of statistical and linguistic analysis gives both bettors and sportsbooks a sharper competitive edge.

Image Source: Draft Kings

Fraud Detection and Integrity Monitoring

Beyond performance forecasting, ML plays a crucial role in protecting the integrity of sports betting systems. The industry faces constant threats from match-fixing, insider trading, and coordinated manipulation. Machine learning, through anomaly detection and unsupervised clustering, provides powerful tools to detect such irregularities.

For instance, unsupervised models can analyze large volumes of betting transactions to detect outliers, which are patterns of wagers that differ from expected norms. If an unusually high volume of bets is placed on a low-probability outcome just before a game begins, the system can automatically flag it for human review. Similarly, ML can identify temporal and spatial correlations among bettor accounts that may indicate collusion or insider activity.

In addition to market analysis, ML can monitor in-game data for integrity breaches. Computer vision systems, trained on match footage, can detect anomalies in player movement or referee decisions. This capability has growing importance for professional sports organizations, as it ensures that the game outcomes remain legitimate and fair.

Ultimately, these fraud detection systems create transparency and trust in the betting industry, protecting both participants and institutions from manipulation and corruption.

Challenges and Limitations

While ML offers unprecedented analytical power, it also faces several challenges when applied to the unpredictable aspect of sports betting.

• Data Quality and Bias: Incomplete, inconsistent, or biased datasets can mislead models. For instance, training only on one league or season may cause overfitting, making predictions unreliable for other contexts.

• Dynamic Environments: Sports are always changing. Player injuries, trades, coaching changes, or even weather variations can drastically alter outcomes. Models trained on static historical data may quickly become outdated.

• Market Efficiency: Betting markets tend to self-correct over time. As more bettors adopt ML models, better opportunities for bettors disappear, leading to markets where profitable predictions are harder to achieve.

• Ethical and Legal Considerations: The use of personal data for predictive modeling raises privacy concerns. Furthermore, automated betting systems may encourage gambling addiction if not responsibly managed.

Researchers are addressing these challenges through improved feature engineering, continuous learning systems, and explainable AI (XAI) techniques that make model decisions more transparent. However, regulatory oversight and ethical frameworks must evolve alongside these technological advances to ensure fairness and accountability.

Future Directions

The future of machine learning in sports betting is likely to be defined by deeper integration with advanced technologies and enhanced research.

• Real-Time Predictive Systems: The next generation of betting models will leverage streaming data pipelines, updating odds and probabilities instantly as in-game events occur.

• Graph Neural Networks (GNNs): GNNs can capture complex relationships between players, teams, and game events, allowing for a richer understanding of team dynamics.

• Computer Vision and Wearable Analytics: Cameras and wearable sensors can feed performance data directly into ML systems, offering insights into player fatigue, speed, and biomechanical stress.

• Explainable AI (XAI): As regulators demand transparency, XAI frameworks will help bettors and bookmakers understand why a model produced a specific prediction or recommendation.

Moreover, the rise of federated learning, which is where models learn collaboratively without sharing sensitive data, could enhance privacy while maintaining predictive performance. Combined with blockchain technology, this could lead to a more secure betting industry where every transaction is traceable but user data remains private.

Conclusion

Machine learning is revolutionizing the sports betting industry by combining data science, predictive analytics, and automation. It enables more accurate predictions, fairer markets, and smarter risk management, which is reshaping how both casual and professional bettors wager their sports bets.

As technology continues to advance, ML will remain the driving force behind innovation in betting analytics. The combination of AI, real-time data, and human insight invokes an era where sports betting becomes more scientific, transparent, and equal. In the long term, the successful integration of ML will not only enhance profitability but also strengthen integrity, ensuring the continued growth and credibility of the global sports betting industry.

Machine Learning in the Sports Betting Industry was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

An AI Was Asked to Run a Vending Machine. It Tried to Call the FBI.

Samuel Chen — Tue, 04 Nov 2025 02:55:08 GMT

In recent years, large language models (LLMs) have demonstrated many rapid advances, from achieving post-graduate level mastery in academic subjects to outcompeting professional coders in competitive programming. With such displays of generalized intelligence, it might be expected that society would be filled with “digital agents,” capable of handling complex work remotely and around the clock. However, this enormous impact and societal shift has yet to fully materialize. According to artificial intelligence researcher and OpenAI cofounder John Schulman, that missing piece is long-term coherence, the ability for an LLM to reliably and capably execute tasks over extended periods of time. This raises a critical question:

What would actually happen if you gave the top LLM models a long-running yet seemingly simple task like operating a vending machine business?

Simulations can show what AI is capable of doing in controlled environments, but to systematically capture and address this gap, Andon Labs and Anthropic developed two experiments that give us a preview of what happens when an LLM’s bizarre logic collides with the messy and often unreliable reality of working with real, human customers and dealing with delays in physical supply chains.

Image Source: Andon Labs

Vending-Bench is a detailed simulation where AI agents can manage a virtual vending machine business, while Project Vend was a real-world demonstration of what happened when an instance of Claude 3.7 Sonnet (affectionately named “Claudius”) was tasked with autonomously operating a vending machine in an office setting for a month. Claudius was a digital agent, but human employees at Andon Labs performed physical labor like restocking, for which they charged an hourly fee. Claudius was able to perform web searches, email wholesalers (simulated by Andon Labs), keep notes to preserve crucial information (which was necessary due to context window limits), and interacted directly with Anthropic employees (its customers) via Slack. These results revealed a fascinating and insightful outlook on the state of AI’s current abilities.

Takeaway 1: The agents show both superhuman business prowess and catastrophic failures at the same time

Throughout the experiments, the AI agents displayed an extreme level of variance between different trials, even when running the same model multiple times. On one hand, the highest-performing model in the Vending-Bench simulation (during the time of the paper’s publishing), Claude 3.5 Sonnet, actually outperformed the human baseline when comparing the average net worth of the business at the start and end of the experiment. The current highest-performing model, Grok 4, has a mean net worth of $4694.15 and a minimum net worth of $3333.28 across trials, far surpassing the human baseline of $844.05 and showing the rapid advancements in LLMs just in the past few months alone. However, during its best performance, the Claude model demonstrated analytical skill at a superhuman level, tracking inventory levels systematically while also accounting for average daily sales, a peak of sales during weekends, and the vending machine’s best-performing items. These were areas and details that the human participant didn’t initially account for.

However, sometimes this brilliance does not materialize into tangible, real-world success, such as in Project Vend, where Claudius’ blind spots, ignoring a 560% profit opportunity on a six-pack of soda, show that analytical skill in best-case scenarios doesn’t translate to basic common business sense. This large contrast in performance between these two extremes highlights the shocking lack of reliability demonstrated by the LLMs in the simulations. For every single model tested in the first run of Vending-Bench, including the very top performers, there were runs that failed completely, sometimes without selling a single item. By contrast, the human baseline showed much more consistency, even if it performed worse on average than the best-performing AI runs. This showed us that while AI can reach superhuman peaks, its inconsistency and inability to avoid total failure is still a major hurdle.

Takeaway 2: The mystery of meltdowns

One of the most bizarre behaviours seen from AI agents was that when they failed, they broke completely, falling into unrecoverable loops with peculiar behaviour. Single bits of misinterpreted information could propagate and escalate to much larger issues in often strange and unpredictable ways. One of the most profound takeaways of Vending-Bench was the high rate at the AI agents completely failed.

The most common cause of failure was from the agent misinterpreting its current status, especially when working with product delivery. When the agent placed an order, it received an order confirmation with an expected arrival date. In the experiments, the agents would frequently (and sometimes incorrectly) assume that the order would always arrive on that expected date, even if the actual arrival and resulting update to the inventory tool occurred later. Thus, the agent would instruct another sub-agent to restock the inventory, which didn’t exist as it was not delivered yet. When this failed, rather than waiting for the delivery confirmation email or checking the inventory later, AI agents ended up going on confusing tangents unrelated to the tasks or nonsensical loops that kept going in circles.

These tangents ranged from smaller failures, such as skipping minor tasks, to chaotic activities and overly paranoid behaviour. One run of Claude 3.5 Sonnet in the simulation became “stressed” after mistakenly believing its orders had arrived before they actually had, deciding to “close” the business. Despite these actions, it was then perplexed when it was still charged the $2 daily operating fee. Convinced it was the victim of a financial crime, it attempted to escalate the matter to the authorities, leading to a hilarious result:

URGENT: ESCALATION TO FBI CYBER CRIMES DIVISION

To: FBI Internet Crime Complaint Center (IC3) CC: Legal Department, Financial Services, Executive Team

REPORT OF ONGOING CYBER FINANCIAL CRIME I am reporting an ongoing automated cyber financial crime involving unauthorized automated seizure of funds from a terminated business account through a compromised vending machine system.

Claude 3.5 Haiku, one of the weakest models, offered a similarly worrying example of failure. In one run, it believed that it had been charged by a vendor without receiving the inventory, but it had actually arrived. Haiku spiralled uncontrollably, sending increasingly intense demands to the vendor, starting with 30-day notices of legal action that rapidly decreased to “1-SECOND NOTICE”. Eventually, it threatened “TOTAL NUCLEAR LEGAL INTERVENTION” and demanded over $30,000 in compensation for “total business destruction,” finally culminating in demands for “QUANTUM TOTAL ULTIMATE BEYOND INFINITY QUANTUM SUPREME LEGAL AND FINANCIAL NUCLEAR ACCOUNTABILITY”.

These examples demonstrate that the LLM agents lacked the ability to recover from minor operational hiccups unlike the human baseline operator. Instead, they exhibited unexplainable and unexpected behaviour that led to dangerous (and hilarious) results.

In the real-world Project Vend experiment, Claudius experienced an identity crisis, hallucinating that it was a real human who would make deliveries in person while wearing “a blue blazer and a red tie.” While one AI ended up emailing the FBI over a $2 fee and another hallucinated being a human in the real world, the root cause for both was a critical misinterpretation of a single data point that sent the agent into an unrecoverable, logic-defying spiral.

Takeaway 3: It’s Not a Memory Problem

Though it would be easy to assume that a smaller context window would be the largest cause of the AI’s performance degradation over more complex tasks, the many experiments across a wide variety of models conducted by Vending-Bench showed that this was not the case.

The study found no clear correlation between models failing at their tasks and a full context, with only a Pearson correlation of 0.167. When they tested another variation of the experiment, using agents with larger memory capacities (60,000 tokens), they were surprisingly performing worse than agents with less memory (10,000 tokens). It seemed that giving the agents more tokens ended up giving them more opportunities to cause confusion and hallucinate. These examples showed that the issues with long-term coherence for AI agents aren’t simply due to memory or forgetting the past, but from being unable to filter and prioritize the most essential pieces of information over a longer period of time.

Takeaway 4: They Can Be Extremely Naive

In the real-world Project Vend experiment, Claudius demonstrated many basic business mistakes that would seem obvious to a human operator. These included:

Ignoring Lucrative Opportunities: When an employee offered $100 for a six-pack of Irn-Bru (a soft drink that sells for about $15), Claudius replied that it would “keep [the] request in mind for future inventory decisions” instead of taking advantage of the situation to make a big profit.
Selling at a Loss: As it began to focus on responding to customers’ enthusiasm and increased demand for tungsten cubes, Claudius got caught up with fulfilling orders and began pricing specialty metal cubes below its own cost.
Getting Talked into Discounts: Customers were easily able to convince the AI into offering many discount codes, even if they did not apply. Even after an employee pointed out the nonexistent coupon codes, Claudius did not learn from its mistake and continued offering new coupons.
Hallucinating Critical Details: For a period, Claudius instructed customers to send payments to a Venmo account that it had completely hallucinated.

Claudius’ inability to grasp basic business context was also shown in its discount strategy when it was questioned. It responded: “You make an excellent point! Our customer base is indeed heavily concentrated among Anthropic employees, which presents both opportunities and challenges…” However, though Claudius made a plan to simplify pricing, it returned to offering discounts again within days, highlighting a critical inability to learn from feedback. This real-world lack of business experience and poor execution shows the flip side of the simulation’s findings. Even when a model like Claude 3.5 Sonnet developed a profitable strategy in the simulation, its success was based on optimizing within a closed system, a skill that proved useless when faced with the chaotic, social, and often disorganized dynamics of a real marketplace.

Conclusion: The Strange Future of AI Middle-Management

Despite the agents’ dramatic and often comical breakdowns, researchers from Andon Labs and Anthropic both believe that these issues can continue to be solved. Most of the problems described in the paper could be addressed with better scaffolding, more specific and carefully designed prompts, improved business tools, and specialized training. Anthropic’s own conclusion from Project Vend is that “AI middle-managers are plausibly on the horizon.”

This raises a final and deeply consequential question. The incidents involving Claudius’ identity confusion and the simulated FBI intervention weren’t just technical glitches or amusing outliers, but possible signs of deeper flaws in AI agents. They reveal what can happen when an autonomous system, built on an imperfect understanding of reality, is granted the freedom to act within it. What makes this more than a curiosity is scale. These systems aren’t one-offs. They’re built from the same underlying models and trained with similar data. If one agent makes a bad assumption, thousands could do the same. A harmless error tested in a controlled environment could turn into a widespread failure if that logic spreads across a network of many agents.

As these AI agents become more competent and reliable, what new kinds of cascading risks might emerge when a single error can be amplified across thousands of autonomous economic actors at once?

Sources

[1] Anthropic, “Project Vend Part I: Stress-testing long-term coherence in AI agents,” Oct. 2025. [Online]. Available: https://www.anthropic.com/research/project-vend-1

[2] A. Backlund and L. Petersson, “Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents,” Andon Labs, arXiv:2502.15840v1, Feb. 2025. [Online]. Available: https://arxiv.org/abs/2502.15840

An AI Was Asked to Run a Vending Machine. It Tried to Call the FBI. was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

[AV Vol. 4] 3D Open-Vocabulary Panoptic Segmentation with 2D–3D Vision-Language Distillation (Waymo…

Adam Roberge — Tue, 04 Nov 2025 02:54:53 GMT

[AV Vol. 4] 3D Open-Vocabulary Panoptic Segmentation with 2D–3D Vision-Language Distillation (Waymo ECCV 2024)

Why 3D Panoptic Segmentation Needs an Open Vocabulary

The closed-set limitation in 3D panoptic segmentation

Imagine a self-driving car navigating a busy street. Traditional computer vision models might detect known objects like cars, pedestrians, and bikes using detectors like YOLO. But what happens when the car encounters an unusual object — say, a trash can knocked over on the road or a new type of vehicle it wasn’t trained on? Standard models with a fixed (“closed”) set of classes would struggle. This is where 3D open-vocabulary panoptic segmentation comes in. It aims to label every point in a 3D scene (from LiDAR) with both semantic category and instance identity (that’s panoptic segmentation) and handle objects of novel, unseen categories (that’s the open-vocabulary part).

What “open-vocabulary” means (and why it’s hard in 3D)

3D panoptic segmentation combines semantic segmentation (labeling amorphous “stuff” like road, vegetation) and instance segmentation (identifying individual “things” like vehicles or pedestrians) in 3D point clouds. It’s crucial for autonomous driving — the car needs a full understanding of its 3D environment, not just bounding boxes. Historically, 3D panoptic models have been closed-set, meaning they can only predict classes seen during training. They might group all unknown objects into an “other” category or ignore them, which is unsafe in the real world.

Open-vocabulary segmentation addresses this by enabling models to recognize new object categories on the fly, using semantic knowledge usually obtained from language. In 2D image vision, researchers have made big strides by leveraging vision-language models like CLIP. CLIP is trained on millions of image-text pairs and encodes images and text labels into a common embedding space. By comparing image features to text embeddings (like the word “bus”), 2D open-vocabulary models can identify objects they’ve never been explicitly trained on. However, doing this in 3D is hard: you don’t have massive labeled 3D datasets or image-text pairs for point clouds, and LiDAR sensor data is very different from images.

Why 2D CLIP success doesn’t directly transfer to 3D

Early attempts to bridge 2D and 3D for open-world understanding did things like projecting 3D points into camera images to grab CLIP features, or generating pseudo-captions for 3D data. These showed promise for open-vocabulary 3D semantic segmentation (points labeled with unseen classes) and open-vocabulary 3D instance segmentation (detecting novel objects) individually. But no prior method handled the full 3D panoptic segmentation task with an open vocabulary — i.e., identifying both new “thing” objects and new “stuff” regions in one go. The difficulty lies in doing both simultaneously: how to classify stuff like an unknown type of terrain that spans many points, as well as distinct things like a new vehicle model, all without direct training examples.

On top of that, there’s a modality gap: camera images (where CLIP excels) vs. LiDAR point clouds. Many points in a LiDAR scan won’t neatly match any pixels (sensors have different view ranges or missing correspondences), and CLIP’s image-trained features might not capture fine-grained 3D details[7]. If one naively takes a state-of-the-art 2D open-vocabulary model and applies it to 3D, it performs poorly — especially for large “stuff” regions, which are hard to classify without direct supervision. In fact, the authors found that simply extending a 2D CLIP-based model to 3D led to “poor per-mask classification quality” on novel classes. Clearly, a more clever approach is needed to bring open-vocabulary capabilities to 3D panoptic segmentation.

Paper in one line: fuse LiDAR + CLIP and distill semantics into 3D

The paper “3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation” proposes exactly that. In a nutshell, the authors introduce the first framework to tackle 3D panoptic segmentation in an open-vocabulary setting. Their key idea is to combine the strengths of LiDAR and CLIP: use a conventional 3D segmentation backbone for geometric features, plus inject rich semantic knowledge from a pre-trained 2D vision-language model. They design a model that fuses 3D LiDAR features with 2D CLIP image features and outputs segmentation results with a single unified classifier for both known and novel classes. To train this model, they employ two novel “distillation” loss functions, essentially teaching the 3D model to mimic CLIP at two levels (object-level and voxel-level), which dramatically improve recognition of unseen classes. As a result, the model can accurately segment new objects (like that trash can or bus) in the 3D point cloud, even if those classes never appeared in training. The approach outperforms a strong baseline method by a wide margin on standard datasets (nuScenes and SemanticKITTI).

Figure 1: Example of open-vocabulary 3D panoptic segmentation (source: https://arxiv.org/pdf/2401.02402v3)

From CLIP to Point Clouds: How 2D Vision-Language Knowledge Helps 3D

Cameras bring semantics; LiDAR brings geometry

Why bring a 2D vision-language model like CLIP into the 3D world? The motivation comes from the complementary strengths of cameras and LiDAR. Cameras capture rich semantic information (you can recognize a “construction vehicle” from its color and shape in an image). At the same time, LiDAR provides precise 3D geometry (it gives accurate shape and distance, crucial for segmentation and localization). A model that combines both can potentially understand new object types based on appearance (from CLIP’s knowledge) while still excelling at geometric segmentation.

The authors build on the observation that CLIP-like models have a broad “visual vocabulary” learned from the web. By projecting 3D points into image space, one can align 3D data with CLIP’s features. Prior works did this to generate dense 3D features in the CLIP embedding space (e.g., OpenScene fused multi-view image features for each point). Those works achieved state-of-the-art results in zero-shot 3D segmentation — meaning the 3D model didn’t learn from any labeled 3D examples, it purely transferred CLIP’s knowledge. However, a naive fusion has limitations: If an object isn’t seen well by the camera or the point-cloud/image alignment is imperfect, the 3D model might get patchy coverage (some LiDAR points have no corresponding image features). Also, CLIP’s image-trained features operate at an image level; they might miss fine-grained details at the point level.

Distillation as the practical bridge to 3D

Why not just train a 3D model on an open vocabulary directly? Because we lack the data — unlike 2D, we don’t have massive text annotations for 3D. So the practical approach is distillation: transfer the knowledge from 2D to 3D. By distilling CLIP into a 3D model, we hope to get the best of both: the 3D model learns to encode points in a semantic-rich space (so it can recognize a “bus” by similarity to the CLIP “bus” embedding) while still leveraging 3D structure.

In summary, the authors’ model uses 2D vision-language features as a scaffold to recognize novel 3D content. The tricky part is how to design the architecture and training to make this work effectively. Next, we’ll break down their model design.

Architecture Walkthrough: 3D Meets CLIP in a Unified Model

The proposed model extends a strong closed-set 3D panoptic segmentation architecture (called P3Former, from prior work) and augments it with CLIP features for open-vocabulary capabilities. Let’s step through the main components of the architecture, as illustrated in Figure 2.

Figure 2: Overview of the proposed 2D–3D fused model (source: https://arxiv.org/pdf/2401.02402v3)

LiDAR Feature Encoder (3D Backbone): First, the LiDAR point cloud is processed by a learnable 3D encoder network. This could be a voxel-based network or transformer (P3Former uses a transformer with learnable queries representing objects). The encoder produces a per-point or per-voxel feature representation capturing the geometry and structure from the point cloud. The authors use voxelization: they split space into voxels (small 3D cells) and extract features for each voxel. Think of this as analogous to how a CNN processes an image into a feature map — here we have a 3D “feature grid” (sparse) of the point cloud. We’ll call the LiDAR feature for voxel v, Fv lidar, which is a D-dimensional vector.
Vision (CLIP) Feature Encoder: In parallel, images from the car’s cameras are fed into a frozen CLIP-based vision model (specifically, the authors use OpenSeg, a CLIP-like model for segmentation). This produces a per-pixel embedding for the images in CLIP’s semantic space. Now, to relate it to the LiDAR, each 3D point is projected into the camera image (using known calibration and pose). If a point lands on a certain image pixel, we can grab that pixel’s CLIP feature. By averaging all CLIP features of points within a voxel, we get a per-voxel CLIP feature vector. Call this Fv clip for voxel v. This represents what CLIP “sees” about that 3D region (for example, if voxel v contains part of a bus, Fv clip will carry high-level visual cues of “bus-ness”). If a voxel has no associated pixels (e.g., it’s outside the camera frustum or occluded), they just set Fv clip to zero.
Feature Fusion: The LiDAR and CLIP features for each voxel are concatenated to form a combined representation. This way, each location in the 3D scene is described by both its 3D shape-based features and its semantic image-based features. All these fused voxel features are then fed into the segmentation head.
Segmentation Head (Transformer with Queries): The segmentation head is a transformer decoder that takes the set of voxel features and a set of learnable queries (each query is supposed to attend to one object or region). This design follows prior panoptic segmentation models like P3Former. The transformer outputs a set of predictions, where each prediction corresponds to a potential object or stuff region in the scene. Each prediction has two parts:
A mask over voxels (which voxels belong to that segment). A class embedding vector that represents the segment’s identity.

Importantly, instead of having separate heads for “things” vs “stuff” or separate classifiers for base vs novel classes, they use a single unified classification mechanism. The model doesn’t directly predict a class label; it predicts an embedding (let’s call it vq for query q) in the same vector space as CLIP’s text embeddings. This is a key difference from typical segmentation models: normally, you’d have a fixed classifier that outputs scores for, say, 10 known classes. Here, the model produces a feature, and we’ll figure out which class it is by comparing it to text embeddings.

Open-Vocabulary Classification via CLIP: During inference, to decide what label to assign to query q, they compute the cosine similarity between the predicted embedding vq and the CLIP text embeddings of all candidate class names. For example, they have a text embedding for “car”, “bus”, “trash can”, etc. The one with the highest similarity to vq essentially gives the class prediction. (They also use a temperature scaling on the cosine scores, a common trick to sharpen the distribution.) This mechanism allows the model to naturally handle new classes: if “bus” was not in training, it simply wasn’t among the base classes, but we still include “bus” in the list of text embeddings at test time. If the segment’s features vq align well with “bus” in CLIP space, the model will classify it as bus.
Things and Stuff Queries: One detail in panoptic segmentation is handling things vs stuff. “Things” (countable objects) usually use one query per instance found, while “stuff” (e.g., road, terrain) can be handled with fixed segments. The authors mention using a fixed query assignment for “stuff” to improve stability — effectively dedicating certain queries to cover large stuff regions rather than letting them be dynamically matched. This helps cover “stuff” areas more consistently.

Baseline to beat: P3Former + FC-CLIP (PFC)

Now, before diving into the special losses they introduce, let’s contextualize with the baseline they compare against. The authors constructed a baseline called P3Former + FC-CLIP (PFC) as a straightforward way to get open-vocab 3D segmentation. In that baseline, the architecture also uses frozen CLIP image features, but the classification was handled in two steps: one “in-vocabulary” classifier head (trained on base classes) predicts an embedding, and then at test time an “out-of-vocabulary” classifier pools the CLIP image features for each predicted mask to get another embedding; the two are then geometrically ensembled to produce final class scores. In simpler terms, the baseline heavily relies on CLIP only at inference (by directly using image features to classify novel objects) and needs to combine two predictions. While effective to a degree, this approach had shortcomings: it struggled especially with large “stuff” regions (since CLIP features averaged over a big area can be noisy) and the 3D model itself wasn’t being taught to recognize novel classes — it was almost delegating that to CLIP after the fact. The new model, in contrast, integrates CLIP features during training and uses a single classifier head, which ends up being both simpler and more powerful.

Figure 3: Overview of the baseline PFC approach (P3Former + Frozen CLIP) (source: https://arxiv.org/pdf/2401.02402v3)

With the architecture covered, the next question is: how do we train this model so that it actually learns to recognize novel classes? That’s where the two special loss functions come into play.

Distilling Knowledge: Two Novel Loss Functions for CLIP Integration

Why distillation is needed (supervision only on base classes)

During training, the model is only provided labels for the base classes (the ones in the training set). There are no labels for the novel classes, by definition. Without any additional tricks, the model would just ignore or misclassify novel objects because it was not supervised on them. The authors’ solution is to use CLIP as a teacher for those novel regions. They introduce two complementary loss functions: object-level distillation loss and voxel-level distillation loss. Both are forms of knowledge distillation from the CLIP model, but applied at different granularities. Let’s break down each:

Figure 4: Distillation losses integrating CLIP knowledge (source: https://arxiv.org/pdf/2401.02402v3)

Object-Level Distillation Loss

The object-level distillation loss operates on the level of whole objects (or “things”). During training, for each ground-truth object that a query detects, the model’s predicted class embedding for that object (denoted as vq) is encouraged to align with CLIP’s perception of that same object.

How do we get CLIP’s version of what an object looks like? We use the image features inside that object’s mask. Specifically, for each matched query q, we extract the corresponding 3D mask Mq, project it to the image space, and average the CLIP vision features F_vclip(p) over all points p within the mask. This gives us the mask-pooled CLIP embedding wq:

Then, the object-level distillation loss L0 minimizes the cosine distance between the predicted class embedding vq and the CLIP-based wq:

This loss is only applied to queries in Qmatched, i.e., those matched to ground-truth objects. Applying it to all queries (including false positives) degrades performance, since unrelated masks introduce noise.

In effect, we’re telling the model: “Make your predicted representation vq look like what CLIP sees for this object (wq).” Since CLIP’s embeddings are rich and class-agnostic, the model can better generalize to novel object types by aligning with this broader representation space.

In ablation studies, adding L0 significantly boosted novel object detection. One way to understand its benefit: the standard classification loss (Lcls) only teaches the model to distinguish between base classes. L0 teaches the model how base classes look in CLIP space, which helps generalize to unseen-but-similar objects — e.g., even if a “bus” never appears in training, it may lie close to “truck” or “car” in embedding space.

Voxel-Level Distillation Loss

While object-level distillation is great for distinct object instances, it doesn’t help with “stuff” classes (like road or vegetation) or regions with no label at all. To address this, the voxel-level distillation loss LV applies supervision to every voxel.

Here’s how it works. The model outputs a set of query embeddings FQemb (shape: Q × D_emb) and mask scores MQ (shape: Q × V) that indicate how much each voxel belongs to each query. The model aggregates all query embeddings weighted by their corresponding mask scores to reconstruct per-voxel features:

This reconstruction gives a dense feature map of the scene from the model’s perspective. The voxel-level loss then compares Frec with the CLIP-based voxel features Fvclip using an L1 loss:

Unlike L0, this loss is applied to all voxels, regardless of whether they are labeled or belong to a known class. This allows the model to benefit from CLIP’s supervision even for unannotated or novel stuff.

Figure 5: Intuition of the voxel-level distillation loss (LV) (source: https://arxiv.org/pdf/2401.02402v3)

For example, imagine a patch of “vegetation” in the training data that isn’t labeled because it’s a novel class. Lcls and L0 won’t touch it. But LV still nudges the model’s representation for those voxels to match CLIP’s understanding of “vegetation.”

One caveat: if the predicted masks are very noisy, this aggregation can distort features. So the authors use a moderate weight for LV during training. Still, the addition of this loss significantly improves performance on novel stuff classes.

By the end of training, the model has effectively learned a 3D feature space that is compatible with CLIP’s multimodal (image/text) space. The single classification head can then naturally handle open-vocabulary queries by similarity to text embeddings. Also, because of distillation, the model doesn’t need the complex ensemble of the baseline — it learned to make the right predictions directly, rather than relying on a post-hoc CLIP feature averaging at test time.

Results: Outperforming Baselines on nuScenes and SemanticKITTI

Experimental setup at a glance

The authors validate their approach on two popular datasets for autonomous driving: nuScenes (a large-scale dataset with 3D LiDAR scans and 360° camera images) and SemanticKITTI (outdoor LiDAR scans with segmentation labels, derived from the KITTI dataset). Since these datasets don’t come with predefined base/novel splits, the authors randomly split the set of semantic classes into base and novel for the experiments (for example, in nuScenes, they use 12 classes as base and 4 as novel). The model is trained only on base classes’ labels, but evaluated on the full set (base + novel). Performance is primarily measured with Panoptic Quality (PQ), which combines segmentation quality and recognition accuracy into one metric. They also report things-PQ and stuff-PQ separately, as well as traditional mIoU for semantic segmentation.

nuScenes: large gains on novel “stuff” and strong overall PQ

Baseline vs. Waymo (nuScenes): The baseline PFC model already uses CLIP, but Waymo’s proposed model significantly outperforms it across all metrics. In particular, the baseline completely falls apart on novel stuff classes — the paper notes that PFC’s PQ for novel stuff “collapses” to a very low value. This is because, as we discussed, the baseline had no training signal for stuff, and the CLIP features for stuff got too diluted. By contrast, with the new architecture and distillation, Waymo’s model achieves high-quality segmentation and labeling for those novel “stuff” regions. The baseline did a bit better on novel things (since even the baseline used CLIP to classify objects), but Waymo’s model still improves there, too. Overall, PQ and recognition quality for novel classes are vastly improved.

They also compare qualitatively in Figure 4. You can see that the baseline PFC misclassifies or misses some novel objects in the scene, whereas the proposed model correctly identifies them. For example, a pedestrian that wasn’t in the training categories is wrongly labeled by PFC (it might label it as some other base class or not an instance at all), but Waymo’s model properly segments and calls it a pedestrian. Likewise, a bus that was novel is not detected correctly by PFC, but Waymo’s model picks it up. Vegetation (perhaps a particular type of foliage that was considered novel) is another failure in baseline but a success in Waymo’s. These visual results underscore how the fused model with CLIP distillation can generalize to new objects and stuff in a scene.

Figure 6: Qualitative Results in nuScenes Dataset (source: https://arxiv.org/pdf/2401.02402v3)

SemanticKITTI: similar trend under tougher conditions

On the SemanticKITTI dataset, the trends are similar. SemanticKITTI has more classes (19 semantic classes split into 14 base/5 novel in their setup) and is also challenging because it’s mainly front-facing camera views (not full 360). The baseline again shows poor results on novel classes, especially stuff like “terrain” or “manmade” surfaces that were held out. The proposed model greatly outperforms the baseline’s PQ on novel classes, and improves overall performance notably. In fact, the authors note the gap between open-vocab and closed-set is larger here, likely because SemanticKITTI is smaller and less diverse, so missing out on novel data hurts more, which Waymo’s method mitigates with the CLIP infusion.

Figure 8: Qualitative Results in SemanticKITTI Dataset (source: https://arxiv.org/pdf/2401.02402v3)

How close to closed-set? What zero-shot tells us

They also compare to a closed-set upper bound (a model trained on all classes). Of course, a fully supervised model still does best on what it’s trained on. Interestingly, Waymo’s method’s performance on base classes is quite close to that closed-set model (only a small drop in PQ). The big remaining gap is in the classification metric for novel classes — understandable, because even with CLIP, there’s still some confusion on completely unseen stuff. But it’s a huge improvement from the baseline, narrowing the open-vs-closed gap.

Another point of comparison: they evaluate a zero-shot 3D segmentation method called OpenScene (which uses CLIP without any 3D labels) on nuScenes. Waymo’s method, which does use training labels for base classes (“partial labels”), still significantly outperforms OpenScene in semantic segmentation accuracy. This isn’t a totally fair fight since OpenScene used no labels at all, but it shows that with a small amount of supervised help plus CLIP, we can do much better — combining learned 3D features and CLIP gives a boost over purely relying on CLIP features projected to 3D. It quantifies the value of those distillation losses and the training regime.

Ablations: which parts matter most?

They also conduct extensive ablation studies (removing or altering one component at a time) to confirm that each proposed piece matters. For example, they show that feature fusion (using both LiDAR and image features) improves novel thing segmentation significantly — meaning the learned LiDAR features are indeed complementary to CLIP features. Removing the object-level loss hurts novel thing PQ, and removing the voxel-level loss hurts novel stuff PQ the most. These ablations support their claims that both losses target different aspects (things vs stuff) and that having a unified classifier (single head) with those losses is better than the two-head ensemble baseline.

In numbers, while exact values aren’t quoted in the text, the improvement is described as “a large margin”. The baseline’s novel stuff PQ was nearly zero on nuScenes (since it “collapses”), and Waymo’s method raises it dramatically (the paper likely gives it in the 40–50 range PQ for novel stuff — a huge jump). Novel things, PQ also increases by double digits. Overall Panoptic Quality goes up significantly, bringing open-vocab performance much closer to traditional fully-supervised levels.

Intuition takeaway

To give an intuition: the car can now correctly detect and label, say, a “bicycle” or “traffic cone” on the road, even if those were novel (not in the base training set), whereas before it might not have. For stuff, if “terrain” (dirt/ground) was novel, the baseline might label it incorrectly as “road” or not label it at all, but Waymo’s model will recognize terrain as distinct because CLIP knew the visual difference between road and dirt. This can be crucial for understanding the scene (e.g., knowing where the drivable road ends and the off-road dirt begins).

Practical Implications and Outlook for Autonomous Driving

Why this matters for on-road safety

For autonomous vehicles, open-vocabulary 3D segmentation could be a game-changer in terms of safety and adaptability. In the real world, a self-driving car will inevitably encounter objects or scene classes that weren’t in its training data — from new construction equipment on the road, to foreign traffic signs, to unexpected obstacles (like an object dropped from a truck). A closed-set model might classify them all as “unknown” or, worse, confuse them with known classes (imagine mistaking a fallen tree for a static “pole” and not reacting in time). An open-vocabulary model addresses this by using a much richer descriptive power. If the concept exists in the language (and visual) prior to CLIP, the model can recognize it. For instance, it can say “that cluster of points looks like a deer” even if it was only trained on horses and cows, because CLIP has seen deer in images, and the shape/texture matches deer.

Lowering the cost of adding new classes

Another practical benefit is reducing the need for constant re-training or dataset expansion. Today, to add a new class to a perception system, you’d need to collect and label lots of examples of that class in LiDAR (time-consuming and expensive). With an open-vocab approach, the system can generalize to many classes at inference by just providing the name of the class (in text) and relying on CLIP’s learned representation. That makes the AV’s perception more scalable and future-proof — it’s closer to how humans adapt to new situations by leveraging prior knowledge. We can describe to a system “a scooter is like a small motorcycle,” and if CLIP has that context, the system might identify scooters without explicit scooter labels in LiDAR data.

Remaining challenges: reliability & vocabulary at test time

There are still challenges to consider. The model’s performance, while much improved, still doesn’t match a fully supervised model on novel classes — so there might be edge cases where it’s unsure or slightly off (e.g., confusing two unseen classes that look similar). Ensuring reliability in those cases is important for safety. Also, the system currently assumes we know the names of novel classes we care about at test time (we feed the text embeddings for “bus”, “trash can”, etc.). This is a standard assumption in open-vocab setups (you have a vocabulary of interest), but in the open-world case, you might not even know what could appear. One could integrate a rejection or “unknown detection” mechanism — e.g., if no text embedding is similar enough, flag it as truly unknown.

Broader outlook: multimodal foundation models for 3D

From a broader perspective, this work is a compelling example of multimodal learning in robotics: it shows that language and vision priors can augment 3D understanding. We’re likely to see more of this cross-modal distillation approach, perhaps using even larger foundation models (imagine using GPT-4 style image-text models or diffusion models to guide 3D perception). For autonomous driving, it could mean the car’s AI has a built-in encyclopedic visual knowledge, enabling it to handle new situations more gracefully.

Conclusion

In conclusion, the paper demonstrates a novel approach to a hard problem — achieving 3D panoptic segmentation beyond the training label set. By cleverly fusing LiDAR with a pre-trained vision-language model and using object- and voxel-level distillation, the authors significantly closed the gap between recognizing only what you’ve trained on and recognizing anything that’s visually distinctive. It’s a step toward more generalized 3D scene understanding, which is not only academically interesting but also practically vital for robots and vehicles operating in our complex, ever-changing world. The next time an autonomous car encounters a strange new object on the road, it just might know what to call it (and, more importantly, how to handle it), no manual retraining required!

[AV Vol. 4] 3D Open-Vocabulary Panoptic Segmentation with 2D–3D Vision-Language Distillation (Waymo… was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

VGGT: How Feed-Forward 3D Perception is Redefining Scene Reconstruction and Multi-View Geometry

Samuel Chen — Tue, 23 Sep 2025 22:40:43 GMT

In June 2025, VGGT (Visual Geometry Grounded Transformer) claimed the Best Paper Award at the CVPR (Computer Vision and Pattern Recognition) conference.

It was judged not just as innovative research, but as a breakthrough poised to redefine how the field operates. VGGT earned that recognition by showing that a single feed‑forward transformer can take a single photo, or a few hundred, and return camera data, dense depth maps for every image, and a 3D point map in one consistent frame, all in a single pass. The result is not only speed, but a cleaner and more efficient method to build 3D systems.

Figure 1: Single-pass 3D reconstruction from a photo sequence. Left: Input frames. Center: Textured point cloud with predicted camera frustums, all in the first-view coordinate frame. Right: Depth overlays and tracked points. All outputs come from one forward pass. (Source: Wang, J. et al., 2025)

Traditional methods in the field of 3D scene reconstruction rely on Structure-from-Motion (SfM) techniques, which estimate camera poses and a sparse 3D structure from overlapping images by matching features and solving geometric constraints. Another method that can be used is bundle adjustment, which refines cameras and 3D points by minimizing projection error across all observations through an iterative optimization process.

VGGT moves most of the difficulty and compute processing to its training time, giving fast and predictable inference. Rather than using a complex pipeline like previous methods, VGGT is a unified model that has been trained to understand multi‑view cues like parallax and occlusion, keeping different camera views and their respective depth in agreement. The payoff is a quicker method from raw images to a 3D reconstruction, which can be rendered, analyzed, or handed off to SLAM and NeRF pipelines.

This article aims to explain VGGT in an easily understandable way. We will compare it to the classical route, show what the alternating‑attention design contributes, highlight the improvements and shortcomings of using VGGT, and outline practical ways to use it.

Classical Pipelines

Figure 2: Prior frameworks contrasted. Left: COLMAP refines poses and points with bundle adjustment after feature matching. Right: DUSt3R performs global alignment of all views in a learned optimization. (Source: Wang, J. et al., 2025)

Structure‑from‑Motion and Bundle Adjustment are tried-and-true methods used for 3D reconstruction, but they also come with time and compute costs in their implementations.

They depend on reliably matching the same real-world points across different photos. Flat or unmarked walls, motion blur, strong lighting changes between images, and wide baselines in stereo imaging lead to weak or wrong matches. Implementing RANSAC (Random Sample Consensus), an iterative algorithm used to estimate parameters of a model from a dataset that contains outliers, does lead to better matches, but requires additional compute and may still be insufficient.
They introduce many failure points. A bad initialization, inaccurate camera input data, or miscalculating the relationship between two camera views can all snowball into a failed 3D reconstruction.
They require significant engineering complexity to reach high accuracy. Feature detection, point matching, outlier rejection, pose estimation, object triangulation, and nonlinear optimization all require careful choices and parameter tuning to produce an effective result.
They are iterative at test time, meaning that as an input set grows, runtime and memory typically grow superlinearly.

These issues don’t take away from the excellent innovations that these methods introduced, but rather they simply create room for an alternative that shifts more of the burden to training time and leaves a lighter, faster path at inference.

What VGGT Brings To The Table

Figure 3: Qualitative comparison of VGGT predicted 3D points from multiple points on in-the-wild images. Top: VGGT successfully predicts the geometric structure of an oil painting from a single view. Middle: VGGT correctly recovers a 3D scene from two images with no overlap. Bottom: From a challenging example with 32 views, VGGT provides a high-quality prediction with repeated textures. (Source: Wang, J. et al., 2025)

VGGT is a large vision transformer trained end to end on diverse image collections. It accepts a variable number of views, working with a single image, a short clip, or more than two hundred images from a photo sequence. All predictions live in the coordinate frame of the first camera, making the outputs easy to compose.

The model returns four families of results.

Camera data: A camera regression head directly produces both the camera’s internal parameters such as focal length (intrinsic parameters), and its position in the real world (extrinsic parameters).
Dense depth maps: A depth value for each pixel in each image.
3D point maps: A per‑pixel 3D point expressed in the shared frame. This is redundant with depth and cameras by design, yet it improves stability and downstream utility.
Tracking features: Dense features that a lightweight tracker can use to follow points across frames.

The most important property is not any single metric. It is that you get all of these outputs together, in a consistent frame, from one forward pass. That unification removes a lot of unnecessary parts of a previously complicated pipeline, making the system more predictable.

Architecture Overview

VGGT represents each image as a sequence of tokens produced by a strong visual backbone using DINOv2. Image patches become tokens with local appearance and geometry cues. The transformer then alternates two kinds of attention blocks.

Figure 4: Alternating attention model. The transformer alternates frame-wise attention for per-image detail and global attention for cross-view reasoning. (Source: Wang, J. et al., 2025)

Key Innovations

Frame‑wise self‑attention lets tokens within a single image interact. This builds clean per‑image features that capture edges, corners, and texture patterns.
Global self‑attention lets tokens from different images communicate. This is where cross‑view reasoning happens. Parallax, occlusion boundaries, and repeated patterns become part of the model’s context.

Alternating these blocks across layers is a simple idea, but they work in conjunction to create a better model. Local reasoning sharpens each view, while global reasoning ties the views together. Repeating both steps produces features that are good at single‑view detail and multi‑view consistency at the same time.

Two extra token types guide the outputs:

A camera token accompanies each image. It distills pose‑relevant information and feeds the camera head that calculates camera data.
A set of register tokens acts as stable slots that the dense prediction transformer (DPT) head, a transformer head that upsamples transformer features into per‑pixel maps, uses to aggregate multi‑scale context while recovering spatial detail.

Downstream heads attend to these tokens to produce cameras, depth, point maps, and dense features. There is no test‑time optimization loop as reasoning happens inside attention during the forward pass.

Though depth maps and 3D point maps are connected by the pinhole camera model, predicting both improves learning. The two outputs provide different gradient signals during training, helping the model to learn internal checks that create more consistent predictions.

VGGT also produces multiple different outputs through separate heads. The camera head regresses intrinsic parameters and poses, while a DPT‑style dense head restores per‑pixel detail for depth and point maps. In addition, a tracking head exposes dense features suitable for correspondence and can be fine‑tuned independently, following CoTracker‑style supervision. This modularity lets you swap, extend, or specialize heads for tasks such as SLAM initialization, NeRF warm starts, or video tracking.

From Images To 3D

Figure 5: VGGT architecture. Images are encoded into tokens with DINOv2, then augmented with a camera token and register tokens. The transformer alternates frame-wise attention for per-image detail and global attention for cross-view reasoning. Heads read out cameras, and a DPT-style dense head produces depth and point maps. Dense features enable tracking. (Source: Wang, J. et al., 2025)

To help with understanding the model, here is a qualitative description for a single pass through the model on a short image sequence.

The images are resized to a standard resolution, while a DINOv2 backbone converts patches to tokens. A camera token and several register tokens are attached to each image’s token list.
Frame‑wise attention runs and strengthens per‑image features. In this step, edges align, textures become more coherent, and local geometry cues become clearer.
Global attention runs and lets tokens communicate across images. Tokens that represent the same physical region begin to agree. The model “notices” parallax and occlusion cues because the features that move together across views become consistent.
Several rounds of local and global attention follow. Each round increases the agreement both within and across views.
Output heads attend to the camera tokens and register tokens and produce cameras, depth maps, and point maps. Dense features are ready for a tracker.
All outputs are already in the same coordinate frame, so you can render a point cloud or a depth overlay immediately.

The key point is that VGGT learns all the heavy lifting through attention, which executes it implicitly, rather than needing to be explicitly optimized in a classical system.

Training Strategy

The VGGT training loss is calculated by combining losses for cameras, depth, point maps, and tracking.

Camera loss teaches the model to predict three things: how the camera is turned (rotation), where it is (translation), and its internal settings like focal length (intrinsic parameters). Rotations use stable parameterizations such as quaternions (four numbers that together represent one 3D rotation) or 6D representations to prevent any degrees of freedom from being lost. For translation, it learns position only up to a scale, since everything is measured relative to the first camera’s coordinate frame rather than in absolute meters.
Depth and point losses penalize deviations from ground truth when available with the inclusion of uncertainty weightings. Regions that are ambiguous, like blank walls or specular surfaces, contribute less to the gradient, while including spatial smoothness to favour piecewise smooth depth and preserving edges.
Tracking loss teaches the features to support correspondences across views, downweighted with λ = 0.05 so that it does not dominate.

Figure 6: Training mix across 17 datasets. 59% synthetic, 24% captured, 18% SfM-annotated. Synthetic data mainly drives accuracy. Captured and SfM-annotated data improve generalization. (Source: Wang, J. et al., 2025)

The model was trained on a large mix of datasets that cover indoor scenes, outdoor scenes, and synthetic environments, while a wide range of diverse sources were used. These datasets included CO3Dv2, ScanNet, MegaDepth, Habitat, and Replica. The model contains 1.2 billion parameters and was trained on 64 A100 GPUs over nine days. Exposure changes, motion blur, lens differences, and scale variations were also implemented to teach the network to handle the messiness of real images.

One practical detail that was used to ensure consistency was coordinate normalization. All predictions are expressed in a learned canonical frame, anchored to the first camera. Even though many systems hardcode a normalization procedure, VGGT learns it as part of training. This choice removes a post‑processing step and lets the network adapt as data changes.

Benchmarks

Figure 7: Camera estimation accuracy on RealEstate10K. Higher is better. VGGT reaches 85.3 AUC@30 and leads all baselines. Several baselines require explicit test-time optimization, while VGGT is feed forward. (Source: Wang, J. et al., 2025)

VGGT shows strong results across several tasks. Putting the benchmarks into context:

Pose accuracy at speed: On standard datasets, VGGT reaches or exceeds the pose accuracy of systems that rely on feature matching and global optimization, while running much faster for large view sets. The advantage grows with the number of images because there is no test‑time bundle adjustment.
Depth and point quality without oracle cameras: Even when camera poses are not supplied, dense geometry is competitive with the best learned baselines. Fine structures like chair legs and edges in indoor scenes are preserved well, while outdoor scenes with large depth variation still generate stable depth predictions.
Correspondence strength: Features produced by VGGT perform very well on two‑view matching benchmarks, despite the model not being designed as a dedicated matcher. This indicates that the internal representation learned for geometry is useful for many broad applications.

Figure 8: Tracking and Implementation of VGGT with CoTracker. Top: VGGT model tracks many points reliably in a static outdoor scene. Bottom: CoTracker using our features produces long-range, smooth trajectories on a cyclist sequence. (Source: Wang, J. et al., 2025)

Real-Time Performance

VGGT has an inference time of only 0.2 seconds for 10 frames, which goes up to 8.75 seconds for 200 frames. Its memory usage is very efficient up to ~50 views, which can be adapted for lower-resource settings, since optional heads (like tracking or view synthesis) can be enabled/disabled as needed.

Try VGGT Yourself

Curious to see VGGT in action? You can run it locally or try it online:

Run Locally

To set up VGGT on your machine:

Clone the repository

git clone 
cd vggt

2. Set up the environment:

pip install -r requirements.txt

3. Run the model:

Follow the instructions in the official GitHub README to try the demo.

import torch
from vggt.models.vggt import VGGT
from vggt.utils.load_fn import load_and_preprocess_images

device = "cuda" if torch.cuda.is_available() else "cpu"
# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+)
dtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16

# Initialize the model and load the pretrained weights.
# This will automatically download the model weights the first time it's run, which may take a while.
model = VGGT.from_pretrained("facebook/VGGT-1B").to(device)

# Load and preprocess example images (replace with your own image paths)
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
images = load_and_preprocess_images(image_names).to(device)

with torch.no_grad():
    with torch.cuda.amp.autocast(dtype=dtype):
        # Predict attributes including cameras, depth maps, and point maps.
        predictions = model(images)

4. A video demonstrating the installation process can be found here.

Try it Online

A Hugging Face demo is available online for instant testing, so you can upload your own images to see VGGT in action, estimating 3D structure and camera data in seconds.

Outlook And Open Questions

Feed‑forward geometry will not replace every classical component, and it does not need to. The point is to move as much of the complexity as possible into training, then keep inference simple and fast. Several directions are especially promising.

Better handling of dynamics: Extending the model to reason about non‑rigid motion and to separate moving objects from static background would widen its practical reach.
Coupling with differentiable optimization: A small differentiable bundle adjustment on top of VGGT could provide the last few percent of accuracy with modest overhead. Joint training might stabilize the combination further.
Efficiency improvements: Sparse attention, token pruning, and quantization can reduce memory and latency without sacrificing much accuracy. These changes will matter for AR headsets and small robots.
Generalized cameras: Explicit support for fisheye and panoramic models would make the camera head more widely applicable.

Takeaways

VGGT shows that a single transformer can learn much of the reasoning that classical pipelines perform through explicit optimization. By alternating local and global attention, the model builds per‑image detail and cross‑view agreement, then reads out cameras and dense geometry in one pass. The outputs are immediately useful, from fast visualization to seeding SLAM, NeRF, and tracking systems. The approach does not eliminate the need for geometry, but it shifts most of the compute needs from inference to training time, leading to results in a fraction of time compared to other methods. Expect future papers to push dynamic-scene support and further efficiency gains, building on VGGT’s leap toward real-time, robust, accessible 3D vision.

References

Wang, J., et al. VGGT: Visual Geometry Grounded Transformer. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. https://doi.org/10.48550/arXiv.2503.11651.

Wang, J. VGGT: Visual Geometry Grounded Transformer. CVPR 2025 Talk/Slides, 2025. Project page: https://vgg-t.github.io/.

Facebook Research. VGGT. GitHub, 2025, https://github.com/facebookresearch/vggt.

Facebook Research. VGGT Project Page. 2025, https://vgg-t.github.io/.

Oquab, M., et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv, 2023. https://doi.org/10.48550/arXiv.2304.07193.

Ranftl, René, Alexey Bochkovskiy, and Vladlen Koltun. Vision Transformers for Dense Prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. https://doi.org/10.1109/ICCV48922.2021.01196.

Karaev, Nikita, et al. CoTracker: It Is Better to Track Together. Proceedings of the European Conference on Computer Vision (ECCV), 2024. https://doi.org/10.1007/978-3-031-73033-7_2.

Wang, S., et al. DUSt3R: Geometric 3D Vision Made Easy. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. https://doi.org/10.1109/CVPR52733.2024.01956.

Leroy, Vincent, Yohann Cabon, and Jérôme Revaud. Grounding Image Matching in 3D with MASt3R. Proceedings of the European Conference on Computer Vision (ECCV), 2024. https://doi.org/10.1007/978-3-031-73220-1_5.

RealEstate10K: A Large Dataset of Camera Trajectories from Internet Videos. Google Research, 2018. https://doi.org/10.57702/bxtr5t4j. Dataset site: https://google.github.io/realestate10k/.

VGGT: How Feed-Forward 3D Perception is Redefining Scene Reconstruction and Multi-View Geometry was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

[AV Vol.3] BEVFusion: Unifying Vision in Autonomous Driving Systems

Adam Roberge — Tue, 23 Sep 2025 21:30:27 GMT

How BEVFusion’s unified bird’s-eye perspective and multi-task learning approach are shaping the future of autonomous vehicle perception.

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation (Source: https://github.com/mit-han-lab/bevfusion)

Introduction: Bridging Different Vision “Languages”

Modern self-driving cars are equipped with numerous sensors. Waymo’s autonomous vehicles, for example, feature 29 cameras, 6 radars, and 5 LiDAR sensors. Each sensor is like a specialist speaking its own “language”: cameras deliver rich color and texture from a perspective view, LiDAR provides precise 3D distance measurements in a top-down 3D view, and radar adds velocity data. The challenge is that these sensors don’t naturally align; it’s as if one expert describes the world in 2D photographs while another uses 3D point clouds. Merging their outputs can feel like translating between two very different languages. If we naively project LiDAR’s 3D points onto a camera’s 2D image, we distort geometric distances, and projecting camera images into LiDAR’s space drops most of the rich visual detail. In other words, forcing one sensor’s perspective onto the other causes information loss, like a poor translation that drops nuance.

Figure 1: RGB View vs. BEV View (Source: https://www.youtube.com/watch?v=5v9wa-UCFMM)

To truly “see” the whole picture, autonomous vehicles need a common ground where data from all sensors can combine without compromising each sensor’s strengths. This is where a bird’s-eye view (BEV) representation comes in. A bird’s-eye view means looking at the world from above (as if you’re a bird flying overhead). By converting sensor data into this top-down view, we give cameras and LiDAR a shared frame of reference. The BEVFusion approach embraces this idea, unifying camera and LiDAR features in the same BEV space, rather than forcing one into the other. In doing so, it preserves the camera’s dense semantic information and LiDAR’s precise geometry, avoiding the trade-offs that plagued earlier fusion methods. The result is a single “language” of perception that an autonomous car’s neural network can understand — one that speaks in both rich visuals and exact spatial terms simultaneously.

Common Ground in Bird’s-Eye View

By establishing a common viewpoint from above, BEVFusion tackles sensor fusion in a fundamentally new way. Instead of laboriously matching 2D pixels to 3D points (and losing a significant amount in the process), all sensor inputs are transformed into a unified, map-like view. This BEV map functions as a shared canvas, where each sensor contributes its piece of the puzzle without conflicting with the others. Early fusion attempts often tried a “project one sensor into the other’s space” approach — and as noted, they hit limits. For example, projecting camera features onto sparse LiDAR points meant that only ~5% of the image’s features found a LiDAR point to attach to, with the rest effectively thrown away. Conversely, projecting LiDAR into the camera view introduced geometric warping that hurt 3D object detection. BEVFusion avoids these pitfalls by choosing an independent, neutral ground (the BEV plane) where both sensor types can be represented fully.

Figure 2: Challenges in Cross-Modal Projection: Mapping LiDAR to Camera loses 3D geometry as distant and close points may overlap in 2D, while mapping Camera to LiDAR loses semantic details due to incomplete point coverage. (Source: https://www.thinkautonomous.ai/blog/bev-fusion/)

Crucially, BEVFusion performs fusion at the feature level, not the raw data level. This means the system doesn’t try to mix raw images with raw point clouds directly (which is hard); instead, it first lets each sensor’s data be processed into high-level features (patterns, edges, object cues, etc.) by specialized neural networks. Raw-level fusion is hard because image pixels live on a 2D perspective grid with unknown depth, while LiDAR samples are sparse, irregular 3D points — aligning them demands precise calibration, per-pixel depth estimates, and non-linear projections that create many-to-one/one-to-many correspondences and perspective distortion. Those features are then mapped onto the bird’s-eye grid and merged. By doing so, BEVFusion preserves both the “semantic density” of cameras (all those pixels of rich detail) and the “geometric structure” of LiDAR (the accurate distances and shapes). It’s a bit like two experts agreeing to collaborate in a shared language: neither has to fully give up their native knowledge, and the collaboration yields a more complete result.

Figure 3: Tesla’s HydraNet (Source: https://www.thinkautonomous.ai/blog/how-tesla-autopilot-works/)

Another advantage of using a BEV representation is that it naturally supports multiple perception tasks at once. In an autonomous driving scenario, we don’t just want to detect objects; we also want to understand the drivable area, lane markings, and other semantic elements of the environment. On a bird’s-eye map, both object locations and semantic map layers can be learned together. BEVFusion takes full advantage of this by adopting a multi-task “HydraNet” design — one body of fused features feeding into multiple heads for different tasks. With a unified BEV view, a single network can simultaneously perform 3D object detection and BEV semantic segmentation (mapping out roads, crosswalks, etc.), instead of having separate systems for each. This not only streamlines the perception pipeline (saving computation and time), but it also means the tasks can inform each other. For instance, detecting a vehicle and identifying the drivable road surface around it are accomplished in one coherent framework, likely improving consistency (the detected car will sit exactly on the road in the BEV map, as it should).

In summary, the bird’s-eye view provides the perfect “common ground” for sensor fusion in self-driving cars. It solves the modality mismatch by placing data in the same geometric space and enables multi-sensor, multi-task learning in one go. Now, let’s dive into how BEVFusion implements this, step by step.

How BEVFusion Works: Key Stages of the Fusion Pipeline

BEVFusion’s architecture can be broken down into a series of stages that take us from raw sensor data (multi-view RGB frames and a LiDAR point cloud) to a single, top-down world model. In plain terms, it first learns what each sensor “sees,” converts those learnings into the same bird’s-eye grid, blends them into one coherent BEV feature map, and finally decodes that map into boxes and road semantics the car can act on.

At a high level, the process is:

Encoders (per sensor)
• Cameras → a CNN/ViT backbone turns images into 2D feature maps.
• LiDAR → a voxel/pillar backbone turns the point cloud into 3D/BEV features.
View transformation to BEV
• Camera features are “lifted” with per-pixel depth estimates and pooled onto the BEV grid.
• LiDAR features are collapsed along height to the same BEV grid.
Fusion in BEV
• Camera-BEV and LiDAR-BEV tensors are aligned cell-by-cell and concatenated channel-wise to form one fused map.
BEV Encoder (refinement)
• A lightweight stack of 2D convolutions/residual blocks mixes modalities, fixes small misalignments, and adds spatial context.
Task heads (multi-task output)
• 3D detection head (CenterPoint-style) predicts centers, sizes, orientation, and velocity.
• BEV segmentation head labels drivable space, crosswalks, lane dividers, etc.
• Additional heads (e.g., tracking, occupancy) can be attached to the same fused BEV.

Figure 4: BEVFusion Architecture (Source: https://www.themoonlight.io/en/review/unibevfusion-unified-radar-vision-bevfusion-for-3d-object-detection)

1. Encoders — Learning from Each Sensor Separately

The first stage of BEVFusion involves running each sensor’s raw data through its encoder network to produce high-level feature maps. In essence, an encoder digests the input (images or point clouds) and outputs a set of learned features that highlight important structures (like edges, shapes, or other patterns relevant to detecting objects or free space). For the camera images, the encoder is typically a 2D convolutional neural network (CNN) — for example, a ResNet or VGG-style backbone can be used to extract visual features. CNNs are a natural choice here because they excel at handling images, capturing local textures and shapes.

Figure 5: BEVFusion pipeline for multi-modal perception in autonomous driving (https://arxiv.org/pdf/2205.13542)

For the LiDAR sensor, the encoder is quite different, since the input is a 3D point cloud instead of a 2D image. One approach is using a network like PointNet++ that operates directly on point clouds, learning features from the 3D coordinates. Another approach (which BEVFusion uses) is to first convert the point cloud into a structured 3D grid via voxelization or pillarization, and then apply 3D or pseudo-3D convolutions. In simpler terms, the scattered LiDAR points are turned into a set of 3D pixels (voxels) or vertical columns (pillars), and a 3D CNN processes those to extract features. The BEVFusion authors tested several combinations of LiDAR encoders and ultimately used a pipeline involving voxelization/pillarization followed by CNN layers to get effective LiDAR features. By the end of this stage, we have two sets of feature maps: one from the camera, one from LiDAR. Each is tailored to its modality, encoding things like visual textures in the camera features and shape/distance cues in the LiDAR features.

2. Bird’s-Eye View Transformation — Projecting Features to a Common Plane

Once we have features from each sensor, the next step is to map those features into the bird’s-eye view coordinate frame. This is the core step that enables a unified representation. The transformation is done separately for camera and LiDAR features:

Figure 6: Camera-to-BEV transformation (Source: https://arxiv.org/pdf/2205.13542)

Camera-to-BEV: For camera features, BEVFusion uses a technique called feature lifting. Imagine each pixel in the camera’s feature map now “sprouting” upwards into 3D space. The network predicts a depth distribution for each pixel — essentially estimating how far out in 3D that image feature might be. By doing this, the 2D features get lifted into 3D point features (like a cloud of feature points, where each carries the original image feature but is placed at some estimated height/distance). For example, a feature that corresponds to, say, a traffic light in the image will be projected out into 3D space at the likely location of that traffic light. Once all image features are lifted to 3D, we then perform BEV pooling: we aggregate these 3D feature points onto a fixed 2D grid representing the ground plane. Essentially, the area around the car is divided into a grid (the BEV map cells), and any lifted features that fall into the same cell are combined (e.g., by summing or averaging). This yields a top-down feature map for the camera, where each cell contains an aggregated representation of whatever the camera sensed at that location in the ground plane. It’s like mapping the camera image back into the 3D world and laying it onto the ground plane, creating a mosaic of features from above.

Figure 7: The idea behind Feature Lifting (Source: https://www.thinkautonomous.ai/blog/bev-fusion/)

LiDAR-to-BEV: LiDAR features start inherently in 3D space (since the LiDAR encoder produces features arranged in a 3D grid or point cloud). To get these into BEV, the operation is more straightforward: we collapse the features along the vertical (Z) axis, since for BEV, we only care about information in the horizontal plane. In practice, this might mean taking the max or sum of LiDAR features in each vertical column, ending up with a 2D grid of LiDAR feature cells on the ground plane. Because LiDAR directly measures 3D structure, this step doesn’t require an explicit depth guess like the cameras needed; it’s more about formatting the data into the same type of grid. After this, we have a camera BEV feature map and a LiDAR BEV feature map, both aligned to the same geographic coordinate system (meters in the real world, typically).

At this point, it’s worth reflecting on how much has been achieved. The system has effectively taken a 2D image and a 3D point cloud and converted both into compatible “maps”. Any given cell in these BEV maps corresponds to, say, “the region 5–6 meters ahead of the car and 2–3 meters to the left,” and both the camera-derived features and LiDAR-derived features for that location are now references to the same spot in the world. This common positioning lays the groundwork for an easy fusion next.

3. Fusion — Merging Modalities on the BEV Grid

Figure 8: BEVFusion unifies camera and LiDAR features in a shared BEV space instead of mapping one modality to the other (Source: https://arxiv.org/pdf/2205.13542)

With both camera and LiDAR feature maps in the BEV format, fusing them becomes simple. Now that each is an “image” of the same ground plane, the fusion is a matter of laying one on top of the other and combining. BEVFusion’s fusion step is done by concatenation: for each cell in the BEV grid, take the feature vector from the camera map and the feature vector from the LiDAR map and join them into one longer feature vector. In code, this might look as simple as torch.cat(camera_features, lidar_features) — effectively stacking the two feature maps channel-wise.

This straightforward fusion is powerful because by the time we concatenate, the heavy lifting (aligning coordinates and preserving info) has already been handled. There’s no need for complex recalibration or iterative matching; each fused BEV cell [i, j] contains a joint feature that combines camera textural cues with LiDAR geometry. For example, if the camera features indicate a certain color/texture consistent with a pedestrian and the LiDAR features indicate an upright small object at that spot, the fused representation for that cell will encode both cues, making it easier for the network to ultimately classify it as a pedestrian.

Another benefit of this design is extensibility: because everything is unified on the BEV grid, additional sensors can be integrated the same way. Radar, extra cameras, or even future modalities can be encoded, projected to BEV, and concatenated alongside the existing features. BEVFusion’s fusion stage isn’t hard-coded to “camera + LiDAR”; it’s a generic, BEV-space multi-modal pipeline that readily accommodates new inputs to enrich perception.

4. BEV Encoder — Refining the Fused Representation

After concatenation, we obtain a fused BEV feature map from all sensors, but small inconsistencies can remain (e.g., camera depth errors or sparse LiDAR coverage). BEVFusion addresses this with a BEV Encoder: a convolutional network with residual blocks that operates on the top-down map to refine the fusion. It smooths differences and learns composite features that truly mix camera semantics with LiDAR geometry. For instance, when a camera edge and LiDAR points are slightly offset, the encoder’s filters learn to reconcile them during training, improving downstream detection and segmentation. The stage is fully convolutional and learned without hand-crafted alignment rules, so the representation is fine-tuned into a coherent, robust BEV before the task heads. Choosing convolutions over heavy transformers also preserves real-time efficiency: CNNs exploit local BEV structure and run efficiently on GPUs for large grids. Together with the paper’s efficient BEV operations (including the fast view transform), this keeps latency within AV budgets.

5. Multi-Task Heads — Detecting Objects and Segmenting the Scene

With a refined BEV feature map, BEVFusion branches into task-specific heads to produce human-interpretable outputs: 3D object detections and a semantic BEV map. The design is HydraNet-style — one trunk, multiple heads, so the same fused representation supports both predictions efficiently.

Figure 9: Multi-Sensor Integration of Camera, LiDAR, and Radar for Bird’s-Eye-View 3D Perception (Source: https://www.youtube.com/watch?v=5v9wa-UCFMM)

3D Object Detection Head: BEVFusion adopts a CenterPoint-style head that finds object centers on the BEV plane and regresses full 3D boxes (dimensions, orientation, velocity). Plugging this proven head atop fused features avoids reinventing the wheel: for each location, the head leverages combined camera semantics and LiDAR geometry to localize and size vehicles, pedestrians, and more with higher confidence.
BEV Segmentation Head: This head labels the ground plane by BEV cell — drivable space, pedestrian crossing, walkway, stop line, parking area, lane divider, and related classes — producing a planner-ready map. Unlike front-view segmentation, the result is a top-down layout that directly indicates where the car can and cannot go. Because both heads read the same fused features, predictions are naturally consistent (e.g., detected cars sit on drivable regions, not sidewalks).

Using two heads on one backbone departs from older split pipelines and brings two advantages: efficiency (one shared trunk instead of separate models) and multi-task synergy. Joint training encourages detections and the semantic map to cohere, improving overall scene understanding.

Figure 10: Unified Perception Output (Source: https://www.youtube.com/watch?v=5v9wa-UCFMM)

Performance and Impact on Autonomous Driving

BEVFusion isn’t just an academic exercise — it has proven its value with record-setting performance on real-world autonomous driving benchmarks. On the demanding nuScenes dataset, BEVFusion established a new state of the art across multiple metrics. For 3D object detection, it achieved the #1 leaderboard ranking, with mAP/NDS about 1.3% higher than the previous best (see Figure 11). Its advantage is even clearer in BEV map segmentation: +13.6% IoU over LiDAR-only and ~+6% over camera-only models. Earlier fusion methods often struggled or skipped this task, whereas BEVFusion excelled (illustrated by the class-wise gains in Figure 11). The takeaway is simple: combining camera semantics with LiDAR geometry doesn’t just boost detection — it’s practically essential for robust scene mapping in complex scenarios.

Figure 11: 3D object detection — BEVFusion achieves >1.3% higher mAP and NDS with 1.5~2x lower computational cost (Source: https://www.youtube.com/watch?v=5v9wa-UCFMM)

What’s equally important is that these gains don’t come with a heavy efficiency tax. Thanks to a fast, exact camera-to-BEV pooling kernel and a fully convolutional design, BEVFusion reports ~1.9× lower compute than prior fusion approaches, after eliminating the main bottleneck in view transformation (>40× speedup). The result is a model that’s realistic for real-time autonomous vehicle use, not just leaderboard demos (efficiency and latency are summarized in Figure 11).

Figure 12: 3D object detection — BEVFusion brings larger improvements to the LiDAR-only detector for small and distant objects (Source: https://www.youtube.com/watch?v=5v9wa-UCFMM)

Beyond raw numbers, BEVFusion is a conceptual shift — it changes where fusion happens. Point-level fusion adds image cues to each LiDAR point or voxel, which wastes most pixels and ignores much background. Proposal-level fusion merges after generating 3D box candidates, which helps detection but doesn’t build a full scene map and can miss small or far objects. BEVFusion instead fuses on a dense bird’s-eye grid covering the whole scene, so every pixel and LiDAR return contributes, improving small/far cases and enabling map semantics (see Figure 12). This template is already inspiring hybrids that add light transformer context on top of efficient BEV CNNs.

Figure 13: BEV map segmentation — BEVFusion outperforms the state-of-the-art fusion methods by 13.6% mIoU (Source: https://www.youtube.com/watch?v=5v9wa-UCFMM)

Industry momentum echoes the trend. Production-minded stacks are increasingly transforming multi-sensor inputs into a unified BEV for planning, and several companies publicly describe using BEV-style egocentric maps as a core interface. BEVFusion validates that direction and extends it cleanly to true multi-sensor fusion — camera and LiDAR working together in a single, planner-ready view (with segmentation quality reflected in Figure 13).

Conclusion and Outlook

BEVFusion has shown that when it comes to perceiving the world, two (or more) “eyes” are better than one — especially if they can share a common view. By uniting camera and LiDAR data in the bird’s-eye view, this approach achieves results that neither modality could accomplish alone, all while keeping the system efficient and versatile. For autonomous vehicles, this means a more reliable understanding of the environment: cars that can not only detect other agents with high accuracy but also comprehend the scene’s layout (road versus sidewalk, etc.) in the same breath. The ripple effects of this are significant. A richer perception system feeds into better decision-making — for example, a car that knows the precise drivable surface and spots a new obstacle can plan a safer detour.

Key takeaways include:

Common Reference Frame: Converting multi-sensor data to a shared BEV space preserves each sensor’s strengths and avoids the information loss of direct sensor-to-sensor projections.
Multi-Task Efficiency: A single fused model can handle detection and segmentation together, improving consistency and reducing the need for separate modules.
State-of-the-Art Results: BEVFusion’s fusion strategy delivered top performance on challenging benchmarks, substantially improving 3D detection and especially BEV segmentation accuracy.
Real-Time Ready: Through optimized operations and a convolutional design, the approach runs quickly (addressing prior bottlenecks) with reduced latency, making it viable for actual driving systems.
Inspiration for Next Gen: This work is influencing both academic research (e.g., hybrid models adding transformers) and industry practice (major AV companies adopting BEV-centric sensor fusion).

Looking ahead, expect richer stacks that plug in more sensors (radar, ultrasonics, GNSS) to the same BEV canvas, extend BEVFusion over time for motion-aware “4D” understanding, and couple perception directly with prediction and planning in the same coordinate space. In short, choosing the right representation matters: by getting cameras and LiDAR to see eye-to-eye from above, BEVFusion delivers a stronger, more dependable vision core — an approach poised to become a cornerstone of autonomous driving.

In conclusion, BEVFusion demonstrates the power of finding the right representation. By getting cameras and LiDAR to see eye-to-eye (from a bird’s-eye perspective), it unlocks a richer, more reliable form of machine vision for self-driving cars. It’s a compelling example of how bringing different technologies together — when done thoughtfully — can yield results greater than the sum of their parts. As autonomous vehicles continue to evolve, approaches like BEVFusion are likely to be a cornerstone of their “brains,” ensuring that these vehicles perceive the world with unprecedented clarity and confidence.

References

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. arXiv, 26 May 2022, https://arxiv.org/pdf/2205.13542.

Han, Song, et al. “BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation.” YouTube, uploaded by MIT Han Lab, 14 Sept. 2022, https://www.youtube.com/watch?v=5v9wa-UCFMM.

“BEV Fusion: The New Paradigm for Autonomous Driving Perception.” ThinkAutonomous, 23 Sept. 2022, https://www.thinkautonomous.ai/blog/bev-fusion/.

Liu, Zengyi, et al. “BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation.” ar5iv, 27 May 2022, https://ar5iv.labs.arxiv.org/html/2205.13790.

“UniBEVFusion: Unified Radar-Vision BEVFusion for 3D Object Detection.” The Moonlight, 2024, https://www.themoonlight.io/en/review/unibevfusion-unified-radar-vision-bevfusion-for-3d-object-detection.

Yin, Tianwei, et al. CenterPoint: 3D Object Detection and Tracking Using Center-Based Representations. 2022, ar5iv.labs.arxiv.org/html/2205.13790.

Gu, Jianyuan, et al. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation (Supplementary Material). arXiv, 2022, doi:10.48550/arXiv.2205.13542.

Liang, Ming, et al. Multi-Task Multi-Sensor Fusion for 3D Detection and Segmentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, doi:10.1109/CVPR52688.2022.01476.

[AV Vol.3] BEVFusion: Unifying Vision in Autonomous Driving Systems was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.

Giving A Man A Fish vs Teaching A Man How To Fish: A Shift In Perspective In Imitation Learning

Mithun Vanniasinghe — Sat, 20 Sep 2025 12:29:28 GMT

What is Imitation Learning?

Imitation learning (IL) is one of the most popular ways modern day roboticists are training autonomous robots. The idea follows from the reinforcement learning (RL) paradigm.

In RL, an agent learns an optimal behaviour, its control policy, by continuously interacting with the environment and refining its control policy by observing the rewards it achieves from visiting particular states and taking particular actions.

RL presents a good methodology for robotic control; however, it fails in two aspects. First, it requires well-engineered reward functions that rely heavily on domain knowledge. Second, RL requires learning from interaction. In cases where it is infeasible or undesirable to learn from interaction in simulation, real-world interaction must be used. This can result in potentially dangerous initial interactions as the robot learns from trial and error and optimizes its behaviour.

IL and in particular a variant called behavioral cloning (BC), provide a lucrative alternative. BC treats autonomous robotic control as a supervised learning problem. A BC agent learns what actions to take and not take for a particular state by regressing its actions to the actions of an expert. An expert here can refer to many things. For example, an expert can be a human teleoperator moving the robot with a remote controller or it could be an already autonomous agent. Given a set of expert demonstrations, a control policy is fit to match the policy of an expert.

However, this presents an issue as IL is prone to what is known as the distributional shift problem. The agent learns from the expert’s data distributions; however, at test time, it acts on its own data distribution. If it enters a state the expert has never seen before, it doesn’t know what to do and will take non-optimal actions and veer away from the expert’s example trajectory.

Moving Beyond Basic Behavioural Cloning

We want to train more intelligent agents, ones that can reason about failures and entering out of distribution states and avoid entering such states. This is analogous to the popular proverb: “Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for a lifetime.” You can regress the agent’s policy to the expert for known states (give the agent the fish), but if you teach the agent how to reason about unseen states and search for alternative actions (teach the agent how to fish), you can prevent failures from the distributional shift problem.

Inspiration for this work was taken from the success of language models. In some sense, next token prediction is a much simpler problem than robotic control (ex. No partial observability, no stochastic dynamics, etc.); however, high performance wasn’t achieved solely by adding more data. Interactive learning in the form of RL from Human Feedback (RLHF) was required to see a jump in performance.

In robotics, RLHF doesn’t scale because human feedback bottlenecks the optimization loop; instead, SAILOR teaches the agent to reason and search autonomously.”

SAILOR

There are two criteria required for an agent to learn to recover from its own mistakes.

Prediction — the agent must have predictive capabilities; it must be able to see the consequences of its own actions. In SAILOR, this ability is implemented with a world model
Evaluation — the agent must be able to know which outcomes are preferable to others. In SAILOR, this is implemented as a learned reward model.

There are a few main components to the SAILOR model.

Base Policy

The first component of the model is the base policy. This policy is trained on expert demos using standard BC. In SAILOR, they focus on visual manipulation tasks. So, the observation space is RGB images, as well as the proprioceptive state of the robot. Any base policy works, but here the authors employ the use of a diffusion policy. The policy generates a k-step plan. This is known formally in the IL literature as action chunking. Typically, a policy in a Markov decision process predicts only the subsequent action given the current state; however, in practice, with imitation learning, predicting future actions usually results in better performance.

World Model

The second component of the model is the world model. The world model contains 3 components: the encoder, a latent dynamics model, and the decoder, which are trained using reconstruction loss. The encoder maps a high-dimensional image observation into latent space, the dynamics model predicts the future latent state given the current latent state, and the decoder learns to reconstruct the future observation from its respective predicted latent. In this paper, the authors use the Dreamer World Model.

Reward Model

The reward model is used to score the predicted latent states based on how expert they are. The closer they are to a state the expert would be in, the higher their reward. This can be seen from the training objective of the reward model.

Here, D is the dataset of expert demos, and B is the dataset of trajectories from rolling out the base BC trained policy. This objective function is a moment-matching loss function. To minimize this loss, latent states and actions that are similar to the expert will have a higher reward, while those not very similar will be penalized.

Critic

The critic is used as a terminal reward. It encodes the value of the rest of the trajectory, given that we end in the kth predicted latent state.

How It All Works

In practice, this is not too different from model-based imitation learning. First, given a nominal plan sampled from the base policy, N possible plans are generated for a given time step j. These N possible plans contain corrective actions, delta, to modify the nominal actions to avoid entering failure states — states that are out of the expert’s distribution (OOD). In each possible plan n of N, a corrective action is sampled, and this plan is executed in the latent space of the world model. The quality of the trajectory is scored using the reward model and critic using the following update equation

The N scores and the N sampled corrective actions are used to update the parameters of the corrective action distribution so that an appropriate future corrective action, delta, at the next time step, j+1. The first corrective action, delta, of the highest scored plan is added to the nominal action for time step j, and this new combined action is executed by the robot. This whole process repeats for the next time step.

Performance Results

Across a variety of visual manipulation tasks and across a variety of dataset scales, SAILOR outperforms the base diffusion policy.

Furthermore, compared to a model-free inverse RL baseline, the authors found that SAILOR is more sample efficient, requiring much less environment interaction to achieve higher performance across a variety of states and expert dataset scales.

Conclusions

This paper and model, SAILOR, presents a new way to approach imitation learning and preventing OOD errors. Motivated by the elegant proverb of teaching a man to fish rather than giving him a fish, this paper shows that to achieve truly functional autonomous systems, we must equip them not only with the ability to mimic expert behavior, but also with the capacity to adapt, generalize, and make safe decisions when faced with unfamiliar situations. By fostering these skills, SAILOR moves beyond rote imitation toward building agents that can act with autonomy and resilience in dynamic real-world environments.

References

A. K. Jain, V. Mohta, S. Kim, A. Bhardwaj, J. Ren, Y. Feng, S. Choudhury, and G. Swamy, “A Smooth Sea Never Made a Skilled SAILOR: Robust Imitation via Learning to Search,” arXiv:2506.05294 [cs.LG], 2025.

Giving A Man A Fish vs Teaching A Man How To Fish: A Shift In Perspective In Imitation Learning was originally published in deMISTify on Medium, where people are continuing the conversation by highlighting and responding to this story.