AudibleEyes | Devpost

Photo enhanced by physical dynamical system analysis
Whatsapp developer bot texting back the response
One of our members wearing the Meta glasses
code for depth estimation

Inspiration

The inspiration behind AudibleEyes stems from a deep-rooted desire to empower visually impaired individuals to navigate the world freely. Rather than sketching out a proof of concept, we wanted to build a concrete system that is practical, human-centric, and immediately usable, working both for the wearers and their caretakers. We also wanted to incorporate deep physics to have an accurate navigation system that goes beyond AI pipelines, eliminating the risks of hallucination, which can be very dangerous for visually impaired people using AudibleEyes.

Our vision is a world where no one lacks the ability to experience and navigate the world due to visual impairment.

What It Does + How We Built It

AudibleEyes Smart Glasses help visually impaired people navigate real-time scenarios using fine-tuned depth estimation (Intel ISL’s MiDaS), object segmentation (YOLOv5), and fine-tuned LLMs (Gemini), combined with deep Newtonian analysis of physical dynamical systems. All these components are integrated through a LangChain graph, generating voice-based navigation assistance delivered directly to the wearer’s ears. Additionally, our app allows users and their caretakers to view an annotated visual map and receive audio navigation assistance.

We are the first in the world to build an AI navigation system for Meta Rayban glasses, despite the lack of developer tools. We bootstrapped a custom PHP server that allows AudibleEyes to work. When the user prompts the Meta glasses with a navigation request, the glasses snap a photo and send it to a WhatsApp developer bot API, which transmits it to a PHP server before passing the image to our LangChain.

Our LangChain uses multimodal agents to analyze the photo, employing fine-tuned YOLOv5 for object detection and segmentation, generating a depth map with Intel ISL’s MiDaS, and performing semantic scene analysis. The final result is processed through the Gemini API, which synthesizes the optimal path. To ensure safety, we combine machine learning-based path prediction with physical dynamical systems and differential equations to predict collisions and repulsions using Potential Fields and Newtonian-cost-based optimization.

The final path is sent back to the PHP server and connected to the Meta glasses via WhatsApp. Meta AI then reads the Gemini-generated summary to the user.

The path and the original image are also sent to a web app built using Streamlit for caretakers to monitor or for users to review (fully compatible with screen readers). We used Cartesia’s Voice AI to convert text descriptions into speech.

Our project consolidates data from various sources into a single interface, offering real-time updates and comprehensive situational awareness – it acts as people’s eyes. This isn’t just navigation; it’s independence for millions.

Tools & Building Blocks

Machine Learning & Computer Vision:
- LangChain
- YOLOv5
- Intel ISL MiDaS
- Google Gemini
- PyTorch
Custom Pipeline:
- Flask
- PHP
- WhatsApp API
Frontend Integration:
- Streamlit.io
Text-to-Speech:
- Cartesia

Challenges We Ran Into

We faced significant challenges working with Ray-Ban Meta smart glasses, which lack a development kit (SDK). Without the ability to register custom commands or retrieve inputs, we created a workaround by transmitting data via WhatsApp, building a custom bot to send data to a PHP server. This allowed us to integrate real-time data analysis into our navigation system.

Accomplishments We’re Proud Of

Pioneering Innovation: First AI navigation system specifically for Meta Rayban glasses, which doesn't have a development kit.
Custom PHP Server and WhatsApp Bot: Built a custom server and bot to bridge the gap between hardware and software, overcoming the lack of an SDK.
Meaningful Impact: Increasing the quality of life for visually-impaired individuals with a seamless, practical solution.
Advanced Multimodal Agents with Dynamical System Analysis: Fine tuned LangChain agents, incorporating physical dynamical analysis beyond basic object detection.
Seamless User Experience: Completely automated system requiring no manual effort from the user.

What’s Next for AudibleEyes

Incorporating GPS location tracking and outdoor navigation guidance.
Live-streamed video input feed with synced instructions.
Adding personal safety features like voice-activated alerts and automatic crime detection to contact emergency services.

Built With

flask
langchain
php
streamlit.io

Submitted to

Cal Hacks 11.0

Created by

I worked on the machine learning pipeline including depth estimation, object segmentation, LangChain framework. I also worked on the physical dynamical systems for image-based navigation as well.

Aditya Gupta
I worked on full-stack development for the webapp, including automatically updating the website when new outputs from the ML models/WhatsApp bot reached the server. I also worked with Cartesia's Voice AI API for text-to-speech.

Nicole Ma
I built the custom SDK for the Meta Glasses using WhatsApp API, PHP client, and Flask. I automated the pipeline from Meta Glasses image data to Gemini/LangChain, and the ML text output back to Meta Glasses Voice AI.

Joe Li
EECS @ Stanford
James Liu