Inspiration

At our school computer science club, we were building Pointr, an AR solution to help students with indoor navigation. It worked well, but it was heavily based on visual inputs. We realized that for those who are visually impaired, our app was useless.

According to a research by the Royal National Institute of Blind People (RNIB), 61% of visually impaired people say they cannot make all the journeys they want to, and over 40% leave their homes less often due to anxiety about unfamiliar environments. Instead of going out to experience the world, they are effectively stuck inside.

Some of us have relatives or know someone living in partial or complete darkness. To understand their reality, imagine walking into a room with our eyes closed. It is difficult, if not impossible, to navigate around furniture, let alone find doors, exit gates, or read crucial text on signs. And that's the reality of millions who been through worst than bumping into wall or stub their toe. There're accidents every year where visually impaired individual tried to navigate a busy street in a remote area with poorly designed traffic lights, faded crosswalks, or silent electric vehicles that can't be heard approaching. A cane can detect the ground immediately in front, but it cannot see the hanging sign at head-level, the silent cyclist, or the emergency exit 5 meters away.

Visionr was born to bridge this gap. We wanted to provide a new way to look at the world by turning the invisible world into an audible map.

What it does

Visionr is a computer vision engine that detects distinct objects in the user's field of view and translates them into intuitive 3D spatial audio. Designed for smart glasses or mobile devices, it identifies obstacles and critical environmental features (such as exit signs or sidewalk borders) using real-time object detection.

It then projects this information as subtle spatial sound cues anchored to their exact physical locations. By simulating realistic acoustic physics, Visionr allows the user to pinpoint and differentiate multiple targets simultaneously using standard headphones.

How we built it

Due to the hardware constraints, we developed a proof-of-concept that focuses on the precision of the audio experience with compromises.

We used the AprilTag 36h11 family as a reliable placeholder for real-world objects, simulating the output of an object detection model. A lightweight Python OpenCV script detects the relative position and ID of these tags via webcam and streams the data over WebSockets. On the receiving end, our JavaScript client converts this data into 3D spatial audio in real-time.

We focused heavily on UX to prevent sensory overload. We developed a distinct acoustic language where different objects are represented by subtle sine wave variations.

  • Stationary Mode: When the user is standing still, objects "ping" like a metronome. This reduces noise and ensures the user isn't overwhelmed.
  • Active Mode: When the user moves faster than a set threshold, the sound transforms into a continuous tone. This allows for smoother tracking of objects during motion, preventing the audio from "glitching" in and out of the spatial field.

Challenges we ran into

  1. Camera Calibration & Latency Trade-offs: Our biggest hurdle was obtaining accurate depth data. We lacked the specific parameters (FOV, focal length) of our laptop webcam, making it difficult to calculate the precise relative position of the tags. We attempted to use a smartphone as an IP camera to get better resolution and demonstration, but the network latency was too high. For the furture fix, we plan to build a native application to access raw hardware camera data and implement a dedicated calibration step.

  2. Web Audio API Limitations: While the standard Web Audio API is powerful, its default PannerNode implements only basic physics. We struggled specifically with vertical distinction (elevation). It was hard to tell if a sound was coming from above or below. In the future, we would migrate to a dedicated spatial audio SDK (like Google Resonance or Steam Audio) that utilizes high-fidelity HRTF (Head-Related Transfer Functions) to solve the elevation ambiguity.

Accomplishments that we're proud of

  • Designing an Acoustic Language: We are proud of our dynamic audio communication system that switches between the "stationary mode" and the "moving mode" to prevent sensory overload.
  • Real-Time Pipeline: We successfully built a low-latency feedback loop that pipes video data from Python to a web client via WebSockets in milliseconds, proving that a real-time audio guide is technically feasible.
  • The translation of Visual to Audio: We successfully translated the complex vision data into a non-visual medium that has the potential the help those who need it the most.

What we learned

  • Audio is 3D, not just Stereo: We learned that true spatial navigation requires more than just panning left and right. We discovered the complexity of Head-Related Transfer Functions (HRTF) and realized how difficult it is to simulate "elevation" (up/down) without specialized physics engines.
  • The "Sensory Bandwidth" Limit: We learned that for visually impaired users, more information isn't always better. Our initial tests were too noisy, forcing us to strip away data and focus on a "less is more" approach to prevent cognitive overload.
  • Hardware Realities: We gained a crash course in computer vision optics. Specifically, you cannot get accurate 3D positioning without precise camera calibration (intrinsic parameters).
  • A World Without Light: The research from the RNIB opened our eyes to a silent crisis. We learned that the world of the visually impaired is often defined by the "known route." A staggering number of individuals cannot make the journeys they want, even in highly accessible parts of the city. We realized this doesn't just limit them physically, it strips away their confidence, turning the simple act of exploration into a source of fear.

What's next for Visionr

  • Implement a detailed text detection system that alerts the user to important signage. By simply focusing their attention on a target, the user can trigger a modern Text-to-Speech (TTS) engine to read the content aloud in real-time.
  • Upgrade the spatial audio engine to simulate complex environmental acoustics. We want to move beyond basic positioning and account for factors like occlusion and reverberation to create a truly realistic soundscape.
  • Refine our acoustic language design to make object identification intuitive and effortless, ensuring the user receives critical context without experiencing cognitive overload.
  • Prioritize audio transparency and safety. Since visually impaired individuals often rely heavily on environmental sound cues, we must ensure our system augments their reality without blocking or masking critical ambient noise.
  • Integrate the software directly into emerging smart glasses hardware and dedicated spatial audio wearables for a completely hands-free experience.
Share this project:

Updates