ClearVision | Devpost

Hardware Setup

Inspiration

Vision is something many take for granted, the ability to move freely, connect, and experience the world shapes who we are and how we belong.

One of our teammates has a close family member who is visually impaired. He recalls childhood moments missed, days watching others play cricket in the courtyard, because his relative’s cane, while vital, became both a tool and a reminder of limitation.

While adaptive technology has advanced rapidly, much of it remains stigmatizing and out of reach for those who need it most.

That’s why our team set out to create a discreet wristband powered by an intelligent vision pipeline, a device that feels the world for you. By translating depth and object recognition into gentle haptic feedback and natural voice cues, it restores confidence and pushes our communities to be more accessible, growing together.

What it does

Our product is a haptic wristband designed to assist visually impaired individuals in navigating their surroundings safely and intuitively. Using an advanced AI vision pipeline running on a Raspberry Pi, the system detects and identifies nearby obstacles, determines their distance, and communicates this information through directional haptic feedback and natural voice cues. A buzz on the left indicates an obstacle to the left, while closer objects trigger stronger vibrations. For more complex environments, the onboard speaker provides spoken guidance, like “door ahead” or “step down.” The result is a discreet, wearable companion that transforms visual information into touch and sound, empowering users with greater confidence and independence.

The haptic wrist sleeve system is designed to receive haptic intensity signals from the cloud via AWS and deliver localized vibrotactile feedback to the wearer through three haptic buzzers. The system architecture consists of several key components that work together across multiple layers. At the cloud layer, AWS hosts the control logic and computes haptic patterns based on either real-time input or preconfigured data, enabling remote programmability and cloud-triggered haptic events. At the edge layer, a Raspberry Pi 4B acts as the local gateway, receiving cloud data and transmitting it to the wearable via Bluetooth Low Energy (BLE). The Raspberry Pi was chosen for its strong computing capabilities and native support for Python, which simplifies BLE prototyping and communication.

How we built it

This project was complex as it involved both hardware and software aspects. In order to effectively carry out the project, we split the project into subcomponents. Specifically, hardware and firmware, object segmentation, and the vision language model.

Hardware/Firmware Spec: The wearable itself is powered by an Adafruit Feather nRF52840 Sense microcontroller, which serves as the BLE receiver and haptic controller. Its compact size, onboard sensors, and low-power design make it ideal for wearable applications. Because the DRV2605L haptic driver boards all share the same fixed I²C address (0x5A), a TCA9548A I²C multiplexer is included to manage communication with multiple driver boards. Each of the four DRV2605L drivers controls one haptic buzzer, defining the waveform and intensity of the vibration feedback. These drivers offer a library of prebuilt haptic effects and support PWM-based control for custom patterns. The system’s haptic buzzers are ERM’s . Power is supplied by a 3.7V, 350mAh lithium-ion battery, chosen for its compact size and sufficient capacity to support multiple operating cycles.

In terms of system flow, haptic intensity data originates from the AWS cloud, is sent to the Raspberry Pi 4B, and then relayed over BLE to the Feather nRF52840 Sense. The microcontroller processes this data and communicates through the I²C multiplexer to the four DRV2605L drivers, each of which activates one of the four buzzers to produce the intended haptic response.

The firmware architecture on the Feather nRF52840 Sense is organized into three main modules. The BLE receiver module listens for incoming data packets from the Raspberry Pi and parses them into an array of four motor intensity values, each ranging from 0 to 255 (or 0–100%). The multiplexer driver module manages the selection of I²C channels on the TCA9548A to ensure proper communication with each haptic driver. Finally, the haptic driver control module configures each DRV2605L with the desired waveform or PWM settings, adjusts the strength according to the received intensity values, and triggers the execution of each vibration effect. Together, these components and software modules enable the haptic wrist sleeve to deliver precise, cloud-controlled tactile feedback in a compact and efficient wearable system.

Software Spec

The vision system is structured into three main components: the vision pipeline, the DINO/Depth modules (low-level models), and the Visual Language Model (high-level model). At the core is a general pipeline class, visionPipeline, which orchestrates data flow from the Raspberry Pi client to AWS and back, ensuring consistency and modularity across all vision components. The pipeline begins with the DINO Module, which processes RGB input and segments the scene into bounding boxes based on predefined object labels such as “table,” “chair,” or “person.” These bounding boxes are then converted into a binary mask corresponding to the image dimensions, identifying which pixels belong to detected objects. This mask is passed into the Depth Module, where it is overlaid onto the camera’s depth map to estimate object proximity. The depth map is divided into three vertical regions, each corresponding to one of the haptic buzzers on the wearable device. Average depth values in each region are calculated and scaled based on object distance, generating a 3-dimensional output array representing buzzer intensities. This data is transmitted to the Raspberry Pi, which triggers the corresponding haptic feedback. Together, this pipeline enables real-time, context-aware translation of visual input into tactile cues.

Another part of this software stack is the Vision Language Model, a VLM, which we run concurrently with the Dino and Depth modules on AWS, to not affect the latency of the low-level model. The VLM contains two modules, the Gemini module and the Elevenlabs module. At a 1/2 Hz rate, the most recent frame from the video is passed into the Gemini API from the raspberry Pi using a custom system prompt which returns an "Alert" if the model sees anything important for the user to know while navigating. Else, the model returns "NA". If Gemini's text return is an "Alert" the description of the alert is sent to our next module, the Elevenlabs module, where the text to speech module returns an audio array which is output onto the PI connected to the speaker.

Challenges we ran into

A key technical focus of this project was reducing system latency to ensure real-time responsiveness between visual input and haptic or auditory feedback. We conducted detailed latency profiling across the entire pipeline to identify bottlenecks, then optimized them through targeted improvements, compressing camera input images to reduce transmission time, using a Least Recently Used cache to reduce computations, vectorizing the depth filtering algorithms for faster processing, and experimenting with lightweight segmentation models to balance accuracy with speed. Additionally, we developed custom firmware for the first time, integrating Bluetooth Low Energy (BLE) communication and an I²C multiplexer (MUX) to enable seamless, low-latency interaction between the Raspberry Pi, microcontroller, and multiple haptic actuators. Together, these efforts significantly improved system efficiency and responsiveness, resulting in a more fluid and natural user experience.

Accomplishments that we're proud of

One of our proudest accomplishments was finding the right balance between latency optimization and system robustness, ensuring the device remained both fast and reliable in challenging real-world conditions. Through extensive testing and iteration, we developed a two-phase perception pipeline that combines a Vision-Language Model (VLM) for high-level scene understanding with a Segmentation and Depth model for precise spatial awareness. By fine-tuning and restructuring parts of the stack, such as removing the SAM model to reduce overhead, we achieved significant latency improvements without compromising detection accuracy or stability. This balance between efficiency and dependability marked a major milestone in making the system truly practical for real-time use.

What we learned

Throughout this project, we learned firsthand the complexity of integrating hardware and software into a cohesive, real-time system. Every component, whether in firmware, communication protocols, or model inference, introduced its own challenges and required careful coordination to maintain system stability. One of the biggest takeaways was understanding how deeply latency impacts user experience; even small, seemingly insignificant design choices could introduce noticeable delays. This taught us the importance of being intentional and data-driven in every decision, from model selection to hardware configuration, to achieve seamless and responsive performance.