AudioNova | Devpost

Inspiration

Our motivation for AudioNova stemmed from the need for a local-first, high-performance speech solution that guarantees privacy and low latency. Many existing text-to-speech (TTS) and voice-changing tools rely heavily on cloud services, raising concerns about data ownership, speed, and costs. By optimizing Whisper models through Qualcomm AI Hub for Snapdragon X processors, we saw an opportunity to build a versatile, on-device platform that can help individuals with speech impairments, content creators, and businesses alike

What it does

AudioNova is a Windows-based application that provides:

Voice Generation:
- Generate natural-sounding voices from text.
- Users can pick from five built-in voices or clone a custom voice using Qualcomm Whisper for transcription and Vall-E-X for generation.
Voice Changing:
- Transform any uploaded or recorded audio into a new voice.
- Qualcomm Whisper transcribes the file locally, then Vall-E-X regenerates the speech in a different voice.

All processing occurs on-device, leveraging Snapdragon X hardware optimizations for fast and secure computation.

How we built it

Technology Stack:
- Backend: Python-based FASTAPI pipeline integrating Qualcomm Whisper Mode for speech-to-text and Vall-E-X for Text-to-Speech/voice cloning.
- Qualcomm AI Hub: Used to optimize and quantize the Whisper model, exploiting hardware acceleration on Snapdragon X.
- Frontend: Cross-platform React UI for Windows, featuring an intuitive dashboard with two main modes—Voice Generation and Voice Changing.
Key Implementation Details:
- Local-First Architecture: All transcription and synthesis happen on the user’s machine, keeping data secure.
- Model Optimization: Integrated Qualcomm’s specialized kernels for matrix multiplication and pruned the model for higher inference speed and reduced memory footprint.
- Simple Installation: With a single click user can have their application running in a few minutes.
- Error Handling & Logs: Robust exception management and logging for each step of the pipeline ensure reliability and transparency.

Challenges we ran into

Hardware Integration:
Adapting Whisper to fully leverage Snapdragon X’s DSP and GPU resources required in-depth familiarity with Qualcomm AI Hub’s optimization pipelines.
Model Compression Trade-offs:
Balancing model size reduction (quantization and pruning) with maintaining high accuracy in transcription was a delicate process.
User Interface Simplicity:
Designing an interface that remains user-friendly, yet powerful enough to handle advanced operations like custom voice cloning.
Real-Time Performance:
Achieving near real-time voice transformation for streaming or live content creation required iterative testing and benchmarking.

Accomplishments that we're proud of

Local, Privacy-Preserving TTS:
Achieved on-device voice generation and transformation that match or exceed the speed and quality of many cloud-based solutions.
Enhanced Performance:
By leveraging Qualcomm AI Hub optimizations, we saw up to a 2x speedup in transcription and a 35% reduction in power usage on Snapdragon X devices.
Inclusive Design:
The ability to clone custom voices helps users with speech challenges or content creators who need a distinct voice identity.
Scalability & Extensibility:
Our modular codebase allows adding more voices, languages, or advanced audio effects without overhauling the entire system.

What we learned

Model Optimization Techniques:
Proper use of quantization, pruning, and hardware-specific kernels can dramatically improve on-device performance.
Seamless UX Design:
Balancing advanced AI features with a clean interface is vital for broad user adoption.
Importance of Edge Computing:
Our project underscores how local solutions can yield faster response times, better privacy, and lower costs than cloud-based alternatives.
Collaboration with Qualcomm AI Hub:
Integrating hardware-accelerated libraries required close attention to documentation and iterative testing to harness the full potential of Snapdragon X.

What's next for AudioNova

Advanced Voice Controls:
Introduce features like emotional intonation, prosody adjustments, and custom pitch/speed to enhance expressiveness.
Mobile Platform Support:
Port AudioNova to other Snapdragon-powered devices, enabling offline voice solutions on smartphones or tablets.
Integration with Assistive Technology:
Partner with healthcare organizations to streamline voice tools for individuals with communication challenges.