Image

Inspiration

Our motivation for AudioNova stemmed from the need for a local-first, high-performance speech solution that guarantees privacy and low latency. Many existing text-to-speech (TTS) and voice-changing tools rely heavily on cloud services, raising concerns about data ownership, speed, and costs. By optimizing Whisper models through Qualcomm AI Hub for Snapdragon X processors, we saw an opportunity to build a versatile, on-device platform that can help individuals with speech impairments, content creators, and businesses alike

What it does

AudioNova is a Windows-based application that provides:

  1. Voice Generation:

    • Generate natural-sounding voices from text.
    • Users can pick from five built-in voices or clone a custom voice using Qualcomm Whisper for transcription and Vall-E-X for generation.
  2. Voice Changing:

    • Transform any uploaded or recorded audio into a new voice.
    • Qualcomm Whisper transcribes the file locally, then Vall-E-X regenerates the speech in a different voice.

All processing occurs on-device, leveraging Snapdragon X hardware optimizations for fast and secure computation.

How we built it

  • Technology Stack:

    • Backend: Python-based FASTAPI pipeline integrating Qualcomm Whisper Mode for speech-to-text and Vall-E-X for Text-to-Speech/voice cloning.
    • Qualcomm AI Hub: Used to optimize and quantize the Whisper model, exploiting hardware acceleration on Snapdragon X.
    • Frontend: Cross-platform React UI for Windows, featuring an intuitive dashboard with two main modes—Voice Generation and Voice Changing.
  • Key Implementation Details:

    • Local-First Architecture: All transcription and synthesis happen on the user’s machine, keeping data secure.
    • Model Optimization: Integrated Qualcomm’s specialized kernels for matrix multiplication and pruned the model for higher inference speed and reduced memory footprint.
    • Simple Installation: With a single click user can have their application running in a few minutes.
    • Error Handling & Logs: Robust exception management and logging for each step of the pipeline ensure reliability and transparency.

Challenges we ran into

  1. Hardware Integration:
    Adapting Whisper to fully leverage Snapdragon X’s DSP and GPU resources required in-depth familiarity with Qualcomm AI Hub’s optimization pipelines.

  2. Model Compression Trade-offs:
    Balancing model size reduction (quantization and pruning) with maintaining high accuracy in transcription was a delicate process.

  3. User Interface Simplicity:
    Designing an interface that remains user-friendly, yet powerful enough to handle advanced operations like custom voice cloning.

  4. Real-Time Performance:
    Achieving near real-time voice transformation for streaming or live content creation required iterative testing and benchmarking.

Accomplishments that we're proud of

  • Local, Privacy-Preserving TTS:
    Achieved on-device voice generation and transformation that match or exceed the speed and quality of many cloud-based solutions.

  • Enhanced Performance:
    By leveraging Qualcomm AI Hub optimizations, we saw up to a 2x speedup in transcription and a 35% reduction in power usage on Snapdragon X devices.

  • Inclusive Design:
    The ability to clone custom voices helps users with speech challenges or content creators who need a distinct voice identity.

  • Scalability & Extensibility:
    Our modular codebase allows adding more voices, languages, or advanced audio effects without overhauling the entire system.

What we learned

  • Model Optimization Techniques:
    Proper use of quantization, pruning, and hardware-specific kernels can dramatically improve on-device performance.

  • Seamless UX Design:
    Balancing advanced AI features with a clean interface is vital for broad user adoption.

  • Importance of Edge Computing:
    Our project underscores how local solutions can yield faster response times, better privacy, and lower costs than cloud-based alternatives.

  • Collaboration with Qualcomm AI Hub:
    Integrating hardware-accelerated libraries required close attention to documentation and iterative testing to harness the full potential of Snapdragon X.

What's next for AudioNova

  • Advanced Voice Controls:
    Introduce features like emotional intonation, prosody adjustments, and custom pitch/speed to enhance expressiveness.

  • Mobile Platform Support:
    Port AudioNova to other Snapdragon-powered devices, enabling offline voice solutions on smartphones or tablets.

  • Integration with Assistive Technology:
    Partner with healthcare organizations to streamline voice tools for individuals with communication challenges.

Built With

+ 453 more
Share this project:

Updates