Inspiration

I was inspired by a simple yet profound idea: the human voice is one of the most powerful and data-rich biomarkers available. It holds subtle clues not just about what I'm thinking, but about my underlying emotional and cognitive health. I envisioned a tool that could non-invasively analyze these vocal patterns to help in the early-stage detection of neurological conditions, monitor mental wellness, and provide a deeper understanding of human communication. This led me to create CogniSpeech, an AI-powered platform designed to unlock the hidden health insights in the human voice.

What it does

CogniSpeech is a production-ready, AI-driven platform that provides deep insights into a person's cognitive and emotional state by analyzing their voice. It goes beyond simple transcription by performing a multi-layered analysis of both how a person speaks and what they say.

  • Comprehensive Vocal Analysis: Extracts over 20 distinct vocal biomarkers, including pitch, jitter, shimmer, and formants, using industry-standard tools for clinical-grade accuracy.
  • Multi-Layered Linguistic Analysis: Employs a suite of five specialized AI models to perform speech-to-text, sentiment analysis, emotion classification, and dialogue act recognition.
  • Advanced Analytics: Provides trend analysis and weekly summaries, offering longitudinal insights into a user's vocal and emotional patterns.

How I built it

CogniSpeech is built on a sophisticated, multi-layered architecture designed for accuracy and scalability.

The Vocal Analyzer

My vocal analysis service uses a "best of both worlds" approach to extract over 20 distinct acoustic features:

  • Layer 1: praat-parselmouth: For clinical-grade precision, I use Praat's algorithms to measure core metrics like pitch, jitter, shimmer, HNR (Harmonics-to-Noise Ratio), and formants.
  • Layer 2: librosa: For a broad spectral profile, I use Librosa to extract features like MFCCs, spectral contrast, and chroma.
  • Layer 3: Custom Rhythm Analysis: I developed custom logic to calculate speech and articulation rates, providing insights into the temporal dynamics of speech.

The Linguistic Analyzer

My linguistic analysis service is a five-layer pipeline that deconstructs the content of the speech:

  1. Speech-to-Text: I use OpenAI's Whisper model for highly accurate transcription.
  2. Sentiment Analysis: A RoBERTa-based model provides a high-level classification of the text as positive, negative, or neutral.
  3. Emotion Classification: A DistilRoBERTa model identifies a range of nuanced emotions, including joy, sadness, anger, and fear.
  4. Dialogue Act Recognition: A BART model analyzes the conversational intent of each sentence, determining if it's a statement, a question, etc.
  5. Summarization: Another BART model generates a concise, AI-powered summary of the entire conversation.

All of this is served through a FastAPI backend, with data stored in a SQLite database managed by SQLAlchemy and Alembic for migrations. The entire application is containerized with Docker for easy and reliable deployment.

Challenges I ran into

Building a system this complex as a solo developer was a formidable challenge, and the journey was filled with technical hurdles that required deep problem-solving and perseverance.

  • The asyncio.CancelledError Server Crash: My biggest hurdle was a recurring server crash. I discovered that loading all the AI models at startup was blocking the main server thread, causing it to time out. To solve this, I re-architected the services to use a "lazy loading" approach and moved the entire analysis pipeline into a separate process using concurrent.futures.ProcessPoolExecutor. This isolated the heavy AI work from the web server and made the application far more stable.
  • Dependency Hell: I ran into numerous issues with library mismatches and incompatibilities, especially with parselmouth and its system-level audio dependencies. This taught me the importance of meticulous environment management, leading me to create a clean, reproducible setup inside a Dockerfile that works every time.
  • Inaccurate AI Models: Initially, the sentiment analysis was giving inaccurate answers. I realized that a single, general-purpose model wasn't enough for the nuances of spoken language. This led me to develop the multi-layered linguistic analysis pipeline, which combines five specialized models to achieve a much higher degree of accuracy and insight.
  • Frontend-Backend Integration: Connecting the frontend was a major challenge, with CORS errors, configuration mismatches, and path issues. This required a deep dive into both frontend and backend configurations to ensure seamless communication between the two.
  • Handling Audio: I faced a wide range of audio-related problems, from format errors and uploading issues to size limitations. I had to build a robust validation and processing pipeline that could handle various audio formats and sizes gracefully, ensuring a smooth user experience.

Accomplishments that I'm proud of

I am incredibly proud of building a system that is not only technologically advanced but also has the potential to make a real-world impact. As a solo developer, some of my key accomplishments include:

  • A Truly Comprehensive Analysis: I successfully integrated two distinct, multi-layered analysis pipelines (vocal and linguistic) into a single, cohesive system. This provides a 360-degree view of the user's speech that is far more insightful than any single analysis.
  • Clinical-Grade Accuracy: By integrating Praat via parselmouth, I've ensured that my vocal biomarker analysis is up to the standards of clinical and academic research.
  • A Production-Ready, Scalable Backend: I built a robust, scalable, and well-tested backend using modern best practices, including a fully containerized deployment with Docker.

What I learned

Building CogniSpeech was an incredible learning journey. I went from a simple concept to a production-ready application, gaining deep expertise in several key areas:

  • Multi-Model AI Integration: I learned that a single AI model is rarely enough. My biggest takeaway was how to create a multi-layered analysis pipeline, combining five specialized models for linguistic analysis and a hybrid approach for vocal analysis.
  • Advanced Audio Processing: I moved beyond basic audio libraries and learned to use praat-parselmouth, the Python interface to the gold-standard in phonetic science.
  • Robust Backend Architecture: I built the entire system on a modern, scalable architecture using FastAPI, learning how to manage background tasks, handle database sessions correctly, and build a powerful and easy-to-use RESTful API.

What's next for CogniSpeech

I'm just getting started. The CogniSpeech platform is designed to be extensible, and I have a clear roadmap for the future:

  • Real-time Analysis: I plan to integrate WebSocket support to enable live, real-time analysis of audio streams.
  • Advanced Clinical Metrics: I will continue to expand the vocal analysis capabilities, adding more specialized metrics for specific clinical applications.
  • Machine Learning for Prediction: With the rich data being collected, I plan to build predictive models that can identify the early signs of specific conditions based on vocal and linguistic trends.
  • Frontend Dashboard: I will be developing a user-friendly frontend dashboard to visualize the analysis results and provide actionable insights to users and clinicians.

Built With

Share this project:

Updates