Inspiration
I was inspired by a simple yet profound idea: the human voice is one of the most powerful and data-rich biomarkers available. It holds subtle clues not just about what I'm thinking, but about my underlying emotional and cognitive health. I envisioned a tool that could non-invasively analyze these vocal patterns to help in the early-stage detection of neurological conditions, monitor mental wellness, and provide a deeper understanding of human communication. This led me to create CogniSpeech, an AI-powered platform designed to unlock the hidden health insights in the human voice.
What it does
CogniSpeech is a production-ready, AI-driven platform that provides deep insights into a person's cognitive and emotional state by analyzing their voice. It goes beyond simple transcription by performing a multi-layered analysis of both how a person speaks and what they say.
- Comprehensive Vocal Analysis: Extracts over 20 distinct vocal biomarkers, including pitch, jitter, shimmer, and formants, using industry-standard tools for clinical-grade accuracy.
- Multi-Layered Linguistic Analysis: Employs a suite of five specialized AI models to perform speech-to-text, sentiment analysis, emotion classification, and dialogue act recognition.
- Advanced Analytics: Provides trend analysis and weekly summaries, offering longitudinal insights into a user's vocal and emotional patterns.
How I built it
CogniSpeech is built on a sophisticated, multi-layered architecture designed for accuracy and scalability.
The Vocal Analyzer
My vocal analysis service uses a "best of both worlds" approach to extract over 20 distinct acoustic features:
- Layer 1:
praat-parselmouth: For clinical-grade precision, I use Praat's algorithms to measure core metrics like pitch, jitter, shimmer, HNR (Harmonics-to-Noise Ratio), and formants. - Layer 2:
librosa: For a broad spectral profile, I use Librosa to extract features like MFCCs, spectral contrast, and chroma. - Layer 3: Custom Rhythm Analysis: I developed custom logic to calculate speech and articulation rates, providing insights into the temporal dynamics of speech.
The Linguistic Analyzer
My linguistic analysis service is a five-layer pipeline that deconstructs the content of the speech:
- Speech-to-Text: I use OpenAI's Whisper model for highly accurate transcription.
- Sentiment Analysis: A RoBERTa-based model provides a high-level classification of the text as positive, negative, or neutral.
- Emotion Classification: A DistilRoBERTa model identifies a range of nuanced emotions, including joy, sadness, anger, and fear.
- Dialogue Act Recognition: A BART model analyzes the conversational intent of each sentence, determining if it's a statement, a question, etc.
- Summarization: Another BART model generates a concise, AI-powered summary of the entire conversation.
All of this is served through a FastAPI backend, with data stored in a SQLite database managed by SQLAlchemy and Alembic for migrations. The entire application is containerized with Docker for easy and reliable deployment.
Challenges I ran into
Building a system this complex as a solo developer was a formidable challenge, and the journey was filled with technical hurdles that required deep problem-solving and perseverance.
- The
asyncio.CancelledErrorServer Crash: My biggest hurdle was a recurring server crash. I discovered that loading all the AI models at startup was blocking the main server thread, causing it to time out. To solve this, I re-architected the services to use a "lazy loading" approach and moved the entire analysis pipeline into a separate process usingconcurrent.futures.ProcessPoolExecutor. This isolated the heavy AI work from the web server and made the application far more stable. - Dependency Hell: I ran into numerous issues with library mismatches and incompatibilities, especially with
parselmouthand its system-level audio dependencies. This taught me the importance of meticulous environment management, leading me to create a clean, reproducible setup inside aDockerfilethat works every time. - Inaccurate AI Models: Initially, the sentiment analysis was giving inaccurate answers. I realized that a single, general-purpose model wasn't enough for the nuances of spoken language. This led me to develop the multi-layered linguistic analysis pipeline, which combines five specialized models to achieve a much higher degree of accuracy and insight.
- Frontend-Backend Integration: Connecting the frontend was a major challenge, with CORS errors, configuration mismatches, and path issues. This required a deep dive into both frontend and backend configurations to ensure seamless communication between the two.
- Handling Audio: I faced a wide range of audio-related problems, from format errors and uploading issues to size limitations. I had to build a robust validation and processing pipeline that could handle various audio formats and sizes gracefully, ensuring a smooth user experience.
Accomplishments that I'm proud of
I am incredibly proud of building a system that is not only technologically advanced but also has the potential to make a real-world impact. As a solo developer, some of my key accomplishments include:
- A Truly Comprehensive Analysis: I successfully integrated two distinct, multi-layered analysis pipelines (vocal and linguistic) into a single, cohesive system. This provides a 360-degree view of the user's speech that is far more insightful than any single analysis.
- Clinical-Grade Accuracy: By integrating Praat via
parselmouth, I've ensured that my vocal biomarker analysis is up to the standards of clinical and academic research. - A Production-Ready, Scalable Backend: I built a robust, scalable, and well-tested backend using modern best practices, including a fully containerized deployment with Docker.
What I learned
Building CogniSpeech was an incredible learning journey. I went from a simple concept to a production-ready application, gaining deep expertise in several key areas:
- Multi-Model AI Integration: I learned that a single AI model is rarely enough. My biggest takeaway was how to create a multi-layered analysis pipeline, combining five specialized models for linguistic analysis and a hybrid approach for vocal analysis.
- Advanced Audio Processing: I moved beyond basic audio libraries and learned to use
praat-parselmouth, the Python interface to the gold-standard in phonetic science. - Robust Backend Architecture: I built the entire system on a modern, scalable architecture using FastAPI, learning how to manage background tasks, handle database sessions correctly, and build a powerful and easy-to-use RESTful API.
What's next for CogniSpeech
I'm just getting started. The CogniSpeech platform is designed to be extensible, and I have a clear roadmap for the future:
- Real-time Analysis: I plan to integrate WebSocket support to enable live, real-time analysis of audio streams.
- Advanced Clinical Metrics: I will continue to expand the vocal analysis capabilities, adding more specialized metrics for specific clinical applications.
- Machine Learning for Prediction: With the rich data being collected, I plan to build predictive models that can identify the early signs of specific conditions based on vocal and linguistic trends.
- Frontend Dashboard: I will be developing a user-friendly frontend dashboard to visualize the analysis results and provide actionable insights to users and clinicians.
Built With
- alembic
- bart
- bert
- distilroberta
- docker
- fastapi
- lebrosa
- openai-whisper
- postgresql
- praat-parselmouth
- python
- react
- roberta
- sqlalchemy
- sqlite
Log in or sign up for Devpost to join the conversation.