Inspiration
I took my first Waymo ride earlier in the week and it was fun and exciting. But after a while the excitement died down... and it gets pretty robotic and boring.
We do believe in the future where robo taxi is the majority, working in harmony with other AI agents in the army of our AI assistants. It will be extremely boring if all of them have the same robotic characteristics. More importantly, if none of them know how to read the room!
What it does
Our CharacterAV detects a passenger's facial expression, pose and voice tone. It then uses its social cues skill to translate those behaviour and use it as a reasoning to whether it should initiate the conversation with the passenger or not, and if yes, when and what to say.
And of course, we use character of the user's choice. In our demo, we picked Harry Potter.
How we built it
We use an open source Voice Activity Detection model running via Onnx in combination with a local Whisper modal run via WebGPU. This combination allows our agent to understand what people are saying, when to speak, when not to speak, and more.
We also use LlaVA (Llama with a vision encoder attached acting as a ViT model) to detect the rider's emotions, actions and more to inform those decisions.
We also use an RVC model to enable having multiple voices
Challenges we ran into
Transcription is hard, especially in the browser. Getting Whisper to run via WebGPU took some novel bug fixes for Next.js
Accomplishments that we're proud of
How much of the pipeline works on device. In the future the entire pipeline could be on device for privacy reasons.
What we learned
LLMs are excellent social agents when given the correct inputs. Our demo is capable of extremely nuanced interactions
What's next for CharacterAV
I work at Zoox, Amazon's self driving car company, and I specifically work on Human Machine Interfaces. I'm going to surface our work with the team and see if it can't be developed further
Built With
- llava
- llm
- next.js
- onnx
- rvc
- webgpu
- whisper
Log in or sign up for Devpost to join the conversation.