Inspiration
Ironman Jarvis is the inspiration. Everyone who has seen the movie would want something like that. With the tools available today, it seems to be on the edge of possibility.
What it does
User naturally interacts with the computer using voice, hands and head. The computer understands user's needs, bring up tools if needed, and help user complete their task faster. I start with creating word document, and it can further expand into other areas like data analytics, 3D modeling and more.
How we built it
See system architecture in the photos. Meta Quest 3 (hardware) -> Unity (Game Engine) <-> Cloud Container (API Server) <-> AI inference APIs. I have built an API server to serve as intermediary between the game engine that uses C# and the AI world that uses python. There are multiple AI inference endpoint created for different purpose. Azure OpenAI 3.5 is cheap, fast, reliable function calls. Gemini 1.5 is multi-nodal, 1 million context window for specialized knowledge agent. Groq with Llama 3 is blazing fast. Use different AI model for its strength. All of the AI inference uses Langchain for easy tracking and data collection. Finally, voice to text uses OpenAI Whisper(small) on device (no api call)! Text to voice uses ElevenLabs, it's expensive, but sounds very good.
Challenges we ran into
Building an API server on cloud was something I've never done before. But it is necessary to tie the game engine functions to the AI world. Waiting for others to build libraries for the latest AI tools is not an option for this hackathon.
Accomplishments that we're proud of
Proof of concept that we can really build something like Ironman Jarvis. From talking naturally to the computer, to it responding by calling the right function. Multi-nodal image to text transcription working on VertextAI/Gemini 1.5. I feel that AI opened so many possibilities and I'm happy to bring some of those possibilities to a real demo.
What we learned
Spatial computing as interface to human is so much better than a mouse + keyboard + multiple giant screens. It is possible with mixed reality headset right now, but it could be other hardware devices like AR or something else. Designing based around human interaction makes technology so much more accessible. AI acting as the brain is what made this possible. Having general understanding of user words then translate that into programming function is what makes this feel magical.
What's next for Project Ada
Continue building more capabilities. Build a proper cloud infrastructure to allow more people to use it. Refine the Audio to text so that infinite time dictation can be possible without user manual trigger. Add memory to agent. Add eye tracking for even better interactions.
Log in or sign up for Devpost to join the conversation.