Snarkify | Devpost

Inspiration

Arguments are very common in our every day relationships! What if there was a funny way to defuse arguments so that nobody is hurt? Introducing...Snarkify! 😡⚡

What it does

Snarkify is an innovative augmented reality application built for Snap's Spectacles platform that combines real-time AI conversation with 3D object generation with face tracking abilities. The project demonstrates how to integrate multiple AI services within an AR environment, creating an immersive experience where users can interact with AI assistants through voice and generate 3D objects in their physical space.

Technologies used

Real-time AI Conversation: Connects to Google's Gemini Live API for streaming audio conversations
3D Object Generation: Uses Snap's Snap3D service to generate interactive 3D models from text prompts
Voice Recognition: Implements automatic speech recognition (ASR) for hands-free interaction
Spatial UI: Features a floating orb interface that can be positioned in 3D space or screen space
Multimodal Input: Supports both audio and camera input for rich AI interactions

How We Built It

Architecture & Technology Stack

The project is built on several key technologies:

Lens Studio 5.12.1: Snap's AR development platform
TypeScript: Primary programming language for logic and components
Remote Service Gateway: Snap's cloud service integration system
Spectacles Interaction Kit (SIK): UI and interaction framework

Key Components

GeminiAssistant.ts - AI Brain

Establishes WebSocket connection to Gemini Live API
Handles real-time audio streaming (16kHz input, 24kHz output)
Processes function calls for 3D generation
Manages conversation flow with custom system instructions
Supports both audio and text-only modes

AIAssistantUIBridge.ts - Integration Hub

Connects the AI assistant to the user interface
Coordinates between voice input, AI processing, and 3D generation
Auto-starts the AI session on app launch
Manages the flow between different system components

SphereController.ts - Spatial UI

Creates a floating orb interface that follows the user
Supports both world-space and screen-space positioning
Displays AI responses and user speech captions
Handles hand tracking and spatial interactions

Snap3DInteractableFactory.ts - 3D Generation

Interfaces with Snap's 3D generation service
Creates interactive 3D objects that users can manipulate
Manages the generation pipeline from text prompt to 3D model
Supports mesh refinement and vertex coloring options

ASRQueryController.ts - Voice Input

Implements speech-to-text functionality
Provides visual feedback during voice recording
Handles voice query processing with configurable accuracy modes

Challenges We Ran Into

Technical Integration Challenges

Real-time Audio Processing: Synchronizing 16kHz microphone input with 24kHz audio output while maintaining low latency for natural conversation flow
WebSocket State Management: Managing complex WebSocket connections with proper error handling, reconnection logic, and state synchronization between multiple AI services
Spatial UI Positioning: Creating a responsive UI that transitions smoothly between hand-tracked, world-space, and screen-space modes while maintaining user context
Cross-Service Communication: Coordinating between Gemini Live's function calling system and Snap3D's asynchronous generation pipeline ### Platform-Specific Constraints
Spectacles Hardware Limitations: Working within the computational and memory constraints of AR glasses
Network Dependency: Ensuring graceful degradation when internet connectivity is poor or intermittent
Audio Feedback Prevention: Preventing audio loops in development while maintaining natural conversation flow

Accomplishments that we're proud of

Innovation in AR-AI Integration

Seamless Multimodal Experience: Successfully created a natural conversation flow where users can speak to AI and see their ideas materialize as 3D objects in real space
Real-time Function Calling: Implemented sophisticated function calling between Gemini Live and Snap3D, allowing the AI to generate 3D content based on conversation context
Adaptive UI System: Built a spatial interface that intelligently switches between different interaction modes based on user context and device capabilities

Technical Achievements

Custom System Instructions: Developed creative AI prompts that make the assistant listen to conversations and generate contextual 3D objects (like generating a clown wig when someone is called a clown)
Robust Audio Pipeline: Implemented a complete audio processing system with Base64 encoding, PCM16 conversion, and dynamic audio output management
Interactive 3D Objects: Created a complete pipeline from text prompt to manipulatable 3D objects that users can move, scale, and interact with in AR space ### User Experience Design
Intuitive Spatial Interaction: Designed natural hand-based interactions that feel native to AR environments
Visual Feedback Systems: Implemented comprehensive visual indicators for AI processing states, voice recording, and 3D generation progress
Graceful Error Handling: Built resilient systems that provide clear feedback when services fail or connectivity issues occur

What We Learned

AR Development Insights

Spatial Computing Paradigms: Gained deep understanding of how UI/UX principles translate to 3D space and the importance of maintaining spatial context
Performance Optimization: Learned to balance rich AI functionality with the constraints of mobile AR hardware
Cross-Platform Considerations: Understanding how to develop for both Lens Studio preview and actual Spectacles hardware ### AI Integration Patterns
Real-time AI Conversations: Mastered the complexities of maintaining natural conversation flow with streaming AI models
Function Calling Architecture: Developed patterns for reliable AI-to-service communication with proper error handling and status reporting
Multimodal AI Design: Learned to coordinate multiple input modalities (voice, camera, text) for rich AI interactions

What's next?

Enhanced AI Capabilities

Multi-AI Support: Integrate additional AI models (OpenAI, Claude, etc.) with seamless switching
Persistent Conversations: Add conversation history and context retention across sessions
AI Vision Integration: Enable the AI to see and comment on the user's environment through camera feed
Emotional Intelligence: Add sentiment analysis and emotional responses to conversations

Advanced 3D Features

Physics Integration: Add realistic physics simulation to generated 3D objects 2, Animation Generation: Allow AI to create animated 3D content, not just static models
Gesture Recognition: Add hand gesture controls for manipulating objects and controlling the AI
Eye Tracking Integration: Use gaze for more natural UI interactions and AI attention