Skip to content

KillMonga130/Conversational-AI

Repository files navigation

Elite Conversational AI

Inspiration

I was inspired by the futuristic, intelligent interfaces we often see in science fiction, where AI isn't just a text box but a seamless, proactive partner. My goal was to move beyond the limitations of traditional chatbots and create an experience that felt truly dynamic and intelligent. I wanted to build an AI that could not only understand and speak but could also visualize information and create its own interactive tools in real-time, adapting to the user's needs on the fly. This project is my attempt to build that next-generation human-computer interface using the full multi-modal power of Google's Gemini API.

What it does

Elite Conversational AI is a voice-first, multi-modal interface that generates its own UI procedurally based on your conversation.

  • Real-time Voice Interaction: It actively listens for voice commands and responds with low-latency, streaming audio, creating a natural and fluid conversation.
  • Procedural UI Generation: The core feature. The AI builds and updates a user interface in real-time to display information. If you ask it to generate an image, it creates an <img> element. If you search for local restaurants, it generates a list of interactive cards with map links.
  • Multi-modal Capabilities: It seamlessly integrates several powerful AI models to handle a wide range of tasks:
    • Image Generation: Creates high-quality images from text prompts.
    • Video Generation: Produces short videos from text or image inputs.
    • Grounded Search: Answers questions about recent events or local places by using Google Search and Google Maps, providing citations and links.
    • File Understanding: Can analyze images and other files that you upload through the generated UI.
  • Adaptive Interaction: The system is designed to understand context and adapt its communication style, making the interaction feel more personal and intuitive.

How we built it

The application is a modern web app that orchestrates several Gemini models on the client side.

  • Frontend: Built with React and TypeScript, using Tailwind CSS for rapid and responsive styling.
  • Core AI Logic: The entire application is powered by the Google Gemini API via the @google/genai JavaScript SDK.
  • Real-time Voice: The heart of the application is the Gemini Live API (ai.live.connect). This enables the low-latency, bidirectional streaming of audio that makes the conversation feel so responsive. I used the browser's MediaStream and Web Audio API to capture microphone input, process it into the required PCM format, and play back the AI's audio response.
  • Procedural UI: The magic of the auto-generating UI comes from meticulous System Prompt Engineering. The AI is instructed to use a generate_procedural_ui tool whenever it needs to display visual information. It constructs a JSON object that defines the layout, theme, and elements of the UI, which the React frontend then parses and renders into components.
  • Multi-modality: I integrated a suite of different Gemini models, each specialized for a specific task: gemini-2.5-flash-native-audio-preview-09-2025 for the live session, imagen-4.0-generate-001 for images, veo-3.1-fast-generate-preview for video, and other gemini-2.5-flash and gemini-2.5-pro variants for search, reasoning, and chat.

Architecture

The application is built on a client-centric architecture that directly communicates with the Google Gemini API, leveraging the browser's native capabilities for real-time interaction.

      +-----------------------------+
      |       User (Voice)          |
      +--------------+--------------+
                     |
                     v
      +-----------------------------+
      | Browser (React, Web Audio)  |
      +--------------+--------------+
                     |
+--------------------v--------------------+
|       Elite AI Service (TypeScript)     |
| - Manages session                     |
| - Handles audio processing & playback |
| - Executes tool calls                 |
+--------------------+--------------------+
                     | (SDK)
                     v
      +-----------------------------+
      |     Google Gemini API       |
      +--------------+--------------+
                     |
       +-------------+-------------+
       |                           |
       v                           v
+--------------+        +--------------------------+
| Gemini Live  | <------> | Other Gemini Models      |
| - Audio I/O  |        | - Imagen 4 (Images)      |
| - Transcribe |        | - Veo (Video)            |
| - Tool Calls |        | - Search/Maps Grounding  |
+--------------+        | - Pro (Reasoning)        |
                        +--------------------------+

Data Flow

  1. User Interaction: The user speaks to the interface.
  2. Audio Capture: The VoiceController component uses the Web Audio API (getUserMedia) to capture microphone input at 16kHz.
  3. Streaming to Gemini: The raw audio data is sent to the eliteConversationalAIService, which streams it in real-time to the Gemini Live API.
  4. AI Processing: The Gemini Live session transcribes the user's speech, understands the intent based on the System Prompt, and orchestrates a response. It may call other specialized Gemini models like Imagen 4 or Veo via tool use.
  5. Streaming Response: The Gemini Live API streams back two things simultaneously:
    • Audio Output: The AI's spoken response as raw audio data.
    • Function Calls: JSON objects representing actions to be taken, like generating a UI.
  6. Client-Side Handling:
    • The eliteConversationalAIService receives the audio and queues it for playback using the Web Audio API.
    • It executes the function calls. For a generate_procedural_ui call, it passes the JSON specification to the EliteInterface component.
  7. UI Rendering: The ProceduralUIRenderer React component takes the JSON spec and dynamically renders the corresponding UI elements, applying themes and animations.

Challenges we ran into

  • Real-time Audio Handling: Processing raw PCM audio streams in the browser was a significant hurdle. The native AudioContext.decodeAudioData method doesn't work on raw streams, so I had to write custom decoders and a carefully timed playback queue to ensure the AI's voice was smooth and without gaps.
  • Complex Prompt Engineering: Designing a system prompt that could reliably teach the AI to control its own UI was an iterative process. It needed to understand when to use the UI, how to format the JSON specification correctly, and—most importantly—how to update the UI by sending a complete new spec rather than a partial patch.
  • State Synchronization: With so many asynchronous events happening—user voice input, streaming AI transcription, streaming AI audio output, and UI updates—keeping the application state consistent and avoiding race conditions was a major focus.
  • Robust Error Handling: The various APIs can fail for different reasons (e.g., an invalid API key, or a specific advanced API like Veo not being enabled for a user's project). I had to build specific error handling to catch these cases and provide clear, helpful feedback to the user.

Accomplishments that we're proud of

  • The Procedural UI Engine: This is the project's cornerstone. Watching the interface assemble itself based on a simple voice command is incredibly satisfying and feels like a genuine step forward for user interfaces.
  • Fluid Voice Conversation: The latency is low enough that you can have a natural, back-and-forth conversation without awkward pauses. The streaming transcriptions that appear as you and the AI speak add to this feeling of immediacy.
  • Seamless Tool Integration: The AI doesn't just "use tools"; it integrates them into the conversational and visual flow. Generating a video or performing a search feels like a natural extension of the dialogue, with the results appearing directly in the UI.

What we learned

  • Modern multi-modal LLMs are incredibly powerful, but their true potential is unlocked by building sophisticated client-side applications that can fully leverage their capabilities like function calling and streaming.
  • A great system prompt is not just a set of instructions; it's the constitution of your AI application. It defines its personality, capabilities, and constraints.
  • Deep diving into browser APIs like the Web Audio API is essential for building rich, real-time media experiences. You can't always rely on high-level libraries for cutting-edge features.

What's next for Elite Conversational AI

  • Deeper Personalization: I plan to enhance the AI's ability to learn user preferences across sessions, automatically tailoring UI themes, layouts, and even the tone of its responses.
  • More Advanced UI Components: I want to empower the AI to generate more complex components like interactive charts, data grids, and editable forms.
  • Direct UI Manipulation: The next frontier is allowing users to modify the UI with voice commands, such as "Move that image to the left," or "Make this title bigger."
  • Collaborative Workflows: I envision features where the AI can act as a collaborative partner, for example, by helping a developer write and display code snippets in real-time or helping a designer brainstorm layouts.

About

Just one of those with a touch of uniquesness

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published