Zelta | Devpost

Inspiration

Like many other computer science students, we often found ourselves struggling with two things: hair loss from stress and maintaining meaningful relationships. This inspired Zelta, aiming to address at least one of these issues by creating an AI companion that understands you, learns from you, and grows with you—though we’re still working on a solution for the hair loss!

What it does

Zelta harnesses the power of Azure AI to connect users with their ideal AI companions, offering far more than just emotionally intelligent conversations. Users start by customizing their ideal AI boyfriend or girlfriend—choosing their character, voice, and gender. Once set up, they can chat with their AI companion. Every conversation is analyzed for keywords and refed into our Azure database, leveraging Retrieval-Augmented Generation (RAG) to make Zelta smarter and more personalized over time. Please note that the AI’s knowledge base is preloaded with user information from their profile. Any updates to the user profile will override previous data and be incorporated into the RAG for improved context.

Beyond being the best companion, Zelta can transform into a personal mentor. For example, in its unique ‘Interviewer Mode’ (you just need to say the key word), it can adopt the tone and style of a real interviewer, helping users practice coding challenges by guiding them through the problem-solving process without giving away the answers (A life-saver for us CS students)

Zelta also acts as a personal assistant, seamlessly integrating with Google Calendar to manage schedules, automate customer service calls (no more 40-minute hold times!), and even find restaurants based on your preferences. Craving Italian or planning a special dinner? Zelta’s got it covered—from reservations to food orders.

How we built it

We began this project with only a rough concept of what Zelta could become, focusing on a flexible and scalable solution from the ground up. To translate this idea into actionable steps, we leveraged GitHub Copilot in Visual Studio Code (VS Code) to kickstart the development process by generating skeleton code. By providing boilerplate code and structural suggestions, it allowed us to set up the basic architecture without having to start from scratch. This included initializing essential files, creating function templates, and scaffolding core modules.

By analyzing our initial ideas, Copilot suggested initial logical steps to approach complex problems. It helped identify the sequence of tasks and created outlines for functionalities like Autogen Multi-Agent interactions, and other Azure suite services.

One of our project's core components was designing effective system prompts and agent instructions. Copilot helped craft these initial prompts by analyzing our intent and generating clear, structured, and context-aware text. Beyond generating code, Copilot suggested refinements for system messages, improving their clarity and ensuring they were tailored for optimal performance when passed to various agents. This included rewording instructions to be concise yet comprehensive and ensuring compatibility with AI models. Lastly, Copilot contributed to writing better agent-specific instructions by aligning them with the project's objectives. It helped ensure that agents like the Intent Classifier or Web Search Agent received well-constructed and precise guidance, which improved their effectiveness.

Agentic AI Pipeline

https://github.com/user-attachments/assets/42673008-f479-42a9-b8e6-789424a79ab8 Our backbone is the Agentic AI Pipeline, built on the Autogen architecture. This multi-agent system employs a network of specialized agents, each designed for a specific task—enhancing clarity, processing user intents, retrieving information, and refining responses. These agents work together to deliver seamless interactions that adapt to the context of each conversation. For example:

1. The user's message is sent to the User Proxy, which represents the user within the system.

2. The message is used to query CosMos DB, where chat history is stored.

The query is passed into the Azure OpenAI Embedding Model to generate embeddings.
These embeddings are used for vector searches to find similar past messages.

3. The Reformulate Agent uses recent and similar messages to transform the user’s query into a standalone message that can be understood without prior chat context.

4. The original and reformulated queries are passed to the Intent Classifier Agent to identify user intent. By having the context, the agent increases the chance of identifying the user’s intent correctly.

Web Search: For real-time data retrieval.
Reading Uploaded Documents (RAG): For document retrieval.
General Query: For standard responses.

5. If the intent is to read the uploaded documents, we don’t need relevant user information. Therefore, the Document Reading Agent processes the query on the “uploaded docs” index on Azure AI Search. For other intents, the Document Reading Agent processes the query on the “user information” index on Azure AI Search to retrieve relevant user’s information (preferences, persona, etc).

The reformulated query is embedded into vectors.
Azure AI Search executes a semantic search to retrieve user-relevant information to aid in formulating the response.

6. If the intent is not web search, the search result is passed to the Conversation Agent (virtual partner). If the intent is a web search, the query is passed to the Web Search Agent for real-time internet data retrieval. Results are then appended to the response prompt for the virtual partner

7. The Conversation Agent synthesizes an initial response to User Proxy using:

Recent and similar messages.
Extracted user information.
Uploaded document data.
Internet search results (if applicable)

8. The Relationship Consulting Agent reviews the response to ensure it is concise, relevant to the user’s query, human-like in tone and style.

9. The Conversation Agent refines the response based on this feedback.

10. The final response is synthesized into voice format.

11. The system delivers the voice response to the user.

Emotionally Intelligent Conversations

We integrated memory and feedback mechanisms to craft emotionally resonant responses. This feature ensures that each AI companion feels more human and relatable, capable of understanding and reacting to the user’s emotional state.

Time-Aware Responses

Real-time awareness is crucial for maintaining the authenticity of interactions. Our system ensures that responses are not only contextually appropriate but also timely, enhancing the conversational experience to be as natural as possible.

Speech-to-Speech Pipeline

We significantly improved the user experience by implementing a sophisticated speech-to-speech pipeline, which enables seamless, natural, and interactive conversations between users and the AI. When the user speaks, their voice input is captured and processed by Azure Speech Services. The service transcribes the spoken words in real-time, dynamically detecting pauses or silence to determine the end of the input. This ensures that partial or incomplete inputs are avoided, allowing for accurate and context-aware processing.

The response is passed through Azure Speech Services for text-to-speech synthesis, producing a natural and emotionally expressive voice output. The system supports nuanced linguistic variations, ensuring the synthesized voice sounds human-like, with tone and inflection that match the conversational context.

To further enhance immersion, we integrated Rhubarb Lip Sync technology. This tool synchronizes the synthesized speech with realistic lip movements, enabling visually accurate articulation of the AI’s spoken responses.

Chat History Management

To maintain context without continuous user input, we save chat history messages in Cosmos DB using PostgreSQL and Pgvector. This setup allows the Reformulate Agent to refer to previous messages and reformulate questions that the AI can understand, enhancing the relevance of responses. Additionally, similar messages are retrieved from the same database to provide the Conversation Agent with context, improving its ability to generate accurate and relevant responses.

Continuous Learning User Information with RAG

Our AI continuously learns from each interaction. It remembers user preferences, milestones, and habits. This learning mechanism personalizes future interactions, making each conversation more relevant and tailored to the individual user. We use Retrieval-Augmented Generation (RAG) for continuous learning about the user. Memory agents extract useful information (such as user’s hobbies, preference, key events, etc.) from conversation messages, which is then uploaded to Azure AI Search and Blob Storage for future reference. During conversation, Document Reading Agent can execute semantic searches of user data, ensuring personalized responses.

Challenges we ran into & Accomplishments that we're proud of

Real-time responsiveness was difficult. Integrating multiple agents was complex. At times, we doubted the emotional capabilities of our AI. Throughout our journey, we learned a great deal about Azure’s AI and cloud services, which empowered us to integrate advanced features like the speech-to-speech pipeline and continuous learning systems. Azure’s robust infrastructure made it possible to create a seamless and efficient platform that powers Zelta’s dynamic interactions.
Designing a pipeline for multiple agents to continuously learn and refine the responses and monitoring the responses are also challenging. We tested different packages from Autogen and compared their responses, but integrating them is still a challenge. So we built new tools to help agents better retrieve the relevant information and context for building the responses.
We also leveraged GitHub Copilot during development, which significantly accelerated our coding process and simplified complex tasks, allowing us to focus on innovation. Paired with VS Code, we had an intuitive and efficient environment for collaborative development, streamlining our workflow.
When working with Three.js to create interactive 3D visuals, Copilot was particularly helpful in debugging complex issues. These 3D visuals enable users to customize their AI companion in real-time, adding a visually immersive and engaging layer to the Zelta experience.

What's next for Zelta

We are now working on integrating real-time mood detection based on voice analysis. Going beyond mood detection in voice analysis by incorporating facial recognition (via video input) and text sentiment analysis will enhance Zelta’s ability to respond to users’ emotions in a more nuanced and empathetic way. Also, we aim to allow users to shape Zelta’s personality traits (e.g., humor, formality, empathy levels) to suit individual preferences.

We also aim to improve system latency to ensure real-time, smoother and faster interactions. Optimizing backend processes and refining our infrastructure are key priorities.

On the frontend, we plan to add more features to enhance user experience. This includes improved customization options for AI companions, interactive 3D environments, and a more intuitive interface. These updates will make Zelta more engaging and user-friendly. In the near future, we would like to introduce gesture recognition to allow users to interact with Zelta via body language or hand movements.

Our goal is to continuously innovate and make Zelta the ultimate AI companion, closer to reality.