Jarvis_FINAL | Devpost

Jarvis_FINAL

💡 Inspiration

We are living in the golden age of LLMs, yet most of them are trapped inside a browser tab or a chatbox. We asked ourselves: "Why can our AI write Shakespeare, but can't see what's on my screen or organize my files without a complex integration?"

The inspiration for Jarvis_FINAL came from the desire to break the "fourth wall" between the AI and the Operating System. We wanted to move from Generative AI (which creates text) to Agentic AI (which executes actions). We envisioned an entity that lives natively in the Linux environment, sees what I see, hears what I hear, and acts as a true co-pilot, not just a consultant.

🤖 What it does

Jarvis is a local, autonomous agent built on the OODA Loop military doctrine (Observe, Orient, Decide, Act). Unlike standard assistants that wait for a prompt, Jarvis runs in a continuous asynchronous loop:

Observe: It constantly ingests multimodal data (Screenshots of the desktop, Microphone audio, System Logs) using local hardware hooks.
Orient: It uses Gemini 3's Multimodal capabilities to understand the context. It doesn't just read text; it sees error messages on the terminal or hears a command.
Decide: Leveraging Gemini 3's reasoning capabilities, it formulates a plan. It decides if it needs to act or just keep watching.
Act: It executes code, runs terminal commands, or manipulates files directly on the host machine.

⚙️ How we built it

The project is a hybrid of high-performance local computing and cloud-based reasoning, running on an Alienware M18 (i9 + RTX 4070) with Native Linux Ubuntu.

The Core: Written in Python using asyncio to ensure the "sensory" threads (Vision/Audio) never block the "cognitive" thread (Gemini).
The Brain: We utilized the Google GenAI SDK to interface with Gemini 3. We specifically leverage the model's large context window to maintain a "Short-Term Memory" of the session logs.
The Body: We used standard Linux libraries (cv2 for vision, sounddevice for audio, subprocess for action) to give the AI hands and eyes.
The Interface: A rich terminal UI (using the rich library) that visualizes the agent's thought process in real-time.

🧠 Gemini 3 Integration

Gemini 3 is not just a backend; it is the Cognitive Core of the architecture.

We use Gemini 3 Pro for complex reasoning (The "Decide" phase), passing it a structured context of the system's state.
We rely on its Multimodal nature to interpret screenshots and logs simultaneously, reducing the need for complex OCR or pre-processing code. The model simply "looks" at the problem.

🚧 Challenges we ran into

The biggest challenge was Latency vs. Concurrency.

Problem: When the agent sends a request to the LLM, the local script would typically "freeze," making the agent blind and deaf for a few seconds.
Solution: We engineered a robust asynchronous architecture. The SensoryInputManager runs on a separate loop from the CognitiveCore. This allows Jarvis to continue "recording" the environment even while waiting for Gemini's response, ensuring no context is lost.

🏆 Accomplishments that we're proud of

Building a truly autonomous loop that doesn't crash on the first error.
Implementing a Self-Correction mechanism: If Jarvis tries a bash command that fails, it feeds the stderr output back into Gemini 3, which analyzes the error and suggests a fix in the next iteration.
Running this natively on Linux, proving that high-end consumer hardware + Gemini API is a viable path for AGI development.

🚀 What's next for Jarvis_FINAL

Long-Term Memory: Integrating a Vector Database (ChromaDB) so Jarvis remembers preferences across reboots.
Active Computer Vision: Giving Jarvis control over the mouse and keyboard to interact with GUI applications, not just the terminal.
Voice Synthesis: Making Jarvis speak back with low-latency TTS.

Built With

asyncio
gemini-3-pro
google-gemini-api
linux
opencv
python
rich
sounddevice
ubuntu

Updates

guilherme vieira alves started this project — Dec 20, 2025 03:22 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.