AI Browser - Human-in-the-loop Navigator

Inspiration

We've all been there โ€“ trying to help a friend or family member navigate a complex website over the phone, or struggling to remember the exact steps to complete an online task we only do once a year. Traditional screen recorders capture what happens but don't teach how to do it. Automation tools complete tasks but leave users dependent and unable to learn.

We were inspired to create something different: an AI-powered browser that doesn't just do things for you, but teaches you how to do them. By combining the intelligence of AI with the irreplaceable human element of learning and decision-making, we built a tool that bridges the gap between automation and education.

What it does

AI Browser is an intelligent, human-in-the-loop navigation assistant that transforms complex web workflows into guided learning experiences. Here's what makes it special:

๐ŸŽฏ Intelligent Task Planning

  • Users describe what they want to accomplish in natural language
  • AI analyzes the task and breaks it down into clear, manageable steps
  • Creates a visual roadmap of the entire workflow

๐Ÿงญ Multi-Mode Learning System

  1. Plan Mode: AI generates and refines step-by-step instructions through conversational interface
  2. Explore Mode: Users discover interface elements with AI assistance, building a shared understanding
  3. Tutorial Mode: AI guides users step-by-step with real-time element highlighting and contextual instructions
  4. Practice Mode: Users attempt tasks independently while the system records their actions
  5. Reflection Mode: Side-by-side comparison of correct workflow vs. user attempts, highlighting areas for improvement

๐Ÿ” Smart Element Detection

  • HTML Mode: Analyzes DOM structure to identify interactive elements
  • Vision Mode: Uses computer vision to locate elements when HTML matching fails
  • Fallback system ensures guidance continues even when exact elements can't be highlighted

๐ŸŽ“ Adaptive Learning

  • Records user progress and mistakes
  • Provides contextual help when users get stuck
  • Builds confidence through progressive independence

How we built it

Technology Stack

Frontend

  • Electron: Desktop application framework with webview integration
  • Vanilla JavaScript: Fast, lightweight rendering without framework overhead
  • Custom CSS: Sophisticated grayscale theme with gradient accents and lotus branding

Backend

  • Python + Flask: RESTful API server handling AI requests
  • LLM Integration: OpenAI GPT for natural language understanding and task planning
  • Computer Vision: Image analysis for element detection fallback
  • DOM Analysis: Intelligent parsing and element matching algorithms

Key Architecture Decisions

  • Preload Scripts: Secure communication between renderer and webview
  • Session Management: Stateful agent tracking across multiple page interactions
  • IPC Communication: Efficient message passing for real-time element highlighting
  • Modular Design: Separate modules for planning, navigation, observation, and translation

Development Workflow

  1. Built core Electron shell with webview integration
  2. Developed DOM scanning and element highlighting system
  3. Integrated LLM for task planning and step generation
  4. Created multi-mode learning workflow (Plan โ†’ Tutorial โ†’ Practice โ†’ Reflect)
  5. Added computer vision fallback for robust element detection
  6. Designed and implemented grayscale UI theme with lotus branding
  7. Implemented session management for continuous task tracking

Challenges we ran into

1. Dynamic Web Content

Modern websites use complex JavaScript frameworks that constantly modify the DOM. Our initial approach of static element selection failed when pages updated. We solved this by:

  • Implementing real-time DOM observation
  • Creating resilient CSS selectors that survive page mutations
  • Adding retry logic with exponential backoff

2. Element Identification Ambiguity

Matching AI-generated instructions to actual HTML elements proved surprisingly difficult. A button described as "Submit" might have aria-label="Complete form" or no text at all. Our solution:

  • Multi-criteria matching (text content, ARIA labels, placeholders, role attributes)
  • Confidence scoring system to detect low-quality matches
  • Computer vision fallback using screenshot analysis

3. Cross-Domain Communication

Electron's security model made it challenging to inject code into arbitrary websites. We overcame this with:

  • Careful preload script architecture
  • IPC message passing between isolated contexts
  • Sandbox permissions for network and git operations

4. State Management Complexity

Coordinating state between frontend UI, backend AI, and webview content required careful design:

  • Session-based tracking with unique IDs
  • Event-driven architecture for asynchronous updates
  • Proper cleanup and reset mechanisms

5. UI/UX for Learning

Balancing between guidance and overwhelming users was tricky:

  • Created collapsible sidebar to maximize browser space
  • Designed non-intrusive element highlighting
  • Implemented progressive disclosure of information
  • Added help button for easy access during practice mode

Accomplishments that we're proud of

โœจ Hybrid AI-Human Approach: We didn't just build another automation tool โ€“ we created a learning system that respects human agency while providing intelligent assistance.

๐ŸŽจ Sophisticated UI Design: Our grayscale theme with subtle gradients and the custom lotus logo creates a professional, calming interface that doesn't distract from the learning experience.

๐Ÿง  Robust Fallback System: When exact element matching fails, the system gracefully degrades to card-based instructions, ensuring users never get stuck.

๐Ÿ”„ Complete Learning Loop: From planning to practice to reflection, we built a comprehensive system that mirrors effective educational psychology principles.

๐Ÿš€ Real-World Applicability: Successfully demonstrated with complex workflows like restaurant reservations on Google Maps โ€“ a genuinely useful application.

๐Ÿ“Š Session Continuity: The agent maintains context across multiple page navigations, understanding the full workflow rather than treating each step in isolation.

What we learned

Technical Insights

  • LLM Prompt Engineering: Crafting prompts that consistently produce structured, actionable outputs required extensive iteration
  • DOM Manipulation at Scale: Working with arbitrary websites taught us resilience patterns for unpredictable environments
  • Electron Architecture: Deep understanding of Electron's security model, preload scripts, and IPC communication
  • Computer Vision Integration: Learned to combine traditional CV with modern AI for hybrid element detection

Design Insights

  • Progressive Learning: Users need different levels of guidance at different stages โ€“ one size doesn't fit all
  • Confidence Matters: Showing confidence scores helps users know when to trust AI vs. verify manually
  • Minimalism Wins: A clean, grayscale interface helps users focus on learning rather than fighting the tool

Product Insights

  • Human-in-the-Loop is Powerful: The best AI tools augment rather than replace human decision-making
  • Context is Everything: Task completion isn't just about clicking buttons โ€“ it's about understanding why
  • Learning Sticks: Users who complete the tutorial-practice-reflect cycle retain knowledge better than those who just watch automation

What's next for AI Browser

Near-Term Enhancements

๐Ÿ”Š Voice Guidance: Audio narration for hands-free learning during tutorials

๐Ÿ“ฑ Mobile Companion: Sync learned workflows to mobile devices for on-the-go reference

๐ŸŒ Workflow Sharing: Community marketplace for common tasks (tax filing, travel booking, etc.)

๐ŸŽฅ Video Generation: Automatic creation of polished tutorial videos from recorded sessions

Advanced Features

๐Ÿค– Adaptive Difficulty: AI adjusts guidance level based on user proficiency

๐Ÿงฉ Workflow Decomposition: Break complex multi-site workflows into manageable chunks

๐Ÿ”— Integration Hub: Connect with password managers, form fillers, and productivity tools

๐Ÿ“ˆ Analytics Dashboard: Track learning progress and identify areas needing more practice

Enterprise Vision

๐Ÿ‘ฅ Team Collaboration: Share and co-edit workflows within organizations

๐Ÿ” Compliance Mode: Ensure workflows follow company policies and regulations

๐Ÿ“Š Training Analytics: Track employee onboarding and process adoption

๐Ÿข Custom Deployment: Self-hosted versions for security-sensitive environments

Research Directions

๐Ÿงช Reinforcement Learning: AI learns from user corrections to improve future guidance

๐Ÿ”ฎ Predictive Assistance: Anticipate user needs based on context and history

๐ŸŒ Multi-Language Support: Automatic translation of workflows and instructions

โ™ฟ Accessibility Features: Screen reader integration, keyboard-only navigation support


The future of web navigation isn't full automation โ€“ it's intelligent assistance that empowers humans to learn, adapt, and succeed. AI Browser is just the beginning of this journey.

Try It Out

Visit our GitHub repository to:

  • Download the latest release
  • View the source code
  • Contribute to development
  • Report issues or request features

Let's make the web accessible through learning, not just automation. ๐ŸŒธ

Built With

Share this project:

Updates