Vision Agent 👁️

A conversational screen-aware web agent optimized for blind and low-vision users

Vision Agent is a Chrome extension that helps visually impaired users navigate the internet through natural conversation, screen understanding, and web automation.

✨ Features

🎤 Voice Interaction

Push-to-talk voice input using Deepgram's Nova-2 model
Natural conversational TTS using Deepgram Aura voices (Thalia recommended)
Multi-language support - 14+ languages including English, Spanish, French, German, Japanese, and more
Conversational filler - Natural "thinking" responses while processing
Natural language conversation with context awareness

👁️ Screen Understanding

Describe any webpage - Get a clear understanding of what's on screen
Analyze content - Summarize articles, identify key information
Detect scams - Check if websites are trustworthy, find hidden fees

🤖 Web Automation (Agent)

Navigate to websites by speaking URLs or site names
Click buttons and links by describing them
Type into search boxes and forms
Scroll through pages

🚀 Getting Started

Prerequisites

Google Chrome browser
Gemini API Key (required)
Deepgram API Key (required for voice)

Installation

Download the extension
```
git clone <your-repo-url>
cd sb_hacks
```
Load in Chrome
- Open chrome://extensions/
- Enable "Developer mode" (top right)
- Click "Load unpacked"
- Select the sb_hacks folder
Configure API Keys
- Click the Vision Agent icon in Chrome
- Click "⚙️ Settings" in the side panel
- Enter your Gemini and Deepgram API keys
- (Optional) Select your preferred language and voice
- Click "Save Settings"
Start using
- Click the Vision Agent icon to open the side panel
- Hold the microphone button and speak, or type a message
- Try "describe this page" to get started!

💬 Example Commands

What to say	What happens
"Describe this page"	Get an overview of the current webpage
"What are the main takeaways here?"	Summarize the important content
"Is this website safe?"	Analyze for scam indicators
"Go to google.com"	Navigate to Google
"Click the login button"	Click the login button
"Type hello in the search box"	Enter text in search field
"Scroll down"	Scroll the page down

🏗️ Project Structure

sb_hacks/
├── manifest.json        # Extension configuration
├── background.js        # Service worker (Gemini integration)
├── sidepanel.html/css/js  # Main UI with voice input
├── content.js           # DOM manipulation
├── options.html/js      # Settings page
├── lib/
│   └── generative-ai.js # Bundled Gemini SDK
└── icons/               # Extension icons

🔧 Tech Stack

AI: Google Gemini 2.0 Flash (vision + chat)
Speech-to-Text: Deepgram Nova-2
Text-to-Speech: Deepgram Aura (Thalia voice & 12 other voices)
Platform: Chrome Extension Manifest V3

🎯 Use Cases

Scam Detection: "Are there hidden fees on this page?"
Article Summaries: "What are the key points of this article?"
Form Navigation: "What form fields are on this page?"
Shopping: "What are the product details and price?"
General Browsing: "What links can I click on this page?"

⚙️ API Keys

Gemini API

Go to Google AI Studio
Create a new API key
Copy and paste into Vision Agent settings

Deepgram API

Go to Deepgram Console
Create a new project
Generate an API key
Copy and paste into Vision Agent settings

📄 License

MIT License - feel free to modify and distribute!

Built with ❤️ for accessibility

blind.CUM

bcum.ai

blind computer use model

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
backend		backend
claude_files1		claude_files1
icons		icons
lib		lib
node_modules		node_modules
.gitignore		.gitignore
README.md		README.md
background.js		background.js
content.js		content.js
launch_clean_chrome.sh		launch_clean_chrome.sh
manifest.json		manifest.json
mic-permission.html		mic-permission.html
offscreen.html		offscreen.html
offscreen.js		offscreen.js
options.html		options.html
options.js		options.js
requestPermissions.html		requestPermissions.html
requestPermissions.js		requestPermissions.js
sidepanel.css		sidepanel.css
sidepanel.html		sidepanel.html
sidepanel.js		sidepanel.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Agent 👁️

✨ Features

🎤 Voice Interaction

👁️ Screen Understanding

🤖 Web Automation (Agent)

🚀 Getting Started

Prerequisites

Installation

💬 Example Commands

🏗️ Project Structure

🔧 Tech Stack

🎯 Use Cases

⚙️ API Keys

Gemini API

Deepgram API

📄 License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

ryunzz/sb_hacks

Folders and files

Latest commit

History

Repository files navigation

Vision Agent 👁️

✨ Features

🎤 Voice Interaction

👁️ Screen Understanding

🤖 Web Automation (Agent)

🚀 Getting Started

Prerequisites

Installation

💬 Example Commands

🏗️ Project Structure

🔧 Tech Stack

🎯 Use Cases

⚙️ API Keys

Gemini API

Deepgram API

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages