A conversational screen-aware web agent optimized for blind and low-vision users
Vision Agent is a Chrome extension that helps visually impaired users navigate the internet through natural conversation, screen understanding, and web automation.
- Push-to-talk voice input using Deepgram's Nova-2 model
- Natural conversational TTS using Deepgram Aura voices (Thalia recommended)
- Multi-language support - 14+ languages including English, Spanish, French, German, Japanese, and more
- Conversational filler - Natural "thinking" responses while processing
- Natural language conversation with context awareness
- Describe any webpage - Get a clear understanding of what's on screen
- Analyze content - Summarize articles, identify key information
- Detect scams - Check if websites are trustworthy, find hidden fees
- Navigate to websites by speaking URLs or site names
- Click buttons and links by describing them
- Type into search boxes and forms
- Scroll through pages
- Google Chrome browser
- Gemini API Key (required)
- Deepgram API Key (required for voice)
-
Download the extension
git clone <your-repo-url> cd sb_hacks
-
Load in Chrome
- Open
chrome://extensions/ - Enable "Developer mode" (top right)
- Click "Load unpacked"
- Select the
sb_hacksfolder
- Open
-
Configure API Keys
- Click the Vision Agent icon in Chrome
- Click "⚙️ Settings" in the side panel
- Enter your Gemini and Deepgram API keys
- (Optional) Select your preferred language and voice
- Click "Save Settings"
-
Start using
- Click the Vision Agent icon to open the side panel
- Hold the microphone button and speak, or type a message
- Try "describe this page" to get started!
| What to say | What happens |
|---|---|
| "Describe this page" | Get an overview of the current webpage |
| "What are the main takeaways here?" | Summarize the important content |
| "Is this website safe?" | Analyze for scam indicators |
| "Go to google.com" | Navigate to Google |
| "Click the login button" | Click the login button |
| "Type hello in the search box" | Enter text in search field |
| "Scroll down" | Scroll the page down |
sb_hacks/
├── manifest.json # Extension configuration
├── background.js # Service worker (Gemini integration)
├── sidepanel.html/css/js # Main UI with voice input
├── content.js # DOM manipulation
├── options.html/js # Settings page
├── lib/
│ └── generative-ai.js # Bundled Gemini SDK
└── icons/ # Extension icons
- AI: Google Gemini 2.0 Flash (vision + chat)
- Speech-to-Text: Deepgram Nova-2
- Text-to-Speech: Deepgram Aura (Thalia voice & 12 other voices)
- Platform: Chrome Extension Manifest V3
- Scam Detection: "Are there hidden fees on this page?"
- Article Summaries: "What are the key points of this article?"
- Form Navigation: "What form fields are on this page?"
- Shopping: "What are the product details and price?"
- General Browsing: "What links can I click on this page?"
- Go to Google AI Studio
- Create a new API key
- Copy and paste into Vision Agent settings
- Go to Deepgram Console
- Create a new project
- Generate an API key
- Copy and paste into Vision Agent settings
MIT License - feel free to modify and distribute!
Built with ❤️ for accessibility
blind.CUM
bcum.ai
blind computer use model