Building a Context-Aware Voice Agent for the Web
Most voice assistants on the web are blind. They hear what you say but have no idea what page you're looking at. That gap makes every interaction feel stilted — you have to narrate context that should be obvious.
We built an agent that reads the room.
The Challenge
Voice assistants typically operate in a vacuum: open mic, hear speech, return a generic answer. On content-rich websites, that's a product failure. A user on a product documentation page doesn't want to describe the feature they're asking about — the assistant should already know.
Three hard requirements emerged:
- Context awareness — the assistant must extract and understand page content at load time and on every navigation
- Framework agnosticism — it had to work on React SPAs, Svelte sites, Next.js apps, and plain HTML pages without breaking
- Non-intrusive delivery — users shouldn't have to install heavy SDKs or restructure their markup
Our Approach
We built a self-contained widget with a Chrome extension variant. The architecture breaks into three layers:
Page intelligence layer. On load, the widget scans the DOM for structured content: headings, paragraphs, code blocks, and metadata. It hooks into the History API and MutationObserver to detect route changes and dynamic content swaps. When a modal or dialog opens, the agent prioritizes that content — it knows what the user is looking at right now.
Conversational pipeline. ElevenLabs SDK handles the full speech loop: streaming STT captures speech in sub-200ms chunks, the agent injects the current page context as a system prompt, and the LLM response streams back through ElevenLabs TTS before the model has finished generating. The result is a natural back-and-forth with no perceptible delay.
Delivery layer. The core widget is vanilla JavaScript — no framework bundle, no dependency conflicts. The Chrome extension wraps the same engine in a content script for cross-site use. Both variants share a common state machine for mic permissions, conversation history, and session lifecycle.
Results
The agent ships in three form factors — embedded widget, Chrome extension, and standalone page — from a single codebase. The modal-aware context prioritization cut irrelevant responses by roughly 60% in testing because the agent no longer tries to answer based on background content when a dialog is open. Perceived latency stays under 300ms thanks to streaming TTS, making the interaction feel like a real conversation rather than a query-response cycle.
Tech Stack
- ElevenLabs SDK — conversational AI pipeline (STT, LLM, TTS)
- WebSocket streaming — low-latency audio transport
- Chrome Extension APIs — content scripts, background workers, cross-origin injection
- Vanilla JS widget core — framework-agnostic, self-contained
- MutationObserver + History API — SPA-aware context extraction