bashwerkz

Most voice assistants on the web are blind. They hear what you say but have no idea what page you're looking at. That gap makes every interaction feel stilted — you have to narrate context that should be obvious.

We built an agent that reads the room.

The Challenge

Voice assistants typically operate in a vacuum: open mic, hear speech, return a generic answer. On content-rich websites, that's a product failure. A user on a product documentation page doesn't want to describe the feature they're asking about — the assistant should already know.

Three hard requirements emerged:

Context awareness — the assistant must extract and understand page content at load time and on every navigation
Framework agnosticism — it had to work on React SPAs, Svelte sites, Next.js apps, and plain HTML pages without breaking
Non-intrusive delivery — users shouldn't have to install heavy SDKs or restructure their markup

Our Approach

We built a self-contained widget with a Chrome extension variant. The architecture breaks into three layers:

Page intelligence layer. On load, the widget scans the DOM for structured content: headings, paragraphs, code blocks, and metadata. It hooks into the History API and MutationObserver to detect route changes and dynamic content swaps. When a modal or dialog opens, the agent prioritizes that content — it knows what the user is looking at right now.

Conversational pipeline. ElevenLabs SDK handles the full speech loop: streaming STT captures speech in sub-200ms chunks, the agent injects the current page context as a system prompt, and the LLM response streams back through ElevenLabs TTS before the model has finished generating. The result is a natural back-and-forth with no perceptible delay.

Delivery layer. The core widget is vanilla JavaScript — no framework bundle, no dependency conflicts. The Chrome extension wraps the same engine in a content script for cross-site use. Both variants share a common state machine for mic permissions, conversation history, and session lifecycle.

Results

The agent ships in three form factors — embedded widget, Chrome extension, and standalone page — from a single codebase. The modal-aware context prioritization cut irrelevant responses by roughly 60% in testing because the agent no longer tries to answer based on background content when a dialog is open. Perceived latency stays under 300ms thanks to streaming TTS, making the interaction feel like a real conversation rather than a query-response cycle.

Tech Stack

ElevenLabs SDK — conversational AI pipeline (STT, LLM, TTS)
WebSocket streaming — low-latency audio transport
Chrome Extension APIs — content scripts, background workers, cross-origin injection
Vanilla JS widget core — framework-agnostic, self-contained
MutationObserver + History API — SPA-aware context extraction