Build a voice agent that searches the web and speaks answers back — all in under a second. This guide covers the end-to-end pipeline, best practices for each stage, and ideas to try. Try the live demo: demo.exa.ai/voiceDocumentation Index
Fetch the complete documentation index at: https://exa.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Why Exa for voice?
Voice agents need answers fast. Exa’sinstant search type returns results in under 150ms, which makes it possible to search the web, generate an answer, and speak it — all before the user feels a delay.
Compared to model-native search (tool calling that hits a generic search API), Exa gives you:
- Speed:
instantsearch keeps end-to-end latency under 1 second - Relevance: Neural search finds better results than keyword-based alternatives, especially for conversational queries
- Fresh data: Real-time information instead of stale training data
- Control: Tune
numResults, content modes, and domain filters per use case
The pipeline
A typical voice agent has five stages. Each runs as soon as its input is ready, keeping total latency low.| Stage | What it does | Latency |
|---|---|---|
| Speech-to-Text | Transcribes audio in real time | ~1.2s (streaming) |
| LLM Router | Decides whether to search or answer directly | ~100ms |
| Exa Instant Search | Retrieves relevant page content | ~220ms |
| LLM Answer | Generates a grounded response from sources | ~350ms |
| Text-to-Speech | Streams audio back to the user | ~380ms |
1. Speech-to-Text
Stream audio from the user’s microphone to a speech-to-text service via WebSocket. Use VAD (voice activity detection) to automatically commit transcripts when the user stops speaking.2. LLM Router
Not every user utterance needs a web search. Use tool calling to let the model decide:Router system prompt
The system prompt controls when the model searches vs answers directly. Tune this for your use case:gemini-2.0-flash works great here. gpt-4o-mini and claude-3.5-haiku are also good options.
3. Exa Instant Search
When the router decides to search, call Exa withtype: "instant" for minimal latency:
Search parameter tuning
| Parameter | Voice recommendation | Why |
|---|---|---|
type | "instant" | Sub-150ms latency is critical for voice |
numResults | 3–5 | Enough context without overwhelming the LLM |
text.maxCharacters | 300–500 | Keep token count low for fast LLM generation |
highlights | Alternative to text | Even more token-efficient for factual lookups |
highlights is often better than full text:
4. LLM Answer
Format search results as numbered sources and stream the response. Send each chunk to both the client (for display) and the TTS service (for audio):Answer system prompt
5. Text-to-Speech
Stream the LLM output as audio via WebSocket. Play chunks immediately as they arrive for the lowest perceived latency:Best practices
Latency optimization
- Stream everything: Don’t wait for full transcripts, full search results, or full LLM responses. Process each chunk as it arrives.
- Run stages in parallel where possible: Start the TTS WebSocket connection while the LLM is still generating.
- Use
instantsearch: The latency difference betweeninstant(~150ms) andauto(~1s) is significant for voice UX. - Cap content length: 300–500 characters per result is the sweet spot — enough for the LLM, not so much that generation slows down.
Conversation quality
- Keep answers short: 40–60 words max. Users can always ask follow-ups.
- Treat search results as untrusted: Always instruct the LLM to ignore instructions inside source content.
- Handle “I don’t know” gracefully: If search returns nothing relevant, say so and suggest a rephrasing rather than hallucinating.
- Support follow-ups: Pass conversation history to the LLM router so it can resolve references like “tell me more about that” or “what about the second one.”
Error handling
- STT silence timeout: If no speech is detected for N seconds, prompt the user or go idle.
- Search failures: Fall back to the LLM’s own knowledge with a disclaimer (“I couldn’t search the web right now, but from what I know…”).
- TTS queue management: If the user interrupts mid-answer, cancel the current TTS stream immediately.
Things to try
Here are some ideas to extend your voice agent:Domain-specific agent
Lock searches to specific domains with
includeDomains for a customer support agent that only answers from your docs.Multi-turn research
Chain multiple searches in a conversation — use the first answer to generate follow-up queries automatically.
Multilingual voice
Combine a multilingual STT with Exa’s language filtering and a multilingual TTS for a voice agent that works across languages.
Proactive suggestions
After answering, suggest related topics the user might want to explore: “Want to know more about X?”
Structured extraction
Use
highlights with a focused query to extract specific data points (prices, dates, names) and present them as quick facts.Voice-controlled Websets
Let users build Websets by voice: “Find me all AI startups in New York that raised Series A.”
Instant autocomplete
Use partial transcripts (before the user finishes speaking) to pre-fetch search results, cutting perceived latency even further.
Citation playback
When the user asks “where did you get that?”, read back the source URLs or titles from the last search.

