Add real-time voice to any Glove agent. The voice pipeline handles microphone capture, speech-to-text, agent processing, and text-to-speech playback — while all your existing tools, display stack, and context management continue to work unchanged.
The voice system is split across three packages, each with a specific responsibility:
GloveVoice, adapter contracts (STT, TTS, VAD), built-in implementations (ElevenLabs adapters, energy-based VAD), audio capture, and audio playback.useGloveVoice (low-level), useGlovePTT (push-to-talk), and VoicePTTButton (headless mic button) with proper lifecycle management.createVoiceTokenHandler for creating Next.js API routes that generate short-lived provider tokens, keeping your API keys on the server.GloveVoice supports two turn detection modes that control how the pipeline decides when the user has finished speaking:
commitTurn(). No automatic barge-in — call interrupt() explicitly when needed. Ideal for noisy environments or when precise control is required.The pipeline transitions through four states during operation:
| Mode | State | Description |
|---|---|---|
| idle | Pipeline off | Not started or stopped. No mic access, no connections. |
| listening | Mic active | Capturing audio, sending to STT. Waiting for the user to speak. |
| thinking | Agent processing | User utterance committed. Glove is processing the request (model call, tool execution). |
| speaking | TTS playback | Audio chunks streaming from TTS to the speaker. Barge-in returns to listening. |
Get voice working in five minutes with ElevenLabs. This assumes you already have a Glove agent running with glove-react and glove-next.
pnpm add glove-voiceglove-react and glove-next are already part of a typical Glove project. The voice subpaths (glove-react/voice and createVoiceTokenHandler from glove-next) are included in those packages.
Create two API routes that generate short-lived ElevenLabs tokens. Your API key stays on the server — the browser only receives single-use tokens.
import { createVoiceTokenHandler } from "glove-next";
export const GET = createVoiceTokenHandler({
provider: "elevenlabs",
type: "stt",
});import { createVoiceTokenHandler } from "glove-next";
export const GET = createVoiceTokenHandler({
provider: "elevenlabs",
type: "tts",
});Set your ElevenLabs API key in .env.local:
ELEVENLABS_API_KEY=your_api_key_hereCreate a voice configuration file that sets up the ElevenLabs STT adapter and TTS factory. The token fetchers point to the routes you just created.
import { createElevenLabsAdapters } from "glove-voice";
async function fetchToken(path: string): Promise<string> {
const res = await fetch(path);
const data = await res.json();
return data.token;
}
export const { stt, createTTS } = createElevenLabsAdapters({
getSTTToken: () => fetchToken("/api/voice/stt-token"),
getTTSToken: () => fetchToken("/api/voice/tts-token"),
voiceId: "JBFqnCBsd6RMkjVDRZzb",
});The voiceId is an ElevenLabs voice identifier. Browse the ElevenLabs Voice Library to find a voice and copy its ID.
Use useGloveVoice alongside useGlove to wire the voice pipeline into your component.
import { useGlove } from "glove-react";
import { useGloveVoice } from "glove-react/voice";
import { stt, createTTS } from "@/lib/voice";
function App() {
const { runnable } = useGlove({ tools, sessionId });
const voice = useGloveVoice({
runnable,
voice: { stt, createTTS },
});
return (
<button onClick={voice.isActive ? voice.stop : voice.start}>
{voice.mode}
</button>
);
}That is it. Clicking the button starts the mic, connects STT, and begins listening. Speak naturally and the pipeline handles the rest: your speech is transcribed, sent to Glove, and the response is spoken back.
The voice pipeline is built around three adapter contracts. Each adapter is an EventEmitter with a specific set of events and methods. You can swap implementations freely — the pipeline does not care which provider you use, only that the contract is satisfied.
Streaming speech-to-text. Receives raw PCM audio and emits transcripts.
| Event | Payload | Description |
|---|---|---|
| partial | string | Streaming partial transcript. Changes as more speech arrives. |
| final | string | Stable, finalized transcript for the completed utterance. |
| error | Error | Connection or transcription error. |
| close | (none) | WebSocket connection closed. |
| Method | Signature | Description |
|---|---|---|
| connect() | () => Promise<void> | Open the connection. Adapter fetches credentials internally via its getToken function. |
| sendAudio(pcm) | (pcm: Int16Array) => void | Send a raw PCM chunk (16kHz mono Int16Array). |
| flushUtterance() | () => void | Signal end of utterance. Adapter should finalize the current transcript. Called by VAD on speech_end. |
| disconnect() | () => void | Close the connection and release resources. |
Streaming text-to-speech. Receives text chunks and emits audio.
| Event | Payload | Description |
|---|---|---|
| audio_chunk | Uint8Array | Raw PCM audio chunk (16kHz mono), ready for the AudioPlayer. |
| done | (none) | All audio for the current turn has been received. |
| error | Error | Connection or synthesis error. |
| Method | Signature | Description |
|---|---|---|
| open() | () => Promise<void> | Open the connection. Resolves once the adapter is ready to accept text. |
| sendText(text) | (text: string) => void | Send a text chunk for synthesis. Safe to call before open() resolves; adapters queue internally. |
| flush() | () => void | Signal end of text stream. Flushes remaining audio. Must be called once after all text is sent. |
| destroy() | () => void | Immediately close the connection, dropping any pending audio. |
Voice activity detection. Processes audio frames and signals speech boundaries.
| Event | Payload | Description |
|---|---|---|
| speech_start | (none) | User started speaking. |
| speech_end | (none) | User stopped speaking. Triggers STT flush in VAD mode. |
| Method | Signature | Description |
|---|---|---|
| process(pcm) | (pcm: Int16Array) => void | Process a PCM frame. Call on every AudioCapture chunk event. |
| reset() | () => void | Force reset internal state. Called when interrupting a turn. |
| Adapter | Provider | Description |
|---|---|---|
| ElevenLabsSTTAdapter | ElevenLabs Scribe Realtime | WebSocket-based streaming STT using ElevenLabs Scribe v2. Supports partial and committed transcripts with auto-reconnect. |
| ElevenLabsTTSAdapter | ElevenLabs Input Streaming | WebSocket-based streaming TTS using ElevenLabs Turbo v2.5. Streams text in, receives PCM audio chunks out. |
| VAD (energy-based) | Built-in | Zero-dependency energy-based voice activity detector. Uses RMS energy thresholds. Good for quiet environments. |
| SileroVADAdapter | Silero VAD (WASM) | ML-based voice activity detection using ONNX Runtime. Much more accurate in noisy environments. Loaded from glove-voice/silero-vad subpath. |
Voice providers like ElevenLabs, Deepgram, and Cartesia authenticate via API keys. These keys must never be exposed to the browser. The token pattern solves this:
Factory function from glove-next that creates a Next.js App Router GET handler for generating provider tokens.
function createVoiceTokenHandler(
config: VoiceTokenHandlerConfig
): (req: Request) => Promise<Response>A discriminated union based on the provider field:
| Provider | Fields | Description |
|---|---|---|
| elevenlabs | type: "stt" | "tts" | ElevenLabs requires separate tokens for STT (realtime_scribe) and TTS (tts_websocket). Create one route for each. Reads ELEVENLABS_API_KEY from env. |
| deepgram | ttlSeconds?: number | Deepgram uses a single token for all operations. ttlSeconds controls token lifetime (default: 30). Reads DEEPGRAM_API_KEY from env. |
| cartesia | (none) | Cartesia uses a single JWT token. Reads CARTESIA_API_KEY from env. |
All providers accept an optional apiKey field to pass the key directly instead of reading from environment variables.
// Override the env var with a direct key
export const GET = createVoiceTokenHandler({
provider: "elevenlabs",
type: "stt",
apiKey: "sk-...",
});VAD determines when the user starts and stops speaking. This controls when to flush the STT buffer and commit a turn, and when to trigger barge-in during playback.
The default VAD uses RMS energy thresholds. It has zero dependencies, works everywhere, and is effective in quiet environments. When no custom vad is passed to GloveVoice, the built-in VAD is used automatically.
| Parameter | Default | Description |
|---|---|---|
| threshold | 0.01 | RMS energy level to consider as speech. Higher values require louder speech. |
| silentFrames | 15 (~600ms) | Consecutive silent frames before speech_end fires. Increase for longer natural pauses. GloveVoice defaults to 40 (~1600ms). |
| speechFrames | 3 | Consecutive speech frames before speech_start fires. Avoids false triggers from brief noises. |
import { useGloveVoice } from "glove-react/voice";
// Override VAD sensitivity via vadConfig
const voice = useGloveVoice({
runnable,
voice: {
stt,
createTTS,
vadConfig: { silentFrames: 60, threshold: 0.02 },
},
});For noisy environments or higher accuracy, use SileroVADAdapter. It runs a neural network (Silero VAD v5) via ONNX Runtime in the browser using WebAssembly. The ML model produces a speech probability score for each audio frame, making it far more accurate than energy-based detection at distinguishing speech from background noise.
SileroVAD depends on @ricky0123/vad-web and onnxruntime-web, which load WASM files in the browser. If you import this from the main glove-voice barrel, bundlers (Next.js, Vite) try to resolve WASM files at build time and may attempt to bundle them for SSR, causing errors.
The solution is a separate entry point at glove-voice/silero-vad combined with a dynamic import:
export async function createSileroVAD() {
const { SileroVADAdapter } = await import("glove-voice/silero-vad");
const vad = new SileroVADAdapter({
positiveSpeechThreshold: 0.5,
negativeSpeechThreshold: 0.35,
wasm: { type: "cdn" },
});
await vad.init();
return vad;
}Pass the created VAD to the voice config:
const vad = await createSileroVAD();
const voice = useGloveVoice({
runnable,
voice: { stt, createTTS, vad },
});When using SileroVAD with Next.js, you need transpilePackages so Next.js processes the glove-voice package correctly. The dynamic import ensures the WASM-dependent code only loads in the browser.
/** @type {import('next').NextConfig} */
const config = {
transpilePackages: ["glove-voice"],
serverExternalPackages: ["better-sqlite3"], // if using SqliteStore
};
export default config;| Mode | Config | Description |
|---|---|---|
| CDN (recommended) | { type: "cdn" } | Loads ONNX Runtime WASM files from jsDelivr CDN. Zero configuration required. Best for most deployments. |
| Local | { type: "local", path: "/onnx/" } | Loads WASM files from your public/ directory. For offline or air-gapped environments. Copy files from node_modules/onnxruntime-web/dist/ to public/onnx/. |
When building with SileroVAD, you will see warnings like:
⚠ Critical dependency: require function is used in a way
in which dependencies cannot be statically extractedThese come from onnxruntime-web's internal dynamic require and are harmless. The WASM loading works correctly at runtime.
| Parameter | Default | Description |
|---|---|---|
| positiveSpeechThreshold | 0.3 | Speech probability score (0-1) above which a frame is considered speech. Higher values mean less sensitivity and fewer false triggers. |
| negativeSpeechThreshold | 0.25 | Speech probability score (0-1) below which a frame is considered silence. Lower values require more definitive silence to end speech detection. |
| redemptionMs | 1400 | Milliseconds of silence allowed within speech before triggering speech_end. Acts as a debounce for brief pauses mid-sentence. |
| preSpeechPadMs | 800 | Milliseconds of audio to include before the detected speech start. Ensures the beginning of utterances is not clipped. |
| minSpeechMs | 100 | Minimum duration of speech in milliseconds. Utterances shorter than this are treated as misfires. |
In VAD mode, the pipeline operates hands-free. The VAD continuously analyzes audio frames and automatically detects when the user starts and stops speaking.
speech_end, the STT adapter flushes its buffer, emits a final transcript, and the pipeline transitions to thinking.speaking or thinking modes, the pipeline calls interrupt() automatically. This aborts the in-flight Glove request, stops TTS playback, clears display slots, and returns to listening.pushAndWait slot is active (for example, a checkout form), barge-in is suppressed at the voice layer. The pipeline checks displayManager.resolverStore.size and skips the interrupt if there are pending resolvers. For full protection, combine this with unAbortable: true on the tool so it survives abort signals from any source, not just voice.const voice = useGloveVoice({
runnable,
voice: { stt, createTTS, turnMode: "vad" }, // "vad" is the default
});In manual mode, the consumer controls turn boundaries. No VAD is created. The mic captures audio and sends it to STT continuously, but nothing commits the utterance until you call commitTurn().
voice.commitTurn() to signal the end of the user's utterance. This flushes the STT buffer and starts agent processing.voice.interrupt() explicitly.For most push-to-talk use cases, useGlovePTT handles all of this automatically — see the Push-to-Talk section below. The following is the low-level alternative for reference:
const voice = useGloveVoice({
runnable,
voice: { stt, createTTS, turnMode: "manual" },
});
// Low-level push-to-talk button
<button
onPointerDown={() => voice.start()}
onPointerUp={() => voice.commitTurn()}
>
Hold to talk
</button>useGlovePTT is a high-level hook that replaces approximately 80 lines of push-to-talk boilerplate with around 5 lines. It wraps useGloveVoice and handles:
import { useGlove, Render } from "glove-react";
import { useGlovePTT, VoicePTTButton } from "glove-react/voice";
import { stt, createTTS } from "@/lib/voice";
function ChatPanel() {
const glove = useGlove({ endpoint: "/api/chat", tools });
const ptt = useGlovePTT({
runnable: glove.runnable,
voice: { stt, createTTS },
hotkey: "Space",
});
return (
<>
<Render glove={glove} voice={ptt} renderInput={() => null} />
<VoicePTTButton ptt={ptt}>
{({ enabled, recording, mode }) => (
<button className={recording ? "recording" : enabled ? "active" : ""}>
<MicIcon />
</button>
)}
</VoicePTTButton>
</>
);
}| Property | Type | Description |
|---|---|---|
| runnable | IGloveRunnable | null | The Glove runnable instance. Pass useGlove().runnable. |
| voice | Omit<GloveVoiceConfig, "turnMode"> | Voice pipeline config. turnMode is forced to "manual" and startMuted to true internally. |
| hotkey? | string | false | Keyboard hotkey code (default: "Space"). Uses KeyboardEvent.code values. Auto-ignores when focused on INPUT, TEXTAREA, or SELECT. Set to false to disable. |
| holdThreshold? | number | Hold duration in ms for click-vs-hold discrimination (default: 300). A quick click toggles voice on/off; a hold triggers PTT recording. |
| minRecordingMs? | number | Minimum recording duration in ms before committing a turn (default: 350). If the user releases early, the mic stays hot until the minimum is reached. |
| Property | Type | Description |
|---|---|---|
| enabled | boolean | Whether the voice pipeline is active (user toggled voice on). |
| recording | boolean | Whether the user is currently holding to record. |
| processing | boolean | Whether STT is finalizing after a short recording. |
| mode | VoiceMode | Current voice pipeline state: idle, listening, thinking, speaking. |
| transcript | string | Current partial transcript while user is speaking. |
| error | Error | null | Last error from the voice pipeline. |
| toggle() | () => Promise<void> | Toggle the voice pipeline on/off. |
| interrupt() | () => void | Barge-in: abort in-flight request and stop TTS. |
| bind | { onPointerDown, onPointerUp, onPointerLeave } | Pointer event handlers to spread onto a mic button. Includes click-vs-hold discrimination. |
| voice | UseGloveVoiceReturn | The underlying voice hook return for advanced use cases. |
Headless (unstyled) component with a render prop pattern. Wraps ptt.bind with role="button", tabIndex, aria-label, aria-pressed, and touch safety (prevents context menu on long press, disables text selection during hold).
import { VoicePTTButton } from "glove-react/voice";
<VoicePTTButton ptt={ptt} className="mic-button">
{({ enabled, recording, processing, mode }) => (
<button className={recording ? "active" : ""}>
{processing ? <Spinner /> : <MicIcon />}
{enabled && <StatusDot />}
</button>
)}
</VoicePTTButton>| Property | Type | Description |
|---|---|---|
| ptt | UseGlovePTTReturn | The return value of useGlovePTT(). |
| children | (props: VoicePTTButtonRenderProps) => ReactNode | Render prop for full styling control. Receives enabled, recording, processing, and mode. |
| className? | string | Additional className on the wrapper span. |
| style? | React.CSSProperties | Additional style on the wrapper span. |
The <Render> component accepts an optional voice prop to auto-render transcript and voice status. This works with both useGlovePTT and useGloveVoice return values:
<Render
glove={glove}
voice={ptt} // or useGloveVoice() return
renderTranscript={({ transcript }) => ( // optional custom renderer
<p className="transcript">{transcript}</p>
)}
renderVoiceStatus={({ mode }) => ( // optional custom renderer
<span className="status">{mode}</span>
)}
renderInput={() => null}
/>The voice prop accepts a VoiceRenderHandle, which is any object with transcript, mode, and enabled fields. Both UseGlovePTTReturn and UseGloveVoiceReturn satisfy this interface. The optional renderTranscript receives TranscriptRenderProps (with a transcript string), and renderVoiceStatus receives VoiceStatusRenderProps (with a mode value).
function useGloveVoice(config: UseGloveVoiceConfig): UseGloveVoiceReturn| Property | Type | Description |
|---|---|---|
| runnable | IGloveRunnable | null | The Glove runnable instance. Pass useGlove().runnable. When null, start() will throw. |
| voice | GloveVoiceConfig | Voice pipeline configuration. Contains the STT adapter, TTS factory, turn mode, optional VAD override, and sample rate. |
| Property | Type | Description |
|---|---|---|
| stt | STTAdapter | Speech-to-text adapter instance. Any implementation of the STTAdapter contract. |
| createTTS | () => TTSAdapter | Factory function that returns a fresh TTS adapter per turn. Must be a factory, not a single instance, because GloveVoice creates a new TTS session for each model response. |
| turnMode? | "vad" | "manual" | Turn detection mode. Default: "vad". In "manual" mode, no VAD is used and the consumer calls commitTurn(). |
| vad? | VADAdapter | Override the VAD implementation. Only used when turnMode is "vad". Pass a SileroVADAdapter for ML-based detection. |
| vadConfig? | VADConfig | Configuration for the built-in energy-based VAD. Only used when turnMode is "vad" and no custom vad is provided. Default silentFrames: 40 (~1600ms). |
| sampleRate? | number | Audio sample rate in Hz. Default: 16000. Must match STT and TTS adapter expectations. |
| startMuted? | boolean | Start the pipeline with mic muted. Defaults to true when turnMode is "manual", false otherwise. Eliminates the race condition between start() resolving and calling mute(). |
| Property | Type | Description |
|---|---|---|
| mode | VoiceMode | Current voice pipeline state: "idle", "listening", "thinking", or "speaking". |
| transcript | string | Current partial transcript while the user is speaking. Cleared when a turn is committed or the pipeline stops. |
| isActive | boolean | Whether the voice pipeline is active (mode is not "idle"). |
| enabled | boolean | Whether the user intended the pipeline to be active. True after start(), false after stop() or pipeline death (WebSocket drop, permission revoked). Unlike isActive, this tracks user intent and auto-resets — no manual sync useEffect needed. |
| error | Error | null | Last error from the voice pipeline. Cleared on the next start() call. |
| start() | () => Promise<void> | Start the voice pipeline. Requests microphone permission, connects STT, and begins listening. Throws if runnable is null or mic permission is denied. |
| stop() | () => Promise<void> | Stop the voice pipeline. Interrupts any in-progress response, disconnects STT, releases the microphone, and returns to idle. |
| interrupt() | () => void | Barge-in. Aborts the in-flight Glove request, stops TTS playback, clears non-blocking display slots, and returns to listening. |
| commitTurn() | () => void | Manual turn commit. Flushes the current utterance to STT for finalization. Primary control mechanism in manual turn mode. Also works in VAD mode as an explicit override. |
| isMuted | boolean | Whether mic audio is currently muted (not forwarded to STT/VAD). The audio_chunk event still fires when muted. |
| mute() | () => void | Stop forwarding mic audio to STT/VAD. The mic stays active and audio_chunk events continue to fire (for visualization). No transcription or VAD detection occurs while muted. |
| unmute() | () => void | Resume forwarding mic audio to STT/VAD. Restores normal transcription and voice activity detection. |
| narrate(text) | (text: string) => Promise<void> | Speak arbitrary text through TTS without involving the model. Auto-mutes mic during playback. Resolves when all audio finishes playing. Safe to call from pushAndWait tool handlers. |
Use voice.narrate(text) to speak arbitrary text through TTS without sending it to the model. This is useful for reading display slot content aloud — order summaries, confirmation details, or any text you want the user to hear.
narrate() returns a promise that resolves when all audio finishes playing. It creates a fresh TTS adapter per call (same pattern as model turns) and auto-mutes the mic during playback to prevent TTS audio from feeding back into STT.
const checkout = defineTool({
name: "checkout",
unAbortable: true,
displayStrategy: "hide-on-complete",
async do(input, display) {
const cart = getCart();
// Narrate the cart summary before showing the form
await voice.narrate(
`Your order has ${cart.length} items totaling ${formatPrice(total)}.`
);
const result = await display.pushAndWait({ items: cart });
if (!result) return "Cancelled";
// Narrate the confirmation
await voice.narrate("Order placed! You'll receive a confirmation email shortly.");
cartOps.clear();
return "Order placed!";
},
});Key detail: narrate() is safe to call from pushAndWait tool handlers. When a tool uses pushAndWait, the model is paused waiting for the tool result, so there is no concurrent model TTS to conflict with.
voice.mute() and voice.unmute() gate mic audio forwarding to STT and VAD. When muted, the mic stays active but no transcription or speech detection occurs. This is useful for temporarily disabling voice input without tearing down the pipeline.
<button onClick={voice.isMuted ? voice.unmute : voice.mute}>
{voice.isMuted ? "Unmute" : "Mute"}
</button>The audio_chunk event on the underlying GloveVoice instance emits raw Int16Array PCM data from the mic, even when muted. Use this for waveform or audio level visualization:
// Listen to audio_chunk on the GloveVoice instance for visualization
voice.on("audio_chunk", (pcm: Int16Array) => {
// Compute RMS level for a simple meter
let sum = 0;
for (let i = 0; i < pcm.length; i++) sum += pcm[i] * pcm[i];
const level = Math.sqrt(sum / pcm.length) / 32768;
updateMeter(level);
});Tools built for voice agents have different design considerations than text-based tools. Voice users cannot click buttons or fill forms while speaking, and the model's response text gets spoken aloud.
In voice-first apps, use pushAndForget instead of pushAndWait for tools that display information. Voice users see the visual result while hearing the narration, but they do not need to interact with it to continue the conversation.
const showMenuTool: ToolConfig<{ items: MenuItem[] }> = {
name: "show_menu",
description: "Display menu items to the user.",
inputSchema: z.object({
items: z.array(z.object({
name: z.string(),
price: z.number(),
description: z.string(),
})),
}),
async do(input, display) {
await display.pushAndForget({ items: input.items });
// Return concise text for the model to narrate
return {
status: "success",
data: input.items
.map(i => `${i.name} for $${i.price.toFixed(2)}`)
.join(", "),
};
},
render({ data }) {
return <MenuCard items={data.items} />;
},
};The data field in your tool result is what the model sees and narrates. Keep it short and descriptive. Avoid returning raw JSON or lengthy details — the model will try to speak all of it.
Append voice-specific instructions to your system prompt when voice is active. This tells the model to keep responses short and conversational:
const basePrompt = "You are a helpful barista assistant.";
const voiceInstructions = `
Voice mode is active. The user is speaking to you.
- Keep responses under 2 sentences
- Describe tool results concisely
- Use natural conversational language
- Do not use markdown, lists, or formatting
`;
function ChatApp() {
const voice = useGloveVoice({ runnable, voice: voiceConfig });
const systemPrompt = voice.isActive
? basePrompt + voiceInstructions
: basePrompt;
const glove = useGlove({ systemPrompt, tools, sessionId });
// ...
}Full barge-in protection for mutation-critical tools (like a checkout form) requires two layers:
pushAndWait resolver is pending, GloveVoice checks displayManager.resolverStore.size > 0 and skips interrupt() entirely. The barge-in never fires.unAbortable: true on the tool makes glove-core run it to completion even if the abort signal fires. This protects against programmatic interrupts, not just voice.Important: pushAndWait alone does not make a tool survive an abort signal. It only suppresses the voice barge-in trigger. If interrupt() is called by other means, only unAbortable: true guarantees the tool runs to completion. Use both together for tools that perform mutations.
const checkout = defineTool({
name: "checkout",
unAbortable: true, // Layer 2: survives abort signals
displayStrategy: "hide-on-complete",
async do(_input, display) {
const result = await display.pushAndWait({ items }); // Layer 1: suppresses voice barge-in
if (!result) return "Cancelled";
cartOps.clear(); // Safe — tool guaranteed to complete
return "Order placed!";
},
});Use pushAndWait sparingly in voice-first apps — only for actions that genuinely require explicit user confirmation. For display-only tools, always prefer pushAndForget so barge-in works naturally.
Never import glove-voice/silero-vad at module level in a Next.js or SSR environment. The WASM dependencies will fail during server-side rendering. Always use await import("glove-voice/silero-vad") inside a function that only runs in the browser.
ElevenLabs Scribe sometimes returns an empty committed transcript for very short utterances like “No” or “Hi”. The ElevenLabsSTTAdapter handles this automatically by falling back to the last partial transcript. You do not need to handle this case yourself.
ElevenLabs WebSocket connections disconnect after approximately 20 seconds of inactivity. This can happen during tool execution when no text is being sent. GloveVoice handles this by closing the TTS session after each model_response_complete event and opening a fresh one when the next text_delta arrives.
A pending pushAndWait resolver suppresses voice barge-in at the trigger level, but does not protect the tool from abort signals. For mutation-critical tools, always set unAbortable: true alongside pushAndWait to guarantee the tool runs to completion. See the pushAndWait and unAbortable section above for the full two-layer explanation.
voice.start() requests microphone permission. If the user denies it, the call throws an error. Handle this and show an appropriate message:
async function handleVoiceToggle() {
try {
if (voice.isActive) {
await voice.stop();
} else {
await voice.start();
}
} catch (err) {
if (err instanceof Error && err.message.includes("Permission")) {
alert("Microphone access is required for voice mode.");
}
}
}Voice responses should be short and conversational. Instruct the LLM in the system prompt: “Keep voice responses under 2 sentences. Describe results concisely.” Without this guidance, the model may generate long, formatted responses that sound unnatural when spoken aloud.
GloveVoice calls createTTS() to get a fresh TTS adapter for each model response within a turn. Do not pass a single adapter instance — it will fail on the second response because the WebSocket connection is already closed. Always pass a factory function:
// Correct: factory function
voice: { stt, createTTS: () => new ElevenLabsTTSAdapter({ getToken, voiceId }) }
// Wrong: single instance
voice: { stt, createTTS: new ElevenLabsTTSAdapter({ getToken, voiceId }) }All adapters must agree on the audio format. The default is 16kHz mono PCM (Int16Array for capture/STT, Uint8Array for TTS playback). Do not change the sample rate unless your provider requires something different, and if you do, set sampleRate in GloveVoiceConfig to match.
voice.narrate() automatically mutes the mic during playback to prevent TTS audio from feeding back into STT/VAD. It restores the previous mute state when done. If you were already muted before calling narrate(), you will remain muted afterward.
Calling narrate() before voice.start() throws an error because the TTS factory and AudioPlayer are not yet initialized. Always ensure the voice pipeline is active before narrating.
If you see WASM loading errors when using SileroVAD, check that your onnxruntime-web version matches what @ricky0123/vad-web expects. The Glove monorepo pins onnxruntime-web@^1.22.0 alongside @ricky0123/vad-web@^0.0.30. Version mismatches between the ONNX Runtime WASM files and the JavaScript API will cause cryptic loading failures.
When context compaction is triggered, the core emits compaction_start and compaction_end observer events. The voice pipeline listens for these and ignores all text_delta events while compaction is in progress. This means the compaction summary is never narrated through TTS. No action is needed on your part — this is handled automatically by GloveVoice.
When using turnMode: "manual" (push-to-talk), you do not need to import SileroVAD or set up any VAD at all. VAD is only used in turnMode: "vad". Skip the WASM overhead for PTT-only apps.
The <Render> component includes a built-in text input. If you have your own input form, always pass renderInput={() => null} to suppress the built-in one — otherwise you get duplicate inputs.
Tool do() functions run outside the React component tree. To access React context (for example, a wallet hook or theme), use a mutable singleton ref synced from a React component (bridge pattern):
// bridge.ts
export const voiceBridge = { current: null as GloveVoiceReturn | null };
// In your component:
useEffect(() => {
voiceBridge.current = voice;
}, [voice]);
// In your tool:
async do(input, display) {
await voiceBridge.current?.narrate("Processing...");
}