Voice Integration

Add real-time voice to any Glove agent. The voice pipeline handles microphone capture, speech-to-text, agent processing, and text-to-speech playback — while all your existing tools, display stack, and context management continue to work unchanged.

Overview

Pipeline Architecture

Mic --> VAD --> STT --> Glove --> TTS --> Speaker | | speech boundary processRequest() detection (tools, display stack, context, compaction)

The voice system is split across three packages, each with a specific responsibility:

Turn Modes

GloveVoice supports two turn detection modes that control how the pipeline decides when the user has finished speaking:

Voice Modes

The pipeline transitions through four states during operation:

ModeStateDescription
idlePipeline offNot started or stopped. No mic access, no connections.
listeningMic activeCapturing audio, sending to STT. Waiting for the user to speak.
thinkingAgent processingUser utterance committed. Glove is processing the request (model call, tool execution).
speakingTTS playbackAudio chunks streaming from TTS to the speaker. Barge-in returns to listening.

Quick Start

Get voice working in five minutes with ElevenLabs. This assumes you already have a Glove agent running with glove-react and glove-next.

Install

bash
pnpm add glove-voice

glove-react and glove-next are already part of a typical Glove project. The voice subpaths (glove-react/voice and createVoiceTokenHandler from glove-next) are included in those packages.

Step 1: Token Routes

Create two API routes that generate short-lived ElevenLabs tokens. Your API key stays on the server — the browser only receives single-use tokens.

app/api/voice/stt-token/route.tstypescript
import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({
  provider: "elevenlabs",
  type: "stt",
});
app/api/voice/tts-token/route.tstypescript
import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({
  provider: "elevenlabs",
  type: "tts",
});

Set your ElevenLabs API key in .env.local:

bash
ELEVENLABS_API_KEY=your_api_key_here

Step 2: Client Voice Config

Create a voice configuration file that sets up the ElevenLabs STT adapter and TTS factory. The token fetchers point to the routes you just created.

lib/voice.tstypescript
import { createElevenLabsAdapters } from "glove-voice";

async function fetchToken(path: string): Promise<string> {
  const res = await fetch(path);
  const data = await res.json();
  return data.token;
}

export const { stt, createTTS } = createElevenLabsAdapters({
  getSTTToken: () => fetchToken("/api/voice/stt-token"),
  getTTSToken: () => fetchToken("/api/voice/tts-token"),
  voiceId: "JBFqnCBsd6RMkjVDRZzb",
});

The voiceId is an ElevenLabs voice identifier. Browse the ElevenLabs Voice Library to find a voice and copy its ID.

Step 3: React Hook

Use useGloveVoice alongside useGlove to wire the voice pipeline into your component.

tsx
import { useGlove } from "glove-react";
import { useGloveVoice } from "glove-react/voice";
import { stt, createTTS } from "@/lib/voice";

function App() {
  const { runnable } = useGlove({ tools, sessionId });
  const voice = useGloveVoice({
    runnable,
    voice: { stt, createTTS },
  });

  return (
    <button onClick={voice.isActive ? voice.stop : voice.start}>
      {voice.mode}
    </button>
  );
}

That is it. Clicking the button starts the mic, connects STT, and begins listening. Speak naturally and the pipeline handles the rest: your speech is transcribed, sent to Glove, and the response is spoken back.

Voice Adapters

The voice pipeline is built around three adapter contracts. Each adapter is an EventEmitter with a specific set of events and methods. You can swap implementations freely — the pipeline does not care which provider you use, only that the contract is satisfied.

STTAdapter

Streaming speech-to-text. Receives raw PCM audio and emits transcripts.

EventPayloadDescription
partialstringStreaming partial transcript. Changes as more speech arrives.
finalstringStable, finalized transcript for the completed utterance.
errorErrorConnection or transcription error.
close(none)WebSocket connection closed.
MethodSignatureDescription
connect()() => Promise<void>Open the connection. Adapter fetches credentials internally via its getToken function.
sendAudio(pcm)(pcm: Int16Array) => voidSend a raw PCM chunk (16kHz mono Int16Array).
flushUtterance()() => voidSignal end of utterance. Adapter should finalize the current transcript. Called by VAD on speech_end.
disconnect()() => voidClose the connection and release resources.

TTSAdapter

Streaming text-to-speech. Receives text chunks and emits audio.

EventPayloadDescription
audio_chunkUint8ArrayRaw PCM audio chunk (16kHz mono), ready for the AudioPlayer.
done(none)All audio for the current turn has been received.
errorErrorConnection or synthesis error.
MethodSignatureDescription
open()() => Promise<void>Open the connection. Resolves once the adapter is ready to accept text.
sendText(text)(text: string) => voidSend a text chunk for synthesis. Safe to call before open() resolves; adapters queue internally.
flush()() => voidSignal end of text stream. Flushes remaining audio. Must be called once after all text is sent.
destroy()() => voidImmediately close the connection, dropping any pending audio.

VADAdapter

Voice activity detection. Processes audio frames and signals speech boundaries.

EventPayloadDescription
speech_start(none)User started speaking.
speech_end(none)User stopped speaking. Triggers STT flush in VAD mode.
MethodSignatureDescription
process(pcm)(pcm: Int16Array) => voidProcess a PCM frame. Call on every AudioCapture chunk event.
reset()() => voidForce reset internal state. Called when interrupting a turn.

Built-in Implementations

AdapterProviderDescription
ElevenLabsSTTAdapterElevenLabs Scribe RealtimeWebSocket-based streaming STT using ElevenLabs Scribe v2. Supports partial and committed transcripts with auto-reconnect.
ElevenLabsTTSAdapterElevenLabs Input StreamingWebSocket-based streaming TTS using ElevenLabs Turbo v2.5. Streams text in, receives PCM audio chunks out.
VAD (energy-based)Built-inZero-dependency energy-based voice activity detector. Uses RMS energy thresholds. Good for quiet environments.
SileroVADAdapterSilero VAD (WASM)ML-based voice activity detection using ONNX Runtime. Much more accurate in noisy environments. Loaded from glove-voice/silero-vad subpath.

Security: Token-based Auth

Voice providers like ElevenLabs, Deepgram, and Cartesia authenticate via API keys. These keys must never be exposed to the browser. The token pattern solves this:

Browser Your Server Provider API | | | |-- GET /api/voice/token ---->| | | |-- POST /token ---------->| | | (with API key) | | |<-- { token } ------------| |<-- { token } --------------| | | | |-- WebSocket (with token) -------- direct connection -->|

createVoiceTokenHandler

Factory function from glove-next that creates a Next.js App Router GET handler for generating provider tokens.

typescript
function createVoiceTokenHandler(
  config: VoiceTokenHandlerConfig
): (req: Request) => Promise<Response>

VoiceTokenHandlerConfig

A discriminated union based on the provider field:

ProviderFieldsDescription
elevenlabstype: "stt" | "tts"ElevenLabs requires separate tokens for STT (realtime_scribe) and TTS (tts_websocket). Create one route for each. Reads ELEVENLABS_API_KEY from env.
deepgramttlSeconds?: numberDeepgram uses a single token for all operations. ttlSeconds controls token lifetime (default: 30). Reads DEEPGRAM_API_KEY from env.
cartesia(none)Cartesia uses a single JWT token. Reads CARTESIA_API_KEY from env.

All providers accept an optional apiKey field to pass the key directly instead of reading from environment variables.

typescript
// Override the env var with a direct key
export const GET = createVoiceTokenHandler({
  provider: "elevenlabs",
  type: "stt",
  apiKey: "sk-...",
});

VAD: Voice Activity Detection

VAD determines when the user starts and stops speaking. This controls when to flush the STT buffer and commit a turn, and when to trigger barge-in during playback.

Built-in VAD (Energy-based)

The default VAD uses RMS energy thresholds. It has zero dependencies, works everywhere, and is effective in quiet environments. When no custom vad is passed to GloveVoice, the built-in VAD is used automatically.

ParameterDefaultDescription
threshold0.01RMS energy level to consider as speech. Higher values require louder speech.
silentFrames15 (~600ms)Consecutive silent frames before speech_end fires. Increase for longer natural pauses. GloveVoice defaults to 40 (~1600ms).
speechFrames3Consecutive speech frames before speech_start fires. Avoids false triggers from brief noises.
typescript
import { useGloveVoice } from "glove-react/voice";

// Override VAD sensitivity via vadConfig
const voice = useGloveVoice({
  runnable,
  voice: {
    stt,
    createTTS,
    vadConfig: { silentFrames: 60, threshold: 0.02 },
  },
});

SileroVAD (ML-based)

For noisy environments or higher accuracy, use SileroVADAdapter. It runs a neural network (Silero VAD v5) via ONNX Runtime in the browser using WebAssembly. The ML model produces a speech probability score for each audio frame, making it far more accurate than energy-based detection at distinguishing speech from background noise.

The WASM Challenge

SileroVAD depends on @ricky0123/vad-web and onnxruntime-web, which load WASM files in the browser. If you import this from the main glove-voice barrel, bundlers (Next.js, Vite) try to resolve WASM files at build time and may attempt to bundle them for SSR, causing errors.

The solution is a separate entry point at glove-voice/silero-vad combined with a dynamic import:

typescript
export async function createSileroVAD() {
  const { SileroVADAdapter } = await import("glove-voice/silero-vad");
  const vad = new SileroVADAdapter({
    positiveSpeechThreshold: 0.5,
    negativeSpeechThreshold: 0.35,
    wasm: { type: "cdn" },
  });
  await vad.init();
  return vad;
}

Pass the created VAD to the voice config:

typescript
const vad = await createSileroVAD();

const voice = useGloveVoice({
  runnable,
  voice: { stt, createTTS, vad },
});

Next.js Configuration

When using SileroVAD with Next.js, you need transpilePackages so Next.js processes the glove-voice package correctly. The dynamic import ensures the WASM-dependent code only loads in the browser.

next.config.tstypescript
/** @type {import('next').NextConfig} */
const config = {
  transpilePackages: ["glove-voice"],
  serverExternalPackages: ["better-sqlite3"], // if using SqliteStore
};

export default config;

WASM Loading Modes

ModeConfigDescription
CDN (recommended){ type: "cdn" }Loads ONNX Runtime WASM files from jsDelivr CDN. Zero configuration required. Best for most deployments.
Local{ type: "local", path: "/onnx/" }Loads WASM files from your public/ directory. For offline or air-gapped environments. Copy files from node_modules/onnxruntime-web/dist/ to public/onnx/.

Build Warnings

When building with SileroVAD, you will see warnings like:

bash
 Critical dependency: require function is used in a way
  in which dependencies cannot be statically extracted

These come from onnxruntime-web's internal dynamic require and are harmless. The WASM loading works correctly at runtime.

Tuning SileroVAD Parameters

ParameterDefaultDescription
positiveSpeechThreshold0.3Speech probability score (0-1) above which a frame is considered speech. Higher values mean less sensitivity and fewer false triggers.
negativeSpeechThreshold0.25Speech probability score (0-1) below which a frame is considered silence. Lower values require more definitive silence to end speech detection.
redemptionMs1400Milliseconds of silence allowed within speech before triggering speech_end. Acts as a debounce for brief pauses mid-sentence.
preSpeechPadMs800Milliseconds of audio to include before the detected speech start. Ensures the beginning of utterances is not clipped.
minSpeechMs100Minimum duration of speech in milliseconds. Utterances shorter than this are treated as misfires.

Turn Modes

VAD Mode (Default)

In VAD mode, the pipeline operates hands-free. The VAD continuously analyzes audio frames and automatically detects when the user starts and stops speaking.

typescript
const voice = useGloveVoice({
  runnable,
  voice: { stt, createTTS, turnMode: "vad" }, // "vad" is the default
});

Manual Mode (Push-to-Talk)

In manual mode, the consumer controls turn boundaries. No VAD is created. The mic captures audio and sends it to STT continuously, but nothing commits the utterance until you call commitTurn().

For most push-to-talk use cases, useGlovePTT handles all of this automatically — see the Push-to-Talk section below. The following is the low-level alternative for reference:

tsx
const voice = useGloveVoice({
  runnable,
  voice: { stt, createTTS, turnMode: "manual" },
});

// Low-level push-to-talk button
<button
  onPointerDown={() => voice.start()}
  onPointerUp={() => voice.commitTurn()}
>
  Hold to talk
</button>

Push-to-Talk (useGlovePTT)

useGlovePTT is a high-level hook that replaces approximately 80 lines of push-to-talk boilerplate with around 5 lines. It wraps useGloveVoice and handles:

Quick Example

tsx
import { useGlove, Render } from "glove-react";
import { useGlovePTT, VoicePTTButton } from "glove-react/voice";
import { stt, createTTS } from "@/lib/voice";

function ChatPanel() {
  const glove = useGlove({ endpoint: "/api/chat", tools });
  const ptt = useGlovePTT({
    runnable: glove.runnable,
    voice: { stt, createTTS },
    hotkey: "Space",
  });

  return (
    <>
      <Render glove={glove} voice={ptt} renderInput={() => null} />
      <VoicePTTButton ptt={ptt}>
        {({ enabled, recording, mode }) => (
          <button className={recording ? "recording" : enabled ? "active" : ""}>
            <MicIcon />
          </button>
        )}
      </VoicePTTButton>
    </>
  );
}

UseGlovePTTConfig

PropertyTypeDescription
runnableIGloveRunnable | nullThe Glove runnable instance. Pass useGlove().runnable.
voiceOmit<GloveVoiceConfig, "turnMode">Voice pipeline config. turnMode is forced to "manual" and startMuted to true internally.
hotkey?string | falseKeyboard hotkey code (default: "Space"). Uses KeyboardEvent.code values. Auto-ignores when focused on INPUT, TEXTAREA, or SELECT. Set to false to disable.
holdThreshold?numberHold duration in ms for click-vs-hold discrimination (default: 300). A quick click toggles voice on/off; a hold triggers PTT recording.
minRecordingMs?numberMinimum recording duration in ms before committing a turn (default: 350). If the user releases early, the mic stays hot until the minimum is reached.

UseGlovePTTReturn

PropertyTypeDescription
enabledbooleanWhether the voice pipeline is active (user toggled voice on).
recordingbooleanWhether the user is currently holding to record.
processingbooleanWhether STT is finalizing after a short recording.
modeVoiceModeCurrent voice pipeline state: idle, listening, thinking, speaking.
transcriptstringCurrent partial transcript while user is speaking.
errorError | nullLast error from the voice pipeline.
toggle()() => Promise<void>Toggle the voice pipeline on/off.
interrupt()() => voidBarge-in: abort in-flight request and stop TTS.
bind{ onPointerDown, onPointerUp, onPointerLeave }Pointer event handlers to spread onto a mic button. Includes click-vs-hold discrimination.
voiceUseGloveVoiceReturnThe underlying voice hook return for advanced use cases.

VoicePTTButton

Headless (unstyled) component with a render prop pattern. Wraps ptt.bind with role="button", tabIndex, aria-label, aria-pressed, and touch safety (prevents context menu on long press, disables text selection during hold).

tsx
import { VoicePTTButton } from "glove-react/voice";

<VoicePTTButton ptt={ptt} className="mic-button">
  {({ enabled, recording, processing, mode }) => (
    <button className={recording ? "active" : ""}>
      {processing ? <Spinner /> : <MicIcon />}
      {enabled && <StatusDot />}
    </button>
  )}
</VoicePTTButton>

VoicePTTButtonProps

PropertyTypeDescription
pttUseGlovePTTReturnThe return value of useGlovePTT().
children(props: VoicePTTButtonRenderProps) => ReactNodeRender prop for full styling control. Receives enabled, recording, processing, and mode.
className?stringAdditional className on the wrapper span.
style?React.CSSPropertiesAdditional style on the wrapper span.

Render Voice Integration

The <Render> component accepts an optional voice prop to auto-render transcript and voice status. This works with both useGlovePTT and useGloveVoice return values:

tsx
<Render
  glove={glove}
  voice={ptt}                              // or useGloveVoice() return
  renderTranscript={({ transcript }) => (  // optional custom renderer
    <p className="transcript">{transcript}</p>
  )}
  renderVoiceStatus={({ mode }) => (       // optional custom renderer
    <span className="status">{mode}</span>
  )}
  renderInput={() => null}
/>

The voice prop accepts a VoiceRenderHandle, which is any object with transcript, mode, and enabled fields. Both UseGlovePTTReturn and UseGloveVoiceReturn satisfy this interface. The optional renderTranscript receives TranscriptRenderProps (with a transcript string), and renderVoiceStatus receives VoiceStatusRenderProps (with a mode value).

useGloveVoice API Reference

Signature

typescript
function useGloveVoice(config: UseGloveVoiceConfig): UseGloveVoiceReturn

UseGloveVoiceConfig

PropertyTypeDescription
runnableIGloveRunnable | nullThe Glove runnable instance. Pass useGlove().runnable. When null, start() will throw.
voiceGloveVoiceConfigVoice pipeline configuration. Contains the STT adapter, TTS factory, turn mode, optional VAD override, and sample rate.

GloveVoiceConfig

PropertyTypeDescription
sttSTTAdapterSpeech-to-text adapter instance. Any implementation of the STTAdapter contract.
createTTS() => TTSAdapterFactory function that returns a fresh TTS adapter per turn. Must be a factory, not a single instance, because GloveVoice creates a new TTS session for each model response.
turnMode?"vad" | "manual"Turn detection mode. Default: "vad". In "manual" mode, no VAD is used and the consumer calls commitTurn().
vad?VADAdapterOverride the VAD implementation. Only used when turnMode is "vad". Pass a SileroVADAdapter for ML-based detection.
vadConfig?VADConfigConfiguration for the built-in energy-based VAD. Only used when turnMode is "vad" and no custom vad is provided. Default silentFrames: 40 (~1600ms).
sampleRate?numberAudio sample rate in Hz. Default: 16000. Must match STT and TTS adapter expectations.
startMuted?booleanStart the pipeline with mic muted. Defaults to true when turnMode is "manual", false otherwise. Eliminates the race condition between start() resolving and calling mute().

UseGloveVoiceReturn

PropertyTypeDescription
modeVoiceModeCurrent voice pipeline state: "idle", "listening", "thinking", or "speaking".
transcriptstringCurrent partial transcript while the user is speaking. Cleared when a turn is committed or the pipeline stops.
isActivebooleanWhether the voice pipeline is active (mode is not "idle").
enabledbooleanWhether the user intended the pipeline to be active. True after start(), false after stop() or pipeline death (WebSocket drop, permission revoked). Unlike isActive, this tracks user intent and auto-resets — no manual sync useEffect needed.
errorError | nullLast error from the voice pipeline. Cleared on the next start() call.
start()() => Promise<void>Start the voice pipeline. Requests microphone permission, connects STT, and begins listening. Throws if runnable is null or mic permission is denied.
stop()() => Promise<void>Stop the voice pipeline. Interrupts any in-progress response, disconnects STT, releases the microphone, and returns to idle.
interrupt()() => voidBarge-in. Aborts the in-flight Glove request, stops TTS playback, clears non-blocking display slots, and returns to listening.
commitTurn()() => voidManual turn commit. Flushes the current utterance to STT for finalization. Primary control mechanism in manual turn mode. Also works in VAD mode as an explicit override.
isMutedbooleanWhether mic audio is currently muted (not forwarded to STT/VAD). The audio_chunk event still fires when muted.
mute()() => voidStop forwarding mic audio to STT/VAD. The mic stays active and audio_chunk events continue to fire (for visualization). No transcription or VAD detection occurs while muted.
unmute()() => voidResume forwarding mic audio to STT/VAD. Restores normal transcription and voice activity detection.
narrate(text)(text: string) => Promise<void>Speak arbitrary text through TTS without involving the model. Auto-mutes mic during playback. Resolves when all audio finishes playing. Safe to call from pushAndWait tool handlers.

Narration & Mic Control

Narrating Display Slots

Use voice.narrate(text) to speak arbitrary text through TTS without sending it to the model. This is useful for reading display slot content aloud — order summaries, confirmation details, or any text you want the user to hear.

narrate() returns a promise that resolves when all audio finishes playing. It creates a fresh TTS adapter per call (same pattern as model turns) and auto-mutes the mic during playback to prevent TTS audio from feeding back into STT.

tsx
const checkout = defineTool({
  name: "checkout",
  unAbortable: true,
  displayStrategy: "hide-on-complete",
  async do(input, display) {
    const cart = getCart();

    // Narrate the cart summary before showing the form
    await voice.narrate(
      `Your order has ${cart.length} items totaling ${formatPrice(total)}.`
    );

    const result = await display.pushAndWait({ items: cart });
    if (!result) return "Cancelled";

    // Narrate the confirmation
    await voice.narrate("Order placed! You'll receive a confirmation email shortly.");

    cartOps.clear();
    return "Order placed!";
  },
});

Key detail: narrate() is safe to call from pushAndWait tool handlers. When a tool uses pushAndWait, the model is paused waiting for the tool result, so there is no concurrent model TTS to conflict with.

Mute / Unmute

voice.mute() and voice.unmute() gate mic audio forwarding to STT and VAD. When muted, the mic stays active but no transcription or speech detection occurs. This is useful for temporarily disabling voice input without tearing down the pipeline.

tsx
<button onClick={voice.isMuted ? voice.unmute : voice.mute}>
  {voice.isMuted ? "Unmute" : "Mute"}
</button>

Audio Visualization

The audio_chunk event on the underlying GloveVoice instance emits raw Int16Array PCM data from the mic, even when muted. Use this for waveform or audio level visualization:

typescript
// Listen to audio_chunk on the GloveVoice instance for visualization
voice.on("audio_chunk", (pcm: Int16Array) => {
  // Compute RMS level for a simple meter
  let sum = 0;
  for (let i = 0; i < pcm.length; i++) sum += pcm[i] * pcm[i];
  const level = Math.sqrt(sum / pcm.length) / 32768;
  updateMeter(level);
});

Voice-First Tool Design

Tools built for voice agents have different design considerations than text-based tools. Voice users cannot click buttons or fill forms while speaking, and the model's response text gets spoken aloud.

Use pushAndForget for Information Display

In voice-first apps, use pushAndForget instead of pushAndWait for tools that display information. Voice users see the visual result while hearing the narration, but they do not need to interact with it to continue the conversation.

tsx
const showMenuTool: ToolConfig<{ items: MenuItem[] }> = {
  name: "show_menu",
  description: "Display menu items to the user.",
  inputSchema: z.object({
    items: z.array(z.object({
      name: z.string(),
      price: z.number(),
      description: z.string(),
    })),
  }),
  async do(input, display) {
    await display.pushAndForget({ items: input.items });
    // Return concise text for the model to narrate
    return {
      status: "success",
      data: input.items
        .map(i => `${i.name} for $${i.price.toFixed(2)}`)
        .join(", "),
    };
  },
  render({ data }) {
    return <MenuCard items={data.items} />;
  },
};

Return Concise Data for Narration

The data field in your tool result is what the model sees and narrates. Keep it short and descriptive. Avoid returning raw JSON or lengthy details — the model will try to speak all of it.

Dynamic System Prompt for Voice

Append voice-specific instructions to your system prompt when voice is active. This tells the model to keep responses short and conversational:

tsx
const basePrompt = "You are a helpful barista assistant.";

const voiceInstructions = `
Voice mode is active. The user is speaking to you.
- Keep responses under 2 sentences
- Describe tool results concisely
- Use natural conversational language
- Do not use markdown, lists, or formatting
`;

function ChatApp() {
  const voice = useGloveVoice({ runnable, voice: voiceConfig });

  const systemPrompt = voice.isActive
    ? basePrompt + voiceInstructions
    : basePrompt;

  const glove = useGlove({ systemPrompt, tools, sessionId });
  // ...
}

pushAndWait and unAbortable in Voice Apps

Full barge-in protection for mutation-critical tools (like a checkout form) requires two layers:

  1. Voice layer (barge-in suppression): When a pushAndWait resolver is pending, GloveVoice checks displayManager.resolverStore.size > 0 and skips interrupt() entirely. The barge-in never fires.
  2. Core layer (abort resistance): Setting unAbortable: true on the tool makes glove-core run it to completion even if the abort signal fires. This protects against programmatic interrupts, not just voice.

Important: pushAndWait alone does not make a tool survive an abort signal. It only suppresses the voice barge-in trigger. If interrupt() is called by other means, only unAbortable: true guarantees the tool runs to completion. Use both together for tools that perform mutations.

tsx
const checkout = defineTool({
  name: "checkout",
  unAbortable: true,              // Layer 2: survives abort signals
  displayStrategy: "hide-on-complete",
  async do(_input, display) {
    const result = await display.pushAndWait({ items });  // Layer 1: suppresses voice barge-in
    if (!result) return "Cancelled";
    cartOps.clear();              // Safe — tool guaranteed to complete
    return "Order placed!";
  },
});

Use pushAndWait sparingly in voice-first apps — only for actions that genuinely require explicit user confirmation. For display-only tools, always prefer pushAndForget so barge-in works naturally.

Common Gotchas

1. SileroVAD Must Be Dynamically Imported

Never import glove-voice/silero-vad at module level in a Next.js or SSR environment. The WASM dependencies will fail during server-side rendering. Always use await import("glove-voice/silero-vad") inside a function that only runs in the browser.

2. Empty Committed Transcripts

ElevenLabs Scribe sometimes returns an empty committed transcript for very short utterances like “No” or “Hi”. The ElevenLabsSTTAdapter handles this automatically by falling back to the last partial transcript. You do not need to handle this case yourself.

3. TTS Idle Timeout

ElevenLabs WebSocket connections disconnect after approximately 20 seconds of inactivity. This can happen during tool execution when no text is being sent. GloveVoice handles this by closing the TTS session after each model_response_complete event and opening a fresh one when the next text_delta arrives.

4. Barge-in Protection Requires unAbortable

A pending pushAndWait resolver suppresses voice barge-in at the trigger level, but does not protect the tool from abort signals. For mutation-critical tools, always set unAbortable: true alongside pushAndWait to guarantee the tool runs to completion. See the pushAndWait and unAbortable section above for the full two-layer explanation.

5. Microphone Permission

voice.start() requests microphone permission. If the user denies it, the call throws an error. Handle this and show an appropriate message:

typescript
async function handleVoiceToggle() {
  try {
    if (voice.isActive) {
      await voice.stop();
    } else {
      await voice.start();
    }
  } catch (err) {
    if (err instanceof Error && err.message.includes("Permission")) {
      alert("Microphone access is required for voice mode.");
    }
  }
}

6. Model Provider Matters

Voice responses should be short and conversational. Instruct the LLM in the system prompt: “Keep voice responses under 2 sentences. Describe results concisely.” Without this guidance, the model may generate long, formatted responses that sound unnatural when spoken aloud.

7. createTTS Must Be a Factory

GloveVoice calls createTTS() to get a fresh TTS adapter for each model response within a turn. Do not pass a single adapter instance — it will fail on the second response because the WebSocket connection is already closed. Always pass a factory function:

typescript
// Correct: factory function
voice: { stt, createTTS: () => new ElevenLabsTTSAdapter({ getToken, voiceId }) }

// Wrong: single instance
voice: { stt, createTTS: new ElevenLabsTTSAdapter({ getToken, voiceId }) }

8. Audio Sample Rate

All adapters must agree on the audio format. The default is 16kHz mono PCM (Int16Array for capture/STT, Uint8Array for TTS playback). Do not change the sample rate unless your provider requires something different, and if you do, set sampleRate in GloveVoiceConfig to match.

9. narrate() Auto-Mutes the Mic

voice.narrate() automatically mutes the mic during playback to prevent TTS audio from feeding back into STT/VAD. It restores the previous mute state when done. If you were already muted before calling narrate(), you will remain muted afterward.

10. narrate() Requires a Started Pipeline

Calling narrate() before voice.start() throws an error because the TTS factory and AudioPlayer are not yet initialized. Always ensure the voice pipeline is active before narrating.

11. onnxruntime-web Version Pinning

If you see WASM loading errors when using SileroVAD, check that your onnxruntime-web version matches what @ricky0123/vad-web expects. The Glove monorepo pins onnxruntime-web@^1.22.0 alongside @ricky0123/vad-web@^0.0.30. Version mismatches between the ONNX Runtime WASM files and the JavaScript API will cause cryptic loading failures.

12. Voice Auto-Silences During Compaction

When context compaction is triggered, the core emits compaction_start and compaction_end observer events. The voice pipeline listens for these and ignores all text_delta events while compaction is in progress. This means the compaction summary is never narrated through TTS. No action is needed on your part — this is handled automatically by GloveVoice.

13. SileroVAD Not Needed for Manual Mode

When using turnMode: "manual" (push-to-talk), you do not need to import SileroVAD or set up any VAD at all. VAD is only used in turnMode: "vad". Skip the WASM overhead for PTT-only apps.

14. Render Ships a Default Input

The <Render> component includes a built-in text input. If you have your own input form, always pass renderInput={() => null} to suppress the built-in one — otherwise you get duplicate inputs.

15. Tools Execute Outside React

Tool do() functions run outside the React component tree. To access React context (for example, a wallet hook or theme), use a mutable singleton ref synced from a React component (bridge pattern):

typescript
// bridge.ts
export const voiceBridge = { current: null as GloveVoiceReturn | null };

// In your component:
useEffect(() => {
  voiceBridge.current = voice;
}, [voice]);

// In your tool:
async do(input, display) {
  await voiceBridge.current?.narrate("Processing...");
}