Voice Integration

Add real-time voice to any Glove agent. The voice pipeline handles microphone capture, speech-to-text, agent processing, and text-to-speech playback — while all your existing tools, display stack, and context management continue to work unchanged.

Overview

Pipeline Architecture

Mic  -->  VAD  -->  STT  -->  Glove  -->  TTS  -->  Speaker
         |                     |
    speech boundary      processRequest()
    detection            (tools, display stack,
                          context, compaction)

The voice system is split across three packages, each with a specific responsibility:

glove-voice — The pipeline engine. Contains GloveVoice, adapter contracts (STT, TTS, VAD), built-in implementations (ElevenLabs adapters, energy-based VAD), audio capture, and audio playback.
glove-react/voice — React hooks and components. Provides useGloveVoice (low-level), useGlovePTT (push-to-talk), and VoicePTTButton (headless mic button) with proper lifecycle management.
glove-next — Token route handlers. Provides createVoiceTokenHandler for creating Next.js API routes that generate short-lived provider tokens, keeping your API keys on the server.

Turn Modes

GloveVoice supports two turn detection modes that control how the pipeline decides when the user has finished speaking:

VAD mode (default) — Hands-free operation. Voice activity detection automatically detects speech boundaries and commits turns. Supports barge-in: when the user speaks during a response, the pipeline interrupts the current TTS playback and model request.
Manual mode — Push-to-talk. The consumer controls turn boundaries by calling commitTurn(). No automatic barge-in — call interrupt() explicitly when needed. Ideal for noisy environments or when precise control is required.

Voice Modes

The pipeline transitions through four states during operation:

Mode	State	Description
idle	Pipeline off	Not started or stopped. No mic access, no connections.
listening	Mic active	Capturing audio, sending to STT. Waiting for the user to speak.
thinking	Agent processing	User utterance committed. Glove is processing the request (model call, tool execution).
speaking	TTS playback	Audio chunks streaming from TTS to the speaker. Barge-in returns to listening.

Quick Start

Get voice working in five minutes with ElevenLabs. This assumes you already have a Glove agent running with glove-react and glove-next.

Install

bash

pnpm add glove-voice

glove-react and glove-next are already part of a typical Glove project. The voice subpaths (glove-react/voice and createVoiceTokenHandler from glove-next) are included in those packages.

Step 1: Token Routes

Create two API routes that generate short-lived ElevenLabs tokens. Your API key stays on the server — the browser only receives single-use tokens.

app/api/voice/stt-token/route.tstypescript

import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({
  provider: "elevenlabs",
  type: "stt",
});

app/api/voice/tts-token/route.tstypescript

import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({
  provider: "elevenlabs",
  type: "tts",
});

Set your ElevenLabs API key in .env.local:

bash

ELEVENLABS_API_KEY=your_api_key_here

Step 2: Client Voice Config

Create a voice configuration file that sets up the ElevenLabs STT adapter and TTS factory. The token fetchers point to the routes you just created.

lib/voice.tstypescript

import { createElevenLabsAdapters } from "glove-voice";

async function fetchToken(path: string): Promise<string> {
  const res = await fetch(path);
  const data = await res.json();
  return data.token;
}

export const { stt, createTTS } = createElevenLabsAdapters({
  getSTTToken: () => fetchToken("/api/voice/stt-token"),
  getTTSToken: () => fetchToken("/api/voice/tts-token"),
  voiceId: "JBFqnCBsd6RMkjVDRZzb",
});

The voiceId is an ElevenLabs voice identifier. Browse the ElevenLabs Voice Library to find a voice and copy its ID.

Step 3: React Hook

Use useGloveVoice alongside useGlove to wire the voice pipeline into your component.

tsx

import { useGlove } from "glove-react";
import { useGloveVoice } from "glove-react/voice";
import { stt, createTTS } from "@/lib/voice";

function App() {
  const { runnable } = useGlove({ tools, sessionId });
  const voice = useGloveVoice({
    runnable,
    voice: { stt, createTTS },
  });

  return (
    <button onClick={voice.isActive ? voice.stop : voice.start}>
      {voice.mode}
    </button>
  );
}

That is it. Clicking the button starts the mic, connects STT, and begins listening. Speak naturally and the pipeline handles the rest: your speech is transcribed, sent to Glove, and the response is spoken back.

Voice Adapters

The voice pipeline is built around three adapter contracts. Each adapter is an EventEmitter with a specific set of events and methods. You can swap implementations freely — the pipeline does not care which provider you use, only that the contract is satisfied.

STTAdapter

Streaming speech-to-text. Receives raw PCM audio and emits transcripts.

Event	Payload	Description
partial	string	Streaming partial transcript. Changes as more speech arrives.
final	string	Stable, finalized transcript for the completed utterance.
error	Error	Connection or transcription error.
close	(none)	WebSocket connection closed.

Method	Signature	Description
connect()	() => Promise<void>	Open the connection. Adapter fetches credentials internally via its getToken function.
sendAudio(pcm)	(pcm: Int16Array) => void	Send a raw PCM chunk (16kHz mono Int16Array).
flushUtterance()	() => void	Signal end of utterance. Adapter should finalize the current transcript. Called by VAD on speech_end.
disconnect()	() => void	Close the connection and release resources.

TTSAdapter

Streaming text-to-speech. Receives text chunks and emits audio.

Event	Payload	Description
audio_chunk	Uint8Array	Raw PCM audio chunk (16kHz mono), ready for the AudioPlayer.
done	(none)	All audio for the current turn has been received.
error	Error	Connection or synthesis error.

Method	Signature	Description
open()	() => Promise<void>	Open the connection. Resolves once the adapter is ready to accept text.
sendText(text)	(text: string) => void	Send a text chunk for synthesis. Safe to call before open() resolves; adapters queue internally.
flush()	() => void	Signal end of text stream. Flushes remaining audio. Must be called once after all text is sent.
destroy()	() => void	Immediately close the connection, dropping any pending audio.

VADAdapter

Voice activity detection. Processes audio frames and signals speech boundaries.

Event	Payload	Description
speech_start	(none)	User started speaking.
speech_end	(none)	User stopped speaking. Triggers STT flush in VAD mode.

Method	Signature	Description
process(pcm)	(pcm: Int16Array) => void	Process a PCM frame. Call on every AudioCapture chunk event.
reset()	() => void	Force reset internal state. Called when interrupting a turn.

Built-in Implementations

Adapter	Provider	Description
ElevenLabsSTTAdapter	ElevenLabs Scribe Realtime	WebSocket-based streaming STT using ElevenLabs Scribe v2. Supports partial and committed transcripts with auto-reconnect.
ElevenLabsTTSAdapter	ElevenLabs Input Streaming	WebSocket-based streaming TTS using ElevenLabs Turbo v2.5. Streams text in, receives PCM audio chunks out.
VAD (energy-based)	Built-in	Zero-dependency energy-based voice activity detector. Uses RMS energy thresholds. Good for quiet environments.
SileroVADAdapter	Silero VAD (WASM)	ML-based voice activity detection using ONNX Runtime. Much more accurate in noisy environments. Loaded from glove-voice/silero-vad subpath.

Security: Token-based Auth

Voice providers like ElevenLabs, Deepgram, and Cartesia authenticate via API keys. These keys must never be exposed to the browser. The token pattern solves this:

Browser                     Your Server                Provider API
   |                             |                          |
   |-- GET /api/voice/token ---->|                          |
   |                             |-- POST /token ---------->|
   |                             |   (with API key)         |
   |                             |<-- { token } ------------|
   |<-- { token } --------------|                          |
   |                                                        |
   |-- WebSocket (with token) -------- direct connection -->|

Your API key never leaves the server. The browser receives a short-lived, single-use token that expires after approximately 15 minutes.
The browser connects directly to the provider's WebSocket using the token. Audio streams between browser and provider without proxying through your server.
If a token is intercepted, it can only be used once and expires quickly.

createVoiceTokenHandler

Factory function from glove-next that creates a Next.js App Router GET handler for generating provider tokens.

typescript

function createVoiceTokenHandler(
  config: VoiceTokenHandlerConfig
): (req: Request) => Promise<Response>

VoiceTokenHandlerConfig

A discriminated union based on the provider field:

Provider	Fields	Description
elevenlabs	type: "stt" \| "tts"	ElevenLabs requires separate tokens for STT (realtime_scribe) and TTS (tts_websocket). Create one route for each. Reads ELEVENLABS_API_KEY from env.
deepgram	ttlSeconds?: number	Deepgram uses a single token for all operations. ttlSeconds controls token lifetime (default: 30). Reads DEEPGRAM_API_KEY from env.
cartesia	(none)	Cartesia uses a single JWT token. Reads CARTESIA_API_KEY from env.

All providers accept an optional apiKey field to pass the key directly instead of reading from environment variables.

typescript

// Override the env var with a direct key
export const GET = createVoiceTokenHandler({
  provider: "elevenlabs",
  type: "stt",
  apiKey: "sk-...",
});

VAD: Voice Activity Detection

VAD determines when the user starts and stops speaking. This controls when to flush the STT buffer and commit a turn, and when to trigger barge-in during playback.

Built-in VAD (Energy-based)

The default VAD uses RMS energy thresholds. It has zero dependencies, works everywhere, and is effective in quiet environments. When no custom vad is passed to GloveVoice, the built-in VAD is used automatically.

Parameter	Default	Description
threshold	0.01	RMS energy level to consider as speech. Higher values require louder speech.
silentFrames	15 (~600ms)	Consecutive silent frames before speech_end fires. Increase for longer natural pauses. GloveVoice defaults to 40 (~1600ms).
speechFrames	3	Consecutive speech frames before speech_start fires. Avoids false triggers from brief noises.

typescript

import { useGloveVoice } from "glove-react/voice";

// Override VAD sensitivity via vadConfig
const voice = useGloveVoice({
  runnable,
  voice: {
    stt,
    createTTS,
    vadConfig: { silentFrames: 60, threshold: 0.02 },
  },
});

SileroVAD (ML-based)

For noisy environments or higher accuracy, use SileroVADAdapter. It runs a neural network (Silero VAD v5) via ONNX Runtime in the browser using WebAssembly. The ML model produces a speech probability score for each audio frame, making it far more accurate than energy-based detection at distinguishing speech from background noise.

The WASM Challenge

SileroVAD depends on @ricky0123/vad-web and onnxruntime-web, which load WASM files in the browser. If you import this from the main glove-voice barrel, bundlers (Next.js, Vite) try to resolve WASM files at build time and may attempt to bundle them for SSR, causing errors.

The solution is a separate entry point at glove-voice/silero-vad combined with a dynamic import:

typescript

export async function createSileroVAD() {
  const { SileroVADAdapter } = await import("glove-voice/silero-vad");
  const vad = new SileroVADAdapter({
    positiveSpeechThreshold: 0.5,
    negativeSpeechThreshold: 0.35,
    wasm: { type: "cdn" },
  });
  await vad.init();
  return vad;
}

Pass the created VAD to the voice config:

typescript

const vad = await createSileroVAD();

const voice = useGloveVoice({
  runnable,
  voice: { stt, createTTS, vad },
});

Next.js Configuration

When using SileroVAD with Next.js, you need transpilePackages so Next.js processes the glove-voice package correctly. The dynamic import ensures the WASM-dependent code only loads in the browser.

next.config.tstypescript

/** @type {import('next').NextConfig} */
const config = {
  transpilePackages: ["glove-voice"],
  serverExternalPackages: ["better-sqlite3"], // only if your StoreAdapter pulls in better-sqlite3
};

export default config;

WASM Loading Modes

Mode	Config	Description
CDN (recommended)	{ type: "cdn" }	Loads ONNX Runtime WASM files from jsDelivr CDN. Zero configuration required. Best for most deployments.
Local	{ type: "local", path: "/onnx/" }	Loads WASM files from your public/ directory. For offline or air-gapped environments. Copy files from node_modules/onnxruntime-web/dist/ to public/onnx/.

Build Warnings

When building with SileroVAD, you will see warnings like:

bash

⚠ Critical dependency: require function is used in a way
  in which dependencies cannot be statically extracted

These come from onnxruntime-web's internal dynamic require and are harmless. The WASM loading works correctly at runtime.

Tuning SileroVAD Parameters

Parameter	Default	Description
positiveSpeechThreshold	0.3	Speech probability score (0-1) above which a frame is considered speech. Higher values mean less sensitivity and fewer false triggers.
negativeSpeechThreshold	0.25	Speech probability score (0-1) below which a frame is considered silence. Lower values require more definitive silence to end speech detection.
redemptionMs	1400	Milliseconds of silence allowed within speech before triggering speech_end. Acts as a debounce for brief pauses mid-sentence.
preSpeechPadMs	800	Milliseconds of audio to include before the detected speech start. Ensures the beginning of utterances is not clipped.
minSpeechMs	100	Minimum duration of speech in milliseconds. Utterances shorter than this are treated as misfires.

Turn Modes

VAD Mode (Default)

In VAD mode, the pipeline operates hands-free. The VAD continuously analyzes audio frames and automatically detects when the user starts and stops speaking.

Automatic turn detection — When the VAD fires speech_end, the STT adapter flushes its buffer, emits a final transcript, and the pipeline transitions to thinking.
Barge-in — When the user speaks during the speaking or thinking modes, the pipeline calls interrupt() automatically. This aborts the in-flight Glove request, stops TTS playback, clears display slots, and returns to listening.
Barge-in protection — If a pushAndWait slot is active (for example, a checkout form), barge-in is suppressed at the voice layer. The pipeline checks displayManager.resolverStore.size and skips the interrupt if there are pending resolvers. For full protection, combine this with unAbortable: true on the tool so it survives abort signals from any source, not just voice.

typescript

const voice = useGloveVoice({
  runnable,
  voice: { stt, createTTS, turnMode: "vad" }, // "vad" is the default
});

Manual Mode (Push-to-Talk)

In manual mode, the consumer controls turn boundaries. No VAD is created. The mic captures audio and sends it to STT continuously, but nothing commits the utterance until you call commitTurn().

Explicit turn commit — Call voice.commitTurn() to signal the end of the user's utterance. This flushes the STT buffer and starts agent processing.
No automatic barge-in — To interrupt a response, call voice.interrupt() explicitly.
Use cases — Noisy environments where VAD would trigger false positives. Applications that need precise control over when the agent responds. Push-to-talk UI patterns.

For most push-to-talk use cases, useGlovePTT handles all of this automatically — see the Push-to-Talk section below. The following is the low-level alternative for reference:

tsx

const voice = useGloveVoice({
  runnable,
  voice: { stt, createTTS, turnMode: "manual" },
});

// Low-level push-to-talk button
<button
  onPointerDown={() => voice.start()}
  onPointerUp={() => voice.commitTurn()}
>
  Hold to talk
</button>

Push-to-Talk (useGlovePTT)

useGlovePTT is a high-level hook that replaces approximately 80 lines of push-to-talk boilerplate with around 5 lines. It wraps useGloveVoice and handles:

Pipeline enable/disable (toggle voice on and off)
Auto-mute on start, unmute on hold, commit + re-mute on release
Keyboard hotkey binding with input element awareness
Click-vs-hold discrimination (quick click toggles, hold records)
Minimum recording duration enforcement
Pipeline death detection (WebSocket drop, permission revoked)

Quick Example

tsx

import { useGlove, Render } from "glove-react";
import { useGlovePTT, VoicePTTButton } from "glove-react/voice";
import { stt, createTTS } from "@/lib/voice";

function ChatPanel() {
  const glove = useGlove({ endpoint: "/api/chat", tools });
  const ptt = useGlovePTT({
    runnable: glove.runnable,
    voice: { stt, createTTS },
    hotkey: "Space",
  });

  return (
    <>
      <Render glove={glove} voice={ptt} renderInput={() => null} />
      <VoicePTTButton ptt={ptt}>
        {({ enabled, recording, mode }) => (
          <button className={recording ? "recording" : enabled ? "active" : ""}>
            <MicIcon />
          </button>
        )}
      </VoicePTTButton>
    </>
  );
}

UseGlovePTTConfig

Property	Type	Description
runnable	IGloveRunnable \| null	The Glove runnable instance. Pass useGlove().runnable.
voice	Omit<GloveVoiceConfig, "turnMode">	Voice pipeline config. turnMode is forced to "manual" and startMuted to true internally.
hotkey?	string \| false	Keyboard hotkey code (default: "Space"). Uses KeyboardEvent.code values. Auto-ignores when focused on INPUT, TEXTAREA, or SELECT. Set to false to disable.
holdThreshold?	number	Hold duration in ms for click-vs-hold discrimination (default: 300). A quick click toggles voice on/off; a hold triggers PTT recording.
minRecordingMs?	number	Minimum recording duration in ms before committing a turn (default: 350). If the user releases early, the mic stays hot until the minimum is reached.

UseGlovePTTReturn

Property	Type	Description
enabled	boolean	Whether the voice pipeline is active (user toggled voice on).
recording	boolean	Whether the user is currently holding to record.
processing	boolean	Whether STT is finalizing after a short recording.
mode	VoiceMode	Current voice pipeline state: idle, listening, thinking, speaking.
transcript	string	Current partial transcript while user is speaking.
error	Error \| null	Last error from the voice pipeline.
toggle()	() => Promise<void>	Toggle the voice pipeline on/off.
interrupt()	() => void	Barge-in: abort in-flight request and stop TTS.
bind	{ onPointerDown, onPointerUp, onPointerLeave }	Pointer event handlers to spread onto a mic button. Includes click-vs-hold discrimination.
voice	UseGloveVoiceReturn	The underlying voice hook return for advanced use cases.

VoicePTTButton

Headless (unstyled) component with a render prop pattern. Wraps ptt.bind with role="button", tabIndex, aria-label, aria-pressed, and touch safety (prevents context menu on long press, disables text selection during hold).

tsx

import { VoicePTTButton } from "glove-react/voice";

<VoicePTTButton ptt={ptt} className="mic-button">
  {({ enabled, recording, processing, mode }) => (
    <button className={recording ? "active" : ""}>
      {processing ? <Spinner /> : <MicIcon />}
      {enabled && <StatusDot />}
    </button>
  )}
</VoicePTTButton>

VoicePTTButtonProps

Property	Type	Description
ptt	UseGlovePTTReturn	The return value of useGlovePTT().
children	(props: VoicePTTButtonRenderProps) => ReactNode	Render prop for full styling control. Receives enabled, recording, processing, and mode.
className?	string	Additional className on the wrapper span.
style?	React.CSSProperties	Additional style on the wrapper span.

Render Voice Integration

The <Render> component accepts an optional voice prop to auto-render transcript and voice status. This works with both useGlovePTT and useGloveVoice return values:

tsx

<Render
  glove={glove}
  voice={ptt}                              // or useGloveVoice() return
  renderTranscript={({ transcript }) => (  // optional custom renderer
    <p className="transcript">{transcript}</p>
  )}
  renderVoiceStatus={({ mode }) => (       // optional custom renderer
    <span className="status">{mode}</span>
  )}
  renderInput={() => null}
/>

The voice prop accepts a VoiceRenderHandle, which is any object with transcript, mode, and enabled fields. Both UseGlovePTTReturn and UseGloveVoiceReturn satisfy this interface. The optional renderTranscript receives TranscriptRenderProps (with a transcript string), and renderVoiceStatus receives VoiceStatusRenderProps (with a mode value).

useGloveVoice API Reference

Signature

typescript

function useGloveVoice(config: UseGloveVoiceConfig): UseGloveVoiceReturn

UseGloveVoiceConfig

Property	Type	Description
runnable	IGloveRunnable \| null	The Glove runnable instance. Pass useGlove().runnable. When null, start() will throw.
voice	GloveVoiceConfig	Voice pipeline configuration. Contains the STT adapter, TTS factory, turn mode, optional VAD override, and sample rate.

GloveVoiceConfig

Property	Type	Description
stt	STTAdapter	Speech-to-text adapter instance. Any implementation of the STTAdapter contract.
createTTS	() => TTSAdapter	Factory function that returns a fresh TTS adapter per turn. Must be a factory, not a single instance, because GloveVoice creates a new TTS session for each model response.
turnMode?	"vad" \| "manual"	Turn detection mode. Default: "vad". In "manual" mode, no VAD is used and the consumer calls commitTurn().
vad?	VADAdapter	Override the VAD implementation. Only used when turnMode is "vad". Pass a SileroVADAdapter for ML-based detection.
vadConfig?	VADConfig	Configuration for the built-in energy-based VAD. Only used when turnMode is "vad" and no custom vad is provided. Default silentFrames: 40 (~1600ms).
sampleRate?	number	Audio sample rate in Hz. Default: 16000. Must match STT and TTS adapter expectations.
startMuted?	boolean	Start the pipeline with mic muted. Defaults to true when turnMode is "manual", false otherwise. Eliminates the race condition between start() resolving and calling mute().

UseGloveVoiceReturn

Property	Type	Description
mode	VoiceMode	Current voice pipeline state: "idle", "listening", "thinking", or "speaking".
transcript	string	Current partial transcript while the user is speaking. Cleared when a turn is committed or the pipeline stops.
isActive	boolean	Whether the voice pipeline is active (mode is not "idle").
enabled	boolean	Whether the user intended the pipeline to be active. True after start(), false after stop() or pipeline death (WebSocket drop, permission revoked). Unlike isActive, this tracks user intent and auto-resets — no manual sync useEffect needed.
error	Error \| null	Last error from the voice pipeline. Cleared on the next start() call.
start()	() => Promise<void>	Start the voice pipeline. Requests microphone permission, connects STT, and begins listening. Throws if runnable is null or mic permission is denied.
stop()	() => Promise<void>	Stop the voice pipeline. Interrupts any in-progress response, disconnects STT, releases the microphone, and returns to idle.
interrupt()	() => void	Barge-in. Aborts the in-flight Glove request, stops TTS playback, clears non-blocking display slots, and returns to listening.
commitTurn()	() => void	Manual turn commit. Flushes the current utterance to STT for finalization. Primary control mechanism in manual turn mode. Also works in VAD mode as an explicit override.
isMuted	boolean	Whether mic audio is currently muted (not forwarded to STT/VAD). The audio_chunk event still fires when muted.
mute()	() => void	Stop forwarding mic audio to STT/VAD. The mic stays active and audio_chunk events continue to fire (for visualization). No transcription or VAD detection occurs while muted.
unmute()	() => void	Resume forwarding mic audio to STT/VAD. Restores normal transcription and voice activity detection.
narrate(text)	(text: string) => Promise<void>	Speak arbitrary text through TTS without involving the model. Auto-mutes mic during playback. Resolves when all audio finishes playing. Safe to call from pushAndWait tool handlers.

Narration & Mic Control

Narrating Display Slots

Use voice.narrate(text) to speak arbitrary text through TTS without sending it to the model. This is useful for reading display slot content aloud — order summaries, confirmation details, or any text you want the user to hear.

narrate() returns a promise that resolves when all audio finishes playing. It creates a fresh TTS adapter per call (same pattern as model turns) and auto-mutes the mic during playback to prevent TTS audio from feeding back into STT.

tsx

const checkout = defineTool({
  name: "checkout",
  unAbortable: true,
  displayStrategy: "hide-on-complete",
  async do(input, display) {
    const cart = getCart();

    // Narrate the cart summary before showing the form
    await voice.narrate(
      `Your order has ${cart.length} items totaling ${formatPrice(total)}.`
    );

    const result = await display.pushAndWait({ items: cart });
    if (!result) return "Cancelled";

    // Narrate the confirmation
    await voice.narrate("Order placed! You'll receive a confirmation email shortly.");

    cartOps.clear();
    return "Order placed!";
  },
});

Key detail: narrate() is safe to call from pushAndWait tool handlers. When a tool uses pushAndWait, the model is paused waiting for the tool result, so there is no concurrent model TTS to conflict with.

Mute / Unmute

voice.mute() and voice.unmute() gate mic audio forwarding to STT and VAD. When muted, the mic stays active but no transcription or speech detection occurs. This is useful for temporarily disabling voice input without tearing down the pipeline.

tsx

<button onClick={voice.isMuted ? voice.unmute : voice.mute}>
  {voice.isMuted ? "Unmute" : "Mute"}
</button>

Audio Visualization

The audio_chunk event on the underlying GloveVoice instance emits raw Int16Array PCM data from the mic, even when muted. Use this for waveform or audio level visualization:

typescript

// Listen to audio_chunk on the GloveVoice instance for visualization
voice.on("audio_chunk", (pcm: Int16Array) => {
  // Compute RMS level for a simple meter
  let sum = 0;
  for (let i = 0; i < pcm.length; i++) sum += pcm[i] * pcm[i];
  const level = Math.sqrt(sum / pcm.length) / 32768;
  updateMeter(level);
});

Voice-First Tool Design

Tools built for voice agents have different design considerations than text-based tools. Voice users cannot click buttons or fill forms while speaking, and the model's response text gets spoken aloud.

Use pushAndForget for Information Display

In voice-first apps, use pushAndForget instead of pushAndWait for tools that display information. Voice users see the visual result while hearing the narration, but they do not need to interact with it to continue the conversation.

tsx

const showMenuTool: ToolConfig<{ items: MenuItem[] }> = {
  name: "show_menu",
  description: "Display menu items to the user.",
  inputSchema: z.object({
    items: z.array(z.object({
      name: z.string(),
      price: z.number(),
      description: z.string(),
    })),
  }),
  async do(input, display) {
    await display.pushAndForget({ items: input.items });
    // Return concise text for the model to narrate
    return {
      status: "success",
      data: input.items
        .map(i => `${i.name} for $${i.price.toFixed(2)}`)
        .join(", "),
    };
  },
  render({ data }) {
    return <MenuCard items={data.items} />;
  },
};

Return Concise Data for Narration

The data field in your tool result is what the model sees and narrates. Keep it short and descriptive. Avoid returning raw JSON or lengthy details — the model will try to speak all of it.

Dynamic System Prompt for Voice

Append voice-specific instructions to your system prompt when voice is active. This tells the model to keep responses short and conversational:

tsx

const basePrompt = "You are a helpful barista assistant.";

const voiceInstructions = `
Voice mode is active. The user is speaking to you.
- Keep responses under 2 sentences
- Describe tool results concisely
- Use natural conversational language
- Do not use markdown, lists, or formatting
`;

function ChatApp() {
  const voice = useGloveVoice({ runnable, voice: voiceConfig });

  const systemPrompt = voice.isActive
    ? basePrompt + voiceInstructions
    : basePrompt;

  const glove = useGlove({ systemPrompt, tools, sessionId });
  // ...
}

pushAndWait and unAbortable in Voice Apps

Full barge-in protection for mutation-critical tools (like a checkout form) requires two layers:

Voice layer (barge-in suppression): When a pushAndWait resolver is pending, GloveVoice checks displayManager.resolverStore.size > 0 and skips interrupt() entirely. The barge-in never fires.
Core layer (abort resistance): Setting unAbortable: true on the tool makes glove-core run it to completion even if the abort signal fires. This protects against programmatic interrupts, not just voice.

Important: pushAndWait alone does not make a tool survive an abort signal. It only suppresses the voice barge-in trigger. If interrupt() is called by other means, only unAbortable: true guarantees the tool runs to completion. Use both together for tools that perform mutations.

tsx

const checkout = defineTool({
  name: "checkout",
  unAbortable: true,              // Layer 2: survives abort signals
  displayStrategy: "hide-on-complete",
  async do(_input, display) {
    const result = await display.pushAndWait({ items });  // Layer 1: suppresses voice barge-in
    if (!result) return "Cancelled";
    cartOps.clear();              // Safe — tool guaranteed to complete
    return "Order placed!";
  },
});

Use pushAndWait sparingly in voice-first apps — only for actions that genuinely require explicit user confirmation. For display-only tools, always prefer pushAndForget so barge-in works naturally.

Common Gotchas

1. SileroVAD Must Be Dynamically Imported

Never import glove-voice/silero-vad at module level in a Next.js or SSR environment. The WASM dependencies will fail during server-side rendering. Always use await import("glove-voice/silero-vad") inside a function that only runs in the browser.

2. Empty Committed Transcripts

ElevenLabs Scribe sometimes returns an empty committed transcript for very short utterances like “No” or “Hi”. The ElevenLabsSTTAdapter handles this automatically by falling back to the last partial transcript. You do not need to handle this case yourself.

3. TTS Idle Timeout

ElevenLabs WebSocket connections disconnect after approximately 20 seconds of inactivity. This can happen during tool execution when no text is being sent. GloveVoice handles this by closing the TTS session after each model_response_complete event and opening a fresh one when the next text_delta arrives.

4. Barge-in Protection Requires unAbortable

A pending pushAndWait resolver suppresses voice barge-in at the trigger level, but does not protect the tool from abort signals. For mutation-critical tools, always set unAbortable: true alongside pushAndWait to guarantee the tool runs to completion. See the pushAndWait and unAbortable section above for the full two-layer explanation.

5. Microphone Permission

voice.start() requests microphone permission. If the user denies it, the call throws an error. Handle this and show an appropriate message:

typescript

async function handleVoiceToggle() {
  try {
    if (voice.isActive) {
      await voice.stop();
    } else {
      await voice.start();
    }
  } catch (err) {
    if (err instanceof Error && err.message.includes("Permission")) {
      alert("Microphone access is required for voice mode.");
    }
  }
}

6. Model Provider Matters

Voice responses should be short and conversational. Instruct the LLM in the system prompt: “Keep voice responses under 2 sentences. Describe results concisely.” Without this guidance, the model may generate long, formatted responses that sound unnatural when spoken aloud.

7. createTTS Must Be a Factory

GloveVoice calls createTTS() to get a fresh TTS adapter for each model response within a turn. Do not pass a single adapter instance — it will fail on the second response because the WebSocket connection is already closed. Always pass a factory function:

typescript

// Correct: factory function
voice: { stt, createTTS: () => new ElevenLabsTTSAdapter({ getToken, voiceId }) }

// Wrong: single instance
voice: { stt, createTTS: new ElevenLabsTTSAdapter({ getToken, voiceId }) }

8. Audio Sample Rate

All adapters must agree on the audio format. The default is 16kHz mono PCM (Int16Array for capture/STT, Uint8Array for TTS playback). Do not change the sample rate unless your provider requires something different, and if you do, set sampleRate in GloveVoiceConfig to match.

9. narrate() Auto-Mutes the Mic

voice.narrate() automatically mutes the mic during playback to prevent TTS audio from feeding back into STT/VAD. It restores the previous mute state when done. If you were already muted before calling narrate(), you will remain muted afterward.

10. narrate() Requires a Started Pipeline

Calling narrate() before voice.start() throws an error because the TTS factory and AudioPlayer are not yet initialized. Always ensure the voice pipeline is active before narrating.

11. onnxruntime-web Version Pinning

If you see WASM loading errors when using SileroVAD, check that your onnxruntime-web version matches what @ricky0123/vad-web expects. The Glove monorepo pins onnxruntime-web@^1.22.0 alongside @ricky0123/vad-web@^0.0.30. Version mismatches between the ONNX Runtime WASM files and the JavaScript API will cause cryptic loading failures.

12. Voice Auto-Silences During Compaction

When context compaction is triggered, the core emits compaction_start and compaction_end observer events. The voice pipeline listens for these and ignores all text_delta events while compaction is in progress. This means the compaction summary is never narrated through TTS. No action is needed on your part — this is handled automatically by GloveVoice.

13. SileroVAD Not Needed for Manual Mode

When using turnMode: "manual" (push-to-talk), you do not need to import SileroVAD or set up any VAD at all. VAD is only used in turnMode: "vad". Skip the WASM overhead for PTT-only apps.

14. Render Ships a Default Input

The <Render> component includes a built-in text input. If you have your own input form, always pass renderInput={() => null} to suppress the built-in one — otherwise you get duplicate inputs.

15. Tools Execute Outside React

Tool do() functions run outside the React component tree. To access React context (for example, a wallet hook or theme), use a mutable singleton ref synced from a React component (bridge pattern):

typescript

// bridge.ts
export const voiceBridge = { current: null as GloveVoiceReturn | null };

// In your component:
useEffect(() => {
  voiceBridge.current = voice;
}, [voice]);

// In your tool:
async do(input, display) {
  await voiceBridge.current?.narrate("Processing...");
}