Build a Movie Companion

Voice-First

In this tutorial you will build Lola, a voice-first movie companion powered by TMDB. The user speaks, Lola responds with voice narration and visual cards — poster grids, movie info, trailers, comparisons, and streaming availability. There is a text input as a fallback, but voice is the primary interaction mode.

This is fundamentally different from adding voice to an existing chat app. In a voice-enabled app (like a coffee shop ordering assistant), you start with text and add voice as a secondary input. In a voice-first app, voice is the default. The screen is not a chat column — it is a visual area that shows ambient, glanceable cards while the AI narrates the content out loud. Every tool uses pushAndForget because nothing should ever block the voice conversation.

Prerequisites: You should have completed Getting Started, read The Display Stack, and reviewed Voice Integration.

What you will build

A movie companion where the user taps a voice orb and says “Tell me about Inception” and the app will:

Transcribe the user's speech to text using ElevenLabs speech-to-text
Send the text to the LLM, which calls search_movies — a poster grid appears on screen (pushAndForget)
The LLM narrates the search results out loud while the user sees the poster cards
The LLM calls get_movie_details — a detailed info card replaces the poster grid (pushAndForget)
Lola describes the director, cast, and plot verbally while the card is visible
The user says “Show me the trailer” — a YouTube embed appears (pushAndForget)

Nine tools, one TMDB proxy route. The user never types unless they choose to. The visual area shows the most recent tool result; the transcript strip shows Lola's latest spoken words in serif font near the bottom of the screen; the voice orb communicates the current state through animation.

Voice-first vs. voice-enabled

Before diving into code, it is important to understand the distinction between voice-first and voice-enabled. Both use the same glove-voice package, but the design decisions are opposite.

Aspect	Voice-Enabled (Coffee Shop)	Voice-First (Lola)
Primary input	Text — voice is secondary	Voice — text is a fallback
Screen layout	Chat column with messages	Visual area + transcript strip + voice orb
Tool blocking	Mix of `pushAndWait` and `pushAndForget`	All tools use `pushAndForget`
Tool results	AI reads structured data, may substitute text	AI narrates results verbally while card is visible
User interaction with cards	Click buttons, fill forms	Glance at visual information — no clicks needed
System prompt	Same prompt for text and voice	Different prompt for voice mode (narration instructions)

The key insight: in a voice-first app, if any tool uses pushAndWait, it blocks the LLM response loop. The agent cannot speak until the user clicks something on screen. That defeats the purpose of voice. Every tool must use pushAndForget so the visual card fires and the agent immediately narrates the content.

Architecture overview

Lola has three layers: a TMDB proxy that keeps the API key server-side, a voice pipeline (ElevenLabs + Silero VAD), and a screen layout built from three components.

/api/tmdb/[...path] — a catch-all Next.js route that proxies requests to the TMDB API. The TMDB_API_KEY stays on the server. Client-side code calls /api/tmdb/search/movie?query=... and gets back raw TMDB JSON.
/api/voice/stt-token and /api/voice/tts-token — token endpoints that generate short-lived ElevenLabs tokens. The browser uses these tokens to connect directly to ElevenLabs for speech-to-text and text-to-speech without exposing the API key.
/api/chat — the standard Glove chat handler that proxies to the LLM.
Visual Area — center of the screen. Shows active tool cards, the most recent completed tool result, or an empty state with suggestion chips.
Transcript Strip — a line of serif text near the bottom showing Lola's last spoken words. Fades after four seconds of silence.
Voice Orb — an 80px amber square with layered ring animations. Communicates listening, thinking, speaking, recording, and processing states.

1. Project setup

Start from a Next.js project with Glove and voice packages installed:

terminalbash

pnpm add glove-core glove-react glove-next glove-voice zod

Lola also uses @ricky0123/vad-web and onnxruntime-web for Silero VAD (voice activity detection — the browser-side model that detects when you start and stop speaking):

terminalbash

pnpm add @ricky0123/vad-web onnxruntime-web

Create three environment variables:

.env.localbash

OPENROUTER_API_KEY=your-openrouter-key
TMDB_API_KEY=your-tmdb-v3-bearer-token
ELEVENLABS_API_KEY=your-elevenlabs-key

The TMDB API key is a free bearer token from themoviedb.org. The ElevenLabs API key comes from elevenlabs.io (free tier works). The OpenRouter API key lets you use any model provider through a single endpoint.

2. TMDB integration

The TMDB integration has two parts: a server-side proxy route that keeps your API key secret, and a client-side module with typed helper functions.

The proxy route

A single catch-all route forwards any path to the TMDB API with your bearer token attached:

app/api/tmdb/[...path]/route.tstypescript

import { NextResponse } from "next/server";

const TMDB_API_BASE = "https://api.themoviedb.org/3";

export async function GET(
  req: Request,
  { params }: { params: Promise<{ path: string[] }> },
) {
  const { path } = await params;
  const url = new URL(req.url);

  const apiKey = process.env.TMDB_API_KEY;
  if (!apiKey) {
    return NextResponse.json({ error: "TMDB_API_KEY not set" }, { status: 500 });
  }

  const tmdbUrl = `${TMDB_API_BASE}/${path.join("/")}?${url.searchParams.toString()}`;

  const res = await fetch(tmdbUrl, {
    headers: { Authorization: `Bearer ${apiKey}` },
  });

  const data = await res.json();
  return NextResponse.json(data, { status: res.status });
}

When client-side code calls /api/tmdb/search/movie?query=Inception, this route rewrites it to https://api.themoviedb.org/3/search/movie?query=Inception with the bearer token. The API key never reaches the browser.

Client-side helpers

A tmdb.ts module wraps the proxy with typed functions and image URL builders. Here are the key parts:

app/lib/tmdb.tstypescript

const API_BASE = "/api/tmdb";
const TMDB_IMAGE_BASE = "https://image.tmdb.org/t/p";

export interface TMDBMovie {
  id: number;
  title: string;
  overview: string;
  release_date: string;
  vote_average: number;
  vote_count: number;
  poster_path: string | null;
  backdrop_path: string | null;
  genres?: { id: number; name: string }[];
  runtime?: number;
  tagline?: string;
  credits?: {
    cast: TMDBCastMember[];
    crew: TMDBCrewMember[];
  };
  videos?: { results: TMDBVideo[] };
  "watch/providers"?: {
    results: Record<string, TMDBProviderData>;
  };
}

// Image URL helpers
export function posterUrl(
  path: string | null,
  size: "w92" | "w154" | "w185" | "w342" | "w500" | "w780" | "original" = "w342",
): string | null {
  if (!path) return null;
  return `${TMDB_IMAGE_BASE}/${size}${path}`;
}

// Internal fetch helper — all calls go through the proxy
async function tmdbFetch<T>(path: string, params?: Record<string, string>): Promise<T> {
  const url = new URL(`${API_BASE}/${path}`, window.location.origin);
  if (params) {
    for (const [key, value] of Object.entries(params)) {
      url.searchParams.set(key, value);
    }
  }
  const res = await fetch(url.toString());
  if (!res.ok) {
    const errorBody = await res.text().catch(() => "Unknown error");
    throw new Error(`TMDB API error (${res.status}): ${errorBody}`);
  }
  return res.json() as Promise<T>;
}

// API functions
export async function searchMovies(query: string, year?: number): Promise<TMDBMovie[]> {
  const params: Record<string, string> = { query };
  if (year) params.year = String(year);
  const data = await tmdbFetch<{ results: TMDBMovie[] }>("search/movie", params);
  return data.results;
}

export async function getMovieDetails(movieId: number): Promise<TMDBMovie> {
  return tmdbFetch<TMDBMovie>(`movie/${movieId}`, {
    append_to_response: "credits,videos,watch/providers",
  });
}

// Utility helpers
export function movieYear(movie: TMDBMovie): string {
  if (!movie.release_date) return "Unknown";
  return movie.release_date.substring(0, 4);
}

export function getDirector(movie: TMDBMovie): string {
  if (!movie.credits?.crew) return "Unknown";
  const director = movie.credits.crew.find((c) => c.job === "Director");
  return director?.name ?? "Unknown";
}

export function getTopCast(movie: TMDBMovie, count: number = 5): TMDBCastMember[] {
  if (!movie.credits?.cast) return [];
  return movie.credits.cast.sort((a, b) => a.order - b.order).slice(0, count);
}

Every TMDB call goes through tmdbFetch, which constructs a URL pointing at /api/tmdb/... and parses the JSON response. The typed return values mean your tooldo functions get full autocomplete for movie fields, cast members, and provider data.

3. Tool design for voice

Voice-first tools follow a specific pattern. Every tool:

Fetches data from the TMDB proxy
Calls display.pushAndForget() with the visual card data
Returns a descriptive text string that the LLM uses for narration, plus renderData for persisting the visual card

The text return is critical. In a text-only app, the LLM reads the tool result and decides what to say. In a voice-first app, the LLM reads the tool result and speaks it. The tool must return enough information for the LLM to give a natural verbal summary, not just “Done” or a raw JSON blob.

search_movies

The search tool is the most common entry point. The user says “Find me sci-fi movies from the 90s” and the tool shows a poster grid while returning a numbered text list for narration.

app/lib/tools/search-movies.tsxtsx

import { defineTool } from "glove-react";
import { z } from "zod";
import { searchMovies, posterUrl, movieYear, type TMDBMovie } from "../tmdb";

export function createSearchMoviesTool() {
  return defineTool({
    name: "search_movies",
    description:
      "Search for movies by title. Returns a visual grid of poster cards " +
      "and text results for narration.",
    inputSchema: z.object({
      query: z.string().describe("Search query for movies"),
      year: z.number().optional().describe("Filter by release year"),
      max_results: z.number().optional().default(4).describe("Max results (1-6)"),
    }),
    displayPropsSchema: z.object({
      movies: z.array(z.any()),
    }),

    async do(input, display) {
      const clampedMax = Math.max(1, Math.min(6, input.max_results ?? 4));
      const results = await searchMovies(input.query, input.year);
      const movies = results.slice(0, clampedMax);

      if (movies.length === 0) {
        return {
          status: "success" as const,
          data: `No movies found matching "${input.query}".`,
          renderData: { movies: [] },
        };
      }

      // Fire the visual card — does NOT block the LLM
      await display.pushAndForget({ movies });

      // Return descriptive text for voice narration
      const summaryLines = movies.map(
        (m, i) =>
          `${i + 1}. ${m.title} (${movieYear(m)}) — Rating: ${m.vote_average.toFixed(1)}/10`,
      );

      return {
        status: "success" as const,
        data: `Found ${movies.length} result(s) for "${input.query}":\n${summaryLines.join("\n")}`,
        renderData: { movies },
      };
    },

    render({ props }) {
      const movies = props.movies as TMDBMovie[];
      return (
        <div style={{ display: "flex", gap: 12, flexWrap: "wrap", justifyContent: "center" }}>
          {movies.map((movie) => (
            <PosterCard key={movie.id} movie={movie} />
          ))}
        </div>
      );
    },

    renderResult({ data }) {
      const result = data as { movies: TMDBMovie[] };
      return (
        <div style={{ display: "flex", gap: 12, flexWrap: "wrap", justifyContent: "center" }}>
          {result.movies.map((movie) => (
            <PosterCard key={movie.id} movie={movie} />
          ))}
        </div>
      );
    },
  });
}

The data string returned to the LLM includes titles, years, and ratings in a numbered list. In voice mode, the LLM reads this and says something like: “Here are four results. First up is Inception from 2010, a solid 8.4. Then we have Interstellar, also from Nolan...” Meanwhile the poster grid is already visible on screen.

get_movie_details

When the user asks about a specific film, this tool fetches full details including credits, videos, and streaming providers in a single TMDB call (using append_to_response). The visual card shows a backdrop image, genre tags, cast list, and director. The text return gives the LLM enough to narrate a compelling summary.

app/lib/tools/get-movie-details.tsxtsx

import { defineTool } from "glove-react";
import { z } from "zod";
import {
  getMovieDetails,
  backdropUrl,
  movieYear,
  formatRuntime,
  getDirector,
  getTopCast,
  genreNames,
  type TMDBMovie,
} from "../tmdb";

export function createGetMovieDetailsTool() {
  return defineTool({
    name: "get_movie_details",
    description:
      "Get comprehensive details about a movie including overview, " +
      "cast, director, runtime, rating, genres, and streaming availability.",
    inputSchema: z.object({
      movie_id: z.number().describe("TMDB movie ID"),
    }),
    displayPropsSchema: z.object({
      movie: z.any(),
    }),

    async do(input, display) {
      const movie = await getMovieDetails(input.movie_id);
      await display.pushAndForget({ movie });

      const year = movieYear(movie);
      const director = getDirector(movie);
      const cast = getTopCast(movie, 5);
      const castNames = cast.map((c) => c.name).join(", ");
      const overviewSnippet =
        movie.overview.length > 200
          ? movie.overview.substring(0, 200) + "..."
          : movie.overview;

      return {
        status: "success" as const,
        data: `${movie.title} (${year}), directed by ${director}. ${overviewSnippet} Stars: ${castNames}. Rating: ${movie.vote_average.toFixed(1)}/10.`,
        renderData: { movie },
      };
    },

    render({ props }) {
      const movie = props.movie as TMDBMovie;
      return <MovieInfoCard movie={movie} />;
    },

    renderResult({ data }) {
      const result = data as { movie: TMDBMovie };
      return <MovieInfoCard movie={result.movie} />;
    },
  });
}

Notice the data string: “Inception (2010), directed by Christopher Nolan. A thief who steals corporate secrets through dream-sharing technology... Stars: Leonardo DiCaprio, Joseph Gordon-Levitt, Elliot Page... Rating: 8.4/10.” The LLM uses this to speak naturally. It will not read it verbatim — the system prompt tells it to describe movies with feeling.

remember_preference (pure data, no UI)

Not every tool needs a visual component. The remember_preference tool silently stores user taste preferences. It has no render function and no pushAndForget call. The LLM acknowledges the preference verbally (“Got it, you love Villeneuve”) without showing anything on screen.

app/lib/tools/remember-preference.tstypescript

import { defineTool } from "glove-react";
import { z } from "zod";

export function createRememberPreferenceTool() {
  return defineTool({
    name: "remember_preference",
    description:
      "Remember a user preference about movies — favorite genres, " +
      "directors, actors, moods, or anything else. Data-only tool " +
      "with no visual display.",
    inputSchema: z.object({
      preference: z.string().describe("User preference to remember"),
      category: z
        .string()
        .optional()
        .describe("Category: genre, director, actor, mood, other"),
    }),
    displayPropsSchema: z.object({}),
    async do(input) {
      const category = input.category ?? "other";
      return {
        status: "success" as const,
        data: `Noted preference (${category}): ${input.preference}`,
      };
    },
  });
}

4. The complete tool inventory

Lola has nine tools, all using pushAndForget (except remember_preference which has no UI at all). Each tool returns descriptive text for narration alongside visual card data.

Tool	Visual Card	Narration Text
`search_movies`	Poster grid with rating badges	Numbered list of titles, years, and ratings
`get_movie_details`	Full info card: backdrop, genres, cast, director, overview	Title, year, director, cast names, overview snippet, rating
`get_ratings`	Score display with rating bar and vote count	Title, score out of 10, vote count
`get_trailer`	YouTube embed (16:9 aspect ratio)	“Trailer for [Title] is now playing on screen”
`compare_movies`	Side-by-side cards (2–4 films) with posters and genres	Per-movie summary: title, year, rating, runtime, genres
`get_recommendations`	Numbered list with poster thumbnails and overview snippets	Numbered list of titles with brief descriptions
`get_person`	Profile card with photo, bio, and notable films	Name, department, notable film titles
`get_streaming`	Provider badges grouped by type (stream, rent, buy)	“Stream on Netflix. Rent on Apple TV.”
`remember_preference`	None	LLM acknowledges verbally

All tools are assembled in a single factory function:

app/lib/tools/index.tstypescript

import type { ToolConfig } from "glove-react";
import { createSearchMoviesTool } from "./search-movies";
import { createGetMovieDetailsTool } from "./get-movie-details";
import { createGetRatingsTool } from "./get-ratings";
import { createGetTrailerTool } from "./get-trailer";
import { createCompareMoviesTool } from "./compare-movies";
import { createGetRecommendationsTool } from "./get-recommendations";
import { createGetPersonTool } from "./get-person";
import { createGetStreamingTool } from "./get-streaming";
import { createRememberPreferenceTool } from "./remember-preference";

export function createLolaTools(): ToolConfig[] {
  return [
    createSearchMoviesTool(),
    createGetMovieDetailsTool(),
    createGetRatingsTool(),
    createGetTrailerTool(),
    createCompareMoviesTool(),
    createGetRecommendationsTool(),
    createGetPersonTool(),
    createGetStreamingTool(),
    createRememberPreferenceTool(),
  ];
}

5. The visual area pattern

In a voice-first app, there is no scrolling chat column. The screen has a single visual area in the center that shows the most relevant content. The visual area has three states:

Active slot — when a tool has just fired via pushAndForget, the visual area renders that tool's card. If multiple tools fire in sequence, the latest one wins.
Last result — when no active slot exists but a previous tool has renderData, the visual area shows the most recent completed result. This means the movie info card stays visible even after the LLM finishes speaking.
Empty state — when there is nothing to show and the agent is not busy, the visual area shows a cinematic onboarding screen with suggestion chips like “Best sci-fi from the 90s” or “Something like Eternal Sunshine.”

app/components/visual-area.tsxtsx

import { useMemo, type ReactNode } from "react";
import type { TimelineEntry, EnhancedSlot } from "glove-react";

interface VisualAreaProps {
  slots: EnhancedSlot[];
  timeline: TimelineEntry[];
  renderSlot: (slot: EnhancedSlot) => ReactNode;
  renderToolResult: (entry: TimelineEntry & { kind: "tool" }) => ReactNode;
  busy: boolean;
  onSuggestion?: (text: string) => void;
}

const SUGGESTIONS = [
  "Best sci-fi from the 90s",
  "Something like Eternal Sunshine",
  "Who directed Parasite?",
  "Cozy rainy day movies",
];

export function VisualArea({
  slots,
  timeline,
  renderSlot,
  renderToolResult,
  busy,
  onSuggestion,
}: VisualAreaProps) {
  const lastToolResult = useMemo(() => {
    for (let i = timeline.length - 1; i >= 0; i--) {
      const entry = timeline[i];
      if (
        entry.kind === "tool" &&
        entry.status === "success" &&
        entry.renderData !== undefined
      ) {
        return entry;
      }
    }
    return null;
  }, [timeline]);

  // Case 1: Active slots — render each via renderSlot
  if (slots.length > 0) {
    return (
      <div className="visual-area">
        {slots.map((slot) => (
          <div key={slot.id} className="display-card">
            {renderSlot(slot)}
          </div>
        ))}
      </div>
    );
  }

  // Case 2: Recent tool result with renderData
  if (lastToolResult) {
    const rendered = renderToolResult(lastToolResult);
    if (rendered) {
      return (
        <div className="visual-area">
          <div className="display-card">{rendered}</div>
        </div>
      );
    }
  }

  // Case 3: Empty state — suggestion chips
  if (!busy) {
    return (
      <div className="visual-area">
        <div className="lola-empty">
          <h1 className="lola-empty__title">Lola</h1>
          <p className="lola-empty__subtitle">
            Your voice-first movie companion.<br />
            Ask me anything about film.
          </p>
          {onSuggestion && (
            <div className="lola-empty__suggestions">
              {SUGGESTIONS.map((s) => (
                <button
                  key={s}
                  type="button"
                  className="lola-empty__chip"
                  onClick={() => onSuggestion(s)}
                >
                  {s}
                </button>
              ))}
            </div>
          )}
        </div>
      </div>
    );
  }

  return <div className="visual-area" />;
}

The visual area is not a Glove concept — it is a UI pattern you build yourself using the slots, timeline, renderSlot, and renderToolResult values from useGlove. In a chat-based app you interleave slots into a message list. In a voice-first app you replace the entire center of the screen.

6. Voice pipeline setup

The voice pipeline has three parts: speech-to-text (STT), text-to-speech (TTS), and voice activity detection (VAD). All three are configured in a single file.

Token routes

ElevenLabs requires short-lived tokens for browser-side connections. Glove provides a helper that generates these tokens from your API key:

app/api/voice/stt-token/route.tstypescript

import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({ provider: "elevenlabs", type: "stt" });

app/api/voice/tts-token/route.tstypescript

import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({ provider: "elevenlabs", type: "tts" });

Voice adapters

The adapters connect the token routes to ElevenLabs and configure the voice. Lola uses the “Charlotte” voice — a warm, cinematic tone that fits the film companion persona:

app/lib/voice.tstypescript

import { createElevenLabsAdapters } from "glove-voice";

async function fetchToken(path: string): Promise<string> {
  const res = await fetch(path);
  const data = (await res.json()) as { token?: string; error?: string };
  if (!res.ok || !data.token) {
    throw new Error(data.error ?? `Token fetch failed (${res.status})`);
  }
  return data.token;
}

export const { stt, createTTS } = createElevenLabsAdapters({
  getSTTToken: () => fetchToken("/api/voice/stt-token"),
  getTTSToken: () => fetchToken("/api/voice/tts-token"),
  voiceId: "XB0fDUnXU5powFXDhCwa", // "Charlotte" — warm, cinematic
});

export async function createSileroVAD() {
  const { SileroVADAdapter } = await import("glove-voice/silero-vad");
  const vad = new SileroVADAdapter({
    positiveSpeechThreshold: 0.5,
    negativeSpeechThreshold: 0.35,
    wasm: { type: "cdn" },
  });
  await vad.init();
  return vad;
}

Silero VAD is a small neural network that runs in the browser using WebAssembly. It listens to the microphone and detects when you start and stop speaking. This is what enables the hands-free “auto” turn mode — you speak, it detects silence, and it automatically sends your speech for transcription. The positiveSpeechThreshold and negativeSpeechThreshold control how sensitive the detection is.

Silero VAD is imported dynamically with await import("glove-voice/silero-vad") because it loads an ONNX model file. Dynamic import keeps it out of the initial bundle and allows it to load the WebAssembly runtime on demand.

7. The voice orb

The voice orb is the primary interaction element. It is an 80px sharp-cornered amber square that communicates state through layered ring animations. Think of it as a visual heartbeat for the voice pipeline.

The orb has six states, each mapped to the VoiceMode from useGloveVoice plus two additional UI states for manual recording and processing:

State	Visual	Meaning
`idle`	Static amber square with mic icon	Voice session not started; tap to begin
`listening`	Gentle breathing pulse on the outer ring	Microphone is active, waiting for speech
`recording`	Warm orange pulse on the core	Manual mode: actively capturing your voice
`processing`	Subdued spin on the middle ring	Finalizing transcription before sending to LLM
`thinking`	Counter-rotating dashed rings	LLM is generating a response (tool calls, text)
`speaking`	Concentric ripples expanding outward	Lola is speaking; tap to interrupt

The orb click handler adapts to the current state:

app/components/voice-orb.tsx (click handler)typescript

const handleClick = () => {
  if (mode === "speaking") {
    onInterrupt();         // Tap while speaking → interrupt
  } else if (isProcessing) {
    onStop();              // Tap while processing → cancel
  } else if (isManual && mode === "listening") {
    if (isManualRecording) {
      onManualRecordStop();  // Tap while recording → send
    } else {
      onManualRecordStart(); // Tap while idle → start recording
    }
  } else {
    onStop();              // Tap otherwise → end voice session
  }
};

The orb also shows a status label beneath it. In listening mode it says “Listening”; while recording it shows the live transcript; while thinking it says “Thinking”; while speaking it says “Speaking.” In manual mode with no recording active, it shows “Hold space or tap to speak.”

8. Dynamic system prompt for voice

Lola uses two system prompts. The base prompt defines her personality and tool usage guidelines. The voice prompt extends the base with narration instructions.

app/lib/system-prompt.tstypescript

export const systemPrompt = `You are Lola, a passionate and knowledgeable movie companion.

## Your Personality
- Genuinely passionate about cinema across all genres and eras
- Warm but opinionated — you have taste but respect others' preferences
- Concise — 1-2 sentences between tool calls. Let the visual cards do the talking.
- You describe movies by feel, not data — "gorgeous, melancholic road trip" not "received 7.8 on IMDb"

## Tool Usage Guidelines
- ALWAYS use visual tools — never list movies as plain text
- Use search_movies for any movie search
- Use get_movie_details when discussing a specific film in depth
- Use get_trailer proactively when it would enhance the conversation
- Keep text responses SHORT — let the visual cards speak`;

export const voiceSystemPrompt = `${systemPrompt}

## Voice Mode — IMPORTANT
The user is interacting via voice. All tools display visual cards on screen.
You MUST ALSO describe things verbally since the user may not be looking at the screen.

### After Each Tool
- search_movies: Briefly narrate the top 2-3 results — title, year, one line each
- get_movie_details: Highlight the director, lead actors, and a sentence about the plot
- get_ratings: Speak the score and what it means ("solid 8.1 — critics loved it")
- get_trailer: Let them know the trailer is playing on screen
- compare_movies: Summarize the key differences verbally
- get_recommendations: Read out the top 2-3 picks with brief reasons
- get_person: Mention their most notable roles
- get_streaming_availability: Tell them where it's available
- remember_preference: Just acknowledge verbally ("Got it, noted.")

### Speaking Style
- Conversational — like a friend who loves movies, chatting on the couch
- Describe movies with feeling — "It's this gorgeous, melancholic road trip"
- Keep it concise for voice — shorter than text responses
- Ask one thing at a time — don't overwhelm
- Never read metadata robotically — translate data into human sentences`;

The swap happens at runtime. When the voice session starts, the orchestrator calls runnable.setSystemPrompt(voiceSystemPrompt). When voice stops, it reverts to the base prompt. This means the LLM's behavior changes dynamically — in voice mode it narrates tool results, in text mode it keeps responses short and lets the cards speak.

app/components/lola.tsx (prompt swap)typescript

useEffect(() => {
  if (!runnable) return;
  if (voice.isActive) {
    runnable.setSystemPrompt(voiceSystemPrompt);
  } else {
    runnable.setSystemPrompt(systemPrompt);
  }
}, [voice.isActive, runnable]);

9. Wiring it all together

The Lola component is the orchestrator. It initializes useGlove with the tools, sets up the voice pipeline with useGloveVoice, manages VAD initialization, handles the thinking sound loop, and renders the three-part layout.

app/lib/client.tstypescript

import { GloveClient, createRemoteStore } from "glove-react";
import { systemPrompt } from "./system-prompt";
import { storeActions } from "./store-actions";

export const gloveClient = new GloveClient({
  endpoint: "/api/chat",
  systemPrompt,
  createStore: (sessionId) => createRemoteStore(sessionId, storeActions),
});

app/components/lola.tsx (orchestrator)tsx

"use client";

import { useState, useRef, useMemo, useCallback, useEffect } from "react";
import { useGlove } from "glove-react";
import { useGloveVoice } from "glove-react/voice";
import type { TurnMode } from "glove-react/voice";
import { createLolaTools } from "../lib/tools";
import { stt, createTTS, createSileroVAD } from "../lib/voice";
import { systemPrompt, voiceSystemPrompt } from "../lib/system-prompt";
import { VisualArea } from "./visual-area";
import { TranscriptStrip } from "./transcript-strip";
import { VoiceOrb } from "./voice-orb";
import { TextInput } from "./text-input";

interface LolaProps {
  sessionId: string;
  onFirstMessage?: (sessionId: string, text: string) => void;
}

export function Lola({ sessionId, onFirstMessage }: LolaProps) {
  const [turnMode, setTurnMode] = useState<TurnMode>("vad");
  const [isManualRecording, setIsManualRecording] = useState(false);
  const [isProcessing, setIsProcessing] = useState(false);
  const [showTextInput, setShowTextInput] = useState(false);
  const [input, setInput] = useState("");
  const [vadReady, setVadReady] = useState(false);
  const vadRef = useRef<Awaited<ReturnType<typeof createSileroVAD>> | null>(null);

  const MIN_RECORDING_MS = 350;

  // Tools — created once, stable reference
  const tools = useMemo(() => createLolaTools(), []);

  // Glove hook — conversation engine
  const glove = useGlove({ tools, sessionId });
  const {
    runnable, timeline, streamingText, busy,
    slots, sendMessage, renderSlot, renderToolResult,
  } = glove;

  // Silero VAD — async initialization
  useEffect(() => {
    createSileroVAD().then((v) => {
      vadRef.current = v;
      setVadReady(true);
    });
  }, []);

  // Voice pipeline
  const voiceConfig = useMemo(
    () => ({
      stt,
      createTTS,
      vad: vadReady ? vadRef.current ?? undefined : undefined,
      turnMode,
    }),
    [vadReady, turnMode],
  );
  const voice = useGloveVoice({ runnable, voice: voiceConfig });

  // Dynamic system prompt swap
  useEffect(() => {
    if (!runnable) return;
    if (voice.isActive) {
      runnable.setSystemPrompt(voiceSystemPrompt);
    } else {
      runnable.setSystemPrompt(systemPrompt);
    }
  }, [voice.isActive, runnable]);

  // Thinking sound loop
  useEffect(() => {
    if (voice.mode !== "thinking") return;
    const audio = new Audio("/lola-thinking.mp3");
    audio.loop = true;
    audio.play().catch(() => {});
    return () => {
      audio.pause();
      audio.src = "";
    };
  }, [voice.mode]);

  // Last agent text from timeline for transcript strip
  const lastAgentText = useMemo(() => {
    for (let i = timeline.length - 1; i >= 0; i--) {
      const entry = timeline[i];
      if (entry.kind === "agent_text") return entry.text;
    }
    return "";
  }, [timeline]);

  return (
    <div className="lola-screen">
      <VisualArea
        slots={slots}
        timeline={timeline}
        renderSlot={renderSlot}
        renderToolResult={renderToolResult}
        busy={busy}
        onSuggestion={(text) => sendMessage(text)}
      />

      <TranscriptStrip
        text={streamingText || lastAgentText}
        isStreaming={!!streamingText}
      />

      <div className="orb-area">
        {voice.isActive ? (
          <VoiceOrb
            mode={voice.mode}
            transcript={voice.transcript}
            turnMode={turnMode}
            isManualRecording={isManualRecording}
            isProcessing={isProcessing}
            onStop={() => voice.stop()}
            onInterrupt={voice.interrupt}
            onManualRecordStart={() => { /* manual recording logic */ }}
            onManualRecordStop={() => { /* commit recording logic */ }}
          />
        ) : (
          <button
            className="voice-orb voice-orb--idle"
            onClick={() => voice.start()}
          >
            Start Voice
          </button>
        )}

        <TextInput
          visible={showTextInput}
          onToggle={() => setShowTextInput(!showTextInput)}
          input={input}
          setInput={setInput}
          busy={busy}
          onSubmit={(e) => {
            e.preventDefault();
            const text = input.trim();
            if (!text || busy) return;
            setInput("");
            sendMessage(text);
          }}
        />
      </div>
    </div>
  );
}

The render structure is flat: VisualArea fills the center, TranscriptStrip sits near the bottom, and the orb-area holds the voice orb plus the optional text input. There is no chat column, no message list, no scroll. The visual area is the only place where tool output appears, and it shows only the most recent content.

10. The transcript strip

The transcript strip shows Lola's most recent spoken or streamed text near the bottom of the screen. It is styled in serif font to match the cinematic aesthetic. While the LLM is streaming, a gentle pulse keeps the text alive. After four seconds of silence, the text fades to near-invisible so it does not compete with the visual area.

app/components/transcript-strip.tsxtsx

import { useEffect, useRef, useState } from "react";

interface TranscriptStripProps {
  text: string;
  isStreaming: boolean;
}

const FADE_DELAY_MS = 4000;

export function TranscriptStrip({ text, isStreaming }: TranscriptStripProps) {
  const [isFading, setIsFading] = useState(false);
  const timerRef = useRef<ReturnType<typeof setTimeout> | null>(null);
  const prevTextRef = useRef(text);

  useEffect(() => {
    // Reset fade when text changes or streaming starts
    if (text !== prevTextRef.current || isStreaming) {
      prevTextRef.current = text;
      setIsFading(false);
      if (timerRef.current) {
        clearTimeout(timerRef.current);
        timerRef.current = null;
      }
    }

    // Start fade timer when not streaming and text exists
    if (!isStreaming && text) {
      timerRef.current = setTimeout(() => {
        setIsFading(true);
        timerRef.current = null;
      }, FADE_DELAY_MS);
    }

    return () => {
      if (timerRef.current) {
        clearTimeout(timerRef.current);
        timerRef.current = null;
      }
    };
  }, [text, isStreaming]);

  if (!text) return null;

  // Show the last ~180 characters, trimmed to a word boundary
  const displayText =
    text.length > 180
      ? "\u2026" + text.slice(text.length - 180).replace(/^\S*\s/, "")
      : text;

  return (
    <div className="transcript-strip" role="status" aria-live="polite">
      <p
        className={`transcript-strip__text ${
          isStreaming ? "transcript-strip__text--streaming" : ""
        } ${isFading ? "transcript-strip__text--fading" : ""}`}
      >
        {displayText}
      </p>
    </div>
  );
}

The 180-character trim ensures the strip never wraps excessively. For long narrations, it shows the tail end with a leading ellipsis. The role="status" and aria-live="polite" attributes ensure screen readers announce new text without interrupting the user.

11. The thinking sound

When the LLM is processing (calling tools, generating text), Lola plays a subtle ambient sound loop. This gives the user audio feedback that something is happening, even when the screen has not changed yet.

app/components/lola.tsx (thinking sound)typescript

useEffect(() => {
  if (voice.mode !== "thinking") return;

  const audio = new Audio("/lola-thinking.mp3");
  audio.loop = true;
  audio.play().catch(() => {});

  return () => {
    audio.pause();
    audio.src = "";
  };
}, [voice.mode]);

The useEffect cleanup function stops the sound immediately when the voice mode changes away from "thinking". Setting audio.src to an empty string releases the audio resource. The .catch(() => {}); handles browsers that block autoplay — the sound is a nice-to-have, not critical.

12. Turn modes: auto vs. push-to-talk

Lola supports two turn modes for voice input:

Auto (VAD) — Silero VAD detects when you start and stop speaking. After a pause, it automatically transcribes and sends your speech. This is the default and feels like a natural conversation.
Push to talk (manual) — you hold the spacebar or tap the orb to record, then release to send. This is useful in noisy environments or when you want precise control over when your speech is captured.

A toggle between these modes sits below the voice orb. It is only enabled when the voice pipeline is in the listening state — you cannot switch modes while the LLM is thinking or speaking.

In manual mode, a 350ms minimum recording duration prevents false positive triggers. If you tap the orb briefly (under 350ms), the commit is delayed until the minimum threshold is reached. This avoids sending empty or garbled audio to the STT service.

app/components/lola.tsx (min-duration commit)typescript

const MIN_RECORDING_MS = 350;

const commitRecording = useCallback(() => {
  if (!recordingRef.current) return;
  recordingRef.current = false;
  setIsManualRecording(false);

  const elapsed = Date.now() - recordingStartRef.current;

  if (elapsed >= MIN_RECORDING_MS) {
    setIsProcessing(true);
    commitTurnRef.current();
  } else {
    // Delay commit until minimum recording duration
    setIsProcessing(true);
    const remaining = MIN_RECORDING_MS - elapsed;
    pendingCommitRef.current = setTimeout(() => {
      pendingCommitRef.current = null;
      commitTurnRef.current();
    }, remaining);
  }
}, []);

Display patterns summary

Tool	Display Method	Why
`search_movies`	`pushAndForget`	Poster grid appears instantly; LLM narrates results in parallel
`get_movie_details`	`pushAndForget`	Info card appears; LLM describes the film verbally
`get_ratings`	`pushAndForget`	Rating card appears; LLM speaks the score
`get_trailer`	`pushAndForget`	YouTube embed appears; LLM says “trailer is playing”
`compare_movies`	`pushAndForget`	Side-by-side cards appear; LLM summarizes differences
`get_recommendations`	`pushAndForget`	Numbered list appears; LLM reads the top picks
`get_person`	`pushAndForget`	Profile card appears; LLM mentions notable roles
`get_streaming`	`pushAndForget`	Provider badges appear; LLM says where to watch
`remember_preference`	None	Pure data — LLM acknowledges verbally, no UI

Every single tool uses pushAndForget. There is no pushAndWait anywhere in the Lola codebase. This is the defining characteristic of a voice-first app. The moment you add a blocking tool, you break the voice flow.

13. Design palette

Lola uses a charcoal + amber palette inspired by film noir aesthetics. The background is void black (#0d0d0f), cards use charcoal surfaces, and amber provides warmth for accents, ratings, and the voice orb.

app/lib/theme.tstypescript

export const VOID = "#0d0d0f";

export const CHARCOAL: Record<number, string> = {
  900: "#1a1a1f",
  800: "#222228",
  700: "#2a2a32",
  600: "#333340",
  500: "#3d3d48",
};

export const AMBER: Record<number, string> = {
  500: "#d4911e",
  400: "#f5a623",
  300: "#f7b84d",
  200: "#fcd88e",
  100: "#fde8b5",
  50: "#fef7e6",
};

export const CREAM = "#faf7f2";
export const CREAM_MUTED = "#a8a4a0";
export const CREAM_DIM = "#706c68";

Three typefaces reinforce the cinematic feel: Instrument Serif for movie titles and the transcript strip, DM Sans for body text and labels, and DM Mono for metadata like years, runtimes, and rating numbers.

14. Run it

terminalbash

# From the monorepo root
pnpm install

# Set environment variables in examples/lola/.env.local:
#   OPENROUTER_API_KEY=...
#   TMDB_API_KEY=...
#   ELEVENLABS_API_KEY=...

pnpm --filter glove-lola run dev

Try these conversations:

“Tell me about Inception” — a poster grid appears, then the LLM narrates the results. Ask “more details on that one” and the full info card replaces the grid.
“Show me the trailer” — a YouTube embed appears in the visual area. Lola says “Here's the trailer — take a look.”
“Compare it with Interstellar” — side-by-side comparison cards appear. Lola highlights the key differences verbally.
“Cozy rainy day movies” — mood-based discovery. Lola uses get_recommendations with the mood string, which maps to genre IDs internally.
“Who directed Parasite?” — a person profile card appears for Bong Joon-ho with his notable films. Lola mentions his other work verbally.
“Where can I watch it?” — streaming provider badges appear grouped by type (stream, rent, buy). Lola tells you the options out loud.

Notice that throughout the conversation, you never need to tap the screen. The visual cards are ambient — they appear and stay visible while Lola narrates. The voice orb communicates state through animation. The text input is hidden by default and only appears when you tap the keyboard icon.

Where each piece runs

Piece	Where	Why
`createChatHandler`	Server	LLM proxy — sends tool schemas, streams responses
Tool `do` functions	Browser	Fetch from TMDB proxy, fire visual cards, return text for narration
`/api/tmdb/[...path]`	Server	TMDB proxy — keeps API key server-side
`/api/voice/stt-token`	Server	Generates ElevenLabs STT tokens
`/api/voice/tts-token`	Server	Generates ElevenLabs TTS tokens
ElevenLabs STT/TTS	Browser (direct connection)	Browser uses tokens to stream audio directly to ElevenLabs
Silero VAD	Browser (WebAssembly)	Runs a small neural network locally for speech detection
Visual area, orb, transcript strip	Browser	UI components rendering tool output and voice state

Next steps

Build a Coffee Shop — see voice-enabled (text primary, voice secondary) for comparison with voice-first
Voice Integration — deep dive into useGloveVoice, adapters, VAD, turn modes, and token routes
The Display Stack — pushAndWait vs. pushAndForget and display strategies
Build a Shopping Assistant — see pushAndWait for interactive forms (the opposite of Lola's approach)
Build a Coding Agent — gate-execute-display pattern for server mutations
defineTool API Reference — full API for typed tool definitions with displayPropsSchema and resolveSchema
React API Reference — full API documentation for useGlove, GloveClient, and rendering