Build a Movie Companion

Voice-First

In this tutorial you will build Lola, a voice-first movie companion powered by TMDB. The user speaks, Lola responds with voice narration and visual cards — poster grids, movie info, trailers, comparisons, and streaming availability. There is a text input as a fallback, but voice is the primary interaction mode.

This is fundamentally different from adding voice to an existing chat app. In a voice-enabled app (like a coffee shop ordering assistant), you start with text and add voice as a secondary input. In a voice-first app, voice is the default. The screen is not a chat column — it is a visual area that shows ambient, glanceable cards while the AI narrates the content out loud. Every tool uses pushAndForget because nothing should ever block the voice conversation.

Prerequisites: You should have completed Getting Started, read The Display Stack, and reviewed Voice Integration.

What you will build

A movie companion where the user taps a voice orb and says “Tell me about Inception” and the app will:

  1. Transcribe the user's speech to text using ElevenLabs speech-to-text
  2. Send the text to the LLM, which calls search_movies — a poster grid appears on screen (pushAndForget)
  3. The LLM narrates the search results out loud while the user sees the poster cards
  4. The LLM calls get_movie_details — a detailed info card replaces the poster grid (pushAndForget)
  5. Lola describes the director, cast, and plot verbally while the card is visible
  6. The user says “Show me the trailer” — a YouTube embed appears (pushAndForget)

Nine tools, one TMDB proxy route. The user never types unless they choose to. The visual area shows the most recent tool result; the transcript strip shows Lola's latest spoken words in serif font near the bottom of the screen; the voice orb communicates the current state through animation.

Voice-first vs. voice-enabled

Before diving into code, it is important to understand the distinction between voice-first and voice-enabled. Both use the same glove-voice package, but the design decisions are opposite.

AspectVoice-Enabled (Coffee Shop)Voice-First (Lola)
Primary inputText — voice is secondaryVoice — text is a fallback
Screen layoutChat column with messagesVisual area + transcript strip + voice orb
Tool blockingMix of pushAndWait and pushAndForgetAll tools use pushAndForget
Tool resultsAI reads structured data, may substitute textAI narrates results verbally while card is visible
User interaction with cardsClick buttons, fill formsGlance at visual information — no clicks needed
System promptSame prompt for text and voiceDifferent prompt for voice mode (narration instructions)

The key insight: in a voice-first app, if any tool uses pushAndWait, it blocks the LLM response loop. The agent cannot speak until the user clicks something on screen. That defeats the purpose of voice. Every tool must use pushAndForget so the visual card fires and the agent immediately narrates the content.

Architecture overview

Lola has three layers: a TMDB proxy that keeps the API key server-side, a voice pipeline (ElevenLabs + Silero VAD), and a screen layout built from three components.

1. Project setup

Start from a Next.js project with Glove and voice packages installed:

terminalbash
pnpm add glove-core glove-react glove-next glove-voice zod

Lola also uses @ricky0123/vad-web and onnxruntime-web for Silero VAD (voice activity detection — the browser-side model that detects when you start and stop speaking):

terminalbash
pnpm add @ricky0123/vad-web onnxruntime-web

Create three environment variables:

.env.localbash
OPENROUTER_API_KEY=your-openrouter-key
TMDB_API_KEY=your-tmdb-v3-bearer-token
ELEVENLABS_API_KEY=your-elevenlabs-key

The TMDB API key is a free bearer token from themoviedb.org. The ElevenLabs API key comes from elevenlabs.io (free tier works). The OpenRouter API key lets you use any model provider through a single endpoint.

2. TMDB integration

The TMDB integration has two parts: a server-side proxy route that keeps your API key secret, and a client-side module with typed helper functions.

The proxy route

A single catch-all route forwards any path to the TMDB API with your bearer token attached:

app/api/tmdb/[...path]/route.tstypescript
import { NextResponse } from "next/server";

const TMDB_API_BASE = "https://api.themoviedb.org/3";

export async function GET(
  req: Request,
  { params }: { params: Promise<{ path: string[] }> },
) {
  const { path } = await params;
  const url = new URL(req.url);

  const apiKey = process.env.TMDB_API_KEY;
  if (!apiKey) {
    return NextResponse.json({ error: "TMDB_API_KEY not set" }, { status: 500 });
  }

  const tmdbUrl = `${TMDB_API_BASE}/${path.join("/")}?${url.searchParams.toString()}`;

  const res = await fetch(tmdbUrl, {
    headers: { Authorization: `Bearer ${apiKey}` },
  });

  const data = await res.json();
  return NextResponse.json(data, { status: res.status });
}

When client-side code calls /api/tmdb/search/movie?query=Inception, this route rewrites it to https://api.themoviedb.org/3/search/movie?query=Inception with the bearer token. The API key never reaches the browser.

Client-side helpers

A tmdb.ts module wraps the proxy with typed functions and image URL builders. Here are the key parts:

app/lib/tmdb.tstypescript
const API_BASE = "/api/tmdb";
const TMDB_IMAGE_BASE = "https://image.tmdb.org/t/p";

export interface TMDBMovie {
  id: number;
  title: string;
  overview: string;
  release_date: string;
  vote_average: number;
  vote_count: number;
  poster_path: string | null;
  backdrop_path: string | null;
  genres?: { id: number; name: string }[];
  runtime?: number;
  tagline?: string;
  credits?: {
    cast: TMDBCastMember[];
    crew: TMDBCrewMember[];
  };
  videos?: { results: TMDBVideo[] };
  "watch/providers"?: {
    results: Record<string, TMDBProviderData>;
  };
}

// Image URL helpers
export function posterUrl(
  path: string | null,
  size: "w92" | "w154" | "w185" | "w342" | "w500" | "w780" | "original" = "w342",
): string | null {
  if (!path) return null;
  return `${TMDB_IMAGE_BASE}/${size}${path}`;
}

// Internal fetch helper — all calls go through the proxy
async function tmdbFetch<T>(path: string, params?: Record<string, string>): Promise<T> {
  const url = new URL(`${API_BASE}/${path}`, window.location.origin);
  if (params) {
    for (const [key, value] of Object.entries(params)) {
      url.searchParams.set(key, value);
    }
  }
  const res = await fetch(url.toString());
  if (!res.ok) {
    const errorBody = await res.text().catch(() => "Unknown error");
    throw new Error(`TMDB API error (${res.status}): ${errorBody}`);
  }
  return res.json() as Promise<T>;
}

// API functions
export async function searchMovies(query: string, year?: number): Promise<TMDBMovie[]> {
  const params: Record<string, string> = { query };
  if (year) params.year = String(year);
  const data = await tmdbFetch<{ results: TMDBMovie[] }>("search/movie", params);
  return data.results;
}

export async function getMovieDetails(movieId: number): Promise<TMDBMovie> {
  return tmdbFetch<TMDBMovie>(`movie/${movieId}`, {
    append_to_response: "credits,videos,watch/providers",
  });
}

// Utility helpers
export function movieYear(movie: TMDBMovie): string {
  if (!movie.release_date) return "Unknown";
  return movie.release_date.substring(0, 4);
}

export function getDirector(movie: TMDBMovie): string {
  if (!movie.credits?.crew) return "Unknown";
  const director = movie.credits.crew.find((c) => c.job === "Director");
  return director?.name ?? "Unknown";
}

export function getTopCast(movie: TMDBMovie, count: number = 5): TMDBCastMember[] {
  if (!movie.credits?.cast) return [];
  return movie.credits.cast.sort((a, b) => a.order - b.order).slice(0, count);
}

Every TMDB call goes through tmdbFetch, which constructs a URL pointing at /api/tmdb/... and parses the JSON response. The typed return values mean your tooldo functions get full autocomplete for movie fields, cast members, and provider data.

3. Tool design for voice

Voice-first tools follow a specific pattern. Every tool:

  1. Fetches data from the TMDB proxy
  2. Calls display.pushAndForget() with the visual card data
  3. Returns a descriptive text string that the LLM uses for narration, plus renderData for persisting the visual card

The text return is critical. In a text-only app, the LLM reads the tool result and decides what to say. In a voice-first app, the LLM reads the tool result and speaks it. The tool must return enough information for the LLM to give a natural verbal summary, not just “Done” or a raw JSON blob.

search_movies

The search tool is the most common entry point. The user says “Find me sci-fi movies from the 90s” and the tool shows a poster grid while returning a numbered text list for narration.

app/lib/tools/search-movies.tsxtsx
import { defineTool } from "glove-react";
import { z } from "zod";
import { searchMovies, posterUrl, movieYear, type TMDBMovie } from "../tmdb";

export function createSearchMoviesTool() {
  return defineTool({
    name: "search_movies",
    description:
      "Search for movies by title. Returns a visual grid of poster cards " +
      "and text results for narration.",
    inputSchema: z.object({
      query: z.string().describe("Search query for movies"),
      year: z.number().optional().describe("Filter by release year"),
      max_results: z.number().optional().default(4).describe("Max results (1-6)"),
    }),
    displayPropsSchema: z.object({
      movies: z.array(z.any()),
    }),

    async do(input, display) {
      const clampedMax = Math.max(1, Math.min(6, input.max_results ?? 4));
      const results = await searchMovies(input.query, input.year);
      const movies = results.slice(0, clampedMax);

      if (movies.length === 0) {
        return {
          status: "success" as const,
          data: `No movies found matching "${input.query}".`,
          renderData: { movies: [] },
        };
      }

      // Fire the visual card — does NOT block the LLM
      await display.pushAndForget({ movies });

      // Return descriptive text for voice narration
      const summaryLines = movies.map(
        (m, i) =>
          `${i + 1}. ${m.title} (${movieYear(m)}) — Rating: ${m.vote_average.toFixed(1)}/10`,
      );

      return {
        status: "success" as const,
        data: `Found ${movies.length} result(s) for "${input.query}":\n${summaryLines.join("\n")}`,
        renderData: { movies },
      };
    },

    render({ props }) {
      const movies = props.movies as TMDBMovie[];
      return (
        <div style={{ display: "flex", gap: 12, flexWrap: "wrap", justifyContent: "center" }}>
          {movies.map((movie) => (
            <PosterCard key={movie.id} movie={movie} />
          ))}
        </div>
      );
    },

    renderResult({ data }) {
      const result = data as { movies: TMDBMovie[] };
      return (
        <div style={{ display: "flex", gap: 12, flexWrap: "wrap", justifyContent: "center" }}>
          {result.movies.map((movie) => (
            <PosterCard key={movie.id} movie={movie} />
          ))}
        </div>
      );
    },
  });
}

The data string returned to the LLM includes titles, years, and ratings in a numbered list. In voice mode, the LLM reads this and says something like: “Here are four results. First up is Inception from 2010, a solid 8.4. Then we have Interstellar, also from Nolan...” Meanwhile the poster grid is already visible on screen.

get_movie_details

When the user asks about a specific film, this tool fetches full details including credits, videos, and streaming providers in a single TMDB call (using append_to_response). The visual card shows a backdrop image, genre tags, cast list, and director. The text return gives the LLM enough to narrate a compelling summary.

app/lib/tools/get-movie-details.tsxtsx
import { defineTool } from "glove-react";
import { z } from "zod";
import {
  getMovieDetails,
  backdropUrl,
  movieYear,
  formatRuntime,
  getDirector,
  getTopCast,
  genreNames,
  type TMDBMovie,
} from "../tmdb";

export function createGetMovieDetailsTool() {
  return defineTool({
    name: "get_movie_details",
    description:
      "Get comprehensive details about a movie including overview, " +
      "cast, director, runtime, rating, genres, and streaming availability.",
    inputSchema: z.object({
      movie_id: z.number().describe("TMDB movie ID"),
    }),
    displayPropsSchema: z.object({
      movie: z.any(),
    }),

    async do(input, display) {
      const movie = await getMovieDetails(input.movie_id);
      await display.pushAndForget({ movie });

      const year = movieYear(movie);
      const director = getDirector(movie);
      const cast = getTopCast(movie, 5);
      const castNames = cast.map((c) => c.name).join(", ");
      const overviewSnippet =
        movie.overview.length > 200
          ? movie.overview.substring(0, 200) + "..."
          : movie.overview;

      return {
        status: "success" as const,
        data: `${movie.title} (${year}), directed by ${director}. ${overviewSnippet} Stars: ${castNames}. Rating: ${movie.vote_average.toFixed(1)}/10.`,
        renderData: { movie },
      };
    },

    render({ props }) {
      const movie = props.movie as TMDBMovie;
      return <MovieInfoCard movie={movie} />;
    },

    renderResult({ data }) {
      const result = data as { movie: TMDBMovie };
      return <MovieInfoCard movie={result.movie} />;
    },
  });
}

Notice the data string: “Inception (2010), directed by Christopher Nolan. A thief who steals corporate secrets through dream-sharing technology... Stars: Leonardo DiCaprio, Joseph Gordon-Levitt, Elliot Page... Rating: 8.4/10.” The LLM uses this to speak naturally. It will not read it verbatim — the system prompt tells it to describe movies with feeling.

remember_preference (pure data, no UI)

Not every tool needs a visual component. The remember_preference tool silently stores user taste preferences. It has no render function and no pushAndForget call. The LLM acknowledges the preference verbally (“Got it, you love Villeneuve”) without showing anything on screen.

app/lib/tools/remember-preference.tstypescript
import { defineTool } from "glove-react";
import { z } from "zod";

export function createRememberPreferenceTool() {
  return defineTool({
    name: "remember_preference",
    description:
      "Remember a user preference about movies — favorite genres, " +
      "directors, actors, moods, or anything else. Data-only tool " +
      "with no visual display.",
    inputSchema: z.object({
      preference: z.string().describe("User preference to remember"),
      category: z
        .string()
        .optional()
        .describe("Category: genre, director, actor, mood, other"),
    }),
    displayPropsSchema: z.object({}),
    async do(input) {
      const category = input.category ?? "other";
      return {
        status: "success" as const,
        data: `Noted preference (${category}): ${input.preference}`,
      };
    },
  });
}

4. The complete tool inventory

Lola has nine tools, all using pushAndForget (except remember_preference which has no UI at all). Each tool returns descriptive text for narration alongside visual card data.

ToolVisual CardNarration Text
search_moviesPoster grid with rating badgesNumbered list of titles, years, and ratings
get_movie_detailsFull info card: backdrop, genres, cast, director, overviewTitle, year, director, cast names, overview snippet, rating
get_ratingsScore display with rating bar and vote countTitle, score out of 10, vote count
get_trailerYouTube embed (16:9 aspect ratio)“Trailer for [Title] is now playing on screen”
compare_moviesSide-by-side cards (2–4 films) with posters and genresPer-movie summary: title, year, rating, runtime, genres
get_recommendationsNumbered list with poster thumbnails and overview snippetsNumbered list of titles with brief descriptions
get_personProfile card with photo, bio, and notable filmsName, department, notable film titles
get_streamingProvider badges grouped by type (stream, rent, buy)“Stream on Netflix. Rent on Apple TV.”
remember_preferenceNoneLLM acknowledges verbally

All tools are assembled in a single factory function:

app/lib/tools/index.tstypescript
import type { ToolConfig } from "glove-react";
import { createSearchMoviesTool } from "./search-movies";
import { createGetMovieDetailsTool } from "./get-movie-details";
import { createGetRatingsTool } from "./get-ratings";
import { createGetTrailerTool } from "./get-trailer";
import { createCompareMoviesTool } from "./compare-movies";
import { createGetRecommendationsTool } from "./get-recommendations";
import { createGetPersonTool } from "./get-person";
import { createGetStreamingTool } from "./get-streaming";
import { createRememberPreferenceTool } from "./remember-preference";

export function createLolaTools(): ToolConfig[] {
  return [
    createSearchMoviesTool(),
    createGetMovieDetailsTool(),
    createGetRatingsTool(),
    createGetTrailerTool(),
    createCompareMoviesTool(),
    createGetRecommendationsTool(),
    createGetPersonTool(),
    createGetStreamingTool(),
    createRememberPreferenceTool(),
  ];
}

5. The visual area pattern

In a voice-first app, there is no scrolling chat column. The screen has a single visual area in the center that shows the most relevant content. The visual area has three states:

  1. Active slot — when a tool has just fired via pushAndForget, the visual area renders that tool's card. If multiple tools fire in sequence, the latest one wins.
  2. Last result — when no active slot exists but a previous tool has renderData, the visual area shows the most recent completed result. This means the movie info card stays visible even after the LLM finishes speaking.
  3. Empty state — when there is nothing to show and the agent is not busy, the visual area shows a cinematic onboarding screen with suggestion chips like “Best sci-fi from the 90s” or “Something like Eternal Sunshine.”
app/components/visual-area.tsxtsx
import { useMemo, type ReactNode } from "react";
import type { TimelineEntry, EnhancedSlot } from "glove-react";

interface VisualAreaProps {
  slots: EnhancedSlot[];
  timeline: TimelineEntry[];
  renderSlot: (slot: EnhancedSlot) => ReactNode;
  renderToolResult: (entry: TimelineEntry & { kind: "tool" }) => ReactNode;
  busy: boolean;
  onSuggestion?: (text: string) => void;
}

const SUGGESTIONS = [
  "Best sci-fi from the 90s",
  "Something like Eternal Sunshine",
  "Who directed Parasite?",
  "Cozy rainy day movies",
];

export function VisualArea({
  slots,
  timeline,
  renderSlot,
  renderToolResult,
  busy,
  onSuggestion,
}: VisualAreaProps) {
  const lastToolResult = useMemo(() => {
    for (let i = timeline.length - 1; i >= 0; i--) {
      const entry = timeline[i];
      if (
        entry.kind === "tool" &&
        entry.status === "success" &&
        entry.renderData !== undefined
      ) {
        return entry;
      }
    }
    return null;
  }, [timeline]);

  // Case 1: Active slots — render each via renderSlot
  if (slots.length > 0) {
    return (
      <div className="visual-area">
        {slots.map((slot) => (
          <div key={slot.id} className="display-card">
            {renderSlot(slot)}
          </div>
        ))}
      </div>
    );
  }

  // Case 2: Recent tool result with renderData
  if (lastToolResult) {
    const rendered = renderToolResult(lastToolResult);
    if (rendered) {
      return (
        <div className="visual-area">
          <div className="display-card">{rendered}</div>
        </div>
      );
    }
  }

  // Case 3: Empty state — suggestion chips
  if (!busy) {
    return (
      <div className="visual-area">
        <div className="lola-empty">
          <h1 className="lola-empty__title">Lola</h1>
          <p className="lola-empty__subtitle">
            Your voice-first movie companion.<br />
            Ask me anything about film.
          </p>
          {onSuggestion && (
            <div className="lola-empty__suggestions">
              {SUGGESTIONS.map((s) => (
                <button
                  key={s}
                  type="button"
                  className="lola-empty__chip"
                  onClick={() => onSuggestion(s)}
                >
                  {s}
                </button>
              ))}
            </div>
          )}
        </div>
      </div>
    );
  }

  return <div className="visual-area" />;
}

The visual area is not a Glove concept — it is a UI pattern you build yourself using the slots, timeline, renderSlot, and renderToolResult values from useGlove. In a chat-based app you interleave slots into a message list. In a voice-first app you replace the entire center of the screen.

6. Voice pipeline setup

The voice pipeline has three parts: speech-to-text (STT), text-to-speech (TTS), and voice activity detection (VAD). All three are configured in a single file.

Token routes

ElevenLabs requires short-lived tokens for browser-side connections. Glove provides a helper that generates these tokens from your API key:

app/api/voice/stt-token/route.tstypescript
import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({ provider: "elevenlabs", type: "stt" });
app/api/voice/tts-token/route.tstypescript
import { createVoiceTokenHandler } from "glove-next";

export const GET = createVoiceTokenHandler({ provider: "elevenlabs", type: "tts" });

Voice adapters

The adapters connect the token routes to ElevenLabs and configure the voice. Lola uses the “Charlotte” voice — a warm, cinematic tone that fits the film companion persona:

app/lib/voice.tstypescript
import { createElevenLabsAdapters } from "glove-voice";

async function fetchToken(path: string): Promise<string> {
  const res = await fetch(path);
  const data = (await res.json()) as { token?: string; error?: string };
  if (!res.ok || !data.token) {
    throw new Error(data.error ?? `Token fetch failed (${res.status})`);
  }
  return data.token;
}

export const { stt, createTTS } = createElevenLabsAdapters({
  getSTTToken: () => fetchToken("/api/voice/stt-token"),
  getTTSToken: () => fetchToken("/api/voice/tts-token"),
  voiceId: "XB0fDUnXU5powFXDhCwa", // "Charlotte" — warm, cinematic
});

export async function createSileroVAD() {
  const { SileroVADAdapter } = await import("glove-voice/silero-vad");
  const vad = new SileroVADAdapter({
    positiveSpeechThreshold: 0.5,
    negativeSpeechThreshold: 0.35,
    wasm: { type: "cdn" },
  });
  await vad.init();
  return vad;
}

Silero VAD is a small neural network that runs in the browser using WebAssembly. It listens to the microphone and detects when you start and stop speaking. This is what enables the hands-free “auto” turn mode — you speak, it detects silence, and it automatically sends your speech for transcription. The positiveSpeechThreshold and negativeSpeechThreshold control how sensitive the detection is.

Silero VAD is imported dynamically with await import("glove-voice/silero-vad") because it loads an ONNX model file. Dynamic import keeps it out of the initial bundle and allows it to load the WebAssembly runtime on demand.

7. The voice orb

The voice orb is the primary interaction element. It is an 80px sharp-cornered amber square that communicates state through layered ring animations. Think of it as a visual heartbeat for the voice pipeline.

The orb has six states, each mapped to the VoiceMode from useGloveVoice plus two additional UI states for manual recording and processing:

StateVisualMeaning
idleStatic amber square with mic iconVoice session not started; tap to begin
listeningGentle breathing pulse on the outer ringMicrophone is active, waiting for speech
recordingWarm orange pulse on the coreManual mode: actively capturing your voice
processingSubdued spin on the middle ringFinalizing transcription before sending to LLM
thinkingCounter-rotating dashed ringsLLM is generating a response (tool calls, text)
speakingConcentric ripples expanding outwardLola is speaking; tap to interrupt

The orb click handler adapts to the current state:

app/components/voice-orb.tsx (click handler)typescript
const handleClick = () => {
  if (mode === "speaking") {
    onInterrupt();         // Tap while speaking → interrupt
  } else if (isProcessing) {
    onStop();              // Tap while processing → cancel
  } else if (isManual && mode === "listening") {
    if (isManualRecording) {
      onManualRecordStop();  // Tap while recording → send
    } else {
      onManualRecordStart(); // Tap while idle → start recording
    }
  } else {
    onStop();              // Tap otherwise → end voice session
  }
};

The orb also shows a status label beneath it. In listening mode it says “Listening”; while recording it shows the live transcript; while thinking it says “Thinking”; while speaking it says “Speaking.” In manual mode with no recording active, it shows “Hold space or tap to speak.”

8. Dynamic system prompt for voice

Lola uses two system prompts. The base prompt defines her personality and tool usage guidelines. The voice prompt extends the base with narration instructions.

app/lib/system-prompt.tstypescript
export const systemPrompt = `You are Lola, a passionate and knowledgeable movie companion.

## Your Personality
- Genuinely passionate about cinema across all genres and eras
- Warm but opinionated — you have taste but respect others' preferences
- Concise — 1-2 sentences between tool calls. Let the visual cards do the talking.
- You describe movies by feel, not data — "gorgeous, melancholic road trip" not "received 7.8 on IMDb"

## Tool Usage Guidelines
- ALWAYS use visual tools — never list movies as plain text
- Use search_movies for any movie search
- Use get_movie_details when discussing a specific film in depth
- Use get_trailer proactively when it would enhance the conversation
- Keep text responses SHORT — let the visual cards speak`;

export const voiceSystemPrompt = `${systemPrompt}

## Voice Mode — IMPORTANT
The user is interacting via voice. All tools display visual cards on screen.
You MUST ALSO describe things verbally since the user may not be looking at the screen.

### After Each Tool
- search_movies: Briefly narrate the top 2-3 results — title, year, one line each
- get_movie_details: Highlight the director, lead actors, and a sentence about the plot
- get_ratings: Speak the score and what it means ("solid 8.1 — critics loved it")
- get_trailer: Let them know the trailer is playing on screen
- compare_movies: Summarize the key differences verbally
- get_recommendations: Read out the top 2-3 picks with brief reasons
- get_person: Mention their most notable roles
- get_streaming_availability: Tell them where it's available
- remember_preference: Just acknowledge verbally ("Got it, noted.")

### Speaking Style
- Conversational — like a friend who loves movies, chatting on the couch
- Describe movies with feeling — "It's this gorgeous, melancholic road trip"
- Keep it concise for voice — shorter than text responses
- Ask one thing at a time — don't overwhelm
- Never read metadata robotically — translate data into human sentences`;

The swap happens at runtime. When the voice session starts, the orchestrator calls runnable.setSystemPrompt(voiceSystemPrompt). When voice stops, it reverts to the base prompt. This means the LLM's behavior changes dynamically — in voice mode it narrates tool results, in text mode it keeps responses short and lets the cards speak.

app/components/lola.tsx (prompt swap)typescript
useEffect(() => {
  if (!runnable) return;
  if (voice.isActive) {
    runnable.setSystemPrompt(voiceSystemPrompt);
  } else {
    runnable.setSystemPrompt(systemPrompt);
  }
}, [voice.isActive, runnable]);

9. Wiring it all together

The Lola component is the orchestrator. It initializes useGlove with the tools, sets up the voice pipeline with useGloveVoice, manages VAD initialization, handles the thinking sound loop, and renders the three-part layout.

app/lib/client.tstypescript
import { GloveClient, createRemoteStore } from "glove-react";
import { systemPrompt } from "./system-prompt";
import { storeActions } from "./store-actions";

export const gloveClient = new GloveClient({
  endpoint: "/api/chat",
  systemPrompt,
  createStore: (sessionId) => createRemoteStore(sessionId, storeActions),
});
app/components/lola.tsx (orchestrator)tsx
"use client";

import { useState, useRef, useMemo, useCallback, useEffect } from "react";
import { useGlove } from "glove-react";
import { useGloveVoice } from "glove-react/voice";
import type { TurnMode } from "glove-react/voice";
import { createLolaTools } from "../lib/tools";
import { stt, createTTS, createSileroVAD } from "../lib/voice";
import { systemPrompt, voiceSystemPrompt } from "../lib/system-prompt";
import { VisualArea } from "./visual-area";
import { TranscriptStrip } from "./transcript-strip";
import { VoiceOrb } from "./voice-orb";
import { TextInput } from "./text-input";

interface LolaProps {
  sessionId: string;
  onFirstMessage?: (sessionId: string, text: string) => void;
}

export function Lola({ sessionId, onFirstMessage }: LolaProps) {
  const [turnMode, setTurnMode] = useState<TurnMode>("vad");
  const [isManualRecording, setIsManualRecording] = useState(false);
  const [isProcessing, setIsProcessing] = useState(false);
  const [showTextInput, setShowTextInput] = useState(false);
  const [input, setInput] = useState("");
  const [vadReady, setVadReady] = useState(false);
  const vadRef = useRef<Awaited<ReturnType<typeof createSileroVAD>> | null>(null);

  const MIN_RECORDING_MS = 350;

  // Tools — created once, stable reference
  const tools = useMemo(() => createLolaTools(), []);

  // Glove hook — conversation engine
  const glove = useGlove({ tools, sessionId });
  const {
    runnable, timeline, streamingText, busy,
    slots, sendMessage, renderSlot, renderToolResult,
  } = glove;

  // Silero VAD — async initialization
  useEffect(() => {
    createSileroVAD().then((v) => {
      vadRef.current = v;
      setVadReady(true);
    });
  }, []);

  // Voice pipeline
  const voiceConfig = useMemo(
    () => ({
      stt,
      createTTS,
      vad: vadReady ? vadRef.current ?? undefined : undefined,
      turnMode,
    }),
    [vadReady, turnMode],
  );
  const voice = useGloveVoice({ runnable, voice: voiceConfig });

  // Dynamic system prompt swap
  useEffect(() => {
    if (!runnable) return;
    if (voice.isActive) {
      runnable.setSystemPrompt(voiceSystemPrompt);
    } else {
      runnable.setSystemPrompt(systemPrompt);
    }
  }, [voice.isActive, runnable]);

  // Thinking sound loop
  useEffect(() => {
    if (voice.mode !== "thinking") return;
    const audio = new Audio("/lola-thinking.mp3");
    audio.loop = true;
    audio.play().catch(() => {});
    return () => {
      audio.pause();
      audio.src = "";
    };
  }, [voice.mode]);

  // Last agent text from timeline for transcript strip
  const lastAgentText = useMemo(() => {
    for (let i = timeline.length - 1; i >= 0; i--) {
      const entry = timeline[i];
      if (entry.kind === "agent_text") return entry.text;
    }
    return "";
  }, [timeline]);

  return (
    <div className="lola-screen">
      <VisualArea
        slots={slots}
        timeline={timeline}
        renderSlot={renderSlot}
        renderToolResult={renderToolResult}
        busy={busy}
        onSuggestion={(text) => sendMessage(text)}
      />

      <TranscriptStrip
        text={streamingText || lastAgentText}
        isStreaming={!!streamingText}
      />

      <div className="orb-area">
        {voice.isActive ? (
          <VoiceOrb
            mode={voice.mode}
            transcript={voice.transcript}
            turnMode={turnMode}
            isManualRecording={isManualRecording}
            isProcessing={isProcessing}
            onStop={() => voice.stop()}
            onInterrupt={voice.interrupt}
            onManualRecordStart={() => { /* manual recording logic */ }}
            onManualRecordStop={() => { /* commit recording logic */ }}
          />
        ) : (
          <button
            className="voice-orb voice-orb--idle"
            onClick={() => voice.start()}
          >
            Start Voice
          </button>
        )}

        <TextInput
          visible={showTextInput}
          onToggle={() => setShowTextInput(!showTextInput)}
          input={input}
          setInput={setInput}
          busy={busy}
          onSubmit={(e) => {
            e.preventDefault();
            const text = input.trim();
            if (!text || busy) return;
            setInput("");
            sendMessage(text);
          }}
        />
      </div>
    </div>
  );
}

The render structure is flat: VisualArea fills the center, TranscriptStrip sits near the bottom, and the orb-area holds the voice orb plus the optional text input. There is no chat column, no message list, no scroll. The visual area is the only place where tool output appears, and it shows only the most recent content.

10. The transcript strip

The transcript strip shows Lola's most recent spoken or streamed text near the bottom of the screen. It is styled in serif font to match the cinematic aesthetic. While the LLM is streaming, a gentle pulse keeps the text alive. After four seconds of silence, the text fades to near-invisible so it does not compete with the visual area.

app/components/transcript-strip.tsxtsx
import { useEffect, useRef, useState } from "react";

interface TranscriptStripProps {
  text: string;
  isStreaming: boolean;
}

const FADE_DELAY_MS = 4000;

export function TranscriptStrip({ text, isStreaming }: TranscriptStripProps) {
  const [isFading, setIsFading] = useState(false);
  const timerRef = useRef<ReturnType<typeof setTimeout> | null>(null);
  const prevTextRef = useRef(text);

  useEffect(() => {
    // Reset fade when text changes or streaming starts
    if (text !== prevTextRef.current || isStreaming) {
      prevTextRef.current = text;
      setIsFading(false);
      if (timerRef.current) {
        clearTimeout(timerRef.current);
        timerRef.current = null;
      }
    }

    // Start fade timer when not streaming and text exists
    if (!isStreaming && text) {
      timerRef.current = setTimeout(() => {
        setIsFading(true);
        timerRef.current = null;
      }, FADE_DELAY_MS);
    }

    return () => {
      if (timerRef.current) {
        clearTimeout(timerRef.current);
        timerRef.current = null;
      }
    };
  }, [text, isStreaming]);

  if (!text) return null;

  // Show the last ~180 characters, trimmed to a word boundary
  const displayText =
    text.length > 180
      ? "\u2026" + text.slice(text.length - 180).replace(/^\S*\s/, "")
      : text;

  return (
    <div className="transcript-strip" role="status" aria-live="polite">
      <p
        className={`transcript-strip__text ${
          isStreaming ? "transcript-strip__text--streaming" : ""
        } ${isFading ? "transcript-strip__text--fading" : ""}`}
      >
        {displayText}
      </p>
    </div>
  );
}

The 180-character trim ensures the strip never wraps excessively. For long narrations, it shows the tail end with a leading ellipsis. The role="status" and aria-live="polite" attributes ensure screen readers announce new text without interrupting the user.

11. The thinking sound

When the LLM is processing (calling tools, generating text), Lola plays a subtle ambient sound loop. This gives the user audio feedback that something is happening, even when the screen has not changed yet.

app/components/lola.tsx (thinking sound)typescript
useEffect(() => {
  if (voice.mode !== "thinking") return;

  const audio = new Audio("/lola-thinking.mp3");
  audio.loop = true;
  audio.play().catch(() => {});

  return () => {
    audio.pause();
    audio.src = "";
  };
}, [voice.mode]);

The useEffect cleanup function stops the sound immediately when the voice mode changes away from "thinking". Setting audio.src to an empty string releases the audio resource. The .catch(() => {}); handles browsers that block autoplay — the sound is a nice-to-have, not critical.

12. Turn modes: auto vs. push-to-talk

Lola supports two turn modes for voice input:

A toggle between these modes sits below the voice orb. It is only enabled when the voice pipeline is in the listening state — you cannot switch modes while the LLM is thinking or speaking.

In manual mode, a 350ms minimum recording duration prevents false positive triggers. If you tap the orb briefly (under 350ms), the commit is delayed until the minimum threshold is reached. This avoids sending empty or garbled audio to the STT service.

app/components/lola.tsx (min-duration commit)typescript
const MIN_RECORDING_MS = 350;

const commitRecording = useCallback(() => {
  if (!recordingRef.current) return;
  recordingRef.current = false;
  setIsManualRecording(false);

  const elapsed = Date.now() - recordingStartRef.current;

  if (elapsed >= MIN_RECORDING_MS) {
    setIsProcessing(true);
    commitTurnRef.current();
  } else {
    // Delay commit until minimum recording duration
    setIsProcessing(true);
    const remaining = MIN_RECORDING_MS - elapsed;
    pendingCommitRef.current = setTimeout(() => {
      pendingCommitRef.current = null;
      commitTurnRef.current();
    }, remaining);
  }
}, []);

Display patterns summary

ToolDisplay MethodWhy
search_moviespushAndForgetPoster grid appears instantly; LLM narrates results in parallel
get_movie_detailspushAndForgetInfo card appears; LLM describes the film verbally
get_ratingspushAndForgetRating card appears; LLM speaks the score
get_trailerpushAndForgetYouTube embed appears; LLM says “trailer is playing”
compare_moviespushAndForgetSide-by-side cards appear; LLM summarizes differences
get_recommendationspushAndForgetNumbered list appears; LLM reads the top picks
get_personpushAndForgetProfile card appears; LLM mentions notable roles
get_streamingpushAndForgetProvider badges appear; LLM says where to watch
remember_preferenceNonePure data — LLM acknowledges verbally, no UI

Every single tool uses pushAndForget. There is no pushAndWait anywhere in the Lola codebase. This is the defining characteristic of a voice-first app. The moment you add a blocking tool, you break the voice flow.

13. Design palette

Lola uses a charcoal + amber palette inspired by film noir aesthetics. The background is void black (#0d0d0f), cards use charcoal surfaces, and amber provides warmth for accents, ratings, and the voice orb.

app/lib/theme.tstypescript
export const VOID = "#0d0d0f";

export const CHARCOAL: Record<number, string> = {
  900: "#1a1a1f",
  800: "#222228",
  700: "#2a2a32",
  600: "#333340",
  500: "#3d3d48",
};

export const AMBER: Record<number, string> = {
  500: "#d4911e",
  400: "#f5a623",
  300: "#f7b84d",
  200: "#fcd88e",
  100: "#fde8b5",
  50: "#fef7e6",
};

export const CREAM = "#faf7f2";
export const CREAM_MUTED = "#a8a4a0";
export const CREAM_DIM = "#706c68";

Three typefaces reinforce the cinematic feel: Instrument Serif for movie titles and the transcript strip, DM Sans for body text and labels, and DM Mono for metadata like years, runtimes, and rating numbers.

14. Run it

terminalbash
# From the monorepo root
pnpm install

# Set environment variables in examples/lola/.env.local:
#   OPENROUTER_API_KEY=...
#   TMDB_API_KEY=...
#   ELEVENLABS_API_KEY=...

pnpm --filter glove-lola run dev

Try these conversations:

Notice that throughout the conversation, you never need to tap the screen. The visual cards are ambient — they appear and stay visible while Lola narrates. The voice orb communicates state through animation. The text input is hidden by default and only appears when you tap the keyboard icon.

Where each piece runs

PieceWhereWhy
createChatHandlerServerLLM proxy — sends tool schemas, streams responses
Tool do functionsBrowserFetch from TMDB proxy, fire visual cards, return text for narration
/api/tmdb/[...path]ServerTMDB proxy — keeps API key server-side
/api/voice/stt-tokenServerGenerates ElevenLabs STT tokens
/api/voice/tts-tokenServerGenerates ElevenLabs TTS tokens
ElevenLabs STT/TTSBrowser (direct connection)Browser uses tokens to stream audio directly to ElevenLabs
Silero VADBrowser (WebAssembly)Runs a small neural network locally for speech detection
Visual area, orb, transcript stripBrowserUI components rendering tool output and voice state

Next steps