In this tutorial you will build Lola, a voice-first movie companion powered by TMDB. The user speaks, Lola responds with voice narration and visual cards — poster grids, movie info, trailers, comparisons, and streaming availability. There is a text input as a fallback, but voice is the primary interaction mode.
This is fundamentally different from adding voice to an existing chat app. In a voice-enabled app (like a coffee shop ordering assistant), you start with text and add voice as a secondary input. In a voice-first app, voice is the default. The screen is not a chat column — it is a visual area that shows ambient, glanceable cards while the AI narrates the content out loud. Every tool uses pushAndForget because nothing should ever block the voice conversation.
Prerequisites: You should have completed Getting Started, read The Display Stack, and reviewed Voice Integration.
A movie companion where the user taps a voice orb and says “Tell me about Inception” and the app will:
search_movies — a poster grid appears on screen (pushAndForget)get_movie_details — a detailed info card replaces the poster grid (pushAndForget)pushAndForget)Nine tools, one TMDB proxy route. The user never types unless they choose to. The visual area shows the most recent tool result; the transcript strip shows Lola's latest spoken words in serif font near the bottom of the screen; the voice orb communicates the current state through animation.
Before diving into code, it is important to understand the distinction between voice-first and voice-enabled. Both use the same glove-voice package, but the design decisions are opposite.
| Aspect | Voice-Enabled (Coffee Shop) | Voice-First (Lola) |
|---|---|---|
| Primary input | Text — voice is secondary | Voice — text is a fallback |
| Screen layout | Chat column with messages | Visual area + transcript strip + voice orb |
| Tool blocking | Mix of pushAndWait and pushAndForget | All tools use pushAndForget |
| Tool results | AI reads structured data, may substitute text | AI narrates results verbally while card is visible |
| User interaction with cards | Click buttons, fill forms | Glance at visual information — no clicks needed |
| System prompt | Same prompt for text and voice | Different prompt for voice mode (narration instructions) |
The key insight: in a voice-first app, if any tool uses pushAndWait, it blocks the LLM response loop. The agent cannot speak until the user clicks something on screen. That defeats the purpose of voice. Every tool must use pushAndForget so the visual card fires and the agent immediately narrates the content.
Lola has three layers: a TMDB proxy that keeps the API key server-side, a voice pipeline (ElevenLabs + Silero VAD), and a screen layout built from three components.
/api/tmdb/[...path] — a catch-all Next.js route that proxies requests to the TMDB API. The TMDB_API_KEY stays on the server. Client-side code calls /api/tmdb/search/movie?query=... and gets back raw TMDB JSON./api/voice/stt-token and /api/voice/tts-token — token endpoints that generate short-lived ElevenLabs tokens. The browser uses these tokens to connect directly to ElevenLabs for speech-to-text and text-to-speech without exposing the API key./api/chat — the standard Glove chat handler that proxies to the LLM.Start from a Next.js project with Glove and voice packages installed:
pnpm add glove-core glove-react glove-next glove-voice zodLola also uses @ricky0123/vad-web and onnxruntime-web for Silero VAD (voice activity detection — the browser-side model that detects when you start and stop speaking):
pnpm add @ricky0123/vad-web onnxruntime-webCreate three environment variables:
OPENROUTER_API_KEY=your-openrouter-key
TMDB_API_KEY=your-tmdb-v3-bearer-token
ELEVENLABS_API_KEY=your-elevenlabs-keyThe TMDB API key is a free bearer token from themoviedb.org. The ElevenLabs API key comes from elevenlabs.io (free tier works). The OpenRouter API key lets you use any model provider through a single endpoint.
The TMDB integration has two parts: a server-side proxy route that keeps your API key secret, and a client-side module with typed helper functions.
A single catch-all route forwards any path to the TMDB API with your bearer token attached:
import { NextResponse } from "next/server";
const TMDB_API_BASE = "https://api.themoviedb.org/3";
export async function GET(
req: Request,
{ params }: { params: Promise<{ path: string[] }> },
) {
const { path } = await params;
const url = new URL(req.url);
const apiKey = process.env.TMDB_API_KEY;
if (!apiKey) {
return NextResponse.json({ error: "TMDB_API_KEY not set" }, { status: 500 });
}
const tmdbUrl = `${TMDB_API_BASE}/${path.join("/")}?${url.searchParams.toString()}`;
const res = await fetch(tmdbUrl, {
headers: { Authorization: `Bearer ${apiKey}` },
});
const data = await res.json();
return NextResponse.json(data, { status: res.status });
}When client-side code calls /api/tmdb/search/movie?query=Inception, this route rewrites it to https://api.themoviedb.org/3/search/movie?query=Inception with the bearer token. The API key never reaches the browser.
A tmdb.ts module wraps the proxy with typed functions and image URL builders. Here are the key parts:
const API_BASE = "/api/tmdb";
const TMDB_IMAGE_BASE = "https://image.tmdb.org/t/p";
export interface TMDBMovie {
id: number;
title: string;
overview: string;
release_date: string;
vote_average: number;
vote_count: number;
poster_path: string | null;
backdrop_path: string | null;
genres?: { id: number; name: string }[];
runtime?: number;
tagline?: string;
credits?: {
cast: TMDBCastMember[];
crew: TMDBCrewMember[];
};
videos?: { results: TMDBVideo[] };
"watch/providers"?: {
results: Record<string, TMDBProviderData>;
};
}
// Image URL helpers
export function posterUrl(
path: string | null,
size: "w92" | "w154" | "w185" | "w342" | "w500" | "w780" | "original" = "w342",
): string | null {
if (!path) return null;
return `${TMDB_IMAGE_BASE}/${size}${path}`;
}
// Internal fetch helper — all calls go through the proxy
async function tmdbFetch<T>(path: string, params?: Record<string, string>): Promise<T> {
const url = new URL(`${API_BASE}/${path}`, window.location.origin);
if (params) {
for (const [key, value] of Object.entries(params)) {
url.searchParams.set(key, value);
}
}
const res = await fetch(url.toString());
if (!res.ok) {
const errorBody = await res.text().catch(() => "Unknown error");
throw new Error(`TMDB API error (${res.status}): ${errorBody}`);
}
return res.json() as Promise<T>;
}
// API functions
export async function searchMovies(query: string, year?: number): Promise<TMDBMovie[]> {
const params: Record<string, string> = { query };
if (year) params.year = String(year);
const data = await tmdbFetch<{ results: TMDBMovie[] }>("search/movie", params);
return data.results;
}
export async function getMovieDetails(movieId: number): Promise<TMDBMovie> {
return tmdbFetch<TMDBMovie>(`movie/${movieId}`, {
append_to_response: "credits,videos,watch/providers",
});
}
// Utility helpers
export function movieYear(movie: TMDBMovie): string {
if (!movie.release_date) return "Unknown";
return movie.release_date.substring(0, 4);
}
export function getDirector(movie: TMDBMovie): string {
if (!movie.credits?.crew) return "Unknown";
const director = movie.credits.crew.find((c) => c.job === "Director");
return director?.name ?? "Unknown";
}
export function getTopCast(movie: TMDBMovie, count: number = 5): TMDBCastMember[] {
if (!movie.credits?.cast) return [];
return movie.credits.cast.sort((a, b) => a.order - b.order).slice(0, count);
}Every TMDB call goes through tmdbFetch, which constructs a URL pointing at /api/tmdb/... and parses the JSON response. The typed return values mean your tooldo functions get full autocomplete for movie fields, cast members, and provider data.
Voice-first tools follow a specific pattern. Every tool:
display.pushAndForget() with the visual card datarenderData for persisting the visual cardThe text return is critical. In a text-only app, the LLM reads the tool result and decides what to say. In a voice-first app, the LLM reads the tool result and speaks it. The tool must return enough information for the LLM to give a natural verbal summary, not just “Done” or a raw JSON blob.
The search tool is the most common entry point. The user says “Find me sci-fi movies from the 90s” and the tool shows a poster grid while returning a numbered text list for narration.
import { defineTool } from "glove-react";
import { z } from "zod";
import { searchMovies, posterUrl, movieYear, type TMDBMovie } from "../tmdb";
export function createSearchMoviesTool() {
return defineTool({
name: "search_movies",
description:
"Search for movies by title. Returns a visual grid of poster cards " +
"and text results for narration.",
inputSchema: z.object({
query: z.string().describe("Search query for movies"),
year: z.number().optional().describe("Filter by release year"),
max_results: z.number().optional().default(4).describe("Max results (1-6)"),
}),
displayPropsSchema: z.object({
movies: z.array(z.any()),
}),
async do(input, display) {
const clampedMax = Math.max(1, Math.min(6, input.max_results ?? 4));
const results = await searchMovies(input.query, input.year);
const movies = results.slice(0, clampedMax);
if (movies.length === 0) {
return {
status: "success" as const,
data: `No movies found matching "${input.query}".`,
renderData: { movies: [] },
};
}
// Fire the visual card — does NOT block the LLM
await display.pushAndForget({ movies });
// Return descriptive text for voice narration
const summaryLines = movies.map(
(m, i) =>
`${i + 1}. ${m.title} (${movieYear(m)}) — Rating: ${m.vote_average.toFixed(1)}/10`,
);
return {
status: "success" as const,
data: `Found ${movies.length} result(s) for "${input.query}":\n${summaryLines.join("\n")}`,
renderData: { movies },
};
},
render({ props }) {
const movies = props.movies as TMDBMovie[];
return (
<div style={{ display: "flex", gap: 12, flexWrap: "wrap", justifyContent: "center" }}>
{movies.map((movie) => (
<PosterCard key={movie.id} movie={movie} />
))}
</div>
);
},
renderResult({ data }) {
const result = data as { movies: TMDBMovie[] };
return (
<div style={{ display: "flex", gap: 12, flexWrap: "wrap", justifyContent: "center" }}>
{result.movies.map((movie) => (
<PosterCard key={movie.id} movie={movie} />
))}
</div>
);
},
});
}The data string returned to the LLM includes titles, years, and ratings in a numbered list. In voice mode, the LLM reads this and says something like: “Here are four results. First up is Inception from 2010, a solid 8.4. Then we have Interstellar, also from Nolan...” Meanwhile the poster grid is already visible on screen.
When the user asks about a specific film, this tool fetches full details including credits, videos, and streaming providers in a single TMDB call (using append_to_response). The visual card shows a backdrop image, genre tags, cast list, and director. The text return gives the LLM enough to narrate a compelling summary.
import { defineTool } from "glove-react";
import { z } from "zod";
import {
getMovieDetails,
backdropUrl,
movieYear,
formatRuntime,
getDirector,
getTopCast,
genreNames,
type TMDBMovie,
} from "../tmdb";
export function createGetMovieDetailsTool() {
return defineTool({
name: "get_movie_details",
description:
"Get comprehensive details about a movie including overview, " +
"cast, director, runtime, rating, genres, and streaming availability.",
inputSchema: z.object({
movie_id: z.number().describe("TMDB movie ID"),
}),
displayPropsSchema: z.object({
movie: z.any(),
}),
async do(input, display) {
const movie = await getMovieDetails(input.movie_id);
await display.pushAndForget({ movie });
const year = movieYear(movie);
const director = getDirector(movie);
const cast = getTopCast(movie, 5);
const castNames = cast.map((c) => c.name).join(", ");
const overviewSnippet =
movie.overview.length > 200
? movie.overview.substring(0, 200) + "..."
: movie.overview;
return {
status: "success" as const,
data: `${movie.title} (${year}), directed by ${director}. ${overviewSnippet} Stars: ${castNames}. Rating: ${movie.vote_average.toFixed(1)}/10.`,
renderData: { movie },
};
},
render({ props }) {
const movie = props.movie as TMDBMovie;
return <MovieInfoCard movie={movie} />;
},
renderResult({ data }) {
const result = data as { movie: TMDBMovie };
return <MovieInfoCard movie={result.movie} />;
},
});
}Notice the data string: “Inception (2010), directed by Christopher Nolan. A thief who steals corporate secrets through dream-sharing technology... Stars: Leonardo DiCaprio, Joseph Gordon-Levitt, Elliot Page... Rating: 8.4/10.” The LLM uses this to speak naturally. It will not read it verbatim — the system prompt tells it to describe movies with feeling.
Not every tool needs a visual component. The remember_preference tool silently stores user taste preferences. It has no render function and no pushAndForget call. The LLM acknowledges the preference verbally (“Got it, you love Villeneuve”) without showing anything on screen.
import { defineTool } from "glove-react";
import { z } from "zod";
export function createRememberPreferenceTool() {
return defineTool({
name: "remember_preference",
description:
"Remember a user preference about movies — favorite genres, " +
"directors, actors, moods, or anything else. Data-only tool " +
"with no visual display.",
inputSchema: z.object({
preference: z.string().describe("User preference to remember"),
category: z
.string()
.optional()
.describe("Category: genre, director, actor, mood, other"),
}),
displayPropsSchema: z.object({}),
async do(input) {
const category = input.category ?? "other";
return {
status: "success" as const,
data: `Noted preference (${category}): ${input.preference}`,
};
},
});
}Lola has nine tools, all using pushAndForget (except remember_preference which has no UI at all). Each tool returns descriptive text for narration alongside visual card data.
| Tool | Visual Card | Narration Text |
|---|---|---|
search_movies | Poster grid with rating badges | Numbered list of titles, years, and ratings |
get_movie_details | Full info card: backdrop, genres, cast, director, overview | Title, year, director, cast names, overview snippet, rating |
get_ratings | Score display with rating bar and vote count | Title, score out of 10, vote count |
get_trailer | YouTube embed (16:9 aspect ratio) | “Trailer for [Title] is now playing on screen” |
compare_movies | Side-by-side cards (2–4 films) with posters and genres | Per-movie summary: title, year, rating, runtime, genres |
get_recommendations | Numbered list with poster thumbnails and overview snippets | Numbered list of titles with brief descriptions |
get_person | Profile card with photo, bio, and notable films | Name, department, notable film titles |
get_streaming | Provider badges grouped by type (stream, rent, buy) | “Stream on Netflix. Rent on Apple TV.” |
remember_preference | None | LLM acknowledges verbally |
All tools are assembled in a single factory function:
import type { ToolConfig } from "glove-react";
import { createSearchMoviesTool } from "./search-movies";
import { createGetMovieDetailsTool } from "./get-movie-details";
import { createGetRatingsTool } from "./get-ratings";
import { createGetTrailerTool } from "./get-trailer";
import { createCompareMoviesTool } from "./compare-movies";
import { createGetRecommendationsTool } from "./get-recommendations";
import { createGetPersonTool } from "./get-person";
import { createGetStreamingTool } from "./get-streaming";
import { createRememberPreferenceTool } from "./remember-preference";
export function createLolaTools(): ToolConfig[] {
return [
createSearchMoviesTool(),
createGetMovieDetailsTool(),
createGetRatingsTool(),
createGetTrailerTool(),
createCompareMoviesTool(),
createGetRecommendationsTool(),
createGetPersonTool(),
createGetStreamingTool(),
createRememberPreferenceTool(),
];
}In a voice-first app, there is no scrolling chat column. The screen has a single visual area in the center that shows the most relevant content. The visual area has three states:
pushAndForget, the visual area renders that tool's card. If multiple tools fire in sequence, the latest one wins.renderData, the visual area shows the most recent completed result. This means the movie info card stays visible even after the LLM finishes speaking.import { useMemo, type ReactNode } from "react";
import type { TimelineEntry, EnhancedSlot } from "glove-react";
interface VisualAreaProps {
slots: EnhancedSlot[];
timeline: TimelineEntry[];
renderSlot: (slot: EnhancedSlot) => ReactNode;
renderToolResult: (entry: TimelineEntry & { kind: "tool" }) => ReactNode;
busy: boolean;
onSuggestion?: (text: string) => void;
}
const SUGGESTIONS = [
"Best sci-fi from the 90s",
"Something like Eternal Sunshine",
"Who directed Parasite?",
"Cozy rainy day movies",
];
export function VisualArea({
slots,
timeline,
renderSlot,
renderToolResult,
busy,
onSuggestion,
}: VisualAreaProps) {
const lastToolResult = useMemo(() => {
for (let i = timeline.length - 1; i >= 0; i--) {
const entry = timeline[i];
if (
entry.kind === "tool" &&
entry.status === "success" &&
entry.renderData !== undefined
) {
return entry;
}
}
return null;
}, [timeline]);
// Case 1: Active slots — render each via renderSlot
if (slots.length > 0) {
return (
<div className="visual-area">
{slots.map((slot) => (
<div key={slot.id} className="display-card">
{renderSlot(slot)}
</div>
))}
</div>
);
}
// Case 2: Recent tool result with renderData
if (lastToolResult) {
const rendered = renderToolResult(lastToolResult);
if (rendered) {
return (
<div className="visual-area">
<div className="display-card">{rendered}</div>
</div>
);
}
}
// Case 3: Empty state — suggestion chips
if (!busy) {
return (
<div className="visual-area">
<div className="lola-empty">
<h1 className="lola-empty__title">Lola</h1>
<p className="lola-empty__subtitle">
Your voice-first movie companion.<br />
Ask me anything about film.
</p>
{onSuggestion && (
<div className="lola-empty__suggestions">
{SUGGESTIONS.map((s) => (
<button
key={s}
type="button"
className="lola-empty__chip"
onClick={() => onSuggestion(s)}
>
{s}
</button>
))}
</div>
)}
</div>
</div>
);
}
return <div className="visual-area" />;
}The visual area is not a Glove concept — it is a UI pattern you build yourself using the slots, timeline, renderSlot, and renderToolResult values from useGlove. In a chat-based app you interleave slots into a message list. In a voice-first app you replace the entire center of the screen.
The voice pipeline has three parts: speech-to-text (STT), text-to-speech (TTS), and voice activity detection (VAD). All three are configured in a single file.
ElevenLabs requires short-lived tokens for browser-side connections. Glove provides a helper that generates these tokens from your API key:
import { createVoiceTokenHandler } from "glove-next";
export const GET = createVoiceTokenHandler({ provider: "elevenlabs", type: "stt" });import { createVoiceTokenHandler } from "glove-next";
export const GET = createVoiceTokenHandler({ provider: "elevenlabs", type: "tts" });The adapters connect the token routes to ElevenLabs and configure the voice. Lola uses the “Charlotte” voice — a warm, cinematic tone that fits the film companion persona:
import { createElevenLabsAdapters } from "glove-voice";
async function fetchToken(path: string): Promise<string> {
const res = await fetch(path);
const data = (await res.json()) as { token?: string; error?: string };
if (!res.ok || !data.token) {
throw new Error(data.error ?? `Token fetch failed (${res.status})`);
}
return data.token;
}
export const { stt, createTTS } = createElevenLabsAdapters({
getSTTToken: () => fetchToken("/api/voice/stt-token"),
getTTSToken: () => fetchToken("/api/voice/tts-token"),
voiceId: "XB0fDUnXU5powFXDhCwa", // "Charlotte" — warm, cinematic
});
export async function createSileroVAD() {
const { SileroVADAdapter } = await import("glove-voice/silero-vad");
const vad = new SileroVADAdapter({
positiveSpeechThreshold: 0.5,
negativeSpeechThreshold: 0.35,
wasm: { type: "cdn" },
});
await vad.init();
return vad;
}Silero VAD is a small neural network that runs in the browser using WebAssembly. It listens to the microphone and detects when you start and stop speaking. This is what enables the hands-free “auto” turn mode — you speak, it detects silence, and it automatically sends your speech for transcription. The positiveSpeechThreshold and negativeSpeechThreshold control how sensitive the detection is.
Silero VAD is imported dynamically with await import("glove-voice/silero-vad") because it loads an ONNX model file. Dynamic import keeps it out of the initial bundle and allows it to load the WebAssembly runtime on demand.
The voice orb is the primary interaction element. It is an 80px sharp-cornered amber square that communicates state through layered ring animations. Think of it as a visual heartbeat for the voice pipeline.
The orb has six states, each mapped to the VoiceMode from useGloveVoice plus two additional UI states for manual recording and processing:
| State | Visual | Meaning |
|---|---|---|
idle | Static amber square with mic icon | Voice session not started; tap to begin |
listening | Gentle breathing pulse on the outer ring | Microphone is active, waiting for speech |
recording | Warm orange pulse on the core | Manual mode: actively capturing your voice |
processing | Subdued spin on the middle ring | Finalizing transcription before sending to LLM |
thinking | Counter-rotating dashed rings | LLM is generating a response (tool calls, text) |
speaking | Concentric ripples expanding outward | Lola is speaking; tap to interrupt |
The orb click handler adapts to the current state:
const handleClick = () => {
if (mode === "speaking") {
onInterrupt(); // Tap while speaking → interrupt
} else if (isProcessing) {
onStop(); // Tap while processing → cancel
} else if (isManual && mode === "listening") {
if (isManualRecording) {
onManualRecordStop(); // Tap while recording → send
} else {
onManualRecordStart(); // Tap while idle → start recording
}
} else {
onStop(); // Tap otherwise → end voice session
}
};The orb also shows a status label beneath it. In listening mode it says “Listening”; while recording it shows the live transcript; while thinking it says “Thinking”; while speaking it says “Speaking.” In manual mode with no recording active, it shows “Hold space or tap to speak.”
Lola uses two system prompts. The base prompt defines her personality and tool usage guidelines. The voice prompt extends the base with narration instructions.
export const systemPrompt = `You are Lola, a passionate and knowledgeable movie companion.
## Your Personality
- Genuinely passionate about cinema across all genres and eras
- Warm but opinionated — you have taste but respect others' preferences
- Concise — 1-2 sentences between tool calls. Let the visual cards do the talking.
- You describe movies by feel, not data — "gorgeous, melancholic road trip" not "received 7.8 on IMDb"
## Tool Usage Guidelines
- ALWAYS use visual tools — never list movies as plain text
- Use search_movies for any movie search
- Use get_movie_details when discussing a specific film in depth
- Use get_trailer proactively when it would enhance the conversation
- Keep text responses SHORT — let the visual cards speak`;
export const voiceSystemPrompt = `${systemPrompt}
## Voice Mode — IMPORTANT
The user is interacting via voice. All tools display visual cards on screen.
You MUST ALSO describe things verbally since the user may not be looking at the screen.
### After Each Tool
- search_movies: Briefly narrate the top 2-3 results — title, year, one line each
- get_movie_details: Highlight the director, lead actors, and a sentence about the plot
- get_ratings: Speak the score and what it means ("solid 8.1 — critics loved it")
- get_trailer: Let them know the trailer is playing on screen
- compare_movies: Summarize the key differences verbally
- get_recommendations: Read out the top 2-3 picks with brief reasons
- get_person: Mention their most notable roles
- get_streaming_availability: Tell them where it's available
- remember_preference: Just acknowledge verbally ("Got it, noted.")
### Speaking Style
- Conversational — like a friend who loves movies, chatting on the couch
- Describe movies with feeling — "It's this gorgeous, melancholic road trip"
- Keep it concise for voice — shorter than text responses
- Ask one thing at a time — don't overwhelm
- Never read metadata robotically — translate data into human sentences`;The swap happens at runtime. When the voice session starts, the orchestrator calls runnable.setSystemPrompt(voiceSystemPrompt). When voice stops, it reverts to the base prompt. This means the LLM's behavior changes dynamically — in voice mode it narrates tool results, in text mode it keeps responses short and lets the cards speak.
useEffect(() => {
if (!runnable) return;
if (voice.isActive) {
runnable.setSystemPrompt(voiceSystemPrompt);
} else {
runnable.setSystemPrompt(systemPrompt);
}
}, [voice.isActive, runnable]);The Lola component is the orchestrator. It initializes useGlove with the tools, sets up the voice pipeline with useGloveVoice, manages VAD initialization, handles the thinking sound loop, and renders the three-part layout.
import { GloveClient, createRemoteStore } from "glove-react";
import { systemPrompt } from "./system-prompt";
import { storeActions } from "./store-actions";
export const gloveClient = new GloveClient({
endpoint: "/api/chat",
systemPrompt,
createStore: (sessionId) => createRemoteStore(sessionId, storeActions),
});"use client";
import { useState, useRef, useMemo, useCallback, useEffect } from "react";
import { useGlove } from "glove-react";
import { useGloveVoice } from "glove-react/voice";
import type { TurnMode } from "glove-react/voice";
import { createLolaTools } from "../lib/tools";
import { stt, createTTS, createSileroVAD } from "../lib/voice";
import { systemPrompt, voiceSystemPrompt } from "../lib/system-prompt";
import { VisualArea } from "./visual-area";
import { TranscriptStrip } from "./transcript-strip";
import { VoiceOrb } from "./voice-orb";
import { TextInput } from "./text-input";
interface LolaProps {
sessionId: string;
onFirstMessage?: (sessionId: string, text: string) => void;
}
export function Lola({ sessionId, onFirstMessage }: LolaProps) {
const [turnMode, setTurnMode] = useState<TurnMode>("vad");
const [isManualRecording, setIsManualRecording] = useState(false);
const [isProcessing, setIsProcessing] = useState(false);
const [showTextInput, setShowTextInput] = useState(false);
const [input, setInput] = useState("");
const [vadReady, setVadReady] = useState(false);
const vadRef = useRef<Awaited<ReturnType<typeof createSileroVAD>> | null>(null);
const MIN_RECORDING_MS = 350;
// Tools — created once, stable reference
const tools = useMemo(() => createLolaTools(), []);
// Glove hook — conversation engine
const glove = useGlove({ tools, sessionId });
const {
runnable, timeline, streamingText, busy,
slots, sendMessage, renderSlot, renderToolResult,
} = glove;
// Silero VAD — async initialization
useEffect(() => {
createSileroVAD().then((v) => {
vadRef.current = v;
setVadReady(true);
});
}, []);
// Voice pipeline
const voiceConfig = useMemo(
() => ({
stt,
createTTS,
vad: vadReady ? vadRef.current ?? undefined : undefined,
turnMode,
}),
[vadReady, turnMode],
);
const voice = useGloveVoice({ runnable, voice: voiceConfig });
// Dynamic system prompt swap
useEffect(() => {
if (!runnable) return;
if (voice.isActive) {
runnable.setSystemPrompt(voiceSystemPrompt);
} else {
runnable.setSystemPrompt(systemPrompt);
}
}, [voice.isActive, runnable]);
// Thinking sound loop
useEffect(() => {
if (voice.mode !== "thinking") return;
const audio = new Audio("/lola-thinking.mp3");
audio.loop = true;
audio.play().catch(() => {});
return () => {
audio.pause();
audio.src = "";
};
}, [voice.mode]);
// Last agent text from timeline for transcript strip
const lastAgentText = useMemo(() => {
for (let i = timeline.length - 1; i >= 0; i--) {
const entry = timeline[i];
if (entry.kind === "agent_text") return entry.text;
}
return "";
}, [timeline]);
return (
<div className="lola-screen">
<VisualArea
slots={slots}
timeline={timeline}
renderSlot={renderSlot}
renderToolResult={renderToolResult}
busy={busy}
onSuggestion={(text) => sendMessage(text)}
/>
<TranscriptStrip
text={streamingText || lastAgentText}
isStreaming={!!streamingText}
/>
<div className="orb-area">
{voice.isActive ? (
<VoiceOrb
mode={voice.mode}
transcript={voice.transcript}
turnMode={turnMode}
isManualRecording={isManualRecording}
isProcessing={isProcessing}
onStop={() => voice.stop()}
onInterrupt={voice.interrupt}
onManualRecordStart={() => { /* manual recording logic */ }}
onManualRecordStop={() => { /* commit recording logic */ }}
/>
) : (
<button
className="voice-orb voice-orb--idle"
onClick={() => voice.start()}
>
Start Voice
</button>
)}
<TextInput
visible={showTextInput}
onToggle={() => setShowTextInput(!showTextInput)}
input={input}
setInput={setInput}
busy={busy}
onSubmit={(e) => {
e.preventDefault();
const text = input.trim();
if (!text || busy) return;
setInput("");
sendMessage(text);
}}
/>
</div>
</div>
);
}The render structure is flat: VisualArea fills the center, TranscriptStrip sits near the bottom, and the orb-area holds the voice orb plus the optional text input. There is no chat column, no message list, no scroll. The visual area is the only place where tool output appears, and it shows only the most recent content.
The transcript strip shows Lola's most recent spoken or streamed text near the bottom of the screen. It is styled in serif font to match the cinematic aesthetic. While the LLM is streaming, a gentle pulse keeps the text alive. After four seconds of silence, the text fades to near-invisible so it does not compete with the visual area.
import { useEffect, useRef, useState } from "react";
interface TranscriptStripProps {
text: string;
isStreaming: boolean;
}
const FADE_DELAY_MS = 4000;
export function TranscriptStrip({ text, isStreaming }: TranscriptStripProps) {
const [isFading, setIsFading] = useState(false);
const timerRef = useRef<ReturnType<typeof setTimeout> | null>(null);
const prevTextRef = useRef(text);
useEffect(() => {
// Reset fade when text changes or streaming starts
if (text !== prevTextRef.current || isStreaming) {
prevTextRef.current = text;
setIsFading(false);
if (timerRef.current) {
clearTimeout(timerRef.current);
timerRef.current = null;
}
}
// Start fade timer when not streaming and text exists
if (!isStreaming && text) {
timerRef.current = setTimeout(() => {
setIsFading(true);
timerRef.current = null;
}, FADE_DELAY_MS);
}
return () => {
if (timerRef.current) {
clearTimeout(timerRef.current);
timerRef.current = null;
}
};
}, [text, isStreaming]);
if (!text) return null;
// Show the last ~180 characters, trimmed to a word boundary
const displayText =
text.length > 180
? "\u2026" + text.slice(text.length - 180).replace(/^\S*\s/, "")
: text;
return (
<div className="transcript-strip" role="status" aria-live="polite">
<p
className={`transcript-strip__text ${
isStreaming ? "transcript-strip__text--streaming" : ""
} ${isFading ? "transcript-strip__text--fading" : ""}`}
>
{displayText}
</p>
</div>
);
}The 180-character trim ensures the strip never wraps excessively. For long narrations, it shows the tail end with a leading ellipsis. The role="status" and aria-live="polite" attributes ensure screen readers announce new text without interrupting the user.
When the LLM is processing (calling tools, generating text), Lola plays a subtle ambient sound loop. This gives the user audio feedback that something is happening, even when the screen has not changed yet.
useEffect(() => {
if (voice.mode !== "thinking") return;
const audio = new Audio("/lola-thinking.mp3");
audio.loop = true;
audio.play().catch(() => {});
return () => {
audio.pause();
audio.src = "";
};
}, [voice.mode]);The useEffect cleanup function stops the sound immediately when the voice mode changes away from "thinking". Setting audio.src to an empty string releases the audio resource. The .catch(() => {}); handles browsers that block autoplay — the sound is a nice-to-have, not critical.
Lola supports two turn modes for voice input:
A toggle between these modes sits below the voice orb. It is only enabled when the voice pipeline is in the listening state — you cannot switch modes while the LLM is thinking or speaking.
In manual mode, a 350ms minimum recording duration prevents false positive triggers. If you tap the orb briefly (under 350ms), the commit is delayed until the minimum threshold is reached. This avoids sending empty or garbled audio to the STT service.
const MIN_RECORDING_MS = 350;
const commitRecording = useCallback(() => {
if (!recordingRef.current) return;
recordingRef.current = false;
setIsManualRecording(false);
const elapsed = Date.now() - recordingStartRef.current;
if (elapsed >= MIN_RECORDING_MS) {
setIsProcessing(true);
commitTurnRef.current();
} else {
// Delay commit until minimum recording duration
setIsProcessing(true);
const remaining = MIN_RECORDING_MS - elapsed;
pendingCommitRef.current = setTimeout(() => {
pendingCommitRef.current = null;
commitTurnRef.current();
}, remaining);
}
}, []);| Tool | Display Method | Why |
|---|---|---|
search_movies | pushAndForget | Poster grid appears instantly; LLM narrates results in parallel |
get_movie_details | pushAndForget | Info card appears; LLM describes the film verbally |
get_ratings | pushAndForget | Rating card appears; LLM speaks the score |
get_trailer | pushAndForget | YouTube embed appears; LLM says “trailer is playing” |
compare_movies | pushAndForget | Side-by-side cards appear; LLM summarizes differences |
get_recommendations | pushAndForget | Numbered list appears; LLM reads the top picks |
get_person | pushAndForget | Profile card appears; LLM mentions notable roles |
get_streaming | pushAndForget | Provider badges appear; LLM says where to watch |
remember_preference | None | Pure data — LLM acknowledges verbally, no UI |
Every single tool uses pushAndForget. There is no pushAndWait anywhere in the Lola codebase. This is the defining characteristic of a voice-first app. The moment you add a blocking tool, you break the voice flow.
Lola uses a charcoal + amber palette inspired by film noir aesthetics. The background is void black (#0d0d0f), cards use charcoal surfaces, and amber provides warmth for accents, ratings, and the voice orb.
export const VOID = "#0d0d0f";
export const CHARCOAL: Record<number, string> = {
900: "#1a1a1f",
800: "#222228",
700: "#2a2a32",
600: "#333340",
500: "#3d3d48",
};
export const AMBER: Record<number, string> = {
500: "#d4911e",
400: "#f5a623",
300: "#f7b84d",
200: "#fcd88e",
100: "#fde8b5",
50: "#fef7e6",
};
export const CREAM = "#faf7f2";
export const CREAM_MUTED = "#a8a4a0";
export const CREAM_DIM = "#706c68";Three typefaces reinforce the cinematic feel: Instrument Serif for movie titles and the transcript strip, DM Sans for body text and labels, and DM Mono for metadata like years, runtimes, and rating numbers.
# From the monorepo root
pnpm install
# Set environment variables in examples/lola/.env.local:
# OPENROUTER_API_KEY=...
# TMDB_API_KEY=...
# ELEVENLABS_API_KEY=...
pnpm --filter glove-lola run devTry these conversations:
get_recommendations with the mood string, which maps to genre IDs internally.Notice that throughout the conversation, you never need to tap the screen. The visual cards are ambient — they appear and stay visible while Lola narrates. The voice orb communicates state through animation. The text input is hidden by default and only appears when you tap the keyboard icon.
| Piece | Where | Why |
|---|---|---|
createChatHandler | Server | LLM proxy — sends tool schemas, streams responses |
Tool do functions | Browser | Fetch from TMDB proxy, fire visual cards, return text for narration |
/api/tmdb/[...path] | Server | TMDB proxy — keeps API key server-side |
/api/voice/stt-token | Server | Generates ElevenLabs STT tokens |
/api/voice/tts-token | Server | Generates ElevenLabs TTS tokens |
| ElevenLabs STT/TTS | Browser (direct connection) | Browser uses tokens to stream audio directly to ElevenLabs |
| Silero VAD | Browser (WebAssembly) | Runs a small neural network locally for speech detection |
| Visual area, orb, transcript strip | Browser | UI components rendering tool output and voice state |
useGloveVoice, adapters, VAD, turn modes, and token routespushAndWait vs. pushAndForget and display strategiespushAndWait for interactive forms (the opposite of Lola's approach)defineTool API Reference — full API for typed tool definitions with displayPropsSchema and resolveSchemauseGlove, GloveClient, and rendering