Database Emulator

glove-scratchpad exposes an agent's capabilities as a relational database it queries with a single execute_sql tool. Instead of loading dozens of tool definitions into the context window, the model writes SQL — which it knows fluently, at every model size — to discover, invoke, and compose capabilities.

The idea: resources become tables. A resource is an entity / data type — github_pr, linear_issue, emails, time, images — and its CRUD verbs map to (possibly different) underlying tools. The model never sees the plumbing; it sees a schema and writes queries against it.

sql

-- discover what you can do
SELECT table_name FROM information_schema.tables;
SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'tasks';

-- invoke a tool by querying its table; push arguments through WHERE
SELECT id, name FROM tasks WHERE due_date = (SELECT tomorrow FROM time);

-- compose across services in one statement — no intermediate rows in context
INSERT INTO notion_page (title, body)
SELECT title, body FROM github_pr WHERE merged = true AND base = 'main';

-- stage an outbound effect, preview it, then commit (or roll back — a dry run)
BEGIN;
INSERT INTO emails (to_addr, subject, body) VALUES ('a@b.com', 'hi', 'yo');
COMMIT;

It is, at heart, a SQL interpreter: every statement is parsed and inspected before any tool runs. That buys discovery (information_schema), composition (joins / INSERT … SELECT), preview (EXPLAIN, transactions), and a real security surface (a syntax tree you can reject) — for free, because the database solved them decades ago.

Runnable, no-API-key tour of every property below: pnpm scratchpad:db (drives Database.execute directly so the transcript is deterministic).

Benchmark: tools-as-SQL, measured

The obvious question is whether this actually pays off, or just moves the complexity around. So we built an agentic A/B benchmark: ten in-process MCP servers mirroring a real product team's stack — GitHub, Linear, Email, Slack, Notion, Jira, Sentry, PagerDuty, Calendar, Filesystem, 32 tools over one deterministic, cross-linked seed world — run through the real glove-core agent loop twice per model: once with all 32 tools folded directly (baseline), once with a single execute_sql over the same capabilities as tables (scratchpad). Runs are graded deterministically; writes are graded on the real side-effect outbox, which can't be faked.

21/21

scratchpad vs 16/21 baseline on the OSS frontier — even frontier models miscount long tool-result lists

74% → 100%

weak-model pass rate across five rounds of hardening — spirals 6 → 0, median tool calls 6 → 2

up to 1731×

less context than paging results back verbatim — the scratchpad holds a flat ~22 KB while the naive payload grows linearly

Three findings fell out. (1) The scratchpad always cut peak context 2–4× — 32 schemas never enter the window and only selected rows do — but the first pass was a tradeoff, not a blowout: weak models spiralled in the open SQL surface (74% pass, six 30-turn runaways). (2) Nearly every weak-model failure traced to a platform gap, not a model limit — most damningly, places where the engine silently mis-answered where Postgres would error. Five rounds of fixes (engine bugs, prompt discipline, SQL-discoverable metadata, read-your-writes, Postgres parity) took the same five budget models from 74% → 100% with zero spirals. (3) At production scale — 40 servers, 367 tools, ~95% noise — the tool-folding baseline inverts: it becomes the least accurate arm at 12× the context and ~6× the cost, while the scratchpad's footprint is statistically unchanged.

The context win is a property of the data, not the model. Measured deterministically (no API key) — asking one cross-service question as the world grows:

rows / service	naive (est. tokens)	scratchpad (est. tokens)	reduction
100	20.3k	5.6k	`3.6×`
1,000	202.8k	5.8k	`35.0×`
5,000	1.01M	5.8k	`174.5×`
20,000	4.06M	5.8k	`696.1×`
50,000	10.14M	5.9k	`1731.2×`

The transferable lesson is a design stance: the scratchpad only works if it behaves like the database the model already knows. Every place it silently deviated from Postgres muscle-memory was a place a weak model failed; every fix that made truth cheaper to see — command tags, read-your-writes, loud errors, in-band discovery — bought more capability than any prompt instruction. The full study, with all transcripts, figures, and the v1→v5 hardening arc, is in the repo: The Scratchpad Is a Database.

Install

pnpm add glove-scratchpad
# zero runtime dependencies — the query engine (glove-sql) is bundled.

# OPTIONAL — a full Postgres dialect (WASM) instead of the bundled subset:
pnpm add @electric-sql/pglite
# OPTIONAL — bridge MCP servers in as tables:
pnpm add glove-mcp

Quick start

import { Database, resourceFromTool, defineResource, mountDatabase } from "glove-scratchpad";

const db = await Database.create({ policy: { writes: true } });

// A read-only tool → a one-row `time` table.
db.register(resourceFromTool(getTimeTool, {
  name: "time", volatility: "stable",
  columns: [{ name: "now", type: "timestamptz" }, { name: "tomorrow", type: "text" }],
}));

// A search tool → a `web` table whose required `query` column is a pushed-down argument.
db.register(resourceFromTool(searchTool, {
  name: "web", volatility: "volatile",
  columns: [{ name: "title", type: "text" }, { name: "url", type: "text" }],
}));

// Fold the single tool + prime the model to discover → invoke → act → stage.
mountDatabase(agent, { db });

Now the agent works entirely in SQL through execute_sql (and explain_sql).

Resources as tables

A resource is an entity with columns and any subset of CRUD verbs, each wired independently. defineResource is the explicit contract; resourceFromTool is the convenience for the single-verb case.

const githubPr = defineResource({
  name: "github_pr",
  volatility: "stable",
  columns: [
    { name: "number", type: "bigint", requiredKey: true },  // an API argument
    { name: "title", type: "text" },
    { name: "merged", type: "boolean" },
  ],
  select: (b) => listPrs({ number: b.one("number") }),       // SELECT  → a list/get tool
  insert: (rows) => createPr(rows[0]),                       // INSERT  → a create tool
  update: (set, b) => updatePr(b.one("number"), set),        // UPDATE  → an update tool
  delete: (b) => closePr(b.one("number")),                   // DELETE  → a close tool
});

A read-only time has only select; an emails (send) is insert-only; an images generator is a select-shaped but volatile function-as-relation (SELECT url FROM images WHERE prompt = '…' — prompt is an argument). Verb presence is the capability gate: SELECTing a write-only resource, or writing one with no writer, is a clear error.

Volatility

Every resource declares Postgres's immutable | stable | volatile. It governs caching and protects effectful tools from being called the wrong number of times.

Volatility	Cached	For
immutable	database lifetime	Pure lookups.
stable	within one execute	Turn-stable reads (e.g. time).
volatile	never	Effectful / nondeterministic reads & all writes. Invoked exactly once per statement.

How a query runs

The query engine (glove-sql) is synchronous; resources are async and effectful. So Database.execute can't hook resolution inside the engine — it pre-resolves:

Parse the SQL (the same parser the engine executes — one grammar).
Gate it: a statement-kind whitelist, read-only by default, CREATE/DROP refused, multi-statement only as a BEGIN … COMMIT/ROLLBACK script.
Collect every relation referenced (FROM, JOINs, subqueries, CTE bodies, INSERT … SELECT source) and classify each as a resource or not.
Push down the WHERE / JOIN-ON equalities scoped to each resource — these are arguments, not just filters; a missing required key is a clear error.
Resolve each resource exactly once, materialize its rows into the engine, run the now-synchronous query, then tear down the ephemeral tables.

Resolving once, up front, is what makes the volatility guarantee hold: the engine evaluates FROM-resolution lazily and repeatedly (once per correlated-subquery row), so an inline async hook would invoke an effectful tool N times. Pre-resolution invokes it once.

Discovery is information_schema

There is no separate discovery step. Resources are advertised in information_schema.tables / .columns (engine-agnostically, via a catalog callback), so the agent lands in an unfamiliar database, lists its tables, inspects the relevant ones, and figures out its own capabilities — exactly how SQL has always done progressive disclosure.

Transactions = preview & staging

A write against a resource is a side-effecting tool call. Inside a transaction it is staged, not fired — recorded with the exact resolver + arguments it will invoke. db.preview() (and the staged field on the result) is the approval surface; COMMIT fires the staged writes in order; ROLLBACK discards them — a true dry run. Writes are off unless the database is created with policy: { writes: true }.

await db.execute(`BEGIN`);
const staged = await db.execute(
  `INSERT INTO notion_page (title) SELECT title FROM github_pr WHERE merged = true`,
);
staged.staged;            // → the writes about to fire (preview them)
await db.execute(`COMMIT`);   // fires in order   (or ROLLBACK to discard)

EXPLAIN

db.explain(sql) (and the explain_sql tool, and EXPLAIN <stmt> through execute_sql) runs the pre-pass only — no resolver calls — and reports which resources a statement will hit, each one's volatility, read/write access, and the arguments it resolved. Explaining a generate_image query costs nothing.

MCP servers → tables

Most MCP tools are CRUD over some resource type, so decompose a server into resources and give each a table. glove-mcp is an optional peer dependency.

import { connectMcp } from "glove-mcp";
import { mountMcpDatabase } from "glove-scratchpad/mcp";

const conn = await connectMcp({ namespace: "github", url });
await mountMcpDatabase(db, conn, {
  table: (t) => t.name === "list_pull_requests"
    ? { name: "github_pr", op: "select", volatility: "stable",
        columns: [{ name: "title", type: "text" }, { name: "merged", type: "boolean" }],
        rows: (d) => JSON.parse(d as string) }
    : null,                       // skip the rest, or map them too
});
// → INSERT INTO linear_issue SELECT … FROM github_pr WHERE merged = true
//   composes two servers in one statement.

A read tool (readOnlyHint) defaults to a select resource; others default to a volatile insert. Declare columns (and a rows extractor) via table(tool) to make a server's data genuinely queryable.

Backends

The manipulation surface is a defined Postgres subset; the backend behind it is swappable (ScratchpadBackend).

Backend	Import	When
glove-sql (default)	bundled	Zero-dependency pure-JS Postgres subset. Covers the SQL agents write.
PgliteBackend	glove-scratchpad/pglite	Embedded Postgres (WASM) for a full dialect. Optional peer.
Bring your own	ScratchpadBackend	Real Postgres / SQLite / a remote service.

API

const db = await Database.create({ policy?: { writes }, backend?, actor? });
db.register(resource);                  // or registerAll([...])
await db.execute(sql, { params?, limit?, allowWrites?, signal? });
//   → { rows, truncated, touched, staged?, committed?, message? }
await db.explain(sql, { params? });     // → { statementKind, readOnly, relations, staged? }
db.preview();                           // staged writes in the open transaction
mountDatabase(glove, { db, prime?, explain?, allowWrites? });

What this is not

Effectful relations are volatile. The interpreter carries a volatility model so the engine can't call them the wrong number of times — but you own declaring volatility correctly.
Atomic conditional composition doesn't reduce. Branching where the next tool depends on a prior tool's output, inside one statement, is imperative-vs-declarative — punt it to the agent loop (query, look, query again).
Tables are live views, not stored data. Rate limits, pagination, and partial failure when one service times out mid-JOIN are yours to handle in the resolver.

The query engine lives in glove-sql — a standalone, zero-dependency Postgres-subset engine you can use on its own.