Dynamic Tool Discovery: How We Reduced Prompt Tokens by 60%

The problem

From the beginning, Ora's agent loop injected every registered tool schema into the system prompt at session start. That meant the LLM received the full parameter documentation for all 41 tools — calendar creation, mail search, Shortcuts runner, file search, Notes editing — whether the user needed them or not.

At roughly 90–110 characters per tool in compact schema format, the tools block alone consumed ~3,700–4,500 characters of every system prompt. At the 0.3 tokens/char estimate used in ConversationManager, that's around 1,200 tokens per generation step, just for the tool list.

This matters for three reasons: cost per inference step (even on local MLX it affects GPU memory pressure), the model's attention budget competing with actual conversation context, and the signal-to-noise ratio of the prompt — a model that sees 35 tools it doesn't need is more likely to confuse them.

Total tools

~1,200

Tokens before (per step)

~450

Tokens after (session start)

The design

The system introduced a load policy on every tool: .core or .deferred. Six tools are marked core — the ones used on nearly every turn:

ToolProtocol.swift

enum ToolLoadPolicy: Sendable, Equatable {
    case core     // full schema always in prompt
    case deferred // compact catalog row; schema on demand
}

// Core tools — high-frequency, always visible to the model
// calendar.query, contacts.search, reminders.list,
// system.open_app, mail.recent, tools.discover

Every other tool defaults to .deferred. The protocol extension supplies the default, so existing tools required no changes beyond marking the six core ones explicitly.

"The model carries a compact one-line catalog of all deferred tools, grouped by domain. When it needs one, it calls tools.discover — and the schema appears in the next step."

Three-section prompt architecture

SystemPromptBuilder now emits three distinct sections instead of one flat tool list. The structure is established at session start and refreshed before each generation step:

The token reduction, visualized

The chart below compares the three scenarios. "Before" is the old single flat block. "After (initial)" is what the model sees at session start — core schemas plus the compact catalog. "After (with one discovery)" shows what happens after a single tools.discover call that returns, say, three deferred schemas.

Approx. tool-block size by scenario (chars)

The discovery index

For tools.discover to be useful, it needs to reliably surface the right tool even when the user's voice input goes through ASR — which means typos, homophones, and phonetic approximations. The ToolDiscoveryIndex uses a two-tier strategy:

Deterministic pass first. Exact tool-name match scores 1.0. Substring match scores 0.97. Keyword overlap and domain match follow. Jaro-Winkler similarity (the same fuzzy matching used elsewhere in Ora for contacts search) is applied as a tiebreaker at ≥0.93 similarity threshold.

BM25 fallback. If the deterministic pass yields no results, standard BM25 ranking runs over name + description + parameter names. k1=1.5, b=0.75. A Jaro-Winkler boost (0.35×) and a domain match boost (0.1) are layered on top. Scores are normalized to [0.05, 0.79] so they stay below the deterministic tier's floor.

The tools.discover call itself is exempt from the business tool-call budget — AgentLoop increments the turn counter only for actual business tools, not for meta-tools like discovery. This prevents a discovery call from consuming one of the three allowed tool calls per turn.

Prompt refresh per generation step

The subtlest part of the design is the prompt refresh mechanism. When the model discovers a tool on step N, that schema needs to be visible on step N+1. AgentLoop rebuilds the system prompt before every generation step using the current session's discovered tool set:

AgentLoop.swift

private func refreshSystemPromptIfNeeded() async {
    let prompt = await buildSystemPrompt()  // core + catalog + discovered
    guard promptHash(prompt) != lastPromptHash else { return }
    await conversationManager.updateSystemPrompt(prompt)
    lastPromptHash = promptHash(prompt)
}

The hash comparison means the refresh is free when nothing has changed — no re-generation, no actor contention. And ConversationManager.updateSystemPrompt(_:) was updated (in the codex review cycle) to call trimContextIfNeeded() immediately after, preventing a larger system prompt from silently pushing total context over the 32K token budget.

What it feels like in practice

For most requests — calendar queries, opening an app, looking up a contact — nothing changes. Those tools are core and their schemas are always present. The model behaves identically to before.

For a first-time deferred-tool request in a session ("send a message to Mom"), the loop adds one extra step: tools.discover → schema cached → next step executes messages.send. After that, within the same session, any repeated request for Messages goes directly to execution. The cache accumulates across turns but clears when the session ends.

A dedicated test validates this constraint automatically: the initial tool block must stay at or below 45% of the full-all-tools baseline. It currently passes with ~5% margin.

Numbers at a glance

Core tools

Deferred tools

−62%

Token reduction at session start