Engineering Notes
Ora · February 25, 2026

Dynamic Tool Discovery: How We Reduced Prompt Tokens by 60%

Ora registers 41 tools. Sending every schema to the LLM on every turn was expensive and noisy. Dynamic tool discovery introduced client-side lazy loading — the model now carries six core schemas and discovers the rest on demand.

The problem

From the beginning, Ora's agent loop injected every registered tool schema into the system prompt at session start. That meant the LLM received the full parameter documentation for all 41 tools — calendar creation, mail search, Shortcuts runner, file search, Notes editing — whether the user needed them or not.

At roughly 90–110 characters per tool in compact schema format, the tools block alone consumed ~3,700–4,500 characters of every system prompt. At the 0.3 tokens/char estimate used in ConversationManager, that's around 1,200 tokens per generation step, just for the tool list.

This matters for three reasons: cost per inference step (even on local MLX it affects GPU memory pressure), the model's attention budget competing with actual conversation context, and the signal-to-noise ratio of the prompt — a model that sees 35 tools it doesn't need is more likely to confuse them.

41
Total tools
~1,200
Tokens before (per step)
~450
Tokens after (session start)

The design

The system introduced a load policy on every tool: .core or .deferred. Six tools are marked core — the ones used on nearly every turn:

ToolProtocol.swift
enum ToolLoadPolicy: Sendable, Equatable {
    case core     // full schema always in prompt
    case deferred // compact catalog row; schema on demand
}

// Core tools — high-frequency, always visible to the model
// calendar.query, contacts.search, reminders.list,
// system.open_app, mail.recent, tools.discover

Every other tool defaults to .deferred. The protocol extension supplies the default, so existing tools required no changes beyond marking the six core ones explicitly.

"The model carries a compact one-line catalog of all deferred tools, grouped by domain. When it needs one, it calls tools.discover — and the schema appears in the next step."

Three-section prompt architecture

SystemPromptBuilder now emits three distinct sections instead of one flat tool list. The structure is established at session start and refreshed before each generation step:

CORE TOOLS DEFERRED CATALOG DISCOVERED 6 tools Full schemas calendar.query contacts.search reminders.list system.open_app mail.recent tools.discover ~600 chars ~180 tokens 35 tools Name-only rows calendar: - calendar.create_event [confirm] - calendar.delete_event [confirm] - calendar.find_slots messages: - messages.send [confirm] - messages.open_chat system: - system.search_files + 26 more... ~900 chars ~270 tokens Empty at session start. Grows as model calls tools.discover. 0 tokens initially → grows on demand Initial total: ~1,500 chars · ~450 tokens · saves 62% vs. flat all-tools

The token reduction, visualized

The chart below compares the three scenarios. "Before" is the old single flat block. "After (initial)" is what the model sees at session start — core schemas plus the compact catalog. "After (with one discovery)" shows what happens after a single tools.discover call that returns, say, three deferred schemas.

Approx. tool-block size by scenario (chars)

4,000 3,000 2,000 1,000 0 ~4,000 Before flat all-tools ~1,500 After (initial) core + catalog ~1,800 After + discovery +3 discovered −62%

The discovery index

For tools.discover to be useful, it needs to reliably surface the right tool even when the user's voice input goes through ASR — which means typos, homophones, and phonetic approximations. The ToolDiscoveryIndex uses a two-tier strategy:

Deterministic pass first. Exact tool-name match scores 1.0. Substring match scores 0.97. Keyword overlap and domain match follow. Jaro-Winkler similarity (the same fuzzy matching used elsewhere in Ora for contacts search) is applied as a tiebreaker at ≥0.93 similarity threshold.

BM25 fallback. If the deterministic pass yields no results, standard BM25 ranking runs over name + description + parameter names. k1=1.5, b=0.75. A Jaro-Winkler boost (0.35×) and a domain match boost (0.1) are layered on top. Scores are normalized to [0.05, 0.79] so they stay below the deterministic tier's floor.

User says: "send a message" tools.discover query: "send message" DiscoveryIndex deterministic pass BM25 fallback cache per session + return full schema Next step: tool available exempt from tool budget

The tools.discover call itself is exempt from the business tool-call budgetAgentLoop increments the turn counter only for actual business tools, not for meta-tools like discovery. This prevents a discovery call from consuming one of the three allowed tool calls per turn.

Prompt refresh per generation step

The subtlest part of the design is the prompt refresh mechanism. When the model discovers a tool on step N, that schema needs to be visible on step N+1. AgentLoop rebuilds the system prompt before every generation step using the current session's discovered tool set:

AgentLoop.swift
private func refreshSystemPromptIfNeeded() async {
    let prompt = await buildSystemPrompt()  // core + catalog + discovered
    guard promptHash(prompt) != lastPromptHash else { return }
    await conversationManager.updateSystemPrompt(prompt)
    lastPromptHash = promptHash(prompt)
}

The hash comparison means the refresh is free when nothing has changed — no re-generation, no actor contention. And ConversationManager.updateSystemPrompt(_:) was updated (in the codex review cycle) to call trimContextIfNeeded() immediately after, preventing a larger system prompt from silently pushing total context over the 32K token budget.

What it feels like in practice

For most requests — calendar queries, opening an app, looking up a contact — nothing changes. Those tools are core and their schemas are always present. The model behaves identically to before.

For a first-time deferred-tool request in a session ("send a message to Mom"), the loop adds one extra step: tools.discover → schema cached → next step executes messages.send. After that, within the same session, any repeated request for Messages goes directly to execution. The cache accumulates across turns but clears when the session ends.

A dedicated test validates this constraint automatically: the initial tool block must stay at or below 45% of the full-all-tools baseline. It currently passes with ~5% margin.

Numbers at a glance

6
Core tools
35
Deferred tools
−62%
Token reduction at session start