Cost and Latency | Agentic Engineering

Agents that are too slow or too expensive don't ship. These constraints shape everything.

Your Mental Model

Frame cost as investment, not expense. The question isn't "is this expensive?" but "what's the cost of NOT using it?" When agents ship 10× faster, the API bill becomes a productivity multiplier, not a line item.

Connections

To Model: How does model choice affect cost/latency?
To Context: How does context size affect cost/latency?
To Evaluation: How do you include cost/latency in your eval metrics?
To Orchestrator Pattern: Capability Minimization: Restricting subagent tools reduces context size → lower token cost per agent
To Scaling Tool Use: Dynamic tool discovery (85% token reduction) and programmatic orchestration (37% token reduction) are production-tested optimization patterns

The Economics of Agent-Assisted Development

Real-World ROI: What $12K/Month Buys

[2025-12-10]: Concrete numbers from a 3-person engineering team running Claude Code in production:

The Investment:

~$12,000/month in API costs
~$4,000 per engineer per month

The Return:

Week's worth of work shipped daily per engineer
10× productivity multiplier on feature delivery
Tasks estimated at 3-5 days completed in 7 hours
35,000 lines of code generated and integrated in a single session

The Reframing:

Traditional question: "Can we afford $12K/month in API costs?"

Correct question: "Can we afford to ship 10× slower?"

For a team with loaded costs of ~$150K/year per engineer ($12.5K/month), spending $4K/month to 10× their output means:

Effective cost per unit of work drops by 90%
Each engineer delivers the output of 10 engineers
The $4K API cost buys you $120K worth of equivalent engineering capacity

Productivity Benchmarks:

From actual development sessions:

Code generation: 35K LOC in 7 hours (previously estimated at 3-5 days)
Feature velocity: Daily shipping cadence vs. weekly estimates
Context switching: Agents handle tedious implementation while engineers focus on architecture

The Hidden Costs of NOT Using Agents

Opportunity cost compounds:

Shipping 10× slower means 10× fewer customer deployments
10× fewer A/B tests run
10× fewer bugs found and fixed
10× less market feedback incorporated

Engineering team economics:

Hiring cost for additional engineers: ~6 months to recruit, onboard, ramp
Agent productivity: Available immediately, scales instantly
Knowledge retention: Agents don't leave; patterns persist in prompts

When Cost Actually Matters

Cost becomes the constraint when:

You're running batch processing at massive scale (millions of documents)
Latency requirements force you to over-provision for peak load
You're building consumer products with thin margins

Cost is rarely the constraint when:

Building internal tooling (developer time >> API costs)
Shipping customer features (revenue impact >> API costs)
Prototyping and validation (speed to learning >> API costs)

Measuring What Matters

Poor metrics:

Total API spend (no context on value delivered)
Cost per token (optimizes the wrong thing)
Agent invocation count (ignores quality)

Better metrics:

Cost per feature shipped
Cost per bug fixed
API cost as % of engineer loaded cost
Time-to-delivery improvement vs. baseline

Best metric:

Revenue or value delivered per dollar of API cost

For a SaaS product, if agents help ship a feature that generates $50K/year in revenue, the $1K in API costs to build it is a 50× return.

Optimization Techniques You've Used

Document specific optimizations, what worked, what the tradeoffs were:

Multi-Agent: Trading Tokens for Quality

[2025-12-09]: Multi-agent architectures trade tokens for deterministic quality. The numbers from academic research are striking:

What You Get:

80× improvement in action specificity
100% actionable recommendation rate (vs. 1.7% for single-agent)
140× improvement in solution correctness in some domains
Zero quality variance across trials—deterministic outcomes

What You Pay:

~15× more tokens than single-agent approaches
Token usage explains 80% of performance variance

The Surprising Finding: Architectural value lies in deterministic quality, not speed. Both single and multi-agent achieve similar latency (~40s for complex research tasks). You're not parallelizing for speed—you're parallelizing for quality and reliability.

Decision Framework: Does your task benefit from parallel analysis by specialized experts AND require near-zero quality variance? Then multi-agent is worth the token cost. For simple tasks, it's overkill. The breakeven point is somewhere around "complex enough that a single agent would need multiple passes anyway."

Anthropic's Production Numbers: Their multi-agent research system showed 90.2% improvement over single-agent Claude Opus 4 on internal research evals, with 90% reduction in research time for complex queries. Lead agent spawns 3-5 subagents in parallel; each subagent uses 3+ concurrent tools.

Sources: Multi-Agent LLM Orchestration Achieves Deterministic Decision Support, Anthropic: How we built our multi-agent research system

Token Cost Models by Feature Type

[2025-12-09]: Different agent feature types—tools, Skills, subagents, and MCP servers—have radically different token cost profiles. Understanding these profiles shapes architectural decisions.

Cost Profiles by Feature Type:

Feature Type	Tokens per Invocation	Primary Cost Driver
Traditional Tools	~100 tokens	Call overhead (parameters + results)
Skills	~1,500+ tokens	Discovery metadata + execution context
Subagents	Full conversation history	Isolated context per subagent
MCP Servers	10,000+ tokens	Rich integration schemas + persistent state

Frequency-Depth Trade-offs:

The right choice depends on usage frequency and task complexity:

One-time actions: Tools minimize cost. For a single database query or file read, ~100 tokens is the floor.
Weekly+ repeatable workflows: Skills amortize higher per-invocation cost (~1,500 tokens) across multiple uses. If you invoke something 3+ times per week, Skills' progressive disclosure becomes cost-effective.
Parallel work streams: Subagents' context isolation prevents main agent bloat. Instead of one agent with 100k tokens of mixed concerns, you get multiple 20k-token focused contexts.
Continuous data access: MCP's upfront cost (tens of thousands of tokens for schema exchange) enables persistent connectivity. For integrations like GitHub that need sustained interaction, this beats repeated tool calls.

Discoverability vs. Cost:

There's a token cost hierarchy for discoverability:

No-overhead (Tools): Minimal cost but requires explicit invocation. The agent must know to call the tool.
Progressive disclosure (Skills): Metadata-based discovery costs tokens upfront (~1,500 per activation) but enables autonomous activation. The agent can discover "I should use this Skill" from context.
Eager loading (MCP): High upfront cost (full schema exchange) but complete discoverability. The agent sees all available operations at once.

Decision Framework:

Ask these questions when choosing feature types:

How often will this be invoked? One-time → Tool. Weekly+ → Skill. Continuous → MCP.
Does the agent need to discover it autonomously? No → Tool. Yes → Skill or MCP.
Is context isolation valuable? Yes → Subagent. No → Tool/Skill.
How rich is the integration schema? Simple → Tool. Complex → MCP.

Real-World Example:

For a code analysis workflow:

File reading: Tool (~100 tokens/call, frequent, simple)
Security audit: Skill (~1,500 tokens, weekly, needs discovery)
Background research: Subagent (isolated context, parallel to main work)
GitHub issue tracking: MCP (persistent connection, rich schema)

The token cost model directly reflects the architectural value provided. Tools are cheap because they're simple. Skills cost more because they're autonomous. Subagents burn tokens for isolation. MCP pays upfront for integration depth.

Sources: Simon Willison: Claude Skills, Claude Skills Deep Dive, Claude Code: Skills Documentation

See Also:

Tool Use — Feature comparison and tool design patterns
Claude Code: Subagent System — Implementation details for context isolation