Why Agent Costs Spiral Out of Control
A single agent call is not a single API call. Every step of an agentic loop sends the full conversation history — system prompt, all previous messages, all tool results — back to the model. A research agent that runs 10 iterations sends 10 API calls, each one larger than the last.
The Three Cost Multipliers
1. Long system prompts. A 3,000-token system prompt gets sent with every API call in the loop. At 10 iterations, that is 30,000 tokens just for the system prompt — before the agent has done anything. Prompt caching (Section 3) eliminates most of this cost.
2. Accumulating context window. Tool results get appended to the conversation history. A web search returning 2,000 tokens of results, run 5 times, adds 10,000 tokens to every subsequent call. Context trimming (Section 5) prevents this from spiraling.
3. Using the wrong model. If you use Claude Opus for routing tasks that could run on Haiku, you are spending 18x more than you need to. Tiered model selection (Section 6) is the single biggest cost lever available.
Worked Example: Realistic Agent Cost Without Optimization
| Iteration | System Prompt | Conversation History | Tool Results | Total Input Tokens | Output Tokens |
|---|---|---|---|---|---|
| 1 | 3,000 | 200 | 0 | 3,200 | 350 |
| 2 | 3,000 | 750 | 1,500 | 5,250 | 300 |
| 3 | 3,000 | 1,500 | 3,000 | 7,500 | 400 |
| 4 | 3,000 | 2,400 | 4,500 | 9,900 | 350 |
| 5 | 3,000 | 3,500 | 6,000 | 12,500 | 450 |
| Total | 15,000 | 8,350 | 15,000 | 38,350 | 1,850 |
At Sonnet pricing ($3 input / $15 output per million tokens): 38,350 input tokens = $0.115 and 1,850 output tokens = $0.028 — roughly $0.14 per agent run. That sounds small. At 1,000 runs/day it is $140/day, $4,200/month. With optimization, the same workload can cost under $500/month.
Model Selection Strategy
The single most impactful decision you make is which model you use for each task. There is an 18x price difference between Haiku and Opus. For tasks where Haiku performs equally well — classification, formatting, simple Q&A, routing — using Opus is pure waste.
| Model | Input (per M tokens) | Output (per M tokens) | Speed | Best For |
|---|---|---|---|---|
| claude-opus-4-6 | $15.00 | $75.00 | Slower | Complex reasoning, multi-step analysis, high-stakes decisions, creative tasks requiring depth |
| claude-sonnet-4-6 | $3.00 | $15.00 | Fast | Most production workloads, coding, structured output, tool use, general agents |
| claude-haiku-4-5 | $0.80 | $4.00 | Fastest | Classification, intent routing, simple extraction, summarization, high-volume tasks |
Model Selection Decision Tree
Prompt Caching — Save Up to 90%
Prompt caching is the highest-leverage optimization for agents with long system prompts. You mark a portion of your system prompt with cache_control. The first call pays full price to write the cache. Every subsequent call within the TTL (Time To Live) window pays only 10% of the normal input price to read from cache — a 90% discount.
For a 3,000-token system prompt sent 100 times per day at Sonnet pricing: without caching, that is 300,000 tokens × $3/M = $0.90/day just for the system prompt. With caching, it is 3,000 tokens (write) + 99 × 300 tokens (cache read at 10%) = ~$0.009/day — a 100x reduction on that portion of cost.
import anthropic
client = anthropic.Anthropic()
# With prompt caching — mark the system prompt with cache_control
# The system prompt costs 10% on cache hits (within the TTL window)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": """You are an expert financial analyst with deep knowledge of
public company earnings, balance sheets, and market dynamics.
When analyzing companies, always consider:
- Revenue growth trajectory (YoY and QoQ)
- Gross margin trends and drivers
- Operating leverage and EBITDA margin expansion
- Free cash flow conversion from net income
- Balance sheet health: net debt/EBITDA, current ratio
- Competitive positioning: market share, moat characteristics
- Management guidance credibility vs historical accuracy
- Key risks: regulatory, competitive, macro sensitivity
Format your analysis in structured sections with clear headers.
Support every claim with specific numbers from the provided data.
Always state your assumptions explicitly.
[... the rest of your 2,000+ token system prompt ...]
""",
"cache_control": {"type": "ephemeral"} # Cache this!
}
],
messages=[{"role": "user", "content": "Analyze Apple's Q4 2024 earnings: ..."}]
)
# Check cache performance in the usage metadata
usage = response.usage
print(f"Input tokens: {usage.input_tokens:,}")
print(f"Cache write tokens: {getattr(usage, 'cache_creation_input_tokens', 0):,} (full price, first call)")
print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0):,} (10% price, cache hit)")
print(f"Output tokens: {usage.output_tokens:,}")
# Calculate actual cost for this call
SONNET_INPUT_PRICE = 3.0 / 1_000_000 # $3 per million
SONNET_CACHE_PRICE = 0.30 / 1_000_000 # $0.30 per million (10% of input)
SONNET_OUTPUT_PRICE = 15.0 / 1_000_000 # $15 per million
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
regular_input = usage.input_tokens - cache_write - cache_read
cost = (
regular_input * SONNET_INPUT_PRICE +
cache_write * SONNET_INPUT_PRICE + # Write costs full price
cache_read * SONNET_CACHE_PRICE + # Read costs 10%
usage.output_tokens * SONNET_OUTPUT_PRICE
)
print(f"Estimated cost: ${cost:.6f}")
Cache Cost Comparison: 100 Agent Runs Per Day
| Scenario | System Prompt Tokens | Token Cost | Daily Cost (100 runs) | Monthly Cost |
|---|---|---|---|---|
| No caching | 3,000 per call × 100 | 300,000 × $3/M | $0.90 | $27.00 |
| With caching (5 min TTL) | 1 write + 99 reads at 10% | 3,000 full + 297,000 × $0.30/M | $0.098 | $2.94 |
| Extended caching (1 hr TTL) | ~1 write per hour | Fewer cache misses | ~$0.05 | ~$1.50 |
Long system prompts that define agent behavior and don't change between runs. RAG documents — shared context like product documentation, internal knowledge bases, or policy documents. Few-shot examples that appear at the top of every prompt. The cache_control marker can appear multiple times in a prompt to cache different segments at different depths.
Batching for Non-Real-Time Tasks
Anthropic's Message Batches API gives you a 50% discount on all requests submitted as a batch. The tradeoff: results are returned asynchronously, potentially up to 24 hours later. This is the right tool for any work that does not need an immediate response: daily data processing, bulk document analysis, overnight report generation, scheduled research tasks.
If you run 10,000 classification tasks per day with Haiku, that is roughly $8/day normally. With the Batch API, it is $4/day — saving $120/month with zero code change beyond the batching wrapper.
import anthropic
import time
import json
client = anthropic.Anthropic()
# Your documents to process
documents_to_analyze = [
"Customer complaint about shipping delay on order #4821...",
"Positive review: product exceeded expectations, would recommend...",
"Refund request for defective item received on Jan 15...",
# ... up to 10,000 items per batch
]
# 1. Create the batch — all requests submitted at once
print(f"Submitting batch of {len(documents_to_analyze)} requests...")
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"doc-{i}", # Your ID to match results to inputs
"params": {
"model": "claude-haiku-4-5-20251001",
"max_tokens": 100,
"messages": [{
"role": "user",
"content": (
f"Classify this customer message. Reply with ONLY one of: "
f"complaint, compliment, refund_request, question, other\n\n"
f"Message: {doc}"
)
}]
}
}
for i, doc in enumerate(documents_to_analyze)
]
)
print(f"Batch created: {batch.id}")
print(f"Status: {batch.processing_status}")
# 2. Poll until complete (in production, use a webhook or scheduled job instead)
while True:
status = client.messages.batches.retrieve(batch.id)
print(f"Status: {status.processing_status} | "
f"Succeeded: {status.request_counts.succeeded} | "
f"Errored: {status.request_counts.errored}")
if status.processing_status == "ended":
break
time.sleep(60) # Check every minute
# 3. Retrieve and process results
results = {}
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
classification = result.result.message.content[0].text.strip()
results[result.custom_id] = classification
else:
# Handle errors
results[result.custom_id] = f"ERROR: {result.result.error.type}"
# 4. Print summary
from collections import Counter
counts = Counter(results.values())
print("\nClassification results:")
for label, count in counts.most_common():
print(f" {label}: {count}")
print(f"\nTotal processed: {len(results)}")
print(f"Estimated cost vs real-time: ~50% savings")
Real-time user interactions, anything that requires a response within a few seconds, or tasks where you need to react to early results before processing all items. Batching is for scheduled, bulk, or background workloads only.
Token Budgeting
Every token you send or receive costs money. Most developers leave significant savings on the table through two habits: setting max_tokens to a default large number regardless of task, and letting conversation history grow unbounded across iterations.
Setting max_tokens by Task Type
| Task Type | Recommended max_tokens | Reasoning |
|---|---|---|
| Classification / routing | 10–50 | You need one word or a short phrase |
| Simple Q&A / extraction | 200–500 | A few sentences, not a full essay |
| Tool call reasoning (in agentic loop) | 500–1024 | Reasoning + tool call, not final output |
| Standard agent response | 1024–2048 | Detailed but focused output |
| Long-form writing / analysis | 2048–4096 | Only when you actually need a long response |
Trimming Conversation History
Without trimming, your agent's context window grows with every tool call. A 10-step agent can easily accumulate 30,000+ tokens of history — most of it earlier reasoning steps the model no longer needs. The trimmer below keeps recent messages within a token budget while always preserving the first user message (the original task).
def estimate_tokens(msg: dict) -> int:
"""
Rough token estimate: 4 characters ≈ 1 token.
Use tiktoken or Anthropic's token counting endpoint for accuracy.
"""
content = msg.get('content', '')
if isinstance(content, list):
# Handle structured content (tool results, etc.)
text = ' '.join(
item.get('text', '') if isinstance(item, dict) else str(item)
for item in content
)
else:
text = str(content)
return max(1, len(text) // 4)
def trim_conversation_history(
messages: list,
max_tokens: int = 50_000,
always_keep_first: bool = True
) -> list:
"""
Trim conversation history to stay within a token budget.
Strategy:
- Always keep the first user message (original task context)
- Always keep the most recent N messages
- Remove messages from the middle when over budget
Args:
messages: Full conversation history
max_tokens: Maximum total tokens to allow (rough estimate)
always_keep_first: If True, never remove messages[0]
Returns:
Trimmed message list
"""
if not messages:
return messages
total = sum(estimate_tokens(m) for m in messages)
if total <= max_tokens:
return messages # No trimming needed
print(f"[trim] Conversation at ~{total:,} tokens, trimming to {max_tokens:,}...")
if not always_keep_first:
# Simple approach: keep most recent messages within budget
result = []
budget = max_tokens
for msg in reversed(messages):
tokens = estimate_tokens(msg)
if budget - tokens > 0:
result.insert(0, msg)
budget -= tokens
else:
break
return result
# Keep first message, fill from the end
first_msg = messages[0]
remaining_messages = messages[1:]
first_tokens = estimate_tokens(first_msg)
budget = max_tokens - first_tokens
result = []
for msg in reversed(remaining_messages):
tokens = estimate_tokens(msg)
if budget - tokens > 0:
result.insert(0, msg)
budget -= tokens
else:
break
trimmed = [first_msg] + result
trimmed_total = sum(estimate_tokens(m) for m in trimmed)
removed = len(messages) - len(trimmed)
if removed > 0:
print(f"[trim] Removed {removed} messages. New size: ~{trimmed_total:,} tokens")
return trimmed
# Usage in your agent loop:
messages = trim_conversation_history(messages, max_tokens=50_000)
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=messages
)
Tiered Systems — Cheap Models Route, Expensive Models Reason
The most powerful cost optimization is architectural: do not use one model for everything. Use Haiku to classify and route requests, Sonnet for the bulk of work, and Opus only for the tasks that genuinely require its capabilities.
In most production agent systems, roughly 50–70% of requests can be handled by Haiku once properly classified — because many user requests are actually simple lookups, yes/no questions, or straightforward transformations, not complex reasoning tasks.
Architecture: Intent Classification Pipeline
| Tier | Model | % of Requests | Task Types | Cost vs All-Opus |
|---|---|---|---|---|
| Routing | Haiku | 100% (all requests classified here) | Intent classification only, tiny output | -98% |
| Simple | Haiku | ~50% | FAQ, extraction, formatting, simple Q&A | -95% |
| Standard | Sonnet | ~45% | Tool use, coding, analysis, general agent work | -80% |
| Complex | Opus | ~5% | Deep research, strategy, multi-step reasoning | Baseline |
import anthropic
client = anthropic.Anthropic()
# Pricing per million tokens (approximate, check Anthropic console for current rates)
MODEL_PRICING = {
'claude-haiku-4-5-20251001': {'input': 0.80, 'output': 4.00},
'claude-sonnet-4-6': {'input': 3.00, 'output': 15.00},
'claude-opus-4-6': {'input': 15.00, 'output': 75.00},
}
def classify_complexity(task: str) -> str:
"""
Use cheap Haiku to classify task complexity.
Costs ~$0.000008 per classification (almost free).
"""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=20, # We only need one word
messages=[{
"role": "user",
"content": f"""Classify this task complexity. Reply with ONLY one word: simple, moderate, or complex.
Rules:
- simple: factual lookup, yes/no, formatting, basic calculation, FAQ
- moderate: multi-step reasoning, code generation, analysis, tool use
- complex: deep research, strategic planning, multi-document synthesis, novel problem-solving
Task: {task}
Classification:"""
}]
)
result = response.content[0].text.strip().lower()
# Validate — default to moderate if unexpected response
return result if result in ('simple', 'moderate', 'complex') else 'moderate'
def select_model(complexity: str) -> str:
"""Map complexity to the appropriate model."""
return {
'simple': 'claude-haiku-4-5-20251001',
'moderate': 'claude-sonnet-4-6',
'complex': 'claude-opus-4-6',
}[complexity]
def run_tiered_agent(task: str, system: str = "", verbose: bool = True) -> dict:
"""
Route a task to the appropriate model based on complexity.
Returns result dict with model used, response, and estimated cost.
"""
# Step 1: Classify (costs ~$0.000008)
complexity = classify_complexity(task)
model = select_model(complexity)
if verbose:
print(f"[tiered] complexity={complexity} → model={model}")
# Step 2: Run on the appropriate model
messages = [{"role": "user", "content": task}]
kwargs = {
"model": model,
"max_tokens": 2048,
"messages": messages
}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
# Step 3: Calculate cost
usage = response.usage
pricing = MODEL_PRICING[model]
cost = (
usage.input_tokens / 1_000_000 * pricing['input'] +
usage.output_tokens / 1_000_000 * pricing['output']
)
return {
"complexity": complexity,
"model": model,
"response": response.content[0].text,
"input_tokens": usage.input_tokens,
"output_tokens": usage.output_tokens,
"estimated_cost_usd": round(cost, 6)
}
# Example usage
tasks = [
"What is the capital of France?", # → Haiku (simple)
"Write a Python function to parse ISO 8601 timestamps", # → Sonnet (moderate)
"Develop a comprehensive go-to-market strategy for a B2B SaaS product targeting mid-market healthcare companies, including competitive positioning, pricing model analysis, and channel strategy" # → Opus (complex)
]
total_cost = 0
for task in tasks:
result = run_tiered_agent(task[:80] + "..." if len(task) > 80 else task)
total_cost += result['estimated_cost_usd']
print(f" Cost: ${result['estimated_cost_usd']:.6f} | Model: {result['model']}\n")
print(f"Total: ${total_cost:.6f}")
Real Cost Comparison Tables
Table 1: Cost per 1,000 Tasks
Assumes average 2,000 input tokens + 500 output tokens per task
| Strategy | Input Cost | Output Cost | Total / 1K Tasks | vs All-Opus |
|---|---|---|---|---|
| All Opus | 2M × $15/M = $30.00 | 500K × $75/M = $37.50 | $67.50 | Baseline |
| All Sonnet | 2M × $3/M = $6.00 | 500K × $15/M = $7.50 | $13.50 | -80% |
| All Haiku | 2M × $0.80/M = $1.60 | 500K × $4/M = $2.00 | $3.60 | -95% |
| Tiered (5% Opus / 45% Sonnet / 50% Haiku) | Blended | Blended | $9.18 | -86% |
| Tiered + Prompt Caching | System prompt at 10% | Same | $5.80 | -91% |
| Tiered + Caching + Batch API | 50% batch discount | 50% batch discount | $2.90 | -96% |
Table 2: Monthly Costs at Scale
Using "Tiered + Caching + Batching" strategy vs "All Sonnet" baseline
| Usage Level | Tasks / Month | All Sonnet Cost | Optimized Cost | Monthly Savings |
|---|---|---|---|---|
| Small (100/day) | 3,000 | $40.50 | $8.70 | $31.80 saved |
| Medium (1,000/day) | 30,000 | $405.00 | $87.00 | $318.00 saved |
| Large (10,000/day) | 300,000 | $4,050.00 | $870.00 | $3,180.00 saved |
| Enterprise (100K/day) | 3,000,000 | $40,500.00 | $8,700.00 | $31,800.00 saved |
Table 3: Cost per Run for Common Agent Types
Typical token usage per complete agent run (all iterations combined)
| Agent Type | Avg Input Tokens | Avg Output Tokens | Cost on Sonnet | Cost on Haiku | Cost Optimized |
|---|---|---|---|---|---|
| Customer support Q&A (3 turns) | 4,500 | 800 | $0.026 | $0.007 | $0.007 |
| Code review (single file) | 8,000 | 1,500 | $0.047 | $0.012 | $0.025 |
| Research agent (5 web searches) | 18,000 | 2,000 | $0.084 | $0.022 | $0.040 |
| Data analysis (DB + charts) | 12,000 | 3,000 | $0.081 | $0.022 | $0.038 |
| Long research report (10+ tools) | 45,000 | 6,000 | $0.225 | Not recommended | $0.095 |
Monitoring and Alerting
Cost optimization is not a one-time setup — it is an ongoing practice. You need visibility into where your tokens are actually going. Often the most expensive parts of a system are not where you expect: a verbose tool result that adds 3,000 tokens to every iteration, a system prompt that grew to 5,000 tokens over six months of edits, a single user running 500 agent calls in a day.
import anthropic
from dataclasses import dataclass, field
from datetime import datetime
from collections import defaultdict
import json
@dataclass
class UsageTracker:
"""
Track token usage and costs per model, per user, per day.
Alerts when approaching budget limits.
"""
daily_budget_usd: float = 10.0 # Total daily budget
per_user_daily_limit: float = 1.0 # Per-user daily limit
# Pricing per million tokens (approximate, early 2026)
PRICING = {
'claude-opus-4-6': {'input': 15.0, 'output': 75.0},
'claude-sonnet-4-6': {'input': 3.0, 'output': 15.0},
'claude-haiku-4-5-20251001': {'input': 0.80, 'output': 4.0},
}
_log: list = field(default_factory=list)
def track(self, model: str, usage: anthropic.types.Usage,
user_id: str = "default", task_type: str = "unknown") -> float:
"""Record usage and return the cost of this call."""
prices = self.PRICING.get(model, {'input': 3.0, 'output': 15.0})
# Account for prompt caching if available
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
regular_input = usage.input_tokens - cache_read - cache_write
cache_read_price = prices['input'] * 0.10 # Cache reads at 10%
cost = (
regular_input / 1_000_000 * prices['input'] +
cache_write / 1_000_000 * prices['input'] +
cache_read / 1_000_000 * cache_read_price +
usage.output_tokens / 1_000_000 * prices['output']
)
entry = {
'timestamp': datetime.now().isoformat(),
'user_id': user_id,
'model': model,
'task_type': task_type,
'input_tokens': usage.input_tokens,
'output_tokens': usage.output_tokens,
'cache_read_tokens': cache_read,
'cost_usd': cost
}
self._log.append(entry)
# Check budget after each call
self._check_alerts(user_id, cost)
return cost
def today_total(self) -> float:
"""Total spend today across all users."""
today = datetime.now().date().isoformat()
return sum(e['cost_usd'] for e in self._log if e['timestamp'][:10] == today)
def today_by_user(self) -> dict:
"""Today's spend grouped by user."""
today = datetime.now().date().isoformat()
totals = defaultdict(float)
for e in self._log:
if e['timestamp'][:10] == today:
totals[e['user_id']] += e['cost_usd']
return dict(totals)
def today_by_model(self) -> dict:
"""Today's spend grouped by model."""
today = datetime.now().date().isoformat()
totals = defaultdict(float)
for e in self._log:
if e['timestamp'][:10] == today:
totals[e['model']] += e['cost_usd']
return dict(totals)
def _check_alerts(self, user_id: str, new_cost: float):
"""Fire alerts when approaching budget limits."""
daily = self.today_total()
user_daily = self.today_by_user().get(user_id, 0)
if daily > self.daily_budget_usd * 0.80:
print(f"[BUDGET WARNING] Daily spend at ${daily:.2f} / "
f"${self.daily_budget_usd:.2f} ({daily/self.daily_budget_usd:.0%})")
if daily > self.daily_budget_usd:
raise RuntimeError(
f"[BUDGET EXCEEDED] Daily budget of ${self.daily_budget_usd:.2f} reached. "
f"Current: ${daily:.2f}. Set a higher limit in UsageTracker or "
f"add a hard spending limit at console.anthropic.com"
)
if user_daily > self.per_user_daily_limit:
print(f"[USER LIMIT] User '{user_id}' has spent ${user_daily:.4f} today "
f"(limit: ${self.per_user_daily_limit:.2f})")
def report(self) -> str:
"""Print a summary of today's usage."""
lines = [
f"=== Usage Report ({datetime.now().date()}) ===",
f"Total spend: ${self.today_total():.4f} / ${self.daily_budget_usd:.2f}",
"",
"By model:",
]
for model, cost in sorted(self.today_by_model().items(), key=lambda x: -x[1]):
short_model = model.split('-')[1] if '-' in model else model
lines.append(f" {short_model:12} ${cost:.4f}")
lines.extend(["", "Top users:"])
for user, cost in sorted(self.today_by_user().items(), key=lambda x: -x[1])[:5]:
lines.append(f" {user:20} ${cost:.4f}")
return "\n".join(lines)
# Usage in your agent:
tracker = UsageTracker(daily_budget_usd=50.0)
response = client.messages.create(model="claude-sonnet-4-6", ...)
cost = tracker.track(
model="claude-sonnet-4-6",
usage=response.usage,
user_id="user_12345",
task_type="research_agent"
)
print(f"This call cost: ${cost:.6f}")
print(tracker.report())
Your usage tracker is a soft limit in code. Always set a hard spending limit in the Anthropic console (console.anthropic.com → Settings → Billing → Spend Limits). This is your failsafe — it stops spending even if your code has a bug or an agent goes rogue. Set it 20–30% above your expected spend to avoid unexpected cutoffs.
Cost Optimization Checklist
Quick-reference list of optimizations ranked roughly by impact. Implement the top items first — they give the most savings for the least effort.
-
1Use Haiku for routing and classificationRoute every incoming request through a Haiku classifier before deciding which model handles it. Classification prompts use <50 output tokens.-80–95%
-
2Enable prompt caching for long system promptsAdd
cache_control: {type: "ephemeral"}to any system prompt over 1,000 tokens. Cache hits cost 10% of normal input price.-60–90% -
3Use the Batch API for non-real-time workDaily reports, bulk analysis, scheduled research — anything that can wait up to 24 hours gets a 50% flat discount automatically.-50%
-
4Trim conversation historySet a max_tokens budget on your message history and trim aggressively. Agents rarely need more than the last 5–6 turns plus the original task.-20–50%
-
5Set appropriate max_tokens per task typeClassification needs 10 tokens. Routing needs 50. Stop defaulting to 4096 for everything. Unused max_tokens don't cost money, but overshooting encourages verbose output.-15–30%
-
6Cache tool results within a sessionIf an agent searches for the same query twice in one run, return the cached result. A simple dict keyed on tool name + serialized arguments is enough.-10–30%
-
7Use streaming to stop generation earlyStream responses and stop consuming tokens once you have what you need (e.g., once you detect a JSON object is complete). You pay for tokens generated, not max_tokens.-5–25%
-
8Pre-filter requests with Haiku before expensive callsUse Haiku to check: "Is this request actually within scope?" or "Does this need a tool call at all?" before sending to Sonnet or Opus.-30–60%
-
9Add max_iterations to all agent loopsA bug or an adversarial prompt can cause an agent to loop indefinitely. Hard cap at 10–20 iterations. Most tasks need fewer than 5.Prevents runaway
-
10Monitor per-user usageOne user testing your agent aggressively (or one bug in a loop) can consume your entire daily budget. Track usage by user ID and alert on outliers.Budget protection
-
11Set hard spending limits in the Anthropic consoleCode bugs happen. Set a hard monthly cap at console.anthropic.com. This is your last line of defense against a runaway agent or billing accident.Last-resort failsafe
-
12Summarize long tool results before adding to contextA web search returning 5,000 tokens of HTML? Run a cheap Haiku call to extract the 200-token summary before adding it to the agent's conversation history.-20–40%
-
13Batch similar requests to maximize cache hitsWhen sending similar requests (same system prompt, different user inputs), group them together in time. Cache writes expire after 5 minutes by default — send within that window to share a cache hit.+10–20% cache efficiency
-
14Profile your actual token usage to find real bottlenecksLog every call with input_tokens, output_tokens, model, and task type. In most systems, 20% of task types account for 80% of cost. Fix those first, not the cheap ones.Identifies highest-impact fixes
-
15Consider fine-tuning for specific high-volume tasksFor very specific, high-volume tasks (e.g., classifying your exact support ticket taxonomy), fine-tuning a smaller model can match larger model quality at a fraction of the cost. Only viable at high volume (100K+ monthly tasks).-70–90% at scale
With cost optimization in place, you are ready to tackle the most powerful (and expensive) agent architecture: multi-agent systems. In the Multi-Agent guide, you will learn how to build networks of specialized agents that collaborate on complex tasks — and how to keep costs from multiplying when you add more agents.