01

Why Agent Costs Spiral Out of Control

A single agent call is not a single API call. Every step of an agentic loop sends the full conversation history — system prompt, all previous messages, all tool results — back to the model. A research agent that runs 10 iterations sends 10 API calls, each one larger than the last.

10x
More tokens than a simple API call
50K
Tokens for a 10-step research agent
$0.15
That same run on Sonnet
$150
At 1,000 runs/day without optimization

The Three Cost Multipliers

1. Long system prompts. A 3,000-token system prompt gets sent with every API call in the loop. At 10 iterations, that is 30,000 tokens just for the system prompt — before the agent has done anything. Prompt caching (Section 3) eliminates most of this cost.

2. Accumulating context window. Tool results get appended to the conversation history. A web search returning 2,000 tokens of results, run 5 times, adds 10,000 tokens to every subsequent call. Context trimming (Section 5) prevents this from spiraling.

3. Using the wrong model. If you use Claude Opus for routing tasks that could run on Haiku, you are spending 18x more than you need to. Tiered model selection (Section 6) is the single biggest cost lever available.

Worked Example: Realistic Agent Cost Without Optimization

Iteration System Prompt Conversation History Tool Results Total Input Tokens Output Tokens
13,00020003,200350
23,0007501,5005,250300
33,0001,5003,0007,500400
43,0002,4004,5009,900350
53,0003,5006,00012,500450
Total 15,000 8,350 15,000 38,350 1,850

At Sonnet pricing ($3 input / $15 output per million tokens): 38,350 input tokens = $0.115 and 1,850 output tokens = $0.028 — roughly $0.14 per agent run. That sounds small. At 1,000 runs/day it is $140/day, $4,200/month. With optimization, the same workload can cost under $500/month.

02

Model Selection Strategy

The single most impactful decision you make is which model you use for each task. There is an 18x price difference between Haiku and Opus. For tasks where Haiku performs equally well — classification, formatting, simple Q&A, routing — using Opus is pure waste.

These prices are approximate as of early 2026. Check console.anthropic.com for current pricing before building cost projections.
Model Input (per M tokens) Output (per M tokens) Speed Best For
claude-opus-4-6 $15.00 $75.00 Slower Complex reasoning, multi-step analysis, high-stakes decisions, creative tasks requiring depth
claude-sonnet-4-6 $3.00 $15.00 Fast Most production workloads, coding, structured output, tool use, general agents
claude-haiku-4-5 $0.80 $4.00 Fastest Classification, intent routing, simple extraction, summarization, high-volume tasks

Model Selection Decision Tree

Model Selection Decision Tree
Is this a high-stakes decision requiring nuanced reasoning, weighing complex tradeoffs, or producing long creative work that must be excellent?
Opus
Is this a standard production task — coding, tool use, structured output generation, agent reasoning with tools, customer-facing responses requiring good quality?
Sonnet
Is this classification, routing, extraction, or a simple transformation — tagging a support ticket, detecting language, extracting a date, routing a question to the right agent?
Haiku
Is this a batch job running on thousands of items overnight — bulk content categorization, document indexing, large-scale analysis?
Haiku + Batch API
Not sure? Start with Sonnet. Profile your actual quality results. Downgrade tasks to Haiku that pass quality checks, upgrade tasks to Opus that fail.
Sonnet (default)
03

Prompt Caching — Save Up to 90%

Up to 90% Savings on Cached Content 5 Min Default TTL 1 Hour Extended TTL

Prompt caching is the highest-leverage optimization for agents with long system prompts. You mark a portion of your system prompt with cache_control. The first call pays full price to write the cache. Every subsequent call within the TTL (Time To Live) window pays only 10% of the normal input price to read from cache — a 90% discount.

For a 3,000-token system prompt sent 100 times per day at Sonnet pricing: without caching, that is 300,000 tokens × $3/M = $0.90/day just for the system prompt. With caching, it is 3,000 tokens (write) + 99 × 300 tokens (cache read at 10%) = ~$0.009/day — a 100x reduction on that portion of cost.

Python — Prompt Caching
import anthropic

client = anthropic.Anthropic()

# With prompt caching — mark the system prompt with cache_control
# The system prompt costs 10% on cache hits (within the TTL window)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are an expert financial analyst with deep knowledge of
            public company earnings, balance sheets, and market dynamics.

            When analyzing companies, always consider:
            - Revenue growth trajectory (YoY and QoQ)
            - Gross margin trends and drivers
            - Operating leverage and EBITDA margin expansion
            - Free cash flow conversion from net income
            - Balance sheet health: net debt/EBITDA, current ratio
            - Competitive positioning: market share, moat characteristics
            - Management guidance credibility vs historical accuracy
            - Key risks: regulatory, competitive, macro sensitivity

            Format your analysis in structured sections with clear headers.
            Support every claim with specific numbers from the provided data.
            Always state your assumptions explicitly.
            [... the rest of your 2,000+ token system prompt ...]
            """,
            "cache_control": {"type": "ephemeral"}  # Cache this!
        }
    ],
    messages=[{"role": "user", "content": "Analyze Apple's Q4 2024 earnings: ..."}]
)

# Check cache performance in the usage metadata
usage = response.usage
print(f"Input tokens:        {usage.input_tokens:,}")
print(f"Cache write tokens:  {getattr(usage, 'cache_creation_input_tokens', 0):,}  (full price, first call)")
print(f"Cache read tokens:   {getattr(usage, 'cache_read_input_tokens', 0):,}   (10% price, cache hit)")
print(f"Output tokens:       {usage.output_tokens:,}")

# Calculate actual cost for this call
SONNET_INPUT_PRICE = 3.0 / 1_000_000    # $3 per million
SONNET_CACHE_PRICE = 0.30 / 1_000_000   # $0.30 per million (10% of input)
SONNET_OUTPUT_PRICE = 15.0 / 1_000_000  # $15 per million

cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
regular_input = usage.input_tokens - cache_write - cache_read

cost = (
    regular_input * SONNET_INPUT_PRICE +
    cache_write * SONNET_INPUT_PRICE +     # Write costs full price
    cache_read * SONNET_CACHE_PRICE +      # Read costs 10%
    usage.output_tokens * SONNET_OUTPUT_PRICE
)
print(f"Estimated cost:      ${cost:.6f}")

Cache Cost Comparison: 100 Agent Runs Per Day

Scenario System Prompt Tokens Token Cost Daily Cost (100 runs) Monthly Cost
No caching 3,000 per call × 100 300,000 × $3/M $0.90 $27.00
With caching (5 min TTL) 1 write + 99 reads at 10% 3,000 full + 297,000 × $0.30/M $0.098 $2.94
Extended caching (1 hr TTL) ~1 write per hour Fewer cache misses ~$0.05 ~$1.50
Best Candidates for Caching

Long system prompts that define agent behavior and don't change between runs. RAG documents — shared context like product documentation, internal knowledge bases, or policy documents. Few-shot examples that appear at the top of every prompt. The cache_control marker can appear multiple times in a prompt to cache different segments at different depths.

04

Batching for Non-Real-Time Tasks

50% Flat Discount Up to 24-Hour Window Up to 10,000 Requests per Batch

Anthropic's Message Batches API gives you a 50% discount on all requests submitted as a batch. The tradeoff: results are returned asynchronously, potentially up to 24 hours later. This is the right tool for any work that does not need an immediate response: daily data processing, bulk document analysis, overnight report generation, scheduled research tasks.

If you run 10,000 classification tasks per day with Haiku, that is roughly $8/day normally. With the Batch API, it is $4/day — saving $120/month with zero code change beyond the batching wrapper.

Python — Message Batches API
import anthropic
import time
import json

client = anthropic.Anthropic()

# Your documents to process
documents_to_analyze = [
    "Customer complaint about shipping delay on order #4821...",
    "Positive review: product exceeded expectations, would recommend...",
    "Refund request for defective item received on Jan 15...",
    # ... up to 10,000 items per batch
]

# 1. Create the batch — all requests submitted at once
print(f"Submitting batch of {len(documents_to_analyze)} requests...")
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"doc-{i}",  # Your ID to match results to inputs
            "params": {
                "model": "claude-haiku-4-5-20251001",
                "max_tokens": 100,
                "messages": [{
                    "role": "user",
                    "content": (
                        f"Classify this customer message. Reply with ONLY one of: "
                        f"complaint, compliment, refund_request, question, other\n\n"
                        f"Message: {doc}"
                    )
                }]
            }
        }
        for i, doc in enumerate(documents_to_analyze)
    ]
)

print(f"Batch created: {batch.id}")
print(f"Status: {batch.processing_status}")

# 2. Poll until complete (in production, use a webhook or scheduled job instead)
while True:
    status = client.messages.batches.retrieve(batch.id)
    print(f"Status: {status.processing_status} | "
          f"Succeeded: {status.request_counts.succeeded} | "
          f"Errored: {status.request_counts.errored}")

    if status.processing_status == "ended":
        break

    time.sleep(60)  # Check every minute

# 3. Retrieve and process results
results = {}
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        classification = result.result.message.content[0].text.strip()
        results[result.custom_id] = classification
    else:
        # Handle errors
        results[result.custom_id] = f"ERROR: {result.result.error.type}"

# 4. Print summary
from collections import Counter
counts = Counter(results.values())
print("\nClassification results:")
for label, count in counts.most_common():
    print(f"  {label}: {count}")

print(f"\nTotal processed: {len(results)}")
print(f"Estimated cost vs real-time: ~50% savings")
When NOT to Use Batching

Real-time user interactions, anything that requires a response within a few seconds, or tasks where you need to react to early results before processing all items. Batching is for scheduled, bulk, or background workloads only.

05

Token Budgeting

Every token you send or receive costs money. Most developers leave significant savings on the table through two habits: setting max_tokens to a default large number regardless of task, and letting conversation history grow unbounded across iterations.

Setting max_tokens by Task Type

Task Type Recommended max_tokens Reasoning
Classification / routing 10–50 You need one word or a short phrase
Simple Q&A / extraction 200–500 A few sentences, not a full essay
Tool call reasoning (in agentic loop) 500–1024 Reasoning + tool call, not final output
Standard agent response 1024–2048 Detailed but focused output
Long-form writing / analysis 2048–4096 Only when you actually need a long response

Trimming Conversation History

Without trimming, your agent's context window grows with every tool call. A 10-step agent can easily accumulate 30,000+ tokens of history — most of it earlier reasoning steps the model no longer needs. The trimmer below keeps recent messages within a token budget while always preserving the first user message (the original task).

Python — Conversation History Trimmer
def estimate_tokens(msg: dict) -> int:
    """
    Rough token estimate: 4 characters ≈ 1 token.
    Use tiktoken or Anthropic's token counting endpoint for accuracy.
    """
    content = msg.get('content', '')
    if isinstance(content, list):
        # Handle structured content (tool results, etc.)
        text = ' '.join(
            item.get('text', '') if isinstance(item, dict) else str(item)
            for item in content
        )
    else:
        text = str(content)
    return max(1, len(text) // 4)


def trim_conversation_history(
    messages: list,
    max_tokens: int = 50_000,
    always_keep_first: bool = True
) -> list:
    """
    Trim conversation history to stay within a token budget.

    Strategy:
    - Always keep the first user message (original task context)
    - Always keep the most recent N messages
    - Remove messages from the middle when over budget

    Args:
        messages: Full conversation history
        max_tokens: Maximum total tokens to allow (rough estimate)
        always_keep_first: If True, never remove messages[0]

    Returns:
        Trimmed message list
    """
    if not messages:
        return messages

    total = sum(estimate_tokens(m) for m in messages)

    if total <= max_tokens:
        return messages  # No trimming needed

    print(f"[trim] Conversation at ~{total:,} tokens, trimming to {max_tokens:,}...")

    if not always_keep_first:
        # Simple approach: keep most recent messages within budget
        result = []
        budget = max_tokens
        for msg in reversed(messages):
            tokens = estimate_tokens(msg)
            if budget - tokens > 0:
                result.insert(0, msg)
                budget -= tokens
            else:
                break
        return result

    # Keep first message, fill from the end
    first_msg = messages[0]
    remaining_messages = messages[1:]

    first_tokens = estimate_tokens(first_msg)
    budget = max_tokens - first_tokens

    result = []
    for msg in reversed(remaining_messages):
        tokens = estimate_tokens(msg)
        if budget - tokens > 0:
            result.insert(0, msg)
            budget -= tokens
        else:
            break

    trimmed = [first_msg] + result
    trimmed_total = sum(estimate_tokens(m) for m in trimmed)
    removed = len(messages) - len(trimmed)

    if removed > 0:
        print(f"[trim] Removed {removed} messages. New size: ~{trimmed_total:,} tokens")

    return trimmed


# Usage in your agent loop:
messages = trim_conversation_history(messages, max_tokens=50_000)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=messages
)
06

Tiered Systems — Cheap Models Route, Expensive Models Reason

Biggest Cost Lever Available Route 80% to Haiku Opus for 5% of Requests

The most powerful cost optimization is architectural: do not use one model for everything. Use Haiku to classify and route requests, Sonnet for the bulk of work, and Opus only for the tasks that genuinely require its capabilities.

In most production agent systems, roughly 50–70% of requests can be handled by Haiku once properly classified — because many user requests are actually simple lookups, yes/no questions, or straightforward transformations, not complex reasoning tasks.

Architecture: Intent Classification Pipeline

Tier Model % of Requests Task Types Cost vs All-Opus
Routing Haiku 100% (all requests classified here) Intent classification only, tiny output -98%
Simple Haiku ~50% FAQ, extraction, formatting, simple Q&A -95%
Standard Sonnet ~45% Tool use, coding, analysis, general agent work -80%
Complex Opus ~5% Deep research, strategy, multi-step reasoning Baseline
Python — Tiered Model Router
import anthropic

client = anthropic.Anthropic()

# Pricing per million tokens (approximate, check Anthropic console for current rates)
MODEL_PRICING = {
    'claude-haiku-4-5-20251001':  {'input': 0.80, 'output': 4.00},
    'claude-sonnet-4-6':           {'input': 3.00, 'output': 15.00},
    'claude-opus-4-6':             {'input': 15.00, 'output': 75.00},
}

def classify_complexity(task: str) -> str:
    """
    Use cheap Haiku to classify task complexity.
    Costs ~$0.000008 per classification (almost free).
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=20,  # We only need one word
        messages=[{
            "role": "user",
            "content": f"""Classify this task complexity. Reply with ONLY one word: simple, moderate, or complex.

Rules:
- simple: factual lookup, yes/no, formatting, basic calculation, FAQ
- moderate: multi-step reasoning, code generation, analysis, tool use
- complex: deep research, strategic planning, multi-document synthesis, novel problem-solving

Task: {task}

Classification:"""
        }]
    )
    result = response.content[0].text.strip().lower()
    # Validate — default to moderate if unexpected response
    return result if result in ('simple', 'moderate', 'complex') else 'moderate'


def select_model(complexity: str) -> str:
    """Map complexity to the appropriate model."""
    return {
        'simple':   'claude-haiku-4-5-20251001',
        'moderate': 'claude-sonnet-4-6',
        'complex':  'claude-opus-4-6',
    }[complexity]


def run_tiered_agent(task: str, system: str = "", verbose: bool = True) -> dict:
    """
    Route a task to the appropriate model based on complexity.
    Returns result dict with model used, response, and estimated cost.
    """
    # Step 1: Classify (costs ~$0.000008)
    complexity = classify_complexity(task)
    model = select_model(complexity)

    if verbose:
        print(f"[tiered] complexity={complexity} → model={model}")

    # Step 2: Run on the appropriate model
    messages = [{"role": "user", "content": task}]
    kwargs = {
        "model": model,
        "max_tokens": 2048,
        "messages": messages
    }
    if system:
        kwargs["system"] = system

    response = client.messages.create(**kwargs)

    # Step 3: Calculate cost
    usage = response.usage
    pricing = MODEL_PRICING[model]
    cost = (
        usage.input_tokens / 1_000_000 * pricing['input'] +
        usage.output_tokens / 1_000_000 * pricing['output']
    )

    return {
        "complexity": complexity,
        "model": model,
        "response": response.content[0].text,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "estimated_cost_usd": round(cost, 6)
    }


# Example usage
tasks = [
    "What is the capital of France?",                          # → Haiku (simple)
    "Write a Python function to parse ISO 8601 timestamps",    # → Sonnet (moderate)
    "Develop a comprehensive go-to-market strategy for a B2B SaaS product targeting mid-market healthcare companies, including competitive positioning, pricing model analysis, and channel strategy"  # → Opus (complex)
]

total_cost = 0
for task in tasks:
    result = run_tiered_agent(task[:80] + "..." if len(task) > 80 else task)
    total_cost += result['estimated_cost_usd']
    print(f"  Cost: ${result['estimated_cost_usd']:.6f} | Model: {result['model']}\n")

print(f"Total: ${total_cost:.6f}")
07

Real Cost Comparison Tables

All calculations use approximate Anthropic pricing as of early 2026: Haiku $0.80/$4.00 per M tokens input/output, Sonnet $3.00/$15.00, Opus $15.00/$75.00. Assumes average task: 2,000 input tokens, 500 output tokens. Check console.anthropic.com for current pricing.

Table 1: Cost per 1,000 Tasks

Assumes average 2,000 input tokens + 500 output tokens per task

Strategy Input Cost Output Cost Total / 1K Tasks vs All-Opus
All Opus 2M × $15/M = $30.00 500K × $75/M = $37.50 $67.50 Baseline
All Sonnet 2M × $3/M = $6.00 500K × $15/M = $7.50 $13.50 -80%
All Haiku 2M × $0.80/M = $1.60 500K × $4/M = $2.00 $3.60 -95%
Tiered (5% Opus / 45% Sonnet / 50% Haiku) Blended Blended $9.18 -86%
Tiered + Prompt Caching System prompt at 10% Same $5.80 -91%
Tiered + Caching + Batch API 50% batch discount 50% batch discount $2.90 -96%

Table 2: Monthly Costs at Scale

Using "Tiered + Caching + Batching" strategy vs "All Sonnet" baseline

Usage Level Tasks / Month All Sonnet Cost Optimized Cost Monthly Savings
Small (100/day) 3,000 $40.50 $8.70 $31.80 saved
Medium (1,000/day) 30,000 $405.00 $87.00 $318.00 saved
Large (10,000/day) 300,000 $4,050.00 $870.00 $3,180.00 saved
Enterprise (100K/day) 3,000,000 $40,500.00 $8,700.00 $31,800.00 saved

Table 3: Cost per Run for Common Agent Types

Typical token usage per complete agent run (all iterations combined)

Agent Type Avg Input Tokens Avg Output Tokens Cost on Sonnet Cost on Haiku Cost Optimized
Customer support Q&A (3 turns) 4,500 800 $0.026 $0.007 $0.007
Code review (single file) 8,000 1,500 $0.047 $0.012 $0.025
Research agent (5 web searches) 18,000 2,000 $0.084 $0.022 $0.040
Data analysis (DB + charts) 12,000 3,000 $0.081 $0.022 $0.038
Long research report (10+ tools) 45,000 6,000 $0.225 Not recommended $0.095
08

Monitoring and Alerting

Cost optimization is not a one-time setup — it is an ongoing practice. You need visibility into where your tokens are actually going. Often the most expensive parts of a system are not where you expect: a verbose tool result that adds 3,000 tokens to every iteration, a system prompt that grew to 5,000 tokens over six months of edits, a single user running 500 agent calls in a day.

Python — Usage Tracker with Budget Alerts
import anthropic
from dataclasses import dataclass, field
from datetime import datetime
from collections import defaultdict
import json

@dataclass
class UsageTracker:
    """
    Track token usage and costs per model, per user, per day.
    Alerts when approaching budget limits.
    """
    daily_budget_usd: float = 10.0  # Total daily budget
    per_user_daily_limit: float = 1.0  # Per-user daily limit

    # Pricing per million tokens (approximate, early 2026)
    PRICING = {
        'claude-opus-4-6':            {'input': 15.0,  'output': 75.0},
        'claude-sonnet-4-6':           {'input': 3.0,   'output': 15.0},
        'claude-haiku-4-5-20251001':  {'input': 0.80,  'output': 4.0},
    }

    _log: list = field(default_factory=list)

    def track(self, model: str, usage: anthropic.types.Usage,
              user_id: str = "default", task_type: str = "unknown") -> float:
        """Record usage and return the cost of this call."""
        prices = self.PRICING.get(model, {'input': 3.0, 'output': 15.0})

        # Account for prompt caching if available
        cache_read = getattr(usage, 'cache_read_input_tokens', 0)
        cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
        regular_input = usage.input_tokens - cache_read - cache_write

        cache_read_price = prices['input'] * 0.10  # Cache reads at 10%
        cost = (
            regular_input / 1_000_000 * prices['input'] +
            cache_write   / 1_000_000 * prices['input'] +
            cache_read    / 1_000_000 * cache_read_price +
            usage.output_tokens / 1_000_000 * prices['output']
        )

        entry = {
            'timestamp': datetime.now().isoformat(),
            'user_id': user_id,
            'model': model,
            'task_type': task_type,
            'input_tokens': usage.input_tokens,
            'output_tokens': usage.output_tokens,
            'cache_read_tokens': cache_read,
            'cost_usd': cost
        }
        self._log.append(entry)

        # Check budget after each call
        self._check_alerts(user_id, cost)
        return cost

    def today_total(self) -> float:
        """Total spend today across all users."""
        today = datetime.now().date().isoformat()
        return sum(e['cost_usd'] for e in self._log if e['timestamp'][:10] == today)

    def today_by_user(self) -> dict:
        """Today's spend grouped by user."""
        today = datetime.now().date().isoformat()
        totals = defaultdict(float)
        for e in self._log:
            if e['timestamp'][:10] == today:
                totals[e['user_id']] += e['cost_usd']
        return dict(totals)

    def today_by_model(self) -> dict:
        """Today's spend grouped by model."""
        today = datetime.now().date().isoformat()
        totals = defaultdict(float)
        for e in self._log:
            if e['timestamp'][:10] == today:
                totals[e['model']] += e['cost_usd']
        return dict(totals)

    def _check_alerts(self, user_id: str, new_cost: float):
        """Fire alerts when approaching budget limits."""
        daily = self.today_total()
        user_daily = self.today_by_user().get(user_id, 0)

        if daily > self.daily_budget_usd * 0.80:
            print(f"[BUDGET WARNING] Daily spend at ${daily:.2f} / "
                  f"${self.daily_budget_usd:.2f} ({daily/self.daily_budget_usd:.0%})")

        if daily > self.daily_budget_usd:
            raise RuntimeError(
                f"[BUDGET EXCEEDED] Daily budget of ${self.daily_budget_usd:.2f} reached. "
                f"Current: ${daily:.2f}. Set a higher limit in UsageTracker or "
                f"add a hard spending limit at console.anthropic.com"
            )

        if user_daily > self.per_user_daily_limit:
            print(f"[USER LIMIT] User '{user_id}' has spent ${user_daily:.4f} today "
                  f"(limit: ${self.per_user_daily_limit:.2f})")

    def report(self) -> str:
        """Print a summary of today's usage."""
        lines = [
            f"=== Usage Report ({datetime.now().date()}) ===",
            f"Total spend:     ${self.today_total():.4f} / ${self.daily_budget_usd:.2f}",
            "",
            "By model:",
        ]
        for model, cost in sorted(self.today_by_model().items(), key=lambda x: -x[1]):
            short_model = model.split('-')[1] if '-' in model else model
            lines.append(f"  {short_model:12} ${cost:.4f}")
        lines.extend(["", "Top users:"])
        for user, cost in sorted(self.today_by_user().items(), key=lambda x: -x[1])[:5]:
            lines.append(f"  {user:20} ${cost:.4f}")
        return "\n".join(lines)


# Usage in your agent:
tracker = UsageTracker(daily_budget_usd=50.0)

response = client.messages.create(model="claude-sonnet-4-6", ...)
cost = tracker.track(
    model="claude-sonnet-4-6",
    usage=response.usage,
    user_id="user_12345",
    task_type="research_agent"
)
print(f"This call cost: ${cost:.6f}")
print(tracker.report())
Hard Spending Limits

Your usage tracker is a soft limit in code. Always set a hard spending limit in the Anthropic console (console.anthropic.com → Settings → Billing → Spend Limits). This is your failsafe — it stops spending even if your code has a bug or an agent goes rogue. Set it 20–30% above your expected spend to avoid unexpected cutoffs.

09

Cost Optimization Checklist

Quick-reference list of optimizations ranked roughly by impact. Implement the top items first — they give the most savings for the least effort.

Your Next Step

With cost optimization in place, you are ready to tackle the most powerful (and expensive) agent architecture: multi-agent systems. In the Multi-Agent guide, you will learn how to build networks of specialized agents that collaborate on complex tasks — and how to keep costs from multiplying when you add more agents.