Cost Optimization Guide — AI Agent Starter Kit

01

Why Agent Costs Spiral Out of Control

A single agent call is not a single API call. Every step of an agentic loop sends the full conversation history — system prompt, all previous messages, all tool results — back to the model. A research agent that runs 10 iterations sends 10 API calls, each one larger than the last.

The Three Cost Multipliers

1. Long system prompts. A 3,000-token system prompt gets sent with every API call in the loop. At 10 iterations, that is 30,000 tokens just for the system prompt — before the agent has done anything. Prompt caching (Section 3) eliminates most of this cost.

2. Accumulating context window. Tool results get appended to the conversation history. A web search returning 2,000 tokens of results, run 5 times, adds 10,000 tokens to every subsequent call. Context trimming (Section 5) prevents this from spiraling.

3. Using the wrong model. If you use Claude Opus for routing tasks that could run on Haiku, you are spending 18x more than you need to. Tiered model selection (Section 6) is the single biggest cost lever available.

Worked Example: Realistic Agent Cost Without Optimization

Iteration	System Prompt	Conversation History	Tool Results	Total Input Tokens	Output Tokens
1	3,000	200	0	3,200	350
2	3,000	750	1,500	5,250	300
3	3,000	1,500	3,000	7,500	400
4	3,000	2,400	4,500	9,900	350
5	3,000	3,500	6,000	12,500	450
Total	15,000	8,350	15,000	38,350	1,850

At Sonnet pricing ($3 input / $15 output per million tokens): 38,350 input tokens = $0.115 and 1,850 output tokens = $0.028 — roughly $0.14 per agent run. That sounds small. At 1,000 runs/day it is $140/day, $4,200/month. With optimization, the same workload can cost under $500/month.

02

Model Selection Strategy

The single most impactful decision you make is which model you use for each task. There is an 18x price difference between Haiku and Opus. For tasks where Haiku performs equally well — classification, formatting, simple Q&A, routing — using Opus is pure waste.

These prices are approximate as of early 2026. Check console.anthropic.com for current pricing before building cost projections.

Model	Input (per M tokens)	Output (per M tokens)	Speed	Best For
claude-opus-4-6	$15.00	$75.00	Slower	Complex reasoning, multi-step analysis, high-stakes decisions, creative tasks requiring depth
claude-sonnet-4-6	$3.00	$15.00	Fast	Most production workloads, coding, structured output, tool use, general agents
claude-haiku-4-5	$0.80	$4.00	Fastest	Classification, intent routing, simple extraction, summarization, high-volume tasks

Model Selection Decision Tree

Is this a high-stakes decision requiring nuanced reasoning, weighing complex tradeoffs, or producing long creative work that must be excellent?

→

Opus

Is this a standard production task — coding, tool use, structured output generation, agent reasoning with tools, customer-facing responses requiring good quality?

→

Sonnet

Is this classification, routing, extraction, or a simple transformation — tagging a support ticket, detecting language, extracting a date, routing a question to the right agent?

→

Haiku

Is this a batch job running on thousands of items overnight — bulk content categorization, document indexing, large-scale analysis?

→

Haiku + Batch API

Not sure? Start with Sonnet. Profile your actual quality results. Downgrade tasks to Haiku that pass quality checks, upgrade tasks to Opus that fail.

→

Sonnet (default)

03

Prompt Caching — Save Up to 90%

Up to 90% Savings on Cached Content 5 Min Default TTL 1 Hour Extended TTL

Prompt caching is the highest-leverage optimization for agents with long system prompts. You mark a portion of your system prompt with cache_control. The first call pays full price to write the cache. Every subsequent call within the TTL (Time To Live) window pays only 10% of the normal input price to read from cache — a 90% discount.

For a 3,000-token system prompt sent 100 times per day at Sonnet pricing: without caching, that is 300,000 tokens × $3/M = $0.90/day just for the system prompt. With caching, it is 3,000 tokens (write) + 99 × 300 tokens (cache read at 10%) = ~$0.009/day — a 100x reduction on that portion of cost.

Python — Prompt Caching

import anthropic

client = anthropic.Anthropic()

# With prompt caching — mark the system prompt with cache_control
# The system prompt costs 10% on cache hits (within the TTL window)
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": """You are an expert financial analyst with deep knowledge of
            public company earnings, balance sheets, and market dynamics.

            When analyzing companies, always consider:
            - Revenue growth trajectory (YoY and QoQ)
            - Gross margin trends and drivers
            - Operating leverage and EBITDA margin expansion
            - Free cash flow conversion from net income
            - Balance sheet health: net debt/EBITDA, current ratio
            - Competitive positioning: market share, moat characteristics
            - Management guidance credibility vs historical accuracy
            - Key risks: regulatory, competitive, macro sensitivity

            Format your analysis in structured sections with clear headers.
            Support every claim with specific numbers from the provided data.
            Always state your assumptions explicitly.
            [... the rest of your 2,000+ token system prompt ...]
            """,
            "cache_control": {"type": "ephemeral"}  # Cache this!
        }
    ],
    messages=[{"role": "user", "content": "Analyze Apple's Q4 2024 earnings: ..."}]
)

# Check cache performance in the usage metadata
usage = response.usage
print(f"Input tokens:        {usage.input_tokens:,}")
print(f"Cache write tokens:  {getattr(usage, 'cache_creation_input_tokens', 0):,}  (full price, first call)")
print(f"Cache read tokens:   {getattr(usage, 'cache_read_input_tokens', 0):,}   (10% price, cache hit)")
print(f"Output tokens:       {usage.output_tokens:,}")

# Calculate actual cost for this call
SONNET_INPUT_PRICE = 3.0 / 1_000_000    # $3 per million
SONNET_CACHE_PRICE = 0.30 / 1_000_000   # $0.30 per million (10% of input)
SONNET_OUTPUT_PRICE = 15.0 / 1_000_000  # $15 per million

cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
regular_input = usage.input_tokens - cache_write - cache_read

cost = (
    regular_input * SONNET_INPUT_PRICE +
    cache_write * SONNET_INPUT_PRICE +     # Write costs full price
    cache_read * SONNET_CACHE_PRICE +      # Read costs 10%
    usage.output_tokens * SONNET_OUTPUT_PRICE
)
print(f"Estimated cost:      ${cost:.6f}")

Cache Cost Comparison: 100 Agent Runs Per Day

Scenario	System Prompt Tokens	Token Cost	Daily Cost (100 runs)	Monthly Cost
No caching	3,000 per call × 100	300,000 × $3/M	$0.90	$27.00
With caching (5 min TTL)	1 write + 99 reads at 10%	3,000 full + 297,000 × $0.30/M	$0.098	$2.94
Extended caching (1 hr TTL)	~1 write per hour	Fewer cache misses	~$0.05	~$1.50

Best Candidates for Caching

Long system prompts that define agent behavior and don't change between runs. RAG documents — shared context like product documentation, internal knowledge bases, or policy documents. Few-shot examples that appear at the top of every prompt. The cache_control marker can appear multiple times in a prompt to cache different segments at different depths.

04

Batching for Non-Real-Time Tasks

50% Flat Discount Up to 24-Hour Window Up to 10,000 Requests per Batch

Anthropic's Message Batches API gives you a 50% discount on all requests submitted as a batch. The tradeoff: results are returned asynchronously, potentially up to 24 hours later. This is the right tool for any work that does not need an immediate response: daily data processing, bulk document analysis, overnight report generation, scheduled research tasks.

If you run 10,000 classification tasks per day with Haiku, that is roughly $8/day normally. With the Batch API, it is $4/day — saving $120/month with zero code change beyond the batching wrapper.

Python — Message Batches API

import anthropic
import time
import json

client = anthropic.Anthropic()

# Your documents to process
documents_to_analyze = [
    "Customer complaint about shipping delay on order #4821...",
    "Positive review: product exceeded expectations, would recommend...",
    "Refund request for defective item received on Jan 15...",
    # ... up to 10,000 items per batch
]

# 1. Create the batch — all requests submitted at once
print(f"Submitting batch of {len(documents_to_analyze)} requests...")
batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"doc-{i}",  # Your ID to match results to inputs
            "params": {
                "model": "claude-haiku-4-5-20251001",
                "max_tokens": 100,
                "messages": [{
                    "role": "user",
                    "content": (
                        f"Classify this customer message. Reply with ONLY one of: "
                        f"complaint, compliment, refund_request, question, other\n\n"
                        f"Message: {doc}"
                    )
                }]
            }
        }
        for i, doc in enumerate(documents_to_analyze)
    ]
)

print(f"Batch created: {batch.id}")
print(f"Status: {batch.processing_status}")

# 2. Poll until complete (in production, use a webhook or scheduled job instead)
while True:
    status = client.messages.batches.retrieve(batch.id)
    print(f"Status: {status.processing_status} | "
          f"Succeeded: {status.request_counts.succeeded} | "
          f"Errored: {status.request_counts.errored}")

    if status.processing_status == "ended":
        break

    time.sleep(60)  # Check every minute

# 3. Retrieve and process results
results = {}
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        classification = result.result.message.content[0].text.strip()
        results[result.custom_id] = classification
    else:
        # Handle errors
        results[result.custom_id] = f"ERROR: {result.result.error.type}"

# 4. Print summary
from collections import Counter
counts = Counter(results.values())
print("\nClassification results:")
for label, count in counts.most_common():
    print(f"  {label}: {count}")

print(f"\nTotal processed: {len(results)}")
print(f"Estimated cost vs real-time: ~50% savings")

When NOT to Use Batching

Real-time user interactions, anything that requires a response within a few seconds, or tasks where you need to react to early results before processing all items. Batching is for scheduled, bulk, or background workloads only.

05

Token Budgeting

Every token you send or receive costs money. Most developers leave significant savings on the table through two habits: setting max_tokens to a default large number regardless of task, and letting conversation history grow unbounded across iterations.

Setting max_tokens by Task Type

Task Type	Recommended max_tokens	Reasoning
Classification / routing	10–50	You need one word or a short phrase
Simple Q&A / extraction	200–500	A few sentences, not a full essay
Tool call reasoning (in agentic loop)	500–1024	Reasoning + tool call, not final output
Standard agent response	1024–2048	Detailed but focused output
Long-form writing / analysis	2048–4096	Only when you actually need a long response

Trimming Conversation History

Without trimming, your agent's context window grows with every tool call. A 10-step agent can easily accumulate 30,000+ tokens of history — most of it earlier reasoning steps the model no longer needs. The trimmer below keeps recent messages within a token budget while always preserving the first user message (the original task).

Python — Conversation History Trimmer

def estimate_tokens(msg: dict) -> int:
    """
    Rough token estimate: 4 characters ≈ 1 token.
    Use tiktoken or Anthropic's token counting endpoint for accuracy.
    """
    content = msg.get('content', '')
    if isinstance(content, list):
        # Handle structured content (tool results, etc.)
        text = ' '.join(
            item.get('text', '') if isinstance(item, dict) else str(item)
            for item in content
        )
    else:
        text = str(content)
    return max(1, len(text) // 4)


def trim_conversation_history(
    messages: list,
    max_tokens: int = 50_000,
    always_keep_first: bool = True
) -> list:
    """
    Trim conversation history to stay within a token budget.

    Strategy:
    - Always keep the first user message (original task context)
    - Always keep the most recent N messages
    - Remove messages from the middle when over budget

    Args:
        messages: Full conversation history
        max_tokens: Maximum total tokens to allow (rough estimate)
        always_keep_first: If True, never remove messages[0]

    Returns:
        Trimmed message list
    """
    if not messages:
        return messages

    total = sum(estimate_tokens(m) for m in messages)

    if total <= max_tokens:
        return messages  # No trimming needed

    print(f"[trim] Conversation at ~{total:,} tokens, trimming to {max_tokens:,}...")

    if not always_keep_first:
        # Simple approach: keep most recent messages within budget
        result = []
        budget = max_tokens
        for msg in reversed(messages):
            tokens = estimate_tokens(msg)
            if budget - tokens > 0:
                result.insert(0, msg)
                budget -= tokens
            else:
                break
        return result

    # Keep first message, fill from the end
    first_msg = messages[0]
    remaining_messages = messages[1:]

    first_tokens = estimate_tokens(first_msg)
    budget = max_tokens - first_tokens

    result = []
    for msg in reversed(remaining_messages):
        tokens = estimate_tokens(msg)
        if budget - tokens > 0:
            result.insert(0, msg)
            budget -= tokens
        else:
            break

    trimmed = [first_msg] + result
    trimmed_total = sum(estimate_tokens(m) for m in trimmed)
    removed = len(messages) - len(trimmed)

    if removed > 0:
        print(f"[trim] Removed {removed} messages. New size: ~{trimmed_total:,} tokens")

    return trimmed


# Usage in your agent loop:
messages = trim_conversation_history(messages, max_tokens=50_000)

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=messages
)

06

Tiered Systems — Cheap Models Route, Expensive Models Reason

Biggest Cost Lever Available Route 80% to Haiku Opus for 5% of Requests

The most powerful cost optimization is architectural: do not use one model for everything. Use Haiku to classify and route requests, Sonnet for the bulk of work, and Opus only for the tasks that genuinely require its capabilities.

In most production agent systems, roughly 50–70% of requests can be handled by Haiku once properly classified — because many user requests are actually simple lookups, yes/no questions, or straightforward transformations, not complex reasoning tasks.

Architecture: Intent Classification Pipeline

Tier	Model	% of Requests	Task Types	Cost vs All-Opus
Routing	Haiku	100% (all requests classified here)	Intent classification only, tiny output	-98%
Simple	Haiku	~50%	FAQ, extraction, formatting, simple Q&A	-95%
Standard	Sonnet	~45%	Tool use, coding, analysis, general agent work	-80%
Complex	Opus	~5%	Deep research, strategy, multi-step reasoning	Baseline

Python — Tiered Model Router

import anthropic

client = anthropic.Anthropic()

# Pricing per million tokens (approximate, check Anthropic console for current rates)
MODEL_PRICING = {
    'claude-haiku-4-5-20251001':  {'input': 0.80, 'output': 4.00},
    'claude-sonnet-4-6':           {'input': 3.00, 'output': 15.00},
    'claude-opus-4-6':             {'input': 15.00, 'output': 75.00},
}

def classify_complexity(task: str) -> str:
    """
    Use cheap Haiku to classify task complexity.
    Costs ~$0.000008 per classification (almost free).
    """
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=20,  # We only need one word
        messages=[{
            "role": "user",
            "content": f"""Classify this task complexity. Reply with ONLY one word: simple, moderate, or complex.

Rules:
- simple: factual lookup, yes/no, formatting, basic calculation, FAQ
- moderate: multi-step reasoning, code generation, analysis, tool use
- complex: deep research, strategic planning, multi-document synthesis, novel problem-solving

Task: {task}

Classification:"""
        }]
    )
    result = response.content[0].text.strip().lower()
    # Validate — default to moderate if unexpected response
    return result if result in ('simple', 'moderate', 'complex') else 'moderate'


def select_model(complexity: str) -> str:
    """Map complexity to the appropriate model."""
    return {
        'simple':   'claude-haiku-4-5-20251001',
        'moderate': 'claude-sonnet-4-6',
        'complex':  'claude-opus-4-6',
    }[complexity]


def run_tiered_agent(task: str, system: str = "", verbose: bool = True) -> dict:
    """
    Route a task to the appropriate model based on complexity.
    Returns result dict with model used, response, and estimated cost.
    """
    # Step 1: Classify (costs ~$0.000008)
    complexity = classify_complexity(task)
    model = select_model(complexity)

    if verbose:
        print(f"[tiered] complexity={complexity} → model={model}")

    # Step 2: Run on the appropriate model
    messages = [{"role": "user", "content": task}]
    kwargs = {
        "model": model,
        "max_tokens": 2048,
        "messages": messages
    }
    if system:
        kwargs["system"] = system

    response = client.messages.create(**kwargs)

    # Step 3: Calculate cost
    usage = response.usage
    pricing = MODEL_PRICING[model]
    cost = (
        usage.input_tokens / 1_000_000 * pricing['input'] +
        usage.output_tokens / 1_000_000 * pricing['output']
    )

    return {
        "complexity": complexity,
        "model": model,
        "response": response.content[0].text,
        "input_tokens": usage.input_tokens,
        "output_tokens": usage.output_tokens,
        "estimated_cost_usd": round(cost, 6)
    }


# Example usage
tasks = [
    "What is the capital of France?",                          # → Haiku (simple)
    "Write a Python function to parse ISO 8601 timestamps",    # → Sonnet (moderate)
    "Develop a comprehensive go-to-market strategy for a B2B SaaS product targeting mid-market healthcare companies, including competitive positioning, pricing model analysis, and channel strategy"  # → Opus (complex)
]

total_cost = 0
for task in tasks:
    result = run_tiered_agent(task[:80] + "..." if len(task) > 80 else task)
    total_cost += result['estimated_cost_usd']
    print(f"  Cost: ${result['estimated_cost_usd']:.6f} | Model: {result['model']}\n")

print(f"Total: ${total_cost:.6f}")

07

Real Cost Comparison Tables

All calculations use approximate Anthropic pricing as of early 2026: Haiku $0.80/$4.00 per M tokens input/output, Sonnet $3.00/$15.00, Opus $15.00/$75.00. Assumes average task: 2,000 input tokens, 500 output tokens. Check console.anthropic.com for current pricing.

Table 1: Cost per 1,000 Tasks

Assumes average 2,000 input tokens + 500 output tokens per task

Strategy	Input Cost	Output Cost	Total / 1K Tasks	vs All-Opus
All Opus	2M × $15/M = $30.00	500K × $75/M = $37.50	$67.50	Baseline
All Sonnet	2M × $3/M = $6.00	500K × $15/M = $7.50	$13.50	-80%
All Haiku	2M × $0.80/M = $1.60	500K × $4/M = $2.00	$3.60	-95%
Tiered (5% Opus / 45% Sonnet / 50% Haiku)	Blended	Blended	$9.18	-86%
Tiered + Prompt Caching	System prompt at 10%	Same	$5.80	-91%
Tiered + Caching + Batch API	50% batch discount	50% batch discount	$2.90	-96%

Table 2: Monthly Costs at Scale

Using "Tiered + Caching + Batching" strategy vs "All Sonnet" baseline

Usage Level	Tasks / Month	All Sonnet Cost	Optimized Cost	Monthly Savings
Small (100/day)	3,000	$40.50	$8.70	$31.80 saved
Medium (1,000/day)	30,000	$405.00	$87.00	$318.00 saved
Large (10,000/day)	300,000	$4,050.00	$870.00	$3,180.00 saved
Enterprise (100K/day)	3,000,000	$40,500.00	$8,700.00	$31,800.00 saved

Table 3: Cost per Run for Common Agent Types

Typical token usage per complete agent run (all iterations combined)

Agent Type	Avg Input Tokens	Avg Output Tokens	Cost on Sonnet	Cost on Haiku	Cost Optimized
Customer support Q&A (3 turns)	4,500	800	$0.026	$0.007	$0.007
Code review (single file)	8,000	1,500	$0.047	$0.012	$0.025
Research agent (5 web searches)	18,000	2,000	$0.084	$0.022	$0.040
Data analysis (DB + charts)	12,000	3,000	$0.081	$0.022	$0.038
Long research report (10+ tools)	45,000	6,000	$0.225	Not recommended	$0.095

08

Monitoring and Alerting

Cost optimization is not a one-time setup — it is an ongoing practice. You need visibility into where your tokens are actually going. Often the most expensive parts of a system are not where you expect: a verbose tool result that adds 3,000 tokens to every iteration, a system prompt that grew to 5,000 tokens over six months of edits, a single user running 500 agent calls in a day.

Python — Usage Tracker with Budget Alerts

import anthropic
from dataclasses import dataclass, field
from datetime import datetime
from collections import defaultdict
import json

@dataclass
class UsageTracker:
    """
    Track token usage and costs per model, per user, per day.
    Alerts when approaching budget limits.
    """
    daily_budget_usd: float = 10.0  # Total daily budget
    per_user_daily_limit: float = 1.0  # Per-user daily limit

    # Pricing per million tokens (approximate, early 2026)
    PRICING = {
        'claude-opus-4-6':            {'input': 15.0,  'output': 75.0},
        'claude-sonnet-4-6':           {'input': 3.0,   'output': 15.0},
        'claude-haiku-4-5-20251001':  {'input': 0.80,  'output': 4.0},
    }

    _log: list = field(default_factory=list)

    def track(self, model: str, usage: anthropic.types.Usage,
              user_id: str = "default", task_type: str = "unknown") -> float:
        """Record usage and return the cost of this call."""
        prices = self.PRICING.get(model, {'input': 3.0, 'output': 15.0})

        # Account for prompt caching if available
        cache_read = getattr(usage, 'cache_read_input_tokens', 0)
        cache_write = getattr(usage, 'cache_creation_input_tokens', 0)
        regular_input = usage.input_tokens - cache_read - cache_write

        cache_read_price = prices['input'] * 0.10  # Cache reads at 10%
        cost = (
            regular_input / 1_000_000 * prices['input'] +
            cache_write   / 1_000_000 * prices['input'] +
            cache_read    / 1_000_000 * cache_read_price +
            usage.output_tokens / 1_000_000 * prices['output']
        )

        entry = {
            'timestamp': datetime.now().isoformat(),
            'user_id': user_id,
            'model': model,
            'task_type': task_type,
            'input_tokens': usage.input_tokens,
            'output_tokens': usage.output_tokens,
            'cache_read_tokens': cache_read,
            'cost_usd': cost
        }
        self._log.append(entry)

        # Check budget after each call
        self._check_alerts(user_id, cost)
        return cost

    def today_total(self) -> float:
        """Total spend today across all users."""
        today = datetime.now().date().isoformat()
        return sum(e['cost_usd'] for e in self._log if e['timestamp'][:10] == today)

    def today_by_user(self) -> dict:
        """Today's spend grouped by user."""
        today = datetime.now().date().isoformat()
        totals = defaultdict(float)
        for e in self._log:
            if e['timestamp'][:10] == today:
                totals[e['user_id']] += e['cost_usd']
        return dict(totals)

    def today_by_model(self) -> dict:
        """Today's spend grouped by model."""
        today = datetime.now().date().isoformat()
        totals = defaultdict(float)
        for e in self._log:
            if e['timestamp'][:10] == today:
                totals[e['model']] += e['cost_usd']
        return dict(totals)

    def _check_alerts(self, user_id: str, new_cost: float):
        """Fire alerts when approaching budget limits."""
        daily = self.today_total()
        user_daily = self.today_by_user().get(user_id, 0)

        if daily > self.daily_budget_usd * 0.80:
            print(f"[BUDGET WARNING] Daily spend at ${daily:.2f} / "
                  f"${self.daily_budget_usd:.2f} ({daily/self.daily_budget_usd:.0%})")

        if daily > self.daily_budget_usd:
            raise RuntimeError(
                f"[BUDGET EXCEEDED] Daily budget of ${self.daily_budget_usd:.2f} reached. "
                f"Current: ${daily:.2f}. Set a higher limit in UsageTracker or "
                f"add a hard spending limit at console.anthropic.com"
            )

        if user_daily > self.per_user_daily_limit:
            print(f"[USER LIMIT] User '{user_id}' has spent ${user_daily:.4f} today "
                  f"(limit: ${self.per_user_daily_limit:.2f})")

    def report(self) -> str:
        """Print a summary of today's usage."""
        lines = [
            f"=== Usage Report ({datetime.now().date()}) ===",
            f"Total spend:     ${self.today_total():.4f} / ${self.daily_budget_usd:.2f}",
            "",
            "By model:",
        ]
        for model, cost in sorted(self.today_by_model().items(), key=lambda x: -x[1]):
            short_model = model.split('-')[1] if '-' in model else model
            lines.append(f"  {short_model:12} ${cost:.4f}")
        lines.extend(["", "Top users:"])
        for user, cost in sorted(self.today_by_user().items(), key=lambda x: -x[1])[:5]:
            lines.append(f"  {user:20} ${cost:.4f}")
        return "\n".join(lines)


# Usage in your agent:
tracker = UsageTracker(daily_budget_usd=50.0)

response = client.messages.create(model="claude-sonnet-4-6", ...)
cost = tracker.track(
    model="claude-sonnet-4-6",
    usage=response.usage,
    user_id="user_12345",
    task_type="research_agent"
)
print(f"This call cost: ${cost:.6f}")
print(tracker.report())

Hard Spending Limits

Your usage tracker is a soft limit in code. Always set a hard spending limit in the Anthropic console (console.anthropic.com → Settings → Billing → Spend Limits). This is your failsafe — it stops spending even if your code has a bug or an agent goes rogue. Set it 20–30% above your expected spend to avoid unexpected cutoffs.

09

Cost Optimization Checklist

Quick-reference list of optimizations ranked roughly by impact. Implement the top items first — they give the most savings for the least effort.

1

Use Haiku for routing and classification

Route every incoming request through a Haiku classifier before deciding which model handles it. Classification prompts use <50 output tokens.

-80–95%
2

Enable prompt caching for long system prompts

Add cache_control: {type: "ephemeral"} to any system prompt over 1,000 tokens. Cache hits cost 10% of normal input price.

-60–90%
3

Use the Batch API for non-real-time work

Daily reports, bulk analysis, scheduled research — anything that can wait up to 24 hours gets a 50% flat discount automatically.

-50%
4

Trim conversation history

Set a max_tokens budget on your message history and trim aggressively. Agents rarely need more than the last 5–6 turns plus the original task.

-20–50%
5

Set appropriate max_tokens per task type

Classification needs 10 tokens. Routing needs 50. Stop defaulting to 4096 for everything. Unused max_tokens don't cost money, but overshooting encourages verbose output.

-15–30%
6

Cache tool results within a session

If an agent searches for the same query twice in one run, return the cached result. A simple dict keyed on tool name + serialized arguments is enough.

-10–30%
7

Use streaming to stop generation early

Stream responses and stop consuming tokens once you have what you need (e.g., once you detect a JSON object is complete). You pay for tokens generated, not max_tokens.

-5–25%
8

Pre-filter requests with Haiku before expensive calls

Use Haiku to check: "Is this request actually within scope?" or "Does this need a tool call at all?" before sending to Sonnet or Opus.

-30–60%
9

Add max_iterations to all agent loops

A bug or an adversarial prompt can cause an agent to loop indefinitely. Hard cap at 10–20 iterations. Most tasks need fewer than 5.

Prevents runaway
10

Monitor per-user usage

One user testing your agent aggressively (or one bug in a loop) can consume your entire daily budget. Track usage by user ID and alert on outliers.

Budget protection
11

Set hard spending limits in the Anthropic console

Code bugs happen. Set a hard monthly cap at console.anthropic.com. This is your last line of defense against a runaway agent or billing accident.

Last-resort failsafe
12

Summarize long tool results before adding to context

A web search returning 5,000 tokens of HTML? Run a cheap Haiku call to extract the 200-token summary before adding it to the agent's conversation history.

-20–40%
13

Batch similar requests to maximize cache hits

When sending similar requests (same system prompt, different user inputs), group them together in time. Cache writes expire after 5 minutes by default — send within that window to share a cache hit.

+10–20% cache efficiency
14

Profile your actual token usage to find real bottlenecks

Log every call with input_tokens, output_tokens, model, and task type. In most systems, 20% of task types account for 80% of cost. Fix those first, not the cheap ones.

Identifies highest-impact fixes
15

Consider fine-tuning for specific high-volume tasks

For very specific, high-volume tasks (e.g., classifying your exact support ticket taxonomy), fine-tuning a smaller model can match larger model quality at a fraction of the cost. Only viable at high volume (100K+ monthly tasks).

-70–90% at scale