LLM Token Costs Benchmarked: What Engineering and FinOps Leaders Actually Need to Know

Pricing pages for LLM APIs look deceptively simple. A number for input. A number for output. A neat table that invites you to pick the cheapest column and move on.
The problem is that this framing is systematically misleading, and the companies that take it at face value are often the ones absorbing costs they don't fully understand. Output tokens cost three to ten times more than input tokens, depending on the model. Reasoning tokens are billed as output but never appear in your response. And the "per million tokens" rate you see published is almost always a best-case figure measured under conditions that don't resemble your production traffic.
This article is not another pricing table. Those exist, go stale within weeks, and are indexed everywhere. What's missing from the current landscape is a framework for reading pricing tables intelligently and translating raw rates into what your specific workloads actually cost to run at scale. That's the benchmark that moves budgets.
For a broader foundation on managing AI spend, see our guide to AI cost optimization.
The pricing illusion: why published rates mislead
Before you can use LLM pricing data correctly, it helps to understand the three ways it routinely misleads.
Input price leads, but output price dominates. Every major provider publishes input and output rates side by side, but input price almost always gets the headline. This is backwards for most production workloads. Output tokens require more computation than input tokens. The model has to generate each token sequentially, which is fundamentally more expensive than reading a prompt. The result: output rates are typically three to ten times higher than input rates, and for any workload that generates substantial responses, output cost dominates the bill. A system that sends long prompts with short responses will have a completely different cost profile than one requesting detailed generated outputs, even if the total token count is similar.
Benchmark conditions don't match production conditions. Published performance benchmarks are measured with large, homogenous batches, consistent sequence lengths, and warm infrastructure. Enterprise traffic is none of those things. Real users send variable-length prompts at unpredictable intervals. Applications frequently mix interactive and batch workloads in the same pipeline. The advertised cost per million tokens is a best-case figure; your effective cost in production will almost always be higher.
Reasoning tokens are invisible but billed. Extended thinking modes, increasingly standard in frontier models, generate internal reasoning tokens that are never returned in the API response. You don't see them. You do pay for them. For reasoning-heavy workloads, this can double or triple expected output costs without any warning in the response payload. If you're not accounting for reasoning tokens in your cost model, you're understating your real spend.
The model tiers: a map, not a ranking
Model selection is not about finding the "best" model. It's about matching capability to task, and the capability differences between tiers are meaningful only in context.
Tier 1 - Frontier models (Claude Opus 4.6 at $5/$25, GPT-5.4 at $2.50/$15, Gemini 3.1 Pro at $1.25/$10 per 1M tokens): These are the right choice for complex multi-step reasoning, high-stakes content generation where quality degradation has business consequences, and agentic orchestration where nuanced judgment is required at each step. They are the wrong choice for high-volume classification, straightforward extraction, or summarization – tasks where mid-tier models produce near-identical output at a fraction of the cost.
Tier 2 — Mid-tier / workhorse models (Claude Sonnet 4.6 at $3/$15, GPT-5.4 mini at $0.25/$2, Gemini 2.5 Flash at $0.30/$2.50): This tier handles the majority of production workloads well. For high-volume applications, such as customer support, document processing, and RAG pipelines, mid-tier models typically represent the best balance of quality and cost. Routing decisions that push everything to Tier 1 "just to be safe" are where significant overspend originates.
Tier 3 — Budget / open-weight models (DeepSeek V3 at $0.27/$1.10, Llama 4 hosted at $0.15/$0.60, Gemini Flash-Lite at $0.10/$0.40): Many engineering teams report that swapping a frontier model for DeepSeek V3 on straightforward classification and extraction tasks produces near-identical output quality at four to five times lower cost. The quality gap widens significantly on tasks requiring deep multi-step reasoning or precise instruction adherence, but for well-scoped, high-volume tasks, the quality parity is often closer than assumed.
The range here is not subtle. Pricing across major LLM APIs varies by more than 600× – from $0.05 to $30 per million input tokens. The right tier for your workload depends on task complexity, volume, and how much quality variation is acceptable.
Real cost by workload type: the benchmark that actually matters
Abstract pricing comparisons have limited value. What matters is what a specific workload type actually costs at production scale. Here are five archetypes with cost profiles that reflect real engineering tradeoffs.
Customer support / conversational chatbot
Typical structure: short user input (100–300 tokens), moderate system prompt (500–800 tokens), short to medium output (150–400 tokens). Input-heavy overall.
At 1 million monthly conversations with 500 input tokens and 200 output tokens per conversation, a flagship model at $2.50/$10 pricing costs roughly $3,250/month. The same workload on a budget-tier model at $0.15/$0.60 costs approximately $195 – a 16× difference for identical token counts. The common overspend pattern: running Tier 1 models on support tickets that a fine-tuned or purpose-built Tier 3 model handles equally well. Most customer support queries are not complex reasoning tasks.
Document processing / summarization pipeline
Typical structure: large input (a 10-page document ≈ 4,000 tokens), short to medium output (300–500 tokens). Highly input-dominant.
For document pipelines processing thousands of documents daily, model selection has a direct P&L impact: Gemini 3.1 Pro at approximately $0.0053 per document versus Claude Opus at approximately $0.098 per document is an 18× cost difference on identical work. Caching wins are enormous in this archetype. Long documents with repeated context (headers, instructions, schemas) are the ideal use case for prompt caching. Implementing caching on a high-volume document pipeline is often the single fastest cost reduction available.
RAG (retrieval-augmented generation)
Typical structure: variable input (base query + retrieved chunks, often 2,000–8,000 tokens), moderate output (300–600 tokens). Input cost dominates because of context stuffing.
The optimization lever in RAG is retrieval quality, not model selection. Every unnecessary chunk passed to the model is billed as input tokens. Over-retrieval, passing five chunks when two would suffice, is one of the most common and expensive RAG mistakes, and it compounds at scale. Reducing the average retrieved context by 40% reduces input tokens by a corresponding amount. Most RAG tasks don't require frontier reasoning; they require accurate retrieval and clean synthesis, which mid-tier models handle well.
Agentic / multi-step workflows
Typical structure: unpredictable and often recursive. Each step can range from 500 to 5,000 tokens; agents can run 5–50 steps per task. Output cost frequently exceeds input.
This is the highest-risk archetype from a cost perspective. Models supporting context windows exceeding one million tokens increase the risk of runaway costs if prompts or retrieved context aren't tightly controlled. A misconfigured agent loop can execute dozens of full-context calls before anyone notices, and those calls stack. Guardrails on step count and token budgets per task are non-negotiable, not optional. The recommended architecture: Tier 1 models for orchestration decisions where reasoning quality matters, Tier 2 or 3 for sub-tasks where the logic is well-defined.
Batch classification / extraction / enrichment
Typical structure: moderate input (500–1,000 tokens), short output (50–200 tokens). High volume, time-insensitive. The classic background pipeline.
This is where the 50% batch API discount becomes material. All major providers offer batch APIs that process requests asynchronously at roughly half the standard rate. Any workload that doesn't require real-time responses should be routed through the batch API. This is free money for non-interactive use cases. The most common overspend pattern in this archetype is running synchronous, real-time API calls for classification or enrichment jobs that could be queued and processed overnight.
The cost levers that move the number more than model switching
After all the emphasis on model selection, here's the counterintuitive finding: teams that switch models without optimizing caching, output control, and routing often leave more savings on the table than teams that optimize all three and stay on a more expensive model.
Prompt caching. Every major provider now supports server-side caching of frequently reused prompts and context. Anthropic charges 10% of the base input price for cache hits; Google's context caching is similarly discounted. For applications with consistent system prompts, which describes almost every production deployment, this is the highest single-leverage cost reduction available. A system prompt of 1,000 tokens reused across 1 million requests per month represents 1 billion input tokens in raw billing; with caching, it's 100 million.
Output length control. Reducing average output length by 40% cuts total costs by 20–30%, depending on your input/output ratio. The practical techniques: instruct the model explicitly to be concise, request structured output formats (JSON, structured lists) rather than prose when structure is what you actually need, set max_tokens limits to prevent runaway generation, and post-process to extract only the information your downstream system consumes.
Intelligent model routing. A typical enterprise routing strategy might send 70% of queries to a budget model, 20% to a mid-tier model, and 10% to a premium model for the most demanding tasks. Compared to routing all traffic through a single premium model, this tiered approach can reduce average per-query cost by 60–80%. The key is building routing logic around task classification and treating that classification as a first-class engineering problem, not an afterthought.
The principle: model selection determines your ceiling; caching, output control, and routing determine how close to the floor you actually operate.
Pricing is a moving target: how to stay current
Any article that publishes specific token prices has a shelf-life problem, and the honest response is to name it directly. LLM API prices dropped approximately 80% between early 2025 and early 2026. That rate of change is not slowing. Pricing that was accurate at the time you built your cost model may be significantly wrong six months later in either direction.
Three practices that help:
Establish a quarterly model review cadence. The models available and their relative price-performance ratios shift meaningfully every six to nine months. A model that was the right mid-tier choice at your product launch may have been displaced by a stronger option at a lower price point by the time you're reading this.
Track cost per workload type, not just cost per token. If a new model is 30% cheaper per token but produces outputs requiring 20% more follow-up requests or human review, the net saving may be negative. The unit of measurement should be cost per task completed, not cost per token generated.
Treat model commitments with the same scrutiny as reserved cloud instances. Provisioned throughput units (PTUs) and similar commitment-based pricing make sense when utilization is predictable and sustained. They carry the same risks as reserved instances: you're trading flexibility for cost efficiency, and that trade degrades badly when workloads shift.
Conclusion
The pricing tables that exist online answer a narrow question: what does this model cost per token? The harder and more valuable question is: what does this workload actually cost to run at scale? Answering that requires layering workload structure, caching strategy, routing logic, and output control on top of raw pricing.
That's the benchmark that moves budgets. Not which model has the lower published rate, but which combination of model, caching configuration, routing logic, and output discipline produces the best cost-per-outcome for each class of work you're actually running.
Knowing what a workload costs is the first step. The next question is who owns that cost across teams, products, and business units. This is where AI cost allocation becomes a critical practice. That's where the conversation continues.

