If you're scoping an AI project on AWS, the first question is always the same: how much will Amazon Bedrock actually cost? The answer is more complicated than the per-token rates suggest.
The problem most teams run into: they budget for model inference tokens and get blindsided by a $350/month OpenSearch Serverless bill for their Knowledge Base, agent calls that consume 5-10x the expected tokens, or CloudWatch logging costs that spiral from verbose prompt storage. I've seen this pattern repeatedly, and it's almost always preventable.
This guide covers every Amazon Bedrock pricing dimension, from per-token model costs to the hidden infrastructure charges that cause bill shock, with real dollar amounts throughout. All pricing data verified as of February 2026. I also maintain an up-to-date Amazon Bedrock Pricing Calculator so you can estimate your specific monthly costs after reading this.
How Amazon Bedrock Pricing Works
Before diving into specific model rates, you need a mental model for how Bedrock costs add up. Most pricing guides jump straight to per-token tables, but that only tells a third of the story.
Bedrock uses a token-based, pay-per-use pricing model. A token is approximately 4 characters or 0.75 words in English. All text model pricing is quoted per 1 million tokens. Input tokens (your prompt) are always cheaper than output tokens (the model's response), typically 3-5x cheaper. This asymmetry exists because generating output requires significantly more compute than processing input.
There are three distinct cost layers that make up your total Bedrock bill:
Layer 1 is what everyone covers: the per-token cost of calling a foundation model. Layer 2 is where teams get surprised: Knowledge Bases, Agents, Guardrails, and Flows each add their own charges on top of model inference. Layer 3 is the operational tail: logging, storage, and data transfer costs that accumulate silently.
Understanding all three layers is the difference between an accurate budget and a surprise bill. Let's start with a detail most guides skip.
Token-Based Pricing Explained
Every time you send a request to a Bedrock model, you're charged for two things: the tokens in your prompt (input) and the tokens in the model's response (output). Pricing is always quoted per 1 million tokens (per 1M tokens).
Here's what you need to know:
- Input tokens include your system message, user message, and any context you provide
- Output tokens are what the model generates in its response
- Extended thinking tokens (used by models like Claude with chain-of-thought reasoning) are billed as output tokens, not as a separate tier
- Image models are priced per image generated or processed, not per token
- Embedding models charge for input tokens only (no output)
Input vs. Output Token Costs
The pricing gap between input and output tokens matters for cost planning. For example, Claude Sonnet 4.5 charges $3.00 per 1M input tokens but $15.00 per 1M output tokens, a 5x difference. If your application generates long responses (code generation, detailed analysis), output tokens will dominate your bill.
Some models also charge more for extended context windows. Claude models with 1M token context windows bill inputs beyond 200K tokens at approximately 2x the standard rate. This is worth knowing if you're building RAG applications that stuff large documents into the context.
Does Bedrock Have a Free Tier?
Amazon Bedrock does not have a permanent free tier. You pay from the first API call.
However, new AWS accounts created after July 15, 2025 receive $200 in credits ($100 on sign-up plus $100 for completing guided activities, which include a Bedrock text playground tutorial). These credits are usable across 200+ AWS services including Bedrock and expire after 6 months. If you're evaluating Bedrock, this is a good way to test without upfront cost.
Now that you understand the pricing model, let me show you the costs that most guides skip entirely: the hidden infrastructure charges that catch teams off guard.
Hidden Costs You Need to Know Before Building
This section is the reason I wrote this guide. Every competitor article mentions hidden costs in a paragraph or two, but none of them give you the actual dollar amounts. Here's the full picture.
When you move beyond basic model inference and start using Bedrock features like Knowledge Bases, Agents, and Guardrails, you trigger additional charges that can easily exceed your inference bill. Understanding these upfront, before you architect your solution, is the best way to avoid bill shock.
| Hidden Cost Component | Minimum Monthly Floor | Mitigation Strategy |
|---|---|---|
| OpenSearch Serverless (Knowledge Base) | ~$350/month (2 OCUs) | Switch to Amazon S3 Vectors (up to 90% cheaper) |
| Agent cost multiplication | 5-10x your naive token estimate | Build focused, single-purpose agents |
| Guardrails (content + denied topics) | $0.30/1K text units combined | Enable only the filters you need |
| Flows | $0.035/1K node transitions | Minimize node count per workflow |
| CloudWatch Logging | $0.50/GB ingestion | Disable full prompt logging in production |
| Reranking (Amazon Rerank 1.0) | $1.00/1K queries | Use only when retrieval quality requires it |
Knowledge Base Costs (OpenSearch Serverless Floor)
Knowledge Bases for Bedrock don't have a separate KB fee. The costs come from the underlying components: embedding model inference, vector database storage, and the foundation model for RAG responses.
The catch is the vector database. If you use Amazon OpenSearch Serverless (the default option), you'll pay a minimum of approximately $350 per month for the required 2 OCUs (OpenSearch Compute Units) at ~$0.24 per OCU per hour, regardless of how many queries you run.
Here's the good news: Amazon S3 Vectors, launched in December 2025, is up to 90% cheaper and supports trillions of vectors with sub-second latency. If you're starting a new Knowledge Base project, S3 Vectors should be your default choice. For existing deployments, the migration is worth the effort.
You can estimate your OpenSearch infrastructure costs specifically with our OpenSearch pricing calculator.
Additional Knowledge Base costs to budget for:
- Reranking with Amazon Rerank 1.0: $1.00 per 1,000 queries (up to 100 document chunks per query)
- Structured data retrieval: $0.002 per GenerateQuery API call
- Data Automation for document parsing: $0.01 per page (standard) or $0.04 per page (custom output)
Agent Cost Multiplication
This is the hidden cost that catches the most teams off guard. Bedrock Agents don't charge a per-invocation fee (the InvokeAgent API itself is free), but a single user query triggers multiple internal model invocations:
- Planning step: the model determines what to do
- Tool selection: the model selects which APIs to call
- Tool execution: each tool call may involve additional model calls
- Summarization: the model synthesizes results into a final response
The result? A single user query can consume 5-10x the tokens you'd estimate from looking at the prompt and response alone. The internal reasoning traces and multi-step processing are all billed at standard token rates.
How to mitigate: Build focused, single-purpose agents rather than Swiss-army-knife agents that try to do everything. Fewer tools means fewer internal reasoning steps.
Guardrails Pricing
Guardrails add safety filters to your model inputs and outputs. Pricing is per filter, per 1,000 text units (where 1 text unit = up to 1,000 characters):
| Filter | Price per 1,000 Text Units |
|---|---|
| Content filters (text) | $0.15 |
| Denied topics | $0.15 |
| Sensitive information (PII, ML-based) | $0.10 |
| Contextual grounding checks | $0.10 |
| Automated Reasoning checks | $0.17 per policy |
| Word/regex filters | Free |
These prices were reduced by 80-85% in December 2024, which makes Guardrails more practical at scale. But they still add up. A customer support chatbot processing 1,000 queries per hour with content filters and denied topics enabled costs approximately $0.90 per hour in Guardrails charges alone.
Important billing detail: if the guardrail blocks the input prompt, you pay for the guardrail evaluation only (no model inference charge). If it blocks the model response, you pay for both guardrail evaluation and model inference.
Flows and Additional Feature Costs
Bedrock Flows (the visual workflow builder for multi-step AI pipelines) charges $0.035 per 1,000 node transitions, plus standard FM inference costs for any model calls within the flow.
For a simple news summarization flow with 25 node transitions running 960 times per month, the flow transition cost is only $0.84/month, but the model inference within those flows is billed separately.
CloudWatch Logging Costs
This one is easy to overlook. If you enable full model invocation logging (including prompts and responses), Bedrock can generate hundreds of gigabytes of log data. CloudWatch Logs charges $0.50 per GB for ingestion.
My recommendation: disable full prompt and response logging in production unless you need it for compliance or debugging. If you do need it, set a CloudWatch Logs retention policy to avoid indefinite storage costs.
With the full cost picture in mind, let's look at what you'll pay for model inference, the core of your Bedrock bill.
On-Demand Model Pricing by Provider
On-demand pricing is what you pay per token without any commitment or volume discounts. This is how most teams start with Bedrock, and the price range is enormous: from $0.035 per 1M input tokens (Nova Micro) to $75.00 per 1M output tokens (Claude Opus 4.6), a 2,000x difference.
The model you choose is the single biggest cost lever you have. Let me walk through the key providers, organized from budget-friendly to premium.
Amazon Nova Models (Budget-Friendly)
Amazon's own Nova models are designed for price-performance leadership. They're the cheapest option for most use cases on Bedrock.
| Model | Context Window | Input (per 1M tokens) | Output (per 1M tokens) | Best For |
|---|---|---|---|---|
| Nova Micro | 128K | $0.035 | $0.14 | Simple classification, routing, extraction |
| Nova Lite | 300K | $0.06 | $0.24 | General Q&A, summarization, multimodal |
| Nova Pro | 300K | $0.80 | $3.20 | Complex reasoning, coding, multi-step tasks |
| Nova Premier | 1M | $2.50 | $12.50 | Frontier intelligence, 1M context |
Nova Micro is text-only, while Nova Lite, Pro, and Premier support multimodal input (text, image, and video). All Nova models support Standard, Priority, Flex, and Batch tiers, with batch pricing at 50% of on-demand.
Nova Act, Amazon's agent-oriented model, uses a different pricing model: $4.75 per agent hour (elapsed real-world time while an agent is working, excluding human-in-the-loop wait time).
Anthropic Claude Models
Claude models are the most popular choice on Bedrock for complex reasoning and coding tasks. Here are the current prices:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Batch Input | Batch Output |
|---|---|---|---|---|
| Claude Opus 4.6 | $15.00 | $75.00 | $7.50 | $37.50 |
| Claude Sonnet 4.5/4.6 | $3.00 | $15.00 | $1.50 | $7.50 |
| Claude Opus 4.5 | $5.00 | $25.00 | $2.50 | $12.50 |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.50 | $2.50 |
| Claude 3 Haiku | $0.25 | $1.25 | $0.125 | $0.625 |
Long context pricing: For Claude models with 1M context windows, inputs beyond 200K tokens are billed at 2x the standard rate. For example, Claude Sonnet 4.5 charges $6.00 per 1M tokens for input beyond the 200K threshold.
Extended Access pricing: Legacy models like Claude 3.5 Sonnet (now in Public Extended Access) have increased from $3/$15 to $6/$30 per million tokens as of December 2025. If you're still using these, migration to Claude Sonnet 4.5 gives you the same price with better performance.
Prompt caching is available for Claude models with two TTL options: 5-minute (1.25x write, 0.1x read) and 1-hour (2.0x write, 0.1x read). I'll cover caching math in the optimization section.
Meta Llama, Mistral, and Open-Source Models
Open-source models on Bedrock offer competitive pricing with no vendor lock-in on the model weights:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Mistral Large 3 | $0.50 | $1.50 |
| Magistral Small 1.2 | $0.50 | $1.50 |
| Devstral 2 135B | $0.40 | $2.00 |
| Ministral 3B | $0.10 | $0.10 |
| DeepSeek v3.2 | $0.62 | $1.85 |
| Llama 2 Chat (13B) | $0.75 | $1.00 |
| Llama 2 Chat (70B) | $1.95 | $2.56 |
Mistral's smaller models (Ministral 3B at $0.10/$0.10, Voxtral Mini at $0.04/$0.04) are excellent budget options for simple tasks. DeepSeek v3.2, added in February 2026, offers strong reasoning at $0.62/$1.85.
All Other Providers (DeepSeek, Google, NVIDIA, Qwen, and More)
Bedrock now hosts models from 15+ providers. Here's a curated selection of the newest and most cost-effective options:
| Provider | Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|---|
| Gemma 3 4B | $0.04 | $0.08 | |
| Gemma 3 27B | $0.23 | $0.38 | |
| NVIDIA | Nemotron Nano 2 | $0.06 | $0.23 |
| OpenAI (GPT-OSS) | GPT-OSS 20B | $0.07 | $0.30 |
| OpenAI (GPT-OSS) | GPT-OSS 120B | $0.15 | $0.60 |
| Qwen | Qwen3 32B | $0.15 | $0.60 |
| Qwen | Qwen3 Coder Next | $0.50 | $1.20 |
| MiniMax AI | MiniMax M2.1 | $0.30 | $1.20 |
| Moonshot AI | Kimi K2.5 | $0.60 | $3.00 |
| Z AI | GLM 4.7 | $0.60 | $2.20 |
| Z AI | GLM 4.7 Flash | $0.07 | $0.40 |
| Writer | Palmyra X5 | $0.60 | $6.00 |
Google Gemma 3 4B at $0.04/$0.08 is one of the cheapest models on Bedrock (alongside Voxtral Mini at $0.04/$0.04 and Nova Micro at $0.035/$0.14). Six of these models (DeepSeek V3.2, MiniMax M2.1, GLM 4.7, GLM 4.7 Flash, Kimi K2.5, Qwen3 Coder Next) were added in February 2026.
For the exhaustive model list, see the official Amazon Bedrock pricing page.
Image and Multimodal Model Pricing
Stability AI provides image generation and editing models on Bedrock, priced per operation:
| Operation | Price per Image |
|---|---|
| Fast Upscale | $0.03 |
| Outpaint | $0.06 |
| Remove Background / Erase Object / Control / Inpaint | $0.07 |
| Style Transfer | $0.08 |
| Conservative Upscale | $0.40 |
| Creative Upscale | $0.60 |
Amazon Nova Canvas (image generation) and Nova Reel (video generation) are also available with per-image and per-second-of-video pricing respectively.
Those are the standard on-demand rates, but Bedrock offers several pricing tiers that can significantly reduce your costs. Let me walk you through when each one makes sense.
Bedrock Service Tiers: Standard, Priority, Flex, and Batch
Beyond on-demand pricing, Bedrock offers multiple tiers that let you trade flexibility for savings (or pay a premium for guaranteed performance). These tiers were formalized in February 2025, and most competitor guides still don't cover them.
Understanding which tier to use can save you 50% or more on your inference costs. Here's how to decide:
Standard Tier (Default)
The Standard tier is the default for all Bedrock inference. You get consistent performance at the base rates listed in the pricing tables above. This is the right choice for most production workloads where you need reliable latency without paying a premium.
Priority Tier (75% Premium for Low Latency)
The Priority tier provides preferential compute allocation with up to 25% better output-tokens-per-second latency at a 75% price premium over Standard.
Use Priority only for mission-critical, latency-sensitive applications where response time directly impacts user experience or revenue. For most workloads, the 75% premium isn't justified.
Example: Claude Sonnet 4.5 at Priority tier costs $5.25/$26.25 per 1M tokens instead of $3.00/$15.00 at Standard.
Flex Tier (50% Discount for Non-Urgent Work)
The Flex tier is one of the best cost optimization levers available. You get a 50% discount from Standard rates in exchange for trading immediacy. The key difference from batch: Flex is accessible through the regular Converse and InvokeModel APIs, so you don't need to restructure your application to use file-based batch processing.
Flex is ideal for evaluations, background summarizations, non-urgent agentic tasks, and any workload where a few seconds of additional latency is acceptable.
Batch Inference (50% Discount for Async Processing)
Batch inference gives you the same 50% discount as Flex, but processes requests asynchronously from a file. You submit a batch job and get results when processing completes.
Best for: large-scale document processing, data enrichment, model evaluations, and any workload that doesn't need real-time responses.
| Model | Standard Input/Output | Batch Input/Output |
|---|---|---|
| Claude Sonnet 4.5 | $3.00 / $15.00 | $1.50 / $7.50 |
| Claude Haiku 4.5 | $1.00 / $5.00 | $0.50 / $2.50 |
| Nova Lite | $0.06 / $0.24 | $0.03 / $0.12 |
| Nova Pro | $0.80 / $3.20 | $0.40 / $1.60 |
Batch and prompt caching discounts can be combined for multiplicative savings. More on that in the optimization section.
Provisioned Throughput (Reserved Capacity)
Provisioned Throughput reserves dedicated model capacity at fixed hourly rates per Model Unit (MU). This is the enterprise option for very high, consistent inference volumes.
Commitment options: no commitment, 1-month, or 6-month (longer commitments cost less). Monthly costs range from $15,000 to $37,000+ per MU depending on the model and commitment term.
| Model | 1-Month (per MU/hr) | Monthly Cost (1 MU) |
|---|---|---|
| Meta Llama 2 | $21.18 | ~$15,758 |
| Amazon Titan Text Express (2 MUs) | $18.40 | ~$27,379 |
| Cohere Command | $39.60 | ~$29,462 |
Provisioned Throughput only makes sense when your per-token costs at on-demand rates consistently exceed the hourly MU cost. Run a breakeven analysis before committing.
For the full tier comparison and model availability, see the Bedrock service tiers documentation.
If you need a customized model rather than the base versions, here's what that costs.
Fine-Tuning and Model Customization Costs
Fine-tuning a foundation model on your own data involves three cost components: training, storage, and inference on the customized model.
This isn't something every team needs, but when you do, it helps to understand the full cost picture upfront.
Training Costs
Fine-tuning is billed per token trained (total tokens in your training data multiplied by the number of epochs):
| Model | Price per 1,000 Training Tokens |
|---|---|
| Amazon Titan Text Lite | $0.001 |
| Amazon Titan Text Express | $0.008 |
| Meta Llama 2 (13B) | $0.00149 |
| Meta Llama 2 (70B) | $0.00799 |
| Cohere Command | $0.004 |
Reinforcement Fine-Tuning (RFT) uses hourly billing instead of per-token: $80 per hour for GPT-OSS 20B and Qwen3 32B models.
Custom Model Storage and Inference
Every custom model costs $1.95 per month for storage. Inference on custom models typically requires provisioned throughput, though some models support on-demand pricing.
Full cost example: Fine-tuning Titan Text Express with 20M tokens over 3 epochs:
- Training: (60M tokens / 1K) x $0.008 = $480 one-time
- Storage: $1.95/month
- Inference (1-month commitment, 1 MU): $13,248/month
- Inference (6-month commitment, 1 MU): $10,656/month
Model Distillation (Cheaper Alternative)
If full fine-tuning seems expensive, consider model distillation. This transfers knowledge from a larger "teacher" model to a smaller "student" model, and the results are impressive: distilled models can be up to 500% faster and 75% less expensive with less than 2% accuracy loss.
Distillation costs include synthetic data generation (at on-demand teacher model rates) and student model training. It's been generally available since May 2025 and supports pathways like Nova Premier to Nova Pro, and Claude 3.5 Sonnet v2 to Llama 3.2 1B/3B.
Model evaluation is also worth noting: algorithmic scores (relevance, coherence, fluency) are free. Human evaluation costs $0.21 per completed task. LLM-as-a-judge evaluation charges at standard token rates for the judge model.
All these per-token rates and pricing tiers can feel abstract. Let me show you what Bedrock actually costs for real-world use cases.
What Does Bedrock Actually Cost? Real-World Scenarios
This is the section I wish existed when I first started budgeting Bedrock projects. No competitor provides use-case cost projections, so here are three realistic scenarios with monthly estimates.
These estimates use US East region pricing at Standard tier. Your actual costs will vary based on configuration, traffic patterns, and model choice.
RAG Chatbot (Knowledge Base + Claude Haiku 4.5)
Assumptions: 10,000 queries per day, ~2,000 input tokens + 500 output tokens per query, using Claude Haiku 4.5 for generation and Titan Embeddings V2 for embedding.
| Cost Component | Monthly Estimate |
|---|---|
| Claude Haiku 4.5 inference (300K queries: 600M input + 150M output tokens) | $600 input + $750 output = $1,350 |
| Titan Embeddings V2 (600M input tokens for embedding queries + chunks) | $12 |
| Vector database (S3 Vectors) | ~$35 |
| Vector database (OpenSearch Serverless, if used instead) | ~$350 |
| Reranking (optional, 300K queries) | $300 |
| Total (with S3 Vectors, no reranking) | ~$1,397/month |
| Total (with OpenSearch + reranking) | ~$2,012/month |
The vector database choice alone creates a $315/month difference. This is why I recommend S3 Vectors as the default for new projects.
Document Summarization Pipeline (Batch + Nova Lite)
Assumptions: 50,000 documents per day, ~4,000 input tokens + 1,000 output tokens per document, using Nova Lite with batch inference (50% discount).
| Cost Component | Monthly Estimate |
|---|---|
| Nova Lite batch input (1.5M docs x 4K tokens = 6B tokens) | 6,000 x $0.03 = $180 |
| Nova Lite batch output (1.5M docs x 1K tokens = 1.5B tokens) | 1,500 x $0.12 = $180 |
| S3 storage for batch files | ~$5 |
| Total | ~$365/month |
Using batch inference cuts the cost in half compared to on-demand ($730/month). If you used Nova Micro instead of Nova Lite for simpler documents, the inference cost drops to approximately $100/month.
Customer Service Agent (Multi-Step Workflow)
Assumptions: 5,000 interactions per day, using Claude Sonnet 4.5 with 6 internal model calls per interaction (agent planning, tool selection, execution, summarization), ~3,000 input + 800 output tokens per call, Guardrails with content filters and denied topics enabled.
| Cost Component | Monthly Estimate |
|---|---|
| Claude Sonnet 4.5 inference (150K interactions x 6 calls = 900K model calls) | |
| - Input: 900K x 3K tokens = 2.7B tokens | 2,700 x $3.00 = $8,100 |
| - Output: 900K x 800 tokens = 720M tokens | 720 x $15.00 = $10,800 |
| Guardrails (content + denied topics, ~3 text units per interaction) | 150K x 3 x $0.30/1K = $135 |
| Total | ~$19,035/month |
This is the agent cost multiplier in action. A naive estimate based on one model call per interaction would suggest ~$3,200/month. The actual cost with 6 internal calls is roughly 6x higher. Budget accordingly.
Cost Estimation with Our Bedrock Pricing Calculator
These scenarios give you a ballpark, but your actual configuration will differ. Use the Amazon Bedrock Pricing Calculator to estimate costs for your specific model, traffic volume, and feature combination. It covers all models, tiers, and features so you can budget accurately before building.
Now that you know what to expect, here are 8 strategies to bring those costs down.
8 Strategies to Reduce Your Bedrock Costs
Each strategy below includes the savings potential and the specific scenario where it works best. I've ordered them from highest impact to most specialized.
1. Right-Size Your Model Selection
Savings: 50-99% per request
This is the single biggest cost lever. Most teams default to a powerful model like Claude Sonnet 4.5 ($3.00/$15.00) when Nova Micro ($0.035/$0.14) or Nova Lite ($0.06/$0.24) would produce acceptable results for simpler tasks.
Rule of thumb: Start with the cheapest model that meets your quality bar, then escalate only when testing shows the cheaper model falls short. Use Intelligent Prompt Routing ($1.00 per 1,000 requests) to automatically route each request to the cheapest adequate model within a family. AWS reports up to 30% cost reduction without compromising accuracy.
2. Use Batch Inference for Non-Real-Time Workloads
Savings: 50% on all tokens
Any workload that doesn't need real-time responses should use batch inference. Document processing, data enrichment, evaluations, content generation pipelines: all of these can run as batch jobs at half the on-demand price.
3. Enable Prompt Caching for Repeated Context
Savings: up to 90% on repeated input tokens
If your application reuses the same context across multiple requests (system prompts, shared documents, RAG context), prompt caching can dramatically reduce costs.
Here's how the math works for Claude Sonnet 4.5:
| Operation | Cost per 1M tokens | Relative to Standard |
|---|---|---|
| Standard input | $3.00 | 1.0x |
| Cache write (5-min TTL) | $3.75 | 1.25x |
| Cache write (1-hour TTL) | $6.00 | 2.0x |
| Cache read (hit) | $0.30 | 0.1x |
Breakeven: With 5-minute TTL, if you reuse the same cached context 2 or more times within 5 minutes, caching saves money. The first request costs 1.25x (cache write), but every subsequent hit costs only 0.1x.
Requirements: minimum 1,024 tokens per cache checkpoint, maximum 4 checkpoints, up to 32K cached tokens for Claude (20K for Nova). The 1-hour TTL option launched in January 2026 for Claude Sonnet 4.5, Haiku 4.5, and Opus 4.5.
4. Switch to the Flex Tier for Tolerant Workloads
Savings: 50% from Standard
Flex gives you the same 50% discount as batch but through the regular Converse and InvokeModel APIs. No restructuring needed. If your workload can tolerate slightly higher latency, Flex is the easiest cost optimization to implement.
5. Use Intelligent Prompt Routing
Savings: up to 30%
For $1.00 per 1,000 requests, Intelligent Prompt Routing automatically selects between models in the same family (like Claude Sonnet + Haiku, or Nova Pro + Lite) based on prompt complexity. Simple queries go to the cheaper model; complex ones go to the more capable one.
At 10,000 requests per day, the routing cost is $10/day ($300/month). If routing saves 30% on a $5,000/month inference bill, that's $1,500 saved for $300 spent.
6. Use Cross-Region Inference for Better Throughput
Savings: ~10% on select models (Global CRIS)
Cross-region inference routes requests across AWS regions with no additional routing or data transfer fee. The Global option offers approximately 10% savings on select models compared to geographic cross-region inference.
The main trade-off is additional latency (double-digit milliseconds when re-routing occurs). If your workload is latency-tolerant, Global cross-region inference gives you better throughput and modest savings at no extra cost.
7. Optimize Your Prompts
Savings: 10-50% on input tokens
Shorter prompts mean fewer input tokens mean lower costs. This sounds obvious, but I've seen system prompts grow to thousands of tokens with redundant instructions.
Use the Prompt Optimization feature ($0.03 per 1,000 tokens) to automatically generate more efficient prompts. The one-time optimization cost is quickly recouped through ongoing input token savings.
8. Replace OpenSearch Serverless with S3 Vectors
Savings: up to 90% on vector storage
If you're running a Knowledge Base on OpenSearch Serverless and paying the ~$350/month floor, switching to Amazon S3 Vectors (launched December 2025) can cut your vector storage costs by up to 90%. S3 Vectors supports trillions of vectors with sub-second latency, making it suitable for most RAG workloads.
For a shift-left FinOps approach, catch these cost decisions during code review before they go to production.
Shift-Left Your FinOps Practice
Move cost awareness from monthly bill reviews to code review. CloudBurn shows AWS cost impact in every PR, empowering developers to make informed infrastructure decisions.
One question I hear a lot: is it cheaper to call model APIs directly instead of going through Bedrock?
Bedrock vs. Direct API Pricing
This is a common question for teams evaluating Bedrock, especially for Anthropic Claude models. The short answer: Bedrock's per-token rates for Claude closely match Anthropic's direct API pricing. The real comparison is about total cost of ownership, not just per-token rates.
Bedrock vs. Anthropic API
Claude Sonnet 4.5 on Bedrock costs $3.00/$15.00 per 1M tokens. Anthropic's direct API pricing is comparable. The per-token cost isn't where Bedrock differentiates.
Where Bedrock adds value:
- Batch inference (50% discount) and Flex tier (50% discount) are Bedrock-exclusive
- Prompt caching with combinable batch discounts
- Unified AWS billing with IAM-based access control and VPC integration
- Guardrails, Knowledge Bases, and Agents as managed features
- Cross-region inference for better throughput at no routing cost
When Bedrock Is the Better Deal
If you're already on AWS, Bedrock is almost always the better deal. You avoid data egress fees, get IAM authentication instead of managing API keys, and have access to cost optimization features (batch, Flex, caching) that reduce your effective per-token cost well below the base rate.
Bedrock also gives you access to 20+ model providers through a single API. If you want to evaluate or switch between Claude, Nova, Llama, Mistral, and DeepSeek, you can do that without integrating separate APIs.
Direct API access might make sense if you need only one model provider, don't use any AWS services, or need features not yet available on Bedrock. But for most AWS teams, Bedrock's infrastructure advantages and cost optimization features make it the more cost-effective choice.
For more details on optimizing your AI infrastructure costs, see the AWS Bedrock cost optimization guide.
Frequently Asked Questions
How does Amazon Bedrock pricing work?
Does Amazon Bedrock have a free tier?
Why are output tokens more expensive than input tokens?
What is the cheapest model on Amazon Bedrock?
How much does Amazon Bedrock cost per month?
How do I reduce my Amazon Bedrock costs?
How do I monitor my Bedrock spending?
Wrapping Up
Amazon Bedrock pricing has a lot of dimensions, but once you understand the three cost layers, the pricing decisions become clearer. Here's what to remember:
Key takeaways:
- Bedrock uses token-based pricing with input tokens 3-5x cheaper than output. Model costs range from $0.035 (Nova Micro) to $75.00 (Claude Opus 4.6) per 1M output tokens.
- Hidden costs can exceed your inference bill. OpenSearch Serverless ($350/month floor), Agent cost multiplication (5-10x tokens), Guardrails, and logging all add up. Budget for the full stack, not just tokens.
- Batch inference and Flex tier each offer 50% savings. Prompt caching can cut costs by up to 90% on repeated context. These are your biggest cost levers after model selection.
- The right model for your use case is usually not the most expensive one. Start with Nova Micro or Claude Haiku and escalate only when quality demands it.
- Use the Amazon Bedrock Pricing Calculator to estimate your specific monthly costs before building.
For broader cost management strategies, check out the AWS cost optimization checklist, explore AWS cost estimation tools, and read our AWS cost estimation guide for your complete infrastructure.
What's your experience with Bedrock pricing? Have you run into any unexpected costs? I'd love to hear about your setup in the comments below.