Speculative Decoding
Written by Shaoni Mukherjee, AI Technical Writer Learn what it means for software, security, and business technology teams.
- Python
- Ai-ml
- Devops Tutorials
- Speculative
- Decoding
By Global Outreach
- Main Page
- Documentation
- Career Opportunities
- Support Center
- Sales Department
- Featured Products
- Compute Resources
- Container Management
- Database Solutions
- Development Tools
- Login Options
- Community Portal
- DigitalOcean Account
- Registration Options
- Community Sign-up
- DigitalOcean Registration
- Login Options
- Community Portal
- DigitalOcean Account
- Registration Options
- Community Sign-up
- DigitalOcean Registration
- Guides
- FAQs
- Product Documentation
- Search Community Forum
Table of Contents
Written by Shaoni Mukherjee, AI Technical Writer
Shaoni is an experienced AI technical writer with a strong background in data science
Speculative decoding can significantly improve token throughput, but it can also increase latency and memory usage if not implemented correctly
Speculative decoding is a conditional optimization that delivers real gains on the right workloads, but it can quietly degrade everything else
This article provides an operational framework for making the decision to use speculative decoding, including which draft model to pick and how to measure its effectiveness
The performance figures in this article come from the vLLM team's own published benchmarks, but it's essential to measure performance in your own production conditions
- Speculative decoding proposes tokens using a small draft model and verifies them in one target model pass
- It helps at low query rates on structured workloads, but degrades at high query rates
- Pick a draft model at a 1:8-1:12 size ratio from the same model family
- Monitor spec_decode_draft_acceptance_rate in production to determine effectiveness
Choosing the Right Draft Model for Speculative Decoding
The first decision is which draft model to use, and it's rarely treated as the most important decision
Why the size ratio is the primary lever
Speculative decoding short-circuits the process of generating tokens by using a small draft model to propose tokens and then verifying them in a single target model pass
Standard generation — 5 tokens = 5 sequential target model passes: [ 70B ] → t₁ → [ 70B ] → t₂ → [ 70B ] → t₃ → [ 70B ] → t₄ → [ 70B ] → t₅ ( pass 1 ) ( pass 2 ) ( pass 3 ) ( pass 4 ) ( pass 5 ) Speculative decoding — 5 proposed tokens = 1 draft pass + 1 verify pass: [ 8B Draft ] → t₁ t₂ t₃ t₄ t₅ ( one fast pass, all proposed ) | ▼ [ 70B Target ] → ✓t₁ ✓t₂ ✓t₃ ✗t₄ — ( one parallel verify pass ) └─────────────┘ 3 tokens accepted, 2 rejected target model ran once instead of five timesThe size ratio of the draft model to the target model is critical, as a larger draft model is better at predicting what the target model would say
A larger draft model predicts more accurately, but it also costs more in terms of VRAM and compute
The mechanics of the algorithm point to a consistent sweet spot: a 1:8 to 1:12 size ratio using same-family, same-training-distribution models
These pairings illustrate how the size ratio affects draft model quality, and acceptance rates vary significantly with hardware, quantization, and prompt distribution
The 1B/70B pairing looks cheap but rarely pays off, as the draft model rejects too many tokens
The vLLM team's published benchmarks used a 0.5B draft model against Llama-3-70B, a 1:140 ratio, and saw 1.5x speedup at low query rates and 1.4x slowdown at high query rates
Temperature destroys your benchmark numbers
Temperature controls how predictable or random a model's output is, and it affects the effectiveness of speculative decoding
At temperature=0, the model is highly predictable, but at higher temperatures, the model picks more surprising tokens, and the draft model's guesses start missing more often
Most benchmarks are run at temperature=0, which is where speculative decoding looks best, but in production, the temperature used depends on the task
The simplest way to see the effect of temperature is to run the same prompt at different temperatures and watch spec_decode_draft_acceptance_rate shift in real-time
import requests VLLM_URL = "http://localhost:8000/v1/chat/completions" METRICS_URL = "http://localhost:8000/metrics" PROMPT = "Write a short story about a robot learning to paint." def get_acceptance_rate ( ) : text = requests . get ( METRICS_URL ) . text for line in text . split ( "\n" ) : if "spec_decode_draft_acceptance_rate" in line and not line . startswith ( "#" ) : return float ( line . split ( ) [ - 1 ] ) return None for temperature in [ 0.0 , 0.4 , 0.8 , 1.0 ] : # Send 20 requests at this temperature for _ in range ( 20 ) : requests . post ( VLLM_URL , json = { "model" : "meta-llama/Llama-3.1-70B-Instruct" , "messages" : [ { "role" : "user" , "content" : PROMPT } ] , "temperature" : temperature , "max_tokens" : 200 , } ) rate = get_acceptance_rate ( ) print ( f"temperature= { temperature } acceptance_rate= { rate : .2f } " )Expected output shape: find the row that matches your production temperature and check the acceptance rate
temperature = 0.0 acceptance_rate = 0.81 temperature = 0.4 acceptance_rate = 0.71 temperature = 0.8 acceptance_rate = 0.52 temperature = 1.0 acceptance_rate = 0.38If the acceptance rate is below 0.5, stop: speculative decoding is net-negative on this workload
Higher temperature leads to a flatter probability distribution, and the draft model's concentrated guesses miss more often, leading to more rejections
Source: vLLM Team, How Speculative Decoding Boosts vLLM Performance by up to 2.8x, October 2024
The last two rows are not edge cases: at production query rates, speculative decoding adds overhead instead of removing it
The intuition for the crossover point: an 8B draft model costs roughly 1/9th the compute of a 70B target model
The practical implication is straightforward: don't assume speculative decoding is helping just because your benchmark looked good
Memory Budget Reality
Running speculative decoding means running two models simultaneously, which is obvious in principle but surprisingly painful in practice
Actual footprint numbers
Using Llama-3.1 as a concrete example, weight sizes are derived from parameter counts and precision
Llama-3.1-70B target model
- BF16: 140GB, too large for a single 80GB H100
- INT8: 70GB, fits on a single H100 with 10GB to spare
Llama-3.1-8B draft model
- BF16: 16GB
- INT8: 8GB
- INT4: 4GB
On a 2x H100 SXM5 setup, a common configuration for production 70B serving
- Target (70B INT8, 70GB) + Draft (8B BF16, 16GB) = 86GB weights, leaving 74GB for KV cache
The practical takeaway for H100: if you want to run speculative decoding with a 70B target, you have to quantize it to INT8
On DO's H200 GPU Droplets, this constraint goes away, and you can run 70B BF16 + 8B BF16 with 126GB of headroom
How the draft model shrinks your KV cache
To understand this, you need to know what a KV cache is: it stores intermediate values called keys and values
The more tokens a request has, the more KV cache it needs, and the more concurrent requests you serve, the more KV cache you need in total
In practice, this means one or both of the following: shorter maximum context length or smaller maximum batch size
- Shorter maximum context length: requests with long inputs or conversation histories will hit memory limits sooner
- Smaller maximum batch size: you can serve fewer concurrent requests before running out of KV cache space
The latency gains from faster token generation can easily be wiped out by the latency increase from being forced to process fewer requests in parallel
# vLLM config for speculative decoding on 2× H100 python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --speculative-model meta-llama/Llama-3.1-8B-Instruct \ --num-speculative-tokens 5 \ --speculative-draft-tensor-parallel-size 1 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.92 \ --dtype bfloat16The --speculative-draft-tensor-parallel-size 1 flag is worth noting explicitly: the draft model typically runs on a single GPU while the target model spans both
Quantization and Speculative Decoding
Running speculative decoding with a quantized target model is one of the most common production configurations
Why quantization changes acceptance rates
The speculative decoding algorithm works by comparing the draft model's proposed tokens against what the target model would have generated
When the target model is quantized, its output distribution shifts slightly, introducing small numerical rounding errors
The draft model was optimized against the full-precision target model's probability distribution, not the quantized variant's
The asymmetry between quantizing draft vs. target
The two models can be quantized independently, and the performance implications are not symmetric
Quantizing the target model has the biggest impact because it decides whether each token proposed by the draft model is accepted
Quantizing the draft model is usually less risky, as the target model still verifies every suggestion
In practice, it's generally safe to quantize the draft model more aggressively than the target model
Weights only: KV cache, activations, and page tables consume additional VRAM
Draft model quantization is handled separately from the target model via --speculative-model-quantization
python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --quantization bitsandbytes \ --load-format bitsandbytes \ --speculative-model meta-llama/Llama-3.1-8B-Instruct \ --speculative-model-quantization bitsandbytes \ --num-speculative-tokens 5 \ --speculative-draft-tensor-parallel-size 1 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.92The --spec-decoding-acceptance-method flag is worth knowing about: it trades a small reduction in output quality for a higher acceptance rate
--spec-decoding-acceptance-method typical_acceptance_sampler \ --typical-acceptance-sampler-posterior-threshold 0.09 \ --typical-acceptance-sampler-posterior-alpha 0.3The defaults are reasonable starting points, but test on your actual prompt distribution before adjusting
One more useful flag for quantized deployments under variable load: --speculative-disable-by-batch-size
What to watch after switching to a quantized configuration
After enabling quantization on either model, pull spec_decode_draft_acceptance_rate and compare it against your baseline measurement
The right quantization configuration depends on your VRAM constraints, context length requirements, and acceptance rate tolerance
Continuous Batching: Where the Scheduler Gets Complicated
The performance advantage of continuous batching comes from requests sharing GPU compute in the same forward pass
What the scheduler actually assumes
Under standard continuous batching, every iteration of the forward pass generates exactly one new token per request
Speculative decoding breaks this assumption: each iteration consists of proposing tokens and verifying them
The number of tokens actually appended to each request's sequence after step 3 is variable
What this means for mixed workloads
Under a homogeneous workload, the irregularity is predictable enough that the scheduler handles it gracefully
Under a mixed workload, the picture is messier: imagine a batch where half the requests are running structured JSON extraction and half are running open-ended creative generation
The scheduler can't split these cleanly: they share the same verification pass, which means the requests that don't benefit from speculation are still paying its cost
In practice, workload homogeneity matters more than most teams realize: speculative decoding is well-suited for dedicated deployments
How to Measure This on Your Own Deployment
The published reference numbers give you a baseline, but what actually matters is the numbers you collect on your own hardware and prompt distribution
The most common operational mistake is measuring the wrong thing: a benchmark that shows 2x speedup on isolated single-request tests will not reliably predict behavior under concurrent production load
What to actually measure
Acceptance rate, per request: track it as a histogram, not an average, to see the distribution
TTFT vs. TPOT separately: speculative decoding affects time-per-output-token, not time-to-first-token
P50 vs. P99 latency under load: this is where the scheduler interactions surface
# Pull speculative decoding metrics from the metrics endpoint curl -s http://localhost:8000/metrics | grep spec_decodeA healthy deployment looks like this: acceptance rate consistently above 0.7, accepted tokens close to the number of draft tokens
# HELP vllm:spec_decode_draft_acceptance_rate Speculative decoding draft acceptance rate vllm:spec_decode_draft_acceptance_rate { .. . } 0.76 vllm:spec_decode_num_draft_tokens_total { .. . } 48200 vllm:spec_decode_num_accepted_tokens_total { .. . } 36600 # ~76% acceptedAn unhealthy deployment: acceptance rate below 0.5, most draft tokens rejected, overhead not paying off
vllm:spec_decode_draft_acceptance_rate { .. . } 0.38 vllm:spec_decode_num_draft_tokens_total { .. . } 51000 vllm:spec_decode_num_accepted_tokens_total { .. . } 19400 # ~38% acceptedAt 38% acceptance with a 1:9 draft/target ratio, you are adding latency, not removing it: turn off speculative decoding until you've addressed the root cause
The monitoring setup you need before trusting the flag
- Acceptance rate histogram segmented by temperature bucket
- TTFT and TPOT at P50, P95, P99, compared against a baseline without speculative decoding
Why aggregate benchmarks lie
Consider a deployment where 60% of requests are low-temperature structured queries and 40% are high-temperature creative requests
Decision Framework
Speculative decoding delivers when these conditions hold: temperature-homogeneous workload, VRAM headroom, batch-size-consistent workload, validated acceptance rates, and TPOT/throughput SLO
- Your workload is temperature-homogeneous and skews toward structured output or code - target temperature ≤ 0.5
- You have VRAM headroom after target model weights (draft model should consume no more than ~15% of total available VRAM)
- Your workload is batch-size-consistent - you’re not mixing request types with dramatically different acceptance rates in the same batch
- You’ve validated acceptance rates at production temperatures, not just greedy benchmarks
- Your primary SLO is TPOT/throughput, not TTFT
Leave it off when:
- Temperature varies widely across your request mix, or your median temperature is above 0.7
- You’re VRAM-constrained and serving long-context requests - the KV cache squeeze will cost you more than the token throughput gains
- Your workload is TTFT-bound rather than throughput-bound
- You haven’t set up acceptance rate monitoring - you can’t tell whether it’s helping
Draft model selection checklist:
Speculative decoding is a genuine win for the right workloads. For everything else, the flag is not a universal accelerant. Treat it like any other performance optimization: measure first, at production conditions, then decide.
Does speculative decoding change the model’s outputs?
No. The acceptance criterion guarantees the final output distribution is identical to what the target model would have produced on its own. It is a pure latency optimization - it changes how fast tokens are generated, not what tokens are generated. You can enable it without touching your prompts, sampling parameters, or output validation.
Does it improve time-to-first-token (TTFT) or time-per-output-token (TPOT)?
It improves TPOT, not TTFT. The draft model adds a small prefill step before the first token is returned, so TTFT may actually increase slightly. If your SLO is primarily TTFT-bound - interactive chat where users notice the first response delay more than the generation speed - speculative decoding may not move the metric that matters to you. It’s most valuable when your bottleneck is throughput or output speed, not initial responsiveness.
What’s the difference between draft model and n-gram speculative decoding?
Draft model speculation uses a separate smaller model to propose tokens - it works across any prompt type but costs VRAM and requires a compatible model family. N-gram speculation reuses repeated phrases from the input prompt itself, which makes it nearly free on memory but only useful when the output closely echoes the input (summarization, RAG, document Q&A). For general chat or code generation, use a draft model. For summarization pipelines where the answer largely paraphrases the source, n-gram is often the better choice and requires no additional model at all.
- Speculative Decoding - vLLM Documentation
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
- Speculative Decoding and Beyond: An In-Depth Survey
- How to Choose the Right GPU for vLLM Inference | DigitalOcean
- LLM Inference Optimization 101 | DigitalOcean
- Splitting LLMs Across Multiple GPUs | DigitalOcean
- The LLM Inference Trilemma | DigitalOcean
- FlashAttention 4 | DigitalOcean
- Deploy NVIDIA Dynamo | DigitalOcean
Learn more about our products
About the author
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.
Featured tutorials
- All tutorials
- All topic tags
Please complete your information!
- Table of contents
- **TL;DR**
- **Choosing the Right Draft Model for Speculative Decoding**
- **Memory Budget Reality**
- **Quantization and Speculative Decoding**
- **Continuous Batching: Where the Scheduler Gets Complicated**
- **How to Measure This on Your Own Deployment**
- **Decision Framework**
- **FAQ**
- **Resources**
- Join the many businesses that use DigitalOcean’s Gradient™ AI Inference Cloud. Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs. Popular Topics AI/ML
- Ubuntu
- Linux Basics
- JavaScript
- Python
- MySQL
- Docker
- Kubernetes
- All tutorials
- Talk to an expert
- Featured tutorials SOLID Design Principles Explained: Building Better Software Architecture
- How To Remove Docker Images, Containers, and Volumes
- How to Create a MySQL User and Grant Privileges (Step-by-Step)
- All tutorials
- All topic tags
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
DigitalOcean Documentation
Full documentation for every DigitalOcean product.
Resources for startups and AI-native businesses
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
The developer cloud
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Start building today
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.
- About
- Leadership
- Blog
- Careers
- Customers
- Partners
- Referral Program
- Press
- Legal
- Privacy Policy
- Security
- Investor Relations
- GPU Droplets
- Bare Metal GPUs
- Inference Engine
- Data & Learning
- Evaluations
- Model Library
- Droplets
- Kubernetes
- Functions
- App Platform
- Load Balancers
- Managed Databases
- Spaces
- Block Storage
- Network File Storage
- API
- Uptime
- Cloud Security Posture Management (CSPM)
- Identity and Access Management (IAM)
- Cloudways
- View all Products
- Community Tutorials
- Community Q&A
- CSS-Tricks
- Currents Research
- DigitalOcean Startups
- Wavemakers Program
- Compass Council
- Open Source
- Marketplace
- Pricing
- Pricing Calculator
- Documentation
- Release Notes
- Code of Conduct
- Shop Swag
- AI Training GPU
- GPU Inference
- VPS Hosting
- Website Hosting
- VPN
- Docker Hosting
- Node.js Hosting
- Web Mobile Apps
- WordPress Hosting
- Virtual Machines
- View all Solutions
- Support
- Sales
- Report Abuse
- System Status
- Share your ideas
- About
- Leadership
- Blog
- Careers
- Customers
- Partners
- Referral Program
- Press
- Legal
- Privacy Policy
- Security
- Investor Relations
- GPU Droplets
- Bare Metal GPUs
- Inference Engine
- Data & Learning
- Evaluations
- Model Library
- Droplets
- Kubernetes
- Functions
- App Platform
- Load Balancers
- Managed Databases
- Spaces
- Block Storage
- Network File Storage
- API
- Uptime
- Cloud Security Posture Management (CSPM)
- Identity and Access Management (IAM)
- Cloudways
- View all Products
- Community Tutorials
- Community Q&A
- CSS-Tricks
- Currents Research
- DigitalOcean Startups
- Wavemakers Program
- Compass Council
- Open Source
- Marketplace
- Pricing
- Pricing Calculator
- Documentation
- Release Notes
- Code of Conduct
- Shop Swag
- AI Training GPU
- GPU Inference
- VPS Hosting
- Website Hosting
- VPN
- Docker Hosting
- Node.js Hosting
- Web Mobile Apps
- WordPress Hosting
- Virtual Machines
- View all Solutions
- Support
- Sales
- Report Abuse
- System Status
- Share your ideas
Want help putting this into practice?
Global Outreach builds ERP, VoIP, and custom software for businesses in Pakistan.
Start a conversation