Speculative Decoding

Main Page
Documentation
Career Opportunities
Support Center
Sales Department

Featured Products
Compute Resources
Container Management
Database Solutions
Development Tools

Login Options
Community Portal
DigitalOcean Account

Registration Options
Community Sign-up
DigitalOcean Registration

Login Options
Community Portal
DigitalOcean Account

Registration Options
Community Sign-up
DigitalOcean Registration

Guides
FAQs
Product Documentation
Search Community Forum

Written by Shaoni Mukherjee, AI Technical Writer

Shaoni is an experienced AI technical writer with a strong background in data science

Speculative decoding can significantly improve token throughput, but it can also increase latency and memory usage if not implemented correctly

Speculative decoding is a conditional optimization that delivers real gains on the right workloads, but it can quietly degrade everything else

This article provides an operational framework for making the decision to use speculative decoding, including which draft model to pick and how to measure its effectiveness

The performance figures in this article come from the vLLM team's own published benchmarks, but it's essential to measure performance in your own production conditions

Speculative decoding proposes tokens using a small draft model and verifies them in one target model pass
It helps at low query rates on structured workloads, but degrades at high query rates
Pick a draft model at a 1:8-1:12 size ratio from the same model family
Monitor spec_decode_draft_acceptance_rate in production to determine effectiveness

Choosing the Right Draft Model for Speculative Decoding

The first decision is which draft model to use, and it's rarely treated as the most important decision

Why the size ratio is the primary lever

Speculative decoding short-circuits the process of generating tokens by using a small draft model to propose tokens and then verifying them in a single target model pass

Standard generation — 5 tokens = 5 sequential target model passes: [ 70B ] → t₁ → [ 70B ] → t₂ → [ 70B ] → t₃ → [ 70B ] → t₄ → [ 70B ] → t₅ ( pass 1 ) ( pass 2 ) ( pass 3 ) ( pass 4 ) ( pass 5 ) Speculative decoding — 5 proposed tokens = 1 draft pass + 1 verify pass: [ 8B Draft ] → t₁ t₂ t₃ t₄ t₅ ( one fast pass, all proposed ) | ▼ [ 70B Target ] → ✓t₁ ✓t₂ ✓t₃ ✗t₄ — ( one parallel verify pass ) └─────────────┘ 3 tokens accepted, 2 rejected target model ran once instead of five times

The size ratio of the draft model to the target model is critical, as a larger draft model is better at predicting what the target model would say

A larger draft model predicts more accurately, but it also costs more in terms of VRAM and compute

The mechanics of the algorithm point to a consistent sweet spot: a 1:8 to 1:12 size ratio using same-family, same-training-distribution models

These pairings illustrate how the size ratio affects draft model quality, and acceptance rates vary significantly with hardware, quantization, and prompt distribution

The 1B/70B pairing looks cheap but rarely pays off, as the draft model rejects too many tokens

The vLLM team's published benchmarks used a 0.5B draft model against Llama-3-70B, a 1:140 ratio, and saw 1.5x speedup at low query rates and 1.4x slowdown at high query rates

Temperature destroys your benchmark numbers

Temperature controls how predictable or random a model's output is, and it affects the effectiveness of speculative decoding

At temperature=0, the model is highly predictable, but at higher temperatures, the model picks more surprising tokens, and the draft model's guesses start missing more often

Most benchmarks are run at temperature=0, which is where speculative decoding looks best, but in production, the temperature used depends on the task

The simplest way to see the effect of temperature is to run the same prompt at different temperatures and watch spec_decode_draft_acceptance_rate shift in real-time

import requests VLLM_URL = "http://localhost:8000/v1/chat/completions" METRICS_URL = "http://localhost:8000/metrics" PROMPT = "Write a short story about a robot learning to paint." def get_acceptance_rate ( ) : text = requests . get ( METRICS_URL ) . text for line in text . split ( "\n" ) : if "spec_decode_draft_acceptance_rate" in line and not line . startswith ( "#" ) : return float ( line . split ( ) [ - 1 ] ) return None for temperature in [ 0.0 , 0.4 , 0.8 , 1.0 ] : # Send 20 requests at this temperature for _ in range ( 20 ) : requests . post ( VLLM_URL , json = { "model" : "meta-llama/Llama-3.1-70B-Instruct" , "messages" : [ { "role" : "user" , "content" : PROMPT } ] , "temperature" : temperature , "max_tokens" : 200 , } ) rate = get_acceptance_rate ( ) print ( f"temperature= { temperature } acceptance_rate= { rate : .2f } " )

Expected output shape: find the row that matches your production temperature and check the acceptance rate

temperature = 0.0 acceptance_rate = 0.81 temperature = 0.4 acceptance_rate = 0.71 temperature = 0.8 acceptance_rate = 0.52 temperature = 1.0 acceptance_rate = 0.38

If the acceptance rate is below 0.5, stop: speculative decoding is net-negative on this workload

Higher temperature leads to a flatter probability distribution, and the draft model's concentrated guesses miss more often, leading to more rejections

Source: vLLM Team, How Speculative Decoding Boosts vLLM Performance by up to 2.8x, October 2024

The last two rows are not edge cases: at production query rates, speculative decoding adds overhead instead of removing it

The intuition for the crossover point: an 8B draft model costs roughly 1/9th the compute of a 70B target model

The practical implication is straightforward: don't assume speculative decoding is helping just because your benchmark looked good

Memory Budget Reality

Running speculative decoding means running two models simultaneously, which is obvious in principle but surprisingly painful in practice

Actual footprint numbers

Using Llama-3.1 as a concrete example, weight sizes are derived from parameter counts and precision

Llama-3.1-70B target model

BF16: 140GB, too large for a single 80GB H100
INT8: 70GB, fits on a single H100 with 10GB to spare

Llama-3.1-8B draft model

BF16: 16GB
INT8: 8GB
INT4: 4GB

On a 2x H100 SXM5 setup, a common configuration for production 70B serving

Target (70B INT8, 70GB) + Draft (8B BF16, 16GB) = 86GB weights, leaving 74GB for KV cache

The practical takeaway for H100: if you want to run speculative decoding with a 70B target, you have to quantize it to INT8

On DO's H200 GPU Droplets, this constraint goes away, and you can run 70B BF16 + 8B BF16 with 126GB of headroom

How the draft model shrinks your KV cache

To understand this, you need to know what a KV cache is: it stores intermediate values called keys and values

The more tokens a request has, the more KV cache it needs, and the more concurrent requests you serve, the more KV cache you need in total

In practice, this means one or both of the following: shorter maximum context length or smaller maximum batch size

Shorter maximum context length: requests with long inputs or conversation histories will hit memory limits sooner
Smaller maximum batch size: you can serve fewer concurrent requests before running out of KV cache space

The latency gains from faster token generation can easily be wiped out by the latency increase from being forced to process fewer requests in parallel

# vLLM config for speculative decoding on 2× H100 python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --speculative-model meta-llama/Llama-3.1-8B-Instruct \ --num-speculative-tokens 5 \ --speculative-draft-tensor-parallel-size 1 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.92 \ --dtype bfloat16

The --speculative-draft-tensor-parallel-size 1 flag is worth noting explicitly: the draft model typically runs on a single GPU while the target model spans both

Quantization and Speculative Decoding

Running speculative decoding with a quantized target model is one of the most common production configurations

Why quantization changes acceptance rates

The speculative decoding algorithm works by comparing the draft model's proposed tokens against what the target model would have generated

When the target model is quantized, its output distribution shifts slightly, introducing small numerical rounding errors

The draft model was optimized against the full-precision target model's probability distribution, not the quantized variant's

The asymmetry between quantizing draft vs. target

The two models can be quantized independently, and the performance implications are not symmetric

Quantizing the target model has the biggest impact because it decides whether each token proposed by the draft model is accepted

Quantizing the draft model is usually less risky, as the target model still verifies every suggestion

In practice, it's generally safe to quantize the draft model more aggressively than the target model

Weights only: KV cache, activations, and page tables consume additional VRAM

Draft model quantization is handled separately from the target model via --speculative-model-quantization

python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-70B-Instruct \ --quantization bitsandbytes \ --load-format bitsandbytes \ --speculative-model meta-llama/Llama-3.1-8B-Instruct \ --speculative-model-quantization bitsandbytes \ --num-speculative-tokens 5 \ --speculative-draft-tensor-parallel-size 1 \ --tensor-parallel-size 2 \ --gpu-memory-utilization 0.92

The --spec-decoding-acceptance-method flag is worth knowing about: it trades a small reduction in output quality for a higher acceptance rate

--spec-decoding-acceptance-method typical_acceptance_sampler \ --typical-acceptance-sampler-posterior-threshold 0.09 \ --typical-acceptance-sampler-posterior-alpha 0.3

The defaults are reasonable starting points, but test on your actual prompt distribution before adjusting

One more useful flag for quantized deployments under variable load: --speculative-disable-by-batch-size

What to watch after switching to a quantized configuration

After enabling quantization on either model, pull spec_decode_draft_acceptance_rate and compare it against your baseline measurement

The right quantization configuration depends on your VRAM constraints, context length requirements, and acceptance rate tolerance

Continuous Batching: Where the Scheduler Gets Complicated

The performance advantage of continuous batching comes from requests sharing GPU compute in the same forward pass

What the scheduler actually assumes

Under standard continuous batching, every iteration of the forward pass generates exactly one new token per request

Speculative decoding breaks this assumption: each iteration consists of proposing tokens and verifying them

The number of tokens actually appended to each request's sequence after step 3 is variable

What this means for mixed workloads

Under a homogeneous workload, the irregularity is predictable enough that the scheduler handles it gracefully

Under a mixed workload, the picture is messier: imagine a batch where half the requests are running structured JSON extraction and half are running open-ended creative generation

The scheduler can't split these cleanly: they share the same verification pass, which means the requests that don't benefit from speculation are still paying its cost

In practice, workload homogeneity matters more than most teams realize: speculative decoding is well-suited for dedicated deployments

How to Measure This on Your Own Deployment

The published reference numbers give you a baseline, but what actually matters is the numbers you collect on your own hardware and prompt distribution

The most common operational mistake is measuring the wrong thing: a benchmark that shows 2x speedup on isolated single-request tests will not reliably predict behavior under concurrent production load

What to actually measure

Acceptance rate, per request: track it as a histogram, not an average, to see the distribution

TTFT vs. TPOT separately: speculative decoding affects time-per-output-token, not time-to-first-token

P50 vs. P99 latency under load: this is where the scheduler interactions surface

# Pull speculative decoding metrics from the metrics endpoint curl -s http://localhost:8000/metrics | grep spec_decode

A healthy deployment looks like this: acceptance rate consistently above 0.7, accepted tokens close to the number of draft tokens

# HELP vllm:spec_decode_draft_acceptance_rate Speculative decoding draft acceptance rate vllm:spec_decode_draft_acceptance_rate { .. . } 0.76 vllm:spec_decode_num_draft_tokens_total { .. . } 48200 vllm:spec_decode_num_accepted_tokens_total { .. . } 36600 # ~76% accepted

An unhealthy deployment: acceptance rate below 0.5, most draft tokens rejected, overhead not paying off

vllm:spec_decode_draft_acceptance_rate { .. . } 0.38 vllm:spec_decode_num_draft_tokens_total { .. . } 51000 vllm:spec_decode_num_accepted_tokens_total { .. . } 19400 # ~38% accepted

At 38% acceptance with a 1:9 draft/target ratio, you are adding latency, not removing it: turn off speculative decoding until you've addressed the root cause

The monitoring setup you need before trusting the flag

Acceptance rate histogram segmented by temperature bucket
TTFT and TPOT at P50, P95, P99, compared against a baseline without speculative decoding

Why aggregate benchmarks lie

Consider a deployment where 60% of requests are low-temperature structured queries and 40% are high-temperature creative requests

Decision Framework

Speculative decoding delivers when these conditions hold: temperature-homogeneous workload, VRAM headroom, batch-size-consistent workload, validated acceptance rates, and TPOT/throughput SLO

Your workload is temperature-homogeneous and skews toward structured output or code - target temperature ≤ 0.5
You have VRAM headroom after target model weights (draft model should consume no more than ~15% of total available VRAM)
Your workload is batch-size-consistent - you’re not mixing request types with dramatically different acceptance rates in the same batch
You’ve validated acceptance rates at production temperatures, not just greedy benchmarks
Your primary SLO is TPOT/throughput, not TTFT

Leave it off when:

Temperature varies widely across your request mix, or your median temperature is above 0.7
You’re VRAM-constrained and serving long-context requests - the KV cache squeeze will cost you more than the token throughput gains
Your workload is TTFT-bound rather than throughput-bound
You haven’t set up acceptance rate monitoring - you can’t tell whether it’s helping

Draft model selection checklist:

Speculative decoding is a genuine win for the right workloads. For everything else, the flag is not a universal accelerant. Treat it like any other performance optimization: measure first, at production conditions, then decide.

Does speculative decoding change the model’s outputs?

No. The acceptance criterion guarantees the final output distribution is identical to what the target model would have produced on its own. It is a pure latency optimization - it changes how fast tokens are generated, not what tokens are generated. You can enable it without touching your prompts, sampling parameters, or output validation.

Does it improve time-to-first-token (TTFT) or time-per-output-token (TPOT)?

It improves TPOT, not TTFT. The draft model adds a small prefill step before the first token is returned, so TTFT may actually increase slightly. If your SLO is primarily TTFT-bound - interactive chat where users notice the first response delay more than the generation speed - speculative decoding may not move the metric that matters to you. It’s most valuable when your bottleneck is throughput or output speed, not initial responsiveness.

What’s the difference between draft model and n-gram speculative decoding?

Draft model speculation uses a separate smaller model to propose tokens - it works across any prompt type but costs VRAM and requires a compatible model family. N-gram speculation reuses repeated phrases from the input prompt itself, which makes it nearly free on memory but only useful when the output closely echoes the input (summarization, RAG, document Q&A). For general chat or code generation, use a draft model. For summarization pipelines where the answer largely paraphrases the source, n-gram is often the better choice and requires no additional model at all.

Speculative Decoding - vLLM Documentation
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Speculative Decoding and Beyond: An In-Depth Survey
How to Choose the Right GPU for vLLM Inference | DigitalOcean
LLM Inference Optimization 101 | DigitalOcean
Splitting LLMs Across Multiple GPUs | DigitalOcean
The LLM Inference Trilemma | DigitalOcean
FlashAttention 4 | DigitalOcean
Deploy NVIDIA Dynamo | DigitalOcean

Learn more about our products

About the author

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs.

Featured tutorials

All tutorials
All topic tags

Please complete your information!

Table of contents
**TL;DR**
**Choosing the Right Draft Model for Speculative Decoding**
**Memory Budget Reality**
**Quantization and Speculative Decoding**
**Continuous Batching: Where the Scheduler Gets Complicated**
**How to Measure This on Your Own Deployment**
**Decision Framework**
**FAQ**
**Resources**

Join the many businesses that use DigitalOcean’s Gradient™ AI Inference Cloud. Reach out to our team for assistance with GPU Droplets, 1-click LLM models, AI Agents, and bare metal GPUs. Popular Topics AI/ML
Ubuntu
Linux Basics
JavaScript
Python
MySQL
Docker
Kubernetes
All tutorials
Talk to an expert
Featured tutorials SOLID Design Principles Explained: Building Better Software Architecture
How To Remove Docker Images, Containers, and Volumes
How to Create a MySQL User and Grant Privileges (Step-by-Step)
All tutorials
All topic tags

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Resources for startups and AI-native businesses

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

About
Leadership
Blog
Careers
Customers
Partners
Referral Program
Press
Legal
Privacy Policy
Security
Investor Relations

GPU Droplets
Bare Metal GPUs
Inference Engine
Data & Learning
Evaluations
Model Library
Droplets
Kubernetes
Functions
App Platform
Load Balancers
Managed Databases
Spaces
Block Storage
Network File Storage
API
Uptime
Cloud Security Posture Management (CSPM)
Identity and Access Management (IAM)
Cloudways
View all Products

Community Tutorials
Community Q&A
CSS-Tricks
Currents Research
DigitalOcean Startups
Wavemakers Program
Compass Council
Open Source
Marketplace
Pricing
Pricing Calculator
Documentation
Release Notes
Code of Conduct
Shop Swag

AI Training GPU
GPU Inference
VPS Hosting
Website Hosting
VPN
Docker Hosting
Node.js Hosting
Web Mobile Apps
WordPress Hosting
Virtual Machines
View all Solutions

Support
Sales
Report Abuse
System Status
Share your ideas

About
Leadership
Blog
Careers
Customers
Partners
Referral Program
Press
Legal
Privacy Policy
Security
Investor Relations

GPU Droplets
Bare Metal GPUs
Inference Engine
Data & Learning
Evaluations
Model Library
Droplets
Kubernetes
Functions
App Platform
Load Balancers
Managed Databases
Spaces
Block Storage
Network File Storage
API
Uptime
Cloud Security Posture Management (CSPM)
Identity and Access Management (IAM)
Cloudways
View all Products

Community Tutorials
Community Q&A
CSS-Tricks
Currents Research
DigitalOcean Startups
Wavemakers Program
Compass Council
Open Source
Marketplace
Pricing
Pricing Calculator
Documentation
Release Notes
Code of Conduct
Shop Swag

AI Training GPU
GPU Inference
VPS Hosting
Website Hosting
VPN
Docker Hosting
Node.js Hosting
Web Mobile Apps
WordPress Hosting
Virtual Machines
View all Solutions

Support
Sales
Report Abuse
System Status
Share your ideas