HBM Tax
By Haimantika Mitra and Shaoni Mukherjee Learn what it means for software, security, and business technology teams.
- Gen-ai
- Ai-ml
- gpu
- Devops Tutorials
By Global Outreach
- Home
- Documentation
- Community
- Support
- Contact
- Products
- Solutions
- Developers
- Partners
- Pricing
- Log in to Community
- Log in to DigitalOcean
- Sign up for Community
- Sign up for DigitalOcean
- Log in to Community
- Log in to DigitalOcean
- Sign up for Community
- Sign up for DigitalOcean
- Tutorials
- Questions
- Product Documentation
- Search Community
Table of Contents
By Haimantika Mitra and Shaoni Mukherjee
Vision-language models require different hardware for vision encoding and language decoding. When run on the same GPU, the model underperforms due to the mismatch in hardware requirements.
The logs do not explain the underperformance.
There are no OOM errors, runaway processes, or obvious resource fights, just an expensive GPU underperforming for no apparent reason.
Teams often start tuning by changing batch sizes, sampling settings, and quantization configurations, but these changes do not fix the real problem, which lies in the hardware contract.
A vision-language model performs two types of work: vision encoding and language decoding.
Vision encoding is computationally intensive, while language decoding requires high memory bandwidth. Running both tasks on the same GPU results in a permanent compromise, underutilizing the GPU in two different ways.
This is known as the HBM tax or High Bandwidth Memory tax. For multimodal traffic, it can consume a third or more of the inference budget.
A paper by Donglin Yu provides hard numbers on the inefficiency of running vision-language models on the same GPU.
- Vision-language models do two kinds of work with opposite hardware needs.
- The HBM tax is the cost of this mismatch.
- Standard monitoring does not show the tax.
- The right fix is to split the pipeline at the modality boundary.
Two Pipelines, Two Hardware Regimes
A vision-language model has two distinct jobs: vision encoding and language decoding.
Vision encoding is a computationally intensive task that barely touches memory.
Language decoding is the reverse, requiring high memory bandwidth and generating one token at a time.
These two tasks have different hardware requirements.
T_e2e = T_vision + T_xfer + T_prefill + T_decodeT_vision is the time spent on vision encoding, T_xfer is the time spent transferring the encoded output to the language model, T_prefill is the time spent processing the text prompt, and T_decode is the time spent generating the response token by token.
Measurements show that during vision encoding, the GPU's compute utilization is high, while memory bandwidth utilization is low. During language decoding, the opposite is true.
The two phases are structurally opposite in their hardware requirements, and no batch size can reconcile them.
What the KV Cache Actually Costs
To understand the cost, we need to look at the KV cache, which stores vision tokens and consumes expensive resources.
Let's understand this better.
What the KV Cache Is
The KV cache is used by the transformer to store intermediate results for every token it processes.
The expensive part is that for every new token, the model has to recompute the K and V matrices for all previous tokens.
The KV cache is the fix, storing the K and V matrices for each token so they can be read from memory instead of recomputed.
What Gets Stored
For each layer, for each token, two matrices are stored: K and V.
- Sequence length
- Number of layers
- Model dimension/number of heads
- Batch size
- Precision
A rough formula for one sequence is given.
KV cache size = 2 × num_layers × seq_len × num_heads × head_dim × bytes_per_elementFor Llama 3 70B, a single 128K-token sequence uses approximately 80GB.
The Core Tension
Model weights are static, while the KV cache is dynamic and per-request.
- Long contexts are memory-hungry
- High concurrency is hard
- Evicting/recomputing cache is a real latency tradeoff
The KV cache exists because attention needs to look back, and recomputing the K and V matrices for every token would be quadratic.
The size of the KV cache is given by a formula involving the number of layers, KV heads, head dimension, context length, batch size, and bytes per element.
D_KV = 2 · L · n_kv · d_h · s_ctx · bFor a 7B MHA model, the KV cache size is approximately 350 MB for a single request.
The image tokens are written into the KV cache at prefill and stay there for the entire decode loop, and every output token has to read all of them from memory, again and again.
This is where arithmetic intensity explains the damage, as the decode loop is memory-bound and has low arithmetic intensity.
The per-token latency grows faster than linearly with image token count, which is the signature of the HBM tax.
Why Existing Systems Don't Solve This
The natural reaction is to disaggregate the pipeline, but standard cuts fall short.
Stage-level disaggregation splits prefill from decode onto separate instances, but the cut point is after prefill, requiring the migration of the entire KV cache to the decode instance.
Intra-node co-location keeps everything on one GPU but multiplexes the two workloads spatially, improving utilization but not changing the cross-device communication structure.
Homogeneous serving is the workhorse, using techniques like paged attention and continuous batching, but it does not address the hardware mismatch.
The paper's diagnosis is that stage-level partitioning bakes in a language-only assumption, which breaks for multimodal inference.
The Ideal Point to Separate Vision and Text Processing
The load-bearing idea is to split the pipeline at the modality boundary, between the vision encoder's output and the language model's input.
The vision encoder's output is a single embedding tensor of size O(N_v * d), which does not grow with the language model's depth L.
For LLaVA-7B, the embedding size is approximately 4.5 MB, which is 78 times smaller than the KV cache size.
Embedding to transfer = 576 × 4096 × 2 bytes ≈ 4.5 MB KV cache ( the stage-level alternative ) ≈ 350 MBThe numbers are not subtle, and the advantage scales with depth L.
R = ( 2 · L · n_kv · d_h · s_ctx ) / ( N_v · d )The ratio of embedding size to KV cache size is independent of hidden dimension d.
R_MHA = 2L · ( 1 + s_text / N_v )Two consequences fall out, and the advantage scales with depth L.
Across architectures, the ratio of embedding size to KV cache size is given.
Even at the conservative GQA end, you're moving an order of magnitude less data than a KV-cache migration would.
You don't need exotic interconnects, and a standard 25 Gbps private network handles the embedding transfer with negligible overhead.
The Cost Argument: Why This Is About Dollars
The reason to care is the invoice, and the paper lays out a hardware asymmetry that practitioners feel but rarely price out.
An RTX 4090 roughly matches an A100 on raw compute, but the A100 pulls ahead on memory, with more HBM and higher bandwidth.
The two cards are specialists, with the RTX 4090 being a compute bargain that's memory-poor, and the A100 being a memory powerhouse that you overpay for if all you need is FLOPs.
The paper builds a closed-form cost model and validates it, predicting 31.4% savings from heterogeneous deployment and measuring 40.6%.
A $38k heterogeneous cluster delivered better tokens-per-dollar than a $64k homogeneous one, with no latency regression.
What a Phase-Aware System Looks Like
To prove the theory, the paper builds a serving system, HeteroServe, around four ideas worth understanding.
Modality-level partitioning runs the vision encoder on a compute-dense GPU and the language model on an HBM-heavy GPU, with only embeddings crossing the link between them.
An embedding-only transfer protocol uses dynamic buffer allocation to absorb variable token counts.
Cross-type work stealing allows encoder GPUs to steal decode work from the language pool when they have no vision work, recovering utilization without role-switching complexity.
Engine optimizations are kept separate, using CUDA Graph-accelerated decoding, packed prefill, and lazy KV allocation.
Mapping This Onto DigitalOcean
Everything above is hardware-agnostic theory, and here's how it lands on infrastructure you can actually rent.
The paper's same-box assumption has to be adapted, as DigitalOcean GPU Droplets are single-GPU-class virtual machines spun up independently.
The practical realization is two Droplets on the same private network, one tier for encode and one for decode, with embeddings crossing the wire between them.
The hardware tiers map cleanly, with DigitalOcean's compute-dense cards being the natural home for the vision encoder and HBM-heavy cards for the language model.
- A 4.5 MB LLaVA-7B embedding crosses in a low-single-digit number of milliseconds.
- A 14 MB Qwen2.5-VL embedding crosses in roughly 5-7 ms.
- Even a full 128-image batch crosses in a few hundred milliseconds.
The transfer survives a network hop instead of PCIe, and the embedding is small enough to tolerate ordinary networking.
You can build this on GPU Droplets, giving you GPU-level control and the ability to wire the two pools together over a private network.
When Serverless Is Still the Right Call
This whole argument cuts against fully managed, serverless inference, and that deserves an honest assessment.
For a large fraction of teams, serverless is the right trade, as it provides operational simplicity and is worth more than a cost optimization.
How to Measure the Tax in Your Own Stack
You don’t have to take any of this on faith. The HBM tax is directly observable, and quantifying it in your own deployment is the right first move before you change any infrastructure.
Run the isolation experiment**.** Hold the LLM backbone fixed and vary only the visual input: (a) text only, (b) one low-resolution image, © one high-resolution image, (d) multiple images. Plot inter-token latency against image-token count. The slope of that line is the tax: the per-image-token cost your decode loop pays on every step.
What to instrument:
- HBM utilization broken out by phase (Nsight Systems). Confirm the split: near-zero bandwidth during encode, near-saturation during decode.
- KV cache size per request at varying image-token counts. Watch it balloon as resolution climbs.
- Throughput vs. image-token count at a fixed batch size. This is your headline curve.
- ITL degradation under concurrency**.** Where the compounding shows up, and the metric closest to what users feel.
Tools that already give you most of this: vLLM’s Prometheus metrics expose KV-cache utilization and request-queue depth; NVIDIA Nsight Compute gives per-kernel bandwidth so you can attribute traffic to phases; a custom harness with controlled batch sizes closes the gap. DigitalOcean’s vLLM sizing guide walks through reading TTFT, ITL, and KV-cache pressure if you want a checklist.
What the data should reveal: under concurrent load, throughput degrades faster than linearly with image-token count. That’s the signature. It’s super-linear because HBM bandwidth is shared and contended, and the decode loop pays the inflated KV-cache cost on every token for every request in the batch at once. Linear would mean each image costs a fixed amount; super-linear means the images are fighting each other for bandwidth: the tax, observed in the wild.
This generalizes beyond vision
The reason this matters past LLaVA and Qwen-VL is that the core advantage is a property of the architecture, not of vision specifically.
The asymmetry is simple: an encoder produces output that’s O(1) per layer (a fixed embedding, regardless of decoder depth) while a decoder accumulates O(L) of KV state. Any pairing with that shape benefits from the same analysis. Audio encoders like Whisper feeding a language model: same structure, same cut. Video encoders, which produce enormous token counts and would make a KV-cache migration brutal: same cut, bigger payoff. Multimodal models with several encoder branches: each branch is another O(1)-per-layer output you can partition off cheaply.
And the trend lines compound the advantage over time. Models keep getting deeper, which raises L and grows the transfer ratio in modality-disaggregation’s favor. Compute density on cheaper cards keeps climbing faster than interconnect bandwidth, widening the gap between cheap FLOPs and expensive bytes. The paper isn’t describing a quirk of one model generation: it’s describing a structural property of encoder-decoder multimodal systems that gets more true as both hardware and models advance.
The tax was never a software bug
Come back to the anomaly we opened with: the model that crawled while every dashboard insisted the GPU was fine. You can explain it precisely now. Image tokens enter the KV cache at prefill and stay there through every decode step. HBM bandwidth, the single scarcest resource in language decoding, gets split between the model weights and an ever-growing cache now carrying hundreds of image tokens that contribute memory traffic but no arithmetic. The GPU looked healthy because no single counter was pegged. The tax is paid in the contention between resources, not the saturation of any one of them.
The fix is a question about where in the inference graph you draw the line. The research says the modality boundary (the seam between vision encoder and language model) is the provably optimal place to draw it. Cut there and you reduce cross-device transfer by 12× to 196× depending on architecture and attention scheme (the GQA-heavy models that dominate today’s deployments at the lower end, deeper and MHA models at the upper), you make ordinary networking sufficient for the transfer, and you open up a cost structure (on the order of 30–40% cheaper in the source research) that homogeneous deployments structurally cannot reach. On DigitalOcean, that means an Ada-class encoder pool, an HBM-class decoder pool, and 25 Gbps of private network between them.
The GPU was never the problem. Asking one GPU to be two different machines was.
What is the HBM tax?
HBM stands for High Bandwidth Memory the fast memory on a GPU. The “HBM tax” refers to the hidden performance and cost penalty you pay when a vision-language model runs both image encoding and text generation on the same GPU. The two phases need different things from the hardware, so neither gets what it needs.
My GPU metrics look normal. How do I know if I’m paying this tax?
Run the same model with different inputs text only, one small image, one large image, multiple images and measure how long each generated token takes as image size goes up. If the per-token latency grows faster than linearly with image token count, you’re paying the tax
What is a KV cache and why does it matter here?
KV stands for Key-Value. When a language model generates text, it stores intermediate results for every token it has already processed so it doesn’t have to recompute them. This stored data is the KV cache. Image tokens get added to this cache and stay there for the entire generation so every output token has to read all of them from memory, again and again.
Why can’t I fix this by tuning batch size or quantization?
Those settings help at the margins, but they don’t address the root cause. The problem is that vision encoding and language decoding want fundamentally different hardware. No config change reconciles that on a single GPU.
What does modality-level disaggregation mean?
It means running the vision encoder on one GPU and the language model on a separate GPU. The only thing transferred between them is the encoder’s output, embedding a small, fixed-size tensor rather than the full KV cache. This makes the transfer fast enough to work over a standard cloud private network.
Do I need special hardware or interconnects like NVLink?
No. The embedding transferred between GPUs is small enough (a few megabytes) that a standard 25 Gbps private network handles it with negligible overhead. This is what makes the approach practical on standard cloud infrastructure.
When does this actually make sense to implement?
When three things are true: you’re serving enough multimodal traffic that a 30–40% cost reduction is meaningful, your requests are image- or video-heavy (not occasional), and you’re willing to manage your own serving setup. If you’re early-stage or low-volume, a managed serverless endpoint is simpler and likely the better trade-off.
Does this apply to models beyond vision-language models?
Yes. Any architecture where a fixed-output encoder feeds a depth-scaling decoder has the same structure. Audio models, video models, and multi-encoder multimodal systems all benefit from the same analysis.
- Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
- Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
- How KV Caching Slashes LLM Inference Costs at Scale
- How to Choose the Right GPU for vLLM Inference
- Long-Context Inference at Scale: The Hidden Infrastructure Cost
Learn more about our products
About the author(s)
A Developer Advocate by profession. I like to build with Cloud, GenAI and can build beautiful websites using JavaScript.
With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Featured tutorials
- All tutorials
- All topic tags
Please complete your information!
- Table of contents
- **Key Takeaways**
- Two pipelines, two hardware regimes
- What the KV cache actually costs
- What the KV Cache Is
- What gets stored
- The core tension
- Why existing systems don't solve this
- The Ideal Point to Separate Vision and Text Processing
- The cost argument: why this is about dollars
- What a phase-aware system looks like
- Mapping this onto DigitalOcean
- When serverless is still the right call
- How to measure the tax in your own stack
- This generalizes beyond vision
- The tax was never a software bug
- **FAQ**
- **Sources**
- Ubuntu
- Linux Basics
- JavaScript
- Python
- MySQL
- Docker
- Kubernetes
- All tutorials
- Talk to an expert
- Featured tutorials SOLID Design Principles Explained: Building Better Software Architecture
- How To Remove Docker Images, Containers, and Volumes
- How to Create a MySQL User and Grant Privileges (Step-by-Step)
- All tutorials
- All topic tags
Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.
DigitalOcean Documentation
Full documentation for every DigitalOcean product.
Resources for startups and AI-native businesses
The Wave has everything you need to know about building a business, from raising funding to marketing your product.
The developer cloud
Scale up as you grow — whether you're running one virtual machine or ten thousand.
Start building today
From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.
- About
- Leadership
- Blog
- Careers
- Customers
- Partners
- Referral Program
- Press
- Legal
- Privacy Policy
- Security
- Investor Relations
- GPU Droplets
- Bare Metal GPUs
- Inference Engine
- Data & Learning
- Model Library
- Droplets
- Kubernetes
- Functions
- App Platform
- Load Balancers
- Managed Databases
- Spaces
- Block Storage
- Network File Storage
- API
- Uptime
- Cloud Security Posture Management (CSPM)
- Identity and Access Management (IAM)
- Cloudways
- View all Products
- Community Tutorials
- Community Q&A
- CSS-Tricks
- Currents Research
- DigitalOcean Startups
- Wavemakers Program
- Compass Council
- Open Source
- Marketplace
- Pricing
- Pricing Calculator
- Documentation
- Release Notes
- Code of Conduct
- Shop Swag
- AI Training GPU
- GPU Inference
- VPS Hosting
- Website Hosting
- VPN
- Docker Hosting
- Node.js Hosting
- Web Mobile Apps
- WordPress Hosting
- Virtual Machines
- View all Solutions
- Support
- Sales
- Report Abuse
- System Status
- Share your ideas
- About
- Leadership
- Blog
- Careers
- Customers
- Partners
- Referral Program
- Press
- Legal
- Privacy Policy
- Security
- Investor Relations
- GPU Droplets
- Bare Metal GPUs
- Inference Engine
- Data & Learning
- Model Library
- Droplets
- Kubernetes
- Functions
- App Platform
- Load Balancers
- Managed Databases
- Spaces
- Block Storage
- Network File Storage
- API
- Uptime
- Cloud Security Posture Management (CSPM)
- Identity and Access Management (IAM)
- Cloudways
- View all Products
- Community Tutorials
- Community Q&A
- CSS-Tricks
- Currents Research
- DigitalOcean Startups
- Wavemakers Program
- Compass Council
- Open Source
- Marketplace
- Pricing
- Pricing Calculator
- Documentation
- Release Notes
- Code of Conduct
- Shop Swag
- AI Training GPU
- GPU Inference
- VPS Hosting
- Website Hosting
- VPN
- Docker Hosting
- Node.js Hosting
- Web Mobile Apps
- WordPress Hosting
- Virtual Machines
- View all Solutions
- Support
- Sales
- Report Abuse
- System Status
- Share your ideas
Want help putting this into practice?
Global Outreach builds ERP, VoIP, and custom software for businesses in Pakistan.
Start a conversation