Optimizing LLM Compression with SparseGPT and Wanda

Blog
Docs
Careers
Get Support
Contact Sales

Highlighted Products AI Solutions
Cloud Compute: Build, deploy, and scale resources
Container Management: Securely store and manage containers
Managed Databases: Fully managed popular database resources
Management Tools: Control infrastructure and gain insights
Networking: Secure and manage app traffic
Security: Protect your account and resources
Storage: Reliable data storage and access in the cloud
Explore all products
Solutions: AI/ML, CMS, Data and IoT, Developer Tools, Gaming, Media Hosting, Security, Startups, Web Platforms
See all solutions
Developer Community: Documentation, Developer Tools, Get Involved
Partners: Join as a Partner, Marketplace
Pricing Plans

Access your account: Log in to Community and DigitalOcean

Access your account: Log in to Community and DigitalOcean

Tutorials
Inquiries
Product Documentation
Community Search

Contents Overview

Authored by Adrien Payong and Shaoni Mukherjee

Training large models incurs high costs, but the ongoing expense lies in inference. Each request consumes GPU resources, including memory and processing power. Model weights must remain in GPU memory persistently, and the cache expands during generation. The serving engine must handle multiple users concurrently with acceptable latency.

Consider a quick thought experiment.

The memory needed for model weights is calculated by multiplying the number of parameters by the byte size per parameter. For instance, a model with seven billion parameters in FP16 format would require approximately 14 GB for the weights alone (7 × 10^9 × 2). This does not account for additional buffers or caches. Large models quickly surpass the VRAM of standard GPUs, and caches can demand more memory than the weights. Simply acquiring larger GPUs is not a sustainable solution, making LLM compression essential for efficient inference.

This highlights the significance of LLM compression. Instead of continually upgrading to larger GPUs, teams can create smaller, more efficient models. Various compression techniques exist, including quantization, distillation, low-rank approximation, and pruning. This article will specifically address pruning, focusing on SparseGPT and Wanda methods, both of which compress large language models without the need for costly retraining.

Pruning lowers LLM inference costs by reducing non-zero weights and alleviating memory pressure.
SparseGPT excels in maintaining quality, especially under aggressive pruning or higher sparsity levels.
Wanda offers a faster and simpler implementation by avoiding complex calculations.
Sparse models do not automatically guarantee speed; real improvements require specialized runtimes and hardware support.
Pruning should be part of a broader optimization strategy, integrating quantization, KV-cache adjustments, batching, caching, speculative decoding, and routing.

The Importance of LLM Compression

LLM inference faces four primary production bottlenecks:

GPU VRAM limitations: Model weights, KV cache, runtime tensors, and multiple active requests must fit in memory. Offline tests may indicate functionality, but real-world traffic can lead to failures due to memory constraints from increased sequence lengths and batch sizes.
Memory bandwidth constraints: During autoregressive decoding, the model generates tokens sequentially, requiring frequent loading of weights and cached states. This can render decoding memory-bound rather than compute-bound, limiting the benefits of higher FLOPs.
Latency issues: Applications need low time-to-first-token, consistent inter-token latency, and predictable end-to-end latency. A model may generate high-quality responses, but slow speeds can negatively impact user experience.
Cost implications: GPU cloud infrastructure can be costly when GPUs are underutilized or overprovisioned. Small efficiency gains can significantly lower costs per million tokens at scale.

Pruning decreases the number of non-zero weights in a model, leading to sparse models that primarily store zeros. This theoretically reduces memory and compute requirements. If the inference framework utilizes this sparsity effectively—through compressed storage formats and specialized matrix multiplication kernels—models can achieve a smaller VRAM footprint, increased throughput, and lower latency. The actual speed improvements depend on the hardware and software employed.

Beyond GPU servers, compression opens new deployment possibilities. Pruned models can run on smaller GPUs, edge devices, or multi-model servers. Memory savings allow multiple models to coexist on one GPU, use more affordable instances, or accommodate larger batch sizes within the same budget. Compression directly influences inference economics by lowering memory usage per token, thus reducing costs.

Understanding Sparsity: Unstructured vs. Structured

Neural networks are generally dense, with most weights being non-zero. In contrast, sparse networks have a majority of zero-valued weights. Pruning enhances sparsity by permanently zeroing certain weights. There are two key types of sparsity in language models: unstructured and structured.

Unstructured Sparsity

Unstructured pruning randomly removes individual weights throughout the model, providing flexibility in selecting which weights to discard while minimizing the effect on model performance. This method can often maintain accuracy at a given sparsity level.

However, GPUs are optimized for dense matrix operations. Sparse matrices with randomly positioned weights create irregular memory access patterns, leading to indexing overhead and reduced kernel efficiency. Without specialized sparse-matrix multiplication kernels, dense operations will still process zero values, meaning unstructured sparsity does not inherently enhance latency.

Structured Sparsity

Structured pruning removes weights in regular patterns, such as rows, columns, or fixed groups. A common hardware-friendly pattern is 2:4 sparsity, where two weights in a block of four are set to zero, achieving a 50% sparsity rate.

Supported GPUs can accelerate these fixed sparse patterns. NVIDIA’s Ampere architecture supports Sparse Tensor Cores for fine-grained structured sparsity, including the 2:4 pattern, theoretically doubling matrix multiplication throughput compared to dense operations. However, actual latency improvements vary based on the model and workload.

Comparing Traditional and Modern LLM Pruning

Traditional neural network pruning typically follows a three-step process: Train a dense model, remove less critical weights, and fine-tune or retrain the sparse model to regain accuracy. This iterative process can yield very sparse networks but requires significant computational resources.

For billion-parameter LLMs, this approach is impractical. It necessitates loading massive models into memory and distributed GPU clusters and preparing recovery data. After each pruning phase, fine-tuning and evaluation of model quality are required before proceeding.

Older pruning techniques often rely on costly second-order approximations or iterative updates, which may maintain accuracy but are harder to scale for larger models as we transition from millions to billions of parameters.

Modern LLM pruning techniques emphasize post-training, one-shot pruning. They utilize a pretrained model and a small calibration dataset to identify less important weights, enabling pruning while keeping outputs close to the original without full retraining.

SparseGPT and Wanda exemplify this approach. SparseGPT employs second-order layer reconstruction for importance estimation, while Wanda uses a simpler activation-aware weight-importance scoring.

SparseGPT: Reconstruction-Based One-Shot Pruning

SparseGPT is a one-shot pruning technique designed for large GPT-style models, framing pruning as a layer-wise sparse regression challenge. For a linear layer with weights W and calibration activations X, the goal is to find a pruned matrix that minimizes reconstruction error.

The outputs of the pruned layer should closely resemble those of the original layer for a specific calibration dataset. SparseGPT aims to align this output rather than simply discarding the smallest weights. It utilizes second-order information from calibration activations to estimate the impact of pruning on layer output. After determining which weights to prune, SparseGPT adjusts the remaining weights to compensate for the removed ones and minimize reconstruction error.

Why SparseGPT is Effective

SparseGPT effectively balances accuracy and efficiency by integrating several principles:

Utilizes second-order sensitivity to rank weights based on their impact on output reconstruction, allowing important weights to be preserved even if they are not the largest.
Processes layers individually, enabling adjustments to remaining weights post-pruning to minimize reconstruction error and maintain model behavior.
Employs weight compensation, allowing adjustments to remaining weights after pruning to reduce reconstruction error and uphold model behavior at higher sparsity levels.
Supports various patterns, including unstructured and semi-structured pruning patterns like 2:4, making it compatible with hardware designed for structured sparsity.

Due to these attributes, SparseGPT maintains quality significantly better than simpler pruning methods, scaling effectively to models with tens or hundreds of billions of parameters.

Wanda: An Activation-Aware Pruning Method

Wanda, or Pruning by Weights and Activations, is a lightweight pruning strategy introduced as a simpler alternative to reconstruction-based methods like SparseGPT. It does not involve complex layer-wise reconstruction or Hessian estimation; instead, it relies on a straightforward activation-aware importance score.

The underlying premise of this score is that a weight is important if it has a high magnitude and connects to an input dimension with strong activation. Weights with lower magnitudes or those linked to weakly activated input dimensions are more likely to receive low importance scores and can be pruned. Wanda ranks weights by removing those with the smallest activation-scaled magnitudes for each output. This method requires no retraining or weight updates, allowing direct use of the pruned model. Wanda significantly outperforms magnitude-based pruning and competes well against more complex methods in experiments with LLaMA and LLaMA-2.

Advantages of Wanda

The simplicity of Wanda offers several advantages:

Simple implementation: The algorithm only requires collecting activation norms and sorting weights, with no need for Hessian approximations or large regression problems.
Rapid pruning process: There are no weight updates during pruning, making it significantly faster than SparseGPT; Wanda can be 5-10 times quicker for 70B models on a single GPU.
Maintains competitive accuracy: Despite its simplicity, Wanda preserves quality well at moderate sparsity levels, achieving perplexity close to SparseGPT on LLaMA-2 models and outperforming basic magnitude pruning.
Useful baseline for experimentation: Engineering teams can quickly implement Wanda to evaluate the behavior of sparse models before committing to more advanced optimization techniques.

However, Wanda may experience a quicker decline in accuracy compared to SparseGPT at very high sparsity levels. It does not incorporate weight compensation and primarily focuses on unstructured pruning, although it does include options for 2:4 and 4:8 patterns.

Contrasting SparseGPT and Wanda

The following table outlines the comparison:

SparseGPT excels when accuracy retention is critical and when engineering resources are available. Wanda is preferable for quick experiments, lighter deployments, or when approximate results are acceptable. Many teams evaluate both methods to identify the optimal approach for their models, sparsity targets, and hardware.

Implementing Pruning Workflow on GPU Cloud

Deploying pruned LLMs requires consideration of both model-level and infrastructure-level factors. A typical workflow involves:

Python Example: Applying Wanda to a Linear Layer

This example outlines a simplified PyTorch function for applying Wanda pruning to a single linear layer. In practice, it should be extended to encompass all projection layers and manage structured patterns.

import torch import torch . nn as nn @torch . no_grad ( ) def wanda_prune_linear ( layer : nn . Linear , input_activations : torch . Tensor , sparsity : float = 0.5 ) : """ Apply Wanda-style unstructured pruning to one Linear layer. Args: layer: PyTorch Linear layer. input_activations: Calibration activations with shape [batch, seq_len, hidden_dim] or [num_tokens, hidden_dim]. sparsity: Fraction of weights to prune per output row. Returns: The pruned layer, modified in-place. """ if not isinstance ( layer , nn . Linear ) : raise TypeError ( "wanda_prune_linear expects an nn.Linear layer." ) if not 0.0 <= sparsity <= 1.0 : raise ValueError ( "sparsity must be between 0 and 1." ) # Flatten activations to shape [n_tokens, input_dim] if input_activations . dim ( ) == 3 : X = input_activations . reshape ( - 1 , input_activations . shape [ - 1 ] ) else : X = input_activations W = layer . weight if X . shape [ - 1 ] != W . shape [ 1 ] : raise ValueError ( f"Activation dimension { X . shape [ - 1 ] } does not match " f"layer input dimension { W . shape [ 1 ] } ." ) # Compute L2 norm of each input dimension activation_norm = torch . norm ( X , p = 2 , dim = 0 ) # Compute Wanda importance scores: |W_ij| * ||X_j|| scores = torch . abs ( W ) * activation_norm . unsqueeze ( 0 ) # Number of weights to prune per output row num_prune = int ( W . shape [ 1 ] * sparsity ) if num_prune == 0 : return layer # Build pruning mask mask = torch . ones_like ( W , dtype = torch . bool ) for row in range ( W . shape [ 0 ] ) : prune_indices = torch . topk ( scores [ row ] , k = num_prune , largest = False ) . indices mask [ row , prune_indices ] = False # Apply mask in-place W . mul_ ( mask ) return layer # Example usage device = "cuda" if torch . cuda . is_available ( ) else "cpu" hidden_dim = 4096 linear = nn . Linear ( hidden_dim , hidden_dim , bias = False ) . half ( ) . to ( device ) calibration_activations = torch . randn ( 4 , 128 , hidden_dim , device = device , dtype = torch . float16 ) pruned_layer = wanda_prune_linear ( linear , calibration_activations , sparsity = 0.5 ) zero_count = torch . sum ( pruned_layer . weight == 0 ) . item ( ) total_count = pruned_layer . weight . numel ( ) print ( f"Sparsity: { zero_count / total_count : .2% } " )

Here’s a simplified example of Wanda-style pruning for a single nn.Linear layer in PyTorch. Calibration activations are utilized to compute the L2 norm for every input dimension. Activation norms are multiplied by the absolute weight values to calculate Wanda importance scores. The lowest-scoring weights are pruned for each output row based on the desired sparsity ratio using a binary mask. In this example, a half-precision linear layer is created, random calibration activations generated, and the layer pruned to 50% sparsity, with the final percentage of zeros in the weights displayed.

Evaluating Sparse Models: Metrics and Expectations

Assessing pruned models necessitates metrics that reflect both quality and efficiency:

Perplexity and downstream accuracy demonstrate the effects of pruning on language modeling and task performance. SparseGPT authors report that GPT-family models can achieve 50% or more sparsity in one shot without retraining and with minimal accuracy loss, with OPT-175B and BLOOM-176B reaching 60% unstructured sparsity with negligible perplexity increase.
VRAM consumption indicates memory savings. If the inference stack utilizes compressed formats for pruned weights and employs sparse kernels, VRAM usage decreases in line with sparsity.
Throughput (tokens per second) and latency assess serving efficiency. Speed improvements necessitate replacing dense kernels with sparse ones; otherwise, zero weights still consume compute resources. For semi-structured 2:4 sparsity, a PyTorch tutorial shows a 1.3x speedup for BERT on A100 GPUs.
Time-to-first token and inter-token latency evaluate user experience. A reduced memory footprint can help decrease these latencies.
Cost per million tokens links engineering enhancements to business value. Compression that halves memory and boosts throughput can significantly lower costs per token.

It's crucial to maintain realistic expectations: achieving 50% sparsity does not guarantee a 2x speedup. Actual speed improvements depend on hardware capabilities, kernel implementations, batch sizes, and whether weight computation or KV-cache operations dominate the workload. Structured sparsity is generally easier to accelerate due to hardware and software support for specific patterns, while unstructured sparsity often necessitates custom CUDA kernels or frameworks like Triton.

Infrastructure Considerations for Sparse Inference

Pruning is not merely a model optimization technique; it presents an infrastructure challenge. Several factors determine whether sparsity translates into speed:

Checkpoint format: Storing zero weights in dense tensors wastes storage and diminishes potential memory savings. Semi-structured sparsity can be stored in compressed formats, where non-zero elements are accompanied by metadata describing their locations. Libraries like cusparSELt offer kernels optimized for specific sparse formats.
Kernel support: Dense matrix multiplication kernels do not bypass zero weights. To achieve speed, the runtime must utilize sparse kernels capable of leveraging the sparse weight layout. PyTorch’s to_sparse_semi_structured method can convert weights with supported semi-structured sparsity patterns into a sparse tensor datatype that enables sparse-kernel execution.
NVIDIA Ampere and Hopper GPUs support sparse Tensor Cores, but only for hardware-friendly patterns like 2:4 sparsity. Unstructured sparsity (random pruning) is more challenging to accelerate and may require general sparse GEMM libraries or custom kernels. Additionally, it often does not yield significant speedups unless the model is extremely sparse and the runtime is optimized for this sparsity.
Batching and scheduling: While sparse matrix multiplication may offer speed advantages, overall latency also hinges on KV-cache management, batching, request scheduling, and memory bandwidth. In workloads with long contexts or high concurrency, the KV cache can dominate memory usage, diminishing the practical benefits of weight pruning.
Observability: Effective monitoring of latency, throughput, GPU utilization, memory usage, and quality is essential. While sparsifying a model may reduce latency in a benchmark, it could lead to quality issues in user-facing tasks falling below production standards.

The accompanying image illustrates that pruning can only enhance LLM inference when the model, sparse checkpoint format, GPU kernels, and serving infrastructure are properly aligned. It uses a simple 2:4 sparsity example to demonstrate why pruning alone is insufficient for meaningful production gains.

When to Opt for SparseGPT and Wanda

Select SparseGPT or Wanda to lower inference costs without the need for complete retraining. They are useful for fitting larger models onto smaller GPUs, serving multiple models on a single GPU node, reducing memory usage, and alleviating KV cache pressure. These methods can facilitate rapid testing of compressed LLM variants, prepare models for edge deployment, and enhance the cost-effectiveness of deploying open-source LLMs.

Avoid relying solely on pruning if your inference engine cannot leverage sparse weights, if your workload frequently accesses the KV cache, if model quality declines post-pruning, or if your deployment hardware lacks sparse acceleration support.

Choose SparseGPT if your primary focus is on retaining quality at higher sparsity levels. Opt for Wanda when swift experimentation and ease of implementation are more critical.

The essential production question is not merely: "Is the model sparse?" A more pertinent inquiry is: "Does this sparse model lower the cost per useful token while maintaining quality?"

What is LLM pruning? LLM pruning is a compression method that eliminates less significant weights from a model to decrease memory usage and inference costs while preserving quality.
What distinguishes SparseGPT from Wanda? SparseGPT employs second-order reconstruction to maintain layer outputs post-pruning, whereas Wanda uses a simpler activation-aware score based on weight magnitude and input activation norm, making it quicker and easier to implement.
Does pruning inherently accelerate an LLM? No. Pruning enhances speed only if the inference stack can utilize sparse weights through compatible formats, kernels, and hardware. If dense kernels process zero weights, latency may not improve.
When should SparseGPT be chosen? SparseGPT is ideal when maintaining quality at higher sparsity levels is paramount. Though more complex, it is designed to uphold model behavior through reconstruction and weight compensation.
When should Wanda be selected? Wanda is preferable for quick experimentation, simplicity, and low implementation complexity. It serves as a strong baseline for assessing how pruning impacts a model before delving into more intricate optimization techniques.

Both SparseGPT and Wanda demonstrate that large language models can be pruned post-training without retraining. SparseGPT employs a complex reconstruction-based metric with second-order approximations to achieve this at scale while retaining accuracy. In contrast, Wanda adopts a simpler activation-aware method, multiplying weight by input norm activation, facilitating quick pruning with minimal engineering effort. Both approaches yield sparse models that reduce memory demands and, with appropriate kernels and hardware, enhance inference throughput and decrease latency.

Pruning is just one aspect to consider. Speed improvements and cost reductions will only materialize with system-level support for sparse formats, custom sparse kernels, smart memory optimization, and additional techniques like quantization and KV-cache management. Production inference stacks will likely employ a combination of strategies—pruning, quantization, caching, batching, speculative decoding, and routing—to deliver responsive AI services at the lowest possible cost. In this future landscape, pruning will be a standard component of the LLM optimization process: a practical strategy for reducing costs per token and efficiently scaling AI services.

SparseGPT: Accurate One-Shot Pruning for Large Language Models
An Effective Pruning Method for Large Language Models
Accelerating BERT with Semi-Structured (2:4) Sparsity
Estimating Memory Requirements for LLM Inference
Pruning LLMs via Weights and Activations

Discover more about our offerings.

About the Authors

I am an AI consultant and technical writer with over four years of experience, holding a master’s in AI. I create insightful articles that provide developers and researchers with practical insights, establishing myself as a trusted voice in the tech community.

With a solid background in data science and over six years of experience, I am dedicated to producing comprehensive content on technology topics. My current focus is on AI, machine learning, and GPU computing, covering areas from deep learning frameworks to optimizing GPU workloads.

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

Very Informative

Join the numerous businesses leveraging DigitalOcean’s Gradient AI Agentic Cloud to drive growth. Contact our team for assistance with GPU Droplets, 1-click LLM models, AI agents, and bare metal GPUs.

Highlighted Tutorials

All tutorials
All topic categories

Please provide your information!

Contents Overview
Key Takeaways
Importance of LLM Compression
Understanding Sparsity: Unstructured vs. Structured
Traditional vs. Modern LLM Pruning
SparseGPT: Reconstruction-Based One-Shot Pruning
Wanda: A Simple Activation-Aware Pruning Method
Comparing SparseGPT and Wanda
Practical Pruning Workflow on GPU Cloud
Benchmarking Sparse Models: Metrics and Expectations
Infrastructure Considerations for Sparse Inference
When to Use SparseGPT and Wanda
FAQs
Conclusion
References

Join the numerous businesses utilizing DigitalOcean’s Gradient AI Agentic Cloud to fuel growth. Reach out for assistance with GPU Droplets, 1-click LLM models, AI agents, and bare metal GPUs. Popular Topics: AI/ML, Ubuntu, Linux Basics, JavaScript, Python, MySQL, Docker, Kubernetes, All tutorials, Expert Consultation, Featured Tutorials: SOLID Design Principles Explained, How to Remove Docker Images, Containers, and Volumes, How to Create a MySQL User and Grant Privileges

Earn money by writing technical tutorials and choose a tech-focused charity for a matching donation.

DigitalOcean Documentation

Complete documentation for all DigitalOcean products.

Resources for Startups and AI-Native Businesses

The Wave contains essential information for building a business, from securing funding to marketing your product.

The Developer Cloud

Scale your operations as you grow, whether running a single virtual machine or thousands.

Begin building today

From GPU-powered inference and Kubernetes to managed databases and storage, acquire everything necessary to create, scale, and deploy intelligent applications.

About
Leadership
Blog
Careers
Customers
Partners
Referral Program
Press
Legal
Privacy Policy
Security
Investor Relations

GPU Droplets
Bare Metal GPUs
Inference Engine
Data & Learning
Model Library
Droplets
Kubernetes
Functions
App Platform
Load Balancers
Managed Databases
Spaces
Block Storage
Network File Storage
API
Uptime
Cloud Security Posture Management
Identity and Access Management
Cloudways
View all Products

Community Tutorials
Community Q&A
CSS-Tricks
Currents Research
DigitalOcean Startups
Wavemakers Program
Compass Council
Open Source
Marketplace
Pricing
Pricing Calculator
Documentation
Release Notes
Code of Conduct
Shop Swag

VPS Hosting
Website Hosting
VPN
Docker Hosting
Node.js Hosting
Web Mobile Apps
WordPress Hosting
Virtual Machines
View all Solutions

Support
Sales
Report Abuse
System Status
Share Your Ideas

About
Leadership
Blog
Careers
Customers
Partners
Referral Program
Press
Legal
Privacy Policy
Security
Investor Relations

GPU Droplets
Bare Metal GPUs
Inference Engine
Data & Learning
Model Library
Droplets
Kubernetes
Functions
App Platform
Load Balancers
Managed Databases
Spaces
Block Storage
Network File Storage
API
Uptime
Cloud Security Posture Management
Identity and Access Management
Cloudways
View all Products

Community Tutorials
Community Q&A
CSS-Tricks
Currents Research
DigitalOcean Startups
Wavemakers Program
Compass Council
Open Source
Marketplace
Pricing
Pricing Calculator
Documentation
Release Notes
Code of Conduct
Shop Swag

VPS Hosting
Website Hosting
VPN
Docker Hosting
Node.js Hosting
Web Mobile Apps
WordPress Hosting
Virtual Machines
View all Solutions

Support
Sales
Report Abuse
System Status
Share Your Ideas