Serverless Inference

When selecting a large language model (LLM) for an application, extensive research is conducted to determine the best model for the use case. However, after integrating the model into the app and pushing it to production, the model's accuracy, time to first token (TTFT), and throughput may be worse than expected, even if the same model is used.

The reason for this discrepancy lies in the fact that models are not treated equally across platforms. Providers make various infrastructure decisions, such as how many replicas to keep warm, what precision to serve the model at, which GPU tier to allocate, and how to prioritize request queues. These decisions are rarely documented and vary significantly from provider to provider and from model to model on the same provider.

What Providers Actually Control

Most developers assume that if a model is listed on a provider's platform, it is being served in a standard, equivalent way. However, providers make several decisions for each model that compound to produce the latency and consistency observed.

Replica Count and Warm Pool Size

Serverless inference works by dynamically allocating GPU capacity to handle incoming requests. Popular models with consistent, high-volume traffic justify keeping multiple live replicas sitting idle at all times. Less popular models may have zero warm replicas outside of peak periods, resulting in cold starts when a request arrives.

Quantization

The same model weights can be served at different numerical precisions. Full-precision serving uses the most memory but preserves the original weights exactly. Lower precisions reduce memory footprint but can produce measurable quality differences on reasoning-heavy benchmarks.

GPU Hardware Allocation

Not all GPUs are equal, and providers may route different models to different GPU tiers based on demand, available inventory, and the economics of that model's traffic volume.

Inference Engine and Kernel Optimization

How a model is actually executed matters as much as the hardware it runs on. Different inference engines and kernel optimizations produce meaningfully different throughput and latency profiles for the same model weights.

Request Queue Priority and Batching

Under load, providers batch multiple requests together to improve GPU utilization. Popular models with steady, predictable traffic batch efficiently, while niche models with sparse, bursty traffic batch poorly.

The Compounding Effect

These decisions don't operate independently. A niche model might be served at a conservative precision level, on an older GPU tier, without speculative decoding, with no warm replicas, and with poor batching efficiency. Each factor adds latency individually, and all of them stack.

Why Popularity Drives These Decisions

Serverless inference providers operate on thin GPU margins. Capacity is expensive, and pre-allocating warm replicas for every model in a catalog is not economically feasible. The allocation decision is straightforward: they invest deeply in models that generate consistent, high-volume traffic and reduce investment in models that sit idle most of the time.

What Internal Testing Revealed

Internal testing revealed significant differences in model behavior across providers. The same model can behave like a completely different product across providers, with some providers showing a coefficient of variation (CV) of 21% and others showing a CV of 710%.

Finding 1: The Same Model Will Behave Differently on Different Providers

DeepSeek V4 Pro is a widely-used model with strong coding and reasoning performance. In testing, the best-performing provider for DeepSeek V4 Pro showed a CV of 21%, implying tight, predictable latency with a median TTFT of 0.39 seconds and a p95 of 0.57 seconds. A second provider showed a CV of 541%: median 0.55 seconds, p95 of 6.3 seconds. A third provider showed a CV of 710%: same model, median 0.73 seconds, p95 of 6.9 seconds.

Finding 2: There Is No Universal 'Best' Provider

Kimi K2.6 tells the opposite story. On one provider, the CV was 989%: median 0.35 seconds, p95 of 5.98 seconds. On a second provider, the CV was 1266%: median 0.43 seconds, p95 of 1.70 seconds but with extreme outliers driving a standard deviation of 5.4 seconds. On the provider that supported it best, the CV dropped to 102%: median 0.25 seconds, p95 of 1.08 seconds, which is roughly 10x more consistent than on the other two platforms.

Prerequisites

To measure the consistency of a model across providers, you will need to set up a benchmarking environment. This can be done using a cloud provider such as DigitalOcean.

Troubleshooting

Technology teams are watching serverless inference closely because changes in this space often arrive faster than internal policies can adapt.

For product and engineering leaders, the practical question is how this could reshape roadmaps, vendor choices, and security reviews over the next few quarters.

Organizations that document lessons early tend to respond more calmly when similar patterns appear again.

If you encounter issues with model consistency, check the provider's documentation for information on their infrastructure decisions and optimization strategies. You can also try contacting the provider's support team for more information.