The State of LLM Serving in 2026: Ollama, SGLang, TensorRT, Triton, and vLLM
Spent the last few weeks reading through the source code of five major inference serving frameworks. Here’s what I found.
Ollama · SGLang · TensorRT · Triton · vLLM · Trends · Recommendations
| Framework | Focus | Language | Target User |
|---|---|---|---|
| Ollama | Local LLM execution | Go + C++ | Developers, enthusiasts |
| SGLang | High-performance serving | Python/Rust/CUDA | Production deployments |
| TensorRT | NVIDIA optimization | C++/CUDA | Enterprise, NVIDIA users |
| Triton | GPU kernel compiler | Python/MLIR/C++ | Kernel developers |
| vLLM | Fast LLM serving | Python/CUDA | Production deployments |
¶Ollama
Ollama is not trying to compete with SGLang or vLLM on throughput. It’s solving a different problem: making local model execution trivial.
The architecture is straightforward. Go server handles HTTP and scheduling. llama.cpp does inference. Clean separation.
The scheduler tells you everything about the design philosophy:
type Scheduler struct {
pendingReqCh chan *LlmRequest
finishedReqCh chan *LlmRequest
expiredCh chan *runnerRef
loaded map[string]*runnerRef
}
var defaultModelsPerGPU = 3
defaultModelsPerGPU = 3. SGLang and vLLM think in thousands of concurrent requests. Ollama thinks in “a few models fitting on consumer hardware.”
The API is minimal:
type GenerateRequest struct {
Model string `json:"model"`
Prompt string `json:"prompt"`
Stream *bool `json:"stream,omitempty"`
Images []ImageData `json:"images,omitempty"`
Think *ThinkValue `json:"think,omitempty"` // Reasoning support
}
Cross-platform GPU detection handles NVIDIA, Apple Silicon, and Jetson. The code discovers available compute and adjusts.
Development branches show where it’s headed: MLX backends for Apple Silicon, constrained sampling for JSON schemas, agent loops for tool calling.
¶SGLang
SGLang powers 400,000+ GPUs in production. The performance comes from RadixAttention.
The key insight: many LLM requests share common prefixes. System prompts, few-shot examples, conversation history. RadixAttention caches these in a trie:
class TreeNode:
def __init__(self, id: Optional[int] = None, priority: int = 0):
self.children = defaultdict(TreeNode)
self.value: Optional[torch.Tensor] = None # GPU KV cache
self.host_value: Optional[torch.Tensor] = None # CPU backup
self.lock_ref = 0
self.hit_count = 0
Dual storage: GPU tensors for active cache, CPU tensors for overflow. Hierarchical caching that exceeds GPU memory limits.
The scheduler uses a mixin pattern with 35+ modules:
from sglang.srt.managers.scheduler_dp_attn_mixin import SchedulerDPAttnMixin
from sglang.srt.managers.scheduler_metrics_mixin import SchedulerMetricsMixin
from sglang.srt.managers.scheduler_output_processor_mixin import SchedulerOutputProcessorMixin
from sglang.srt.managers.scheduler_pp_mixin import SchedulerPPMixin
from sglang.srt.disaggregation.prefill import SchedulerDisaggregationPrefillMixin
Need disaggregated prefill/decode? Add the mixin. Metrics? Add that mixin. Composability without code duplication.
33 attention backends. FlashAttention, FlashInfer, AMD AITER, TensorRT-LLM. This isn’t bloat. Optimal performance requires hardware-specific implementations.
EAGLE speculative decoding with tree-based draft verification:
def build_tree_kernel_efficient(
verified_id: torch.Tensor,
parent_list: List[torch.Tensor],
top_scores_index: torch.Tensor,
draft_tokens: torch.Tensor,
...
):
draft_tokens = torch.cat((verified_id.unsqueeze(1), draft_tokens), dim=1).flatten()
5x speedup claims with RadixAttention are common in production.
¶TensorRT
TensorRT is not a serving framework. It’s a collection of optimized inference primitives that other systems build on.
45+ CUDA kernel plugins. Attention, normalization, detection:
QKVToContextPluginDynamic::QKVToContextPluginDynamic(
const std::string name,
const DataType type,
const int32_t hiddenSize,
const int32_t numHeads,
float const dqProbs,
bool hasImask)
: mHeadSize(hiddenSize / numHeads)
, mHiddenSize(hiddenSize)
, mNumHeads(numHeads)
{
mSM = getSmVersion(); // Auto-detect GPU SM version
}
getSmVersion() reveals the approach: every kernel is specialized per GPU architecture. Same multi-head attention, different implementations:
fused_multihead_attention_fp16_128_64_kernel.sm75.cpp # Turing
fused_multihead_attention_fp16_128_64_kernel.sm80.cpp # Ampere
fused_multihead_attention_fp16_128_64_kernel.sm90.cpp # Hopper
fused_multihead_attention_fp16_128_64_kernel.sm100.cpp # Blackwell
Expensive to maintain. Delivers performance generic implementations cannot match.
IPluginV3 separates concerns cleanly:
IPluginCapability* QKVToContextPluginDynamic::getCapabilityInterface(
PluginCapabilityType type) noexcept
{
if (type == PluginCapabilityType::kBUILD)
return static_cast<IPluginV3OneBuild*>(this);
if (type == PluginCapabilityType::kRUNTIME)
return static_cast<IPluginV3OneRuntime*>(this);
return static_cast<IPluginV3OneCore*>(this);
}
Build-time optimization, runtime execution, core functionality. Aggressive optimization during build, minimal runtime footprint.
Not for prototyping. For production-grade performance on NVIDIA hardware.
¶Triton
Triton takes a different approach. Not pre-built kernels. A language for writing GPU kernels with Python-like syntax.
CUDA is hard. The concepts (tiled computation, memory coalescing, shared memory) are not inherently complex. Triton provides abstractions.
JIT compilation with automatic dependency tracking:
class DependenciesFinder(ast.NodeVisitor):
"""
AST visitor that finds dependencies to invalidate
a JITFunction's hash when source code changes.
"""
def __init__(self, name, globals, nonlocals, src) -> None:
self.hasher = hashlib.sha256(src.encode("utf-8"))
self.used_global_vals: Dict[Tuple[str, int], Tuple[Any, Dict[str, Any]]] = {}
Compiler tracks kernel source and all referenced global variables. Change a constant, kernel recompiles. Iterative development becomes practical.
23 GPU-specific transformation passes:
| Pass | Purpose |
|---|---|
AccelerateMatmul |
MMA instruction selection |
Pipeliner |
Software pipelining |
RemoveLayoutConversions |
Eliminate redundant conversions |
ReorderInstructions |
Instruction scheduling |
WarpSpecialization |
Warp-level parallelism |
Coalesce |
Memory coalescing |
Software pipelining hides memory latency:
struct LoopPipelinerInternal {
struct LiverangeInfo {
unsigned lastUseStage = 0;
unsigned defStage = 0;
};
ForOp forOp;
unsigned maxStage = 0;
DenseMap<Operation *, unsigned> stages;
};
Automatic overlap of memory operations with computation. Would require significant expertise by hand.
Triton is the default backend for PyTorch 2.0’s torch.compile. That says enough about maturity.
¶vLLM
vLLM introduced PagedAttention. Treating KV cache memory like OS pages allows efficient utilization when sequence lengths vary.
Currently transitioning to V1 architecture:
class Scheduler(SchedulerInterface):
def __init__(
self,
vllm_config: VllmConfig,
kv_cache_config: KVCacheConfig,
structured_output_manager: StructuredOutputManager,
...
) -> None:
self.connector = None
if self.vllm_config.kv_transfer_config is not None:
self.connector = KVConnectorFactory.create_connector(
config=self.vllm_config,
role=KVConnectorRole.SCHEDULER,
kv_cache_config=self.kv_cache_config,
)
KVConnector enables prefill/decode disaggregation. Run compute-intensive prefill on different hardware than memory-bound decode. Increasingly important as models grow.
218 model architectures. More than any other framework.
V1 attention backends redesigned. FlashAttention primary, multiple MLA implementations:
class FlashAttentionBackend(AttentionBackend):
accept_output_buffer: bool = True
supported_dtypes: ClassVar[list[torch.dtype]] = [torch.float16, torch.bfloat16]
@staticmethod
def get_kv_cache_shape(num_blocks, block_size, num_kv_heads, head_size, ...):
return (2, num_blocks, block_size, num_kv_heads, head_size)
Optimization levels:
-O0: No optimizations (fastest startup)
-O1: Quick optimizations (CUDAGraphMode.PIECEWISE)
-O2: Full optimizations (default, CUDAGraphMode.FULL_AND_PIECEWISE)
V1 integrates torch.compile. Moving from hand-written CUDA to compiled optimizations that adapt to new hardware.
¶Where This Is All Heading
These frameworks started in different places. They’re converging.
¶Attention: MHA to MLA
Multi-head Latent Attention (MLA) compresses key-value representations. Reduces memory bandwidth. DeepSeek popularized it. SGLang has multiple backends. vLLM has four implementations.
Standard MHA → FlashAttention → MLA → Sparse Attention
Each step trades generality for efficiency.
¶KV Cache: Paged to Hierarchical
PagedAttention was a breakthrough. RadixAttention (SGLang) exploits prefix sharing. HiCache adds hierarchical storage with CPU offloading.
PagedAttention → RadixAttention → HiCache
¶Compilation: Static to Dynamic
All frameworks moving toward torch.compile. vLLM V1 uses it by default. SGLang has active branches. Even TensorRT is adapting.
¶Distribution: TP/PP to Disaggregation
Tensor and pipeline parallelism giving way to prefill/decode disaggregation. Run different phases on different hardware. SGLang and vLLM both implementing this.
¶Quantization: Lower Precision
| Format | Support |
|---|---|
| FP16 | Universal |
| INT8 | Common |
| FP8 | Growing (TensorRT, SGLang, vLLM) |
| INT4 | Common for local (GGML) |
| FP4/MXFP | Emerging (Triton, TensorRT) |
MXFP formats are next. Better quality-per-bit than integer quantization.
¶Which One
| Use Case | Framework | Reason |
|---|---|---|
| Local development | Ollama | Simplest setup |
| Maximum throughput | SGLang | RadixAttention, proven at scale |
| NVIDIA-only performance | TensorRT | SM-specific optimization |
| Custom kernels | Triton | DSL flexibility |
| Production serving | vLLM | Mature ecosystem, wide model support |
Production usually comes down to SGLang vs vLLM. SGLang has better raw throughput via RadixAttention. vLLM has broader model support and more mature ecosystem. Both actively developed.
Building custom inference primitives? Triton. It’s becoming the common compilation target.
Getting started? Ollama.
Canteen