Spent the last few weeks reading through the source code of five major inference serving frameworks. Here’s what I found.

Ollama · SGLang · TensorRT · Triton · vLLM · Trends · Recommendations

Framework Focus Language Target User
Ollama Local LLM execution Go + C++ Developers, enthusiasts
SGLang High-performance serving Python/Rust/CUDA Production deployments
TensorRT NVIDIA optimization C++/CUDA Enterprise, NVIDIA users
Triton GPU kernel compiler Python/MLIR/C++ Kernel developers
vLLM Fast LLM serving Python/CUDA Production deployments

Ollama

Ollama is not trying to compete with SGLang or vLLM on throughput. It’s solving a different problem: making local model execution trivial.

The architecture is straightforward. Go server handles HTTP and scheduling. llama.cpp does inference. Clean separation.

The scheduler tells you everything about the design philosophy:

type Scheduler struct {
    pendingReqCh  chan *LlmRequest
    finishedReqCh chan *LlmRequest
    expiredCh     chan *runnerRef
    loaded        map[string]*runnerRef
}

var defaultModelsPerGPU = 3

defaultModelsPerGPU = 3. SGLang and vLLM think in thousands of concurrent requests. Ollama thinks in “a few models fitting on consumer hardware.”

The API is minimal:

type GenerateRequest struct {
    Model    string `json:"model"`
    Prompt   string `json:"prompt"`
    Stream   *bool  `json:"stream,omitempty"`
    Images   []ImageData `json:"images,omitempty"`
    Think    *ThinkValue `json:"think,omitempty"`  // Reasoning support
}

Cross-platform GPU detection handles NVIDIA, Apple Silicon, and Jetson. The code discovers available compute and adjusts.

Development branches show where it’s headed: MLX backends for Apple Silicon, constrained sampling for JSON schemas, agent loops for tool calling.

SGLang

SGLang powers 400,000+ GPUs in production. The performance comes from RadixAttention.

The key insight: many LLM requests share common prefixes. System prompts, few-shot examples, conversation history. RadixAttention caches these in a trie:

class TreeNode:
    def __init__(self, id: Optional[int] = None, priority: int = 0):
        self.children = defaultdict(TreeNode)
        self.value: Optional[torch.Tensor] = None  # GPU KV cache
        self.host_value: Optional[torch.Tensor] = None  # CPU backup
        self.lock_ref = 0
        self.hit_count = 0

Dual storage: GPU tensors for active cache, CPU tensors for overflow. Hierarchical caching that exceeds GPU memory limits.

The scheduler uses a mixin pattern with 35+ modules:

from sglang.srt.managers.scheduler_dp_attn_mixin import SchedulerDPAttnMixin
from sglang.srt.managers.scheduler_metrics_mixin import SchedulerMetricsMixin
from sglang.srt.managers.scheduler_output_processor_mixin import SchedulerOutputProcessorMixin
from sglang.srt.managers.scheduler_pp_mixin import SchedulerPPMixin
from sglang.srt.disaggregation.prefill import SchedulerDisaggregationPrefillMixin

Need disaggregated prefill/decode? Add the mixin. Metrics? Add that mixin. Composability without code duplication.

33 attention backends. FlashAttention, FlashInfer, AMD AITER, TensorRT-LLM. This isn’t bloat. Optimal performance requires hardware-specific implementations.

EAGLE speculative decoding with tree-based draft verification:

def build_tree_kernel_efficient(
    verified_id: torch.Tensor,
    parent_list: List[torch.Tensor],
    top_scores_index: torch.Tensor,
    draft_tokens: torch.Tensor,
    ...
):
    draft_tokens = torch.cat((verified_id.unsqueeze(1), draft_tokens), dim=1).flatten()

5x speedup claims with RadixAttention are common in production.

TensorRT

TensorRT is not a serving framework. It’s a collection of optimized inference primitives that other systems build on.

45+ CUDA kernel plugins. Attention, normalization, detection:

QKVToContextPluginDynamic::QKVToContextPluginDynamic(
    const std::string name,
    const DataType type,
    const int32_t hiddenSize,
    const int32_t numHeads,
    float const dqProbs,
    bool hasImask)
    : mHeadSize(hiddenSize / numHeads)
    , mHiddenSize(hiddenSize)
    , mNumHeads(numHeads)
{
    mSM = getSmVersion();  // Auto-detect GPU SM version
}

getSmVersion() reveals the approach: every kernel is specialized per GPU architecture. Same multi-head attention, different implementations:

fused_multihead_attention_fp16_128_64_kernel.sm75.cpp   # Turing
fused_multihead_attention_fp16_128_64_kernel.sm80.cpp   # Ampere
fused_multihead_attention_fp16_128_64_kernel.sm90.cpp   # Hopper
fused_multihead_attention_fp16_128_64_kernel.sm100.cpp  # Blackwell

Expensive to maintain. Delivers performance generic implementations cannot match.

IPluginV3 separates concerns cleanly:

IPluginCapability* QKVToContextPluginDynamic::getCapabilityInterface(
    PluginCapabilityType type) noexcept
{
    if (type == PluginCapabilityType::kBUILD)
        return static_cast<IPluginV3OneBuild*>(this);
    if (type == PluginCapabilityType::kRUNTIME)
        return static_cast<IPluginV3OneRuntime*>(this);
    return static_cast<IPluginV3OneCore*>(this);
}

Build-time optimization, runtime execution, core functionality. Aggressive optimization during build, minimal runtime footprint.

Not for prototyping. For production-grade performance on NVIDIA hardware.

Triton

Triton takes a different approach. Not pre-built kernels. A language for writing GPU kernels with Python-like syntax.

CUDA is hard. The concepts (tiled computation, memory coalescing, shared memory) are not inherently complex. Triton provides abstractions.

JIT compilation with automatic dependency tracking:

class DependenciesFinder(ast.NodeVisitor):
    """
    AST visitor that finds dependencies to invalidate
    a JITFunction's hash when source code changes.
    """

    def __init__(self, name, globals, nonlocals, src) -> None:
        self.hasher = hashlib.sha256(src.encode("utf-8"))
        self.used_global_vals: Dict[Tuple[str, int], Tuple[Any, Dict[str, Any]]] = {}

Compiler tracks kernel source and all referenced global variables. Change a constant, kernel recompiles. Iterative development becomes practical.

23 GPU-specific transformation passes:

Pass Purpose
AccelerateMatmul MMA instruction selection
Pipeliner Software pipelining
RemoveLayoutConversions Eliminate redundant conversions
ReorderInstructions Instruction scheduling
WarpSpecialization Warp-level parallelism
Coalesce Memory coalescing

Software pipelining hides memory latency:

struct LoopPipelinerInternal {
    struct LiverangeInfo {
        unsigned lastUseStage = 0;
        unsigned defStage = 0;
    };

    ForOp forOp;
    unsigned maxStage = 0;
    DenseMap<Operation *, unsigned> stages;
};

Automatic overlap of memory operations with computation. Would require significant expertise by hand.

Triton is the default backend for PyTorch 2.0’s torch.compile. That says enough about maturity.

vLLM

vLLM introduced PagedAttention. Treating KV cache memory like OS pages allows efficient utilization when sequence lengths vary.

Currently transitioning to V1 architecture:

class Scheduler(SchedulerInterface):
    def __init__(
        self,
        vllm_config: VllmConfig,
        kv_cache_config: KVCacheConfig,
        structured_output_manager: StructuredOutputManager,
        ...
    ) -> None:
        self.connector = None
        if self.vllm_config.kv_transfer_config is not None:
            self.connector = KVConnectorFactory.create_connector(
                config=self.vllm_config,
                role=KVConnectorRole.SCHEDULER,
                kv_cache_config=self.kv_cache_config,
            )

KVConnector enables prefill/decode disaggregation. Run compute-intensive prefill on different hardware than memory-bound decode. Increasingly important as models grow.

218 model architectures. More than any other framework.

V1 attention backends redesigned. FlashAttention primary, multiple MLA implementations:

class FlashAttentionBackend(AttentionBackend):
    accept_output_buffer: bool = True
    supported_dtypes: ClassVar[list[torch.dtype]] = [torch.float16, torch.bfloat16]

    @staticmethod
    def get_kv_cache_shape(num_blocks, block_size, num_kv_heads, head_size, ...):
        return (2, num_blocks, block_size, num_kv_heads, head_size)

Optimization levels:

-O0: No optimizations (fastest startup)
-O1: Quick optimizations (CUDAGraphMode.PIECEWISE)
-O2: Full optimizations (default, CUDAGraphMode.FULL_AND_PIECEWISE)

V1 integrates torch.compile. Moving from hand-written CUDA to compiled optimizations that adapt to new hardware.

These frameworks started in different places. They’re converging.

Attention: MHA to MLA

Multi-head Latent Attention (MLA) compresses key-value representations. Reduces memory bandwidth. DeepSeek popularized it. SGLang has multiple backends. vLLM has four implementations.

Standard MHA → FlashAttention → MLA → Sparse Attention

Each step trades generality for efficiency.

KV Cache: Paged to Hierarchical

PagedAttention was a breakthrough. RadixAttention (SGLang) exploits prefix sharing. HiCache adds hierarchical storage with CPU offloading.

PagedAttention → RadixAttention → HiCache

Compilation: Static to Dynamic

All frameworks moving toward torch.compile. vLLM V1 uses it by default. SGLang has active branches. Even TensorRT is adapting.

Distribution: TP/PP to Disaggregation

Tensor and pipeline parallelism giving way to prefill/decode disaggregation. Run different phases on different hardware. SGLang and vLLM both implementing this.

Quantization: Lower Precision

Format Support
FP16 Universal
INT8 Common
FP8 Growing (TensorRT, SGLang, vLLM)
INT4 Common for local (GGML)
FP4/MXFP Emerging (Triton, TensorRT)

MXFP formats are next. Better quality-per-bit than integer quantization.

Which One

Use Case Framework Reason
Local development Ollama Simplest setup
Maximum throughput SGLang RadixAttention, proven at scale
NVIDIA-only performance TensorRT SM-specific optimization
Custom kernels Triton DSL flexibility
Production serving vLLM Mature ecosystem, wide model support

Production usually comes down to SGLang vs vLLM. SGLang has better raw throughput via RadixAttention. vLLM has broader model support and more mature ecosystem. Both actively developed.

Building custom inference primitives? Triton. It’s becoming the common compilation target.

Getting started? Ollama.