22× stock TensorRT-LLM on the same silicon.

Production Measured Verified

Inference Fabric saturates NVIDIA silicon by fixing the supply chain around the kernels. Measured on production workloads on a single RTX 5090: 96.9% mean SM utilization, 250K peak tokens per second, 0.42% padding waste.

Source of truth

Production telemetry, captured during sustained saturation runs on the development host. Numbers are published in the site's canonical fact-sheet (/llms.txt) and cross-referenced in the Inference Fabric page hero + evidence section.

Mechanism

SIR (Saturated Inference Runtime) wraps TensorRT-LLM with shape-pure batching, a zero-allocation hot path, eight-level backpressure, and class-keyed KV reuse. Producers fire and forget into the buffer; the GPU only ever sees clean, homogeneous batches.

Evidence

GPU SATURATION · PRODUCTION TELEMETRY

  96.9%      mean SM utilization (100% peak)
  250K       tokens/sec peak, single RTX 5090
  160K+      tokens/sec sustained, single GPU
  22×        throughput vs stock TensorRT-LLM
  ~95%       lower marginal cost per million tokens at saturation
  99.9%      KV cache hit rate
  0          shape switches
  0          XID errors
  0.42%      padding waste  (industry typical: 15–40%)

  Hardware path: NVIDIA Blackwell (SM 120) · CUDA 13 · TensorRT-LLM · FP8/FP4

Reproduction

Source artifact

Inference Fabric production telemetry + DCGM / nvidia-smi observation

Command

run a sustained saturation workload through SIR; observe DCGM SM% + Fabric trace logs

Expected output

SM% sustained in the 96–100 range; tokens/sec near 250K peak on RTX 5090; KV hit ≥99%; padding waste <0.5%

Verification

same hardware (Blackwell SM 120) + same workload class + same Fabric build → match within 1–2%

Caveats

Numbers above are from sustained production runs on NVIDIA Blackwell (SM 120, CUDA 13, TensorRT-LLM, FP8/FP4). Reproducible on equivalent silicon with the same SIR build.

← Previous

P-009 · Fleet Retrieval

P-001 · Recovery