Validated with 2,000 concurrent AI workers.

Production Measured Verified

The benchmark intentionally exceeded expected production behavior by driving individual retrieval surfaces at full concurrency, even though real autonomous organizations distribute requests across multiple tools simultaneously rather than sending every worker to the same endpoint.

Validated concurrent workers

Search Observations

2,000 ✓

Trace Symbol Dependencies

2,000 ✓

List Indexes

2,000 ✓

Create Observation

2,000 ✓

Search Code

2,000 ✓

Search References

1,500 ✓

Explain Code Path

100

Search Code validated modes: exact_symbol · keyword. Search References validated at 1,500 concurrent (default mode); hybrid and semantic modes validated at 1,200 concurrent. Explain Code Path is LLM-bound and scales with available model throughput.

Retrieval surfaces · peak throughput at 2,000 concurrent

Tool	Mode	RPS	p50	p99
Search Observations	keyword	1,468	1.34s	1.90s
Search Observations	by-id	1,246	1.61s	3.82s
Trace Symbol Dependencies	graph	1,191	1.65s	2.37s
List Indexes	registry	1,070	1.87s	2.51s
Create Observation	fire-and-forget	1,079	1.59s	3.49s
Create Observation	sync	963	2.13s	3.55s
Search Code	exact symbol	268	8.02s	11.06s
Search Code	keyword	98	19.5s	40.4s

Hybrid retrieval · peak throughput at 1,200 concurrent

Tool	Mode	RPS	p50	p99
Search Code	semantic	445	2.63s	3.58s
Search Code	hybrid	448	2.61s	3.37s
Search References	semantic	340	1.95s	8.93s
Search References	hybrid	344	2.60s	11.08s
Search Observations	hybrid	810	0.81s	31.1s

Hybrid retrieval prioritizes recall over tail latency by combining semantic and lexical ranking. Under extreme synthetic concurrency, p99 grows while maintaining 100% successful responses.

Reasoning surfaces · LLM-bound

Tool	Concurrency	RPS	p50	Notes
Explain Code Path	100	11	8.79s	Bounded by available model throughput. Scales with the Inference Fabric's concurrent model serving.

Benchmark environment

Workstation

Single production workstation

CPU

Intel Core Ultra · 24 cores

Memory

256 GB DDR5

GPU

NVIDIA RTX 5080

Powers production inference services. Not used to accelerate retrieval. Retrieval benchmarks exercised the production runtime while these services remained active.

Platform
runtime

Production AgentOS runtime · running concurrently:

AgentOS
Agent Swarm
FAFO Memory
Inference Fabric
FAFO Buffer
MCP endpoint
PostgreSQL
Redis
Vector indexes
Embedding service
Cross-encoder reranker
Local model execution
Background orchestration
Observability / telemetry services

Endpoint

Production MCP endpoint — not a stripped-down benchmark harness

Methodology

Window

30s warmup · 60s measure · 30s cooldown per run

Captured

2026-05-20 → 2026-05-22

Coverage

7 fafo-memory tools across 5+ retrieval modes (keyword · semantic · hybrid · exact symbol · dependency · by-id · fire-and-forget) plus the LLM-bound reasoning surface

Measurement

REST mode against the live production deployment; p50/p95/p99 latency + RPS + error rate per (tool, mode, concurrency)

Reproducibility

Equivalent workstation hardware + the same production AgentOS stack will reproduce these results

The claim

The production AgentOS runtime sustained 2,000 concurrent autonomous workers against its memory layer while the broader platform was also running: Agent Swarm, Inference Fabric, FAFO Buffer, PostgreSQL, Redis, embeddings, reranking, local model services, and background orchestration.

PostgreSQL wasn't dedicated. Redis wasn't dedicated. The GPU wasn't dedicated. FAFO Memory wasn't running on an empty machine.

The benchmark exercised the system the way customers would actually deploy it.

P-005 · GPU Saturation