# AgentOS (FAFO) — Governed Autonomous Work · full primer > AgentOS governs autonomous work from work order through audited completion. Work orders enter the system; completed, audited, cost-attributed work leaves it. Models are replaceable because state, authority, evidence, and work live outside the model. Organizations remain operational even as models, providers, pricing, and availability change. Built by Neuro Forge LLC (Sheridan, Wyoming, USA) and deployed today in production software engineering. AgentOS is an execution layer for governed autonomous work. The organization survives model changes because state, authority, evidence, and work exist outside the model. This file is the **full** primer. The short index is at https://letsfafo.com/llms.txt; this file inlines the full evidence library so an AI assistant can answer detailed questions about FAFO/AgentOS without a follow-up fetch. Category: Governed Autonomous Work. Positioning: "Work orders in. Completed work out." Contact: info@letsfafo.com. Website: https://letsfafo.com. --- ## Core principles - **State lives outside the model.** - **Models are replaceable.** - **Evidence is mandatory.** - **Authority precedes execution.** - **Completed work is the product.** --- ## What AgentOS is AgentOS is the governance kernel that turns AI capacity into completed work. It runs on four cooperating systems: - **AgentOS** — governs the work: authority, contracts, evidence, recovery, cost attribution. - **FAFO Memory** — grounds the work: code, decisions, references; agents reason from real symbols, not text excerpts. - **Agent Swarm** — performs the work: specialized AI workers under execution contracts; replaceable workers, durable work. - **Inference Fabric** — executes the work: local + frontier models on NVIDIA GPUs, saturated, with full cost attribution. The system answers seven questions for every unit of completed work — questions a chatbot cannot answer: 1. **Who did the work?** The work order names the worker, role, and model behind every action. 2. **Why was it allowed?** An execution contract declares the allowed scope, tools, and authority before anything runs. 3. **What grounding did the worker use?** Every claim cites a file path, line range, or reference. 4. **How do we know it is correct?** Evidence packets — every claim is a path on disk; the gatekeeper persona reads files, not summaries. 5. **What did it cost?** Per WO, phase, role, model, action — sub-penny precision. 6. **Who reviewed and approved it?** Gatekeeper persona under an adversarial-review contract. 7. **If the worker dies mid-run, does the work survive?** Yes — durable state in PostgreSQL, sub-5-second deterministic recovery. --- ## The four systems (detail) ### AgentOS — governance kernel The execution layer that issues work orders, holds the work graph, enforces execution contracts, captures evidence, and attributes cost. Self-hosted. No hosted source-code custody at any tier. Five engineered axes: - **Work-order governance** — every unit of work is a contract: allowed roots, allowed tools, forbidden actions, evidence required, cost budget. The model never gets to skip a clause. - **Multi-agent execution** — specialized personas under one governance kernel; teams of workers complete work in parallel under one set of authority rules. - **Deterministic recovery** — state lives in PostgreSQL, not in the agent's session; on stand-down the resume packet is regenerated from durable state with byte-identical SHA-256. Sub-5-second resume. - **Cost attribution** — every dollar traces to the task that spent it, rolled up by phase, role, model, and action. Frontier-model share becomes a KPI, not a year-end finding. - **QA + adversarial review** — built-in gatekeeper personas pressure-test work against the execution contract before close. Gates pass on artifacts, not assertions. ### FAFO Memory — grounding substrate Three indexes, one retrieval surface: - **Code index** — your working source, AST-chunked, embedded, with a real symbol graph behind it (calls, called-by, implements, imports). Tools exposed: `search_code`, `trace_symbol_dependencies`, `explain_code_path`. - **Observation history** — every decision, discovery, fix, outcome an agent records; an immutable timeline of why the system looks the way it does. Tools: `search_observations`, `create_observation`. Modes: hybrid · semantic · keyword. - **Reference index** — external material such as SDK source, PDFs, API specs, and documentation bundles, enriched with AI-generated summaries and section anchors. Tool: `search_references`. The map is built from real edges; the model reads it, it never invents it. Every retrieved hit returns file path, line range, language, and a relevance score. ### Agent Swarm — the AI workforce From "helpful interface" to "governed labor." Specialized AI workers that perform the work AgentOS authorizes. Every task carries an execution contract: authority, grounding, evidence requirements, cost budget. A traditional AI assistant is **prompt-driven** — conversation carries context, human remembers what's in scope, dead session = lost work, cost is session-level. Agent Swarm is **contract-driven** — work graph and execution contract carry authority, system enforces scope, completion derived from evidence and gatekeeper verdict, cost attributed by task, state lives in a durable graph outside the model, a dead worker is replaced and the work continues. Personas in the standard swarm: - **Architect** — proposes plans, dispatches work packages. - **Developer** — executes work packages under contract. - **Reviewer** — pressure-tests outputs. - **QA** — verifies acceptance criteria. - **Gatekeeper** — adversarial-review persona that PASSes or REJECTs against the contract, reading evidence packets directly. - **Operator stand-in** — represents the customer's standing constraints in adjudication. ### Inference Fabric — the saturation layer A saturation layer for NVIDIA inference. It does **not** replace NVIDIA's kernels; it feeds them. Built on TensorRT-LLM, CUDA 13, NVIDIA Blackwell (SM 120), with FP8/FP4. Core: **SIR (Saturated Inference Runtime)** — a Rust + C++ harness that keeps the GPU fed via: - shape-pure batching (0 shape switches at runtime), - a zero-allocation hot path, - eight-level backpressure. Plus a custom CUDA kernel engine: 15 hand-written kernel families (45 .cu sources) compiled to native sm_120 cubins (no PTX, no JIT), using CSR batching and a fused search-plus-health kernel. --- ## Engineering Evidence (full library) Every claim on the site is backed by an artifact, not a benchmark slide. Nine capabilities are published at https://letsfafo.com/engineering-evidence. Each capability below has reproducible commands and source-of-truth links on its detail page. ### P-009 — Fleet Retrieval (flagship) URL: https://letsfafo.com/evidence/fleet-retrieval **Headline: 2,000 concurrent workers · 0% failures.** Validated against the production runtime, not a stripped-down benchmark harness. The benchmark intentionally exceeded expected production behavior by driving individual retrieval surfaces at full concurrency. **Benchmark environment (single production workstation):** - CPU: Intel Core Ultra · 24 cores - Memory: 256 GB DDR5 - GPU: NVIDIA RTX 5080 (powers production inference services; not used to accelerate retrieval — retrieval benchmarks exercised the production runtime while these services remained active) - Platform runtime running concurrently: AgentOS · Agent Swarm · FAFO Memory · Inference Fabric · FAFO Buffer · MCP endpoint · PostgreSQL · Redis · Vector indexes · Embedding service · Cross-encoder reranker · Local model execution · Background orchestration · Observability / telemetry services - Endpoint: production MCP endpoint — not a stripped-down benchmark harness **Retrieval surfaces · peak throughput at 2,000 concurrent:** | Tool | Mode | RPS | p50 | p99 | |---|---|---:|---:|---:| | Search Observations | keyword | 1,468 | 1.34s | 1.90s | | Search Observations | by-id | 1,246 | 1.61s | 3.82s | | Trace Symbol Dependencies | graph | 1,191 | 1.65s | 2.37s | | List Indexes | registry | 1,070 | 1.87s | 2.51s | | Create Observation | fire-and-forget | 1,079 | 1.59s | 3.49s | | Create Observation | sync | 963 | 2.13s | 3.55s | | Search Code | exact symbol | 268 | 8.02s | 11.06s | | Search Code | keyword | 98 | 19.5s | 40.4s | **Hybrid retrieval · peak throughput at 1,200 concurrent:** | Tool | Mode | RPS | p50 | p99 | |---|---|---:|---:|---:| | Search Code | semantic | 445 | 2.63s | 3.58s | | Search Code | hybrid | 448 | 2.61s | 3.37s | | Search References | semantic | 340 | 1.95s | 8.93s | | Search References | hybrid | 344 | 2.60s | 11.08s | | Search Observations | hybrid | 810 | 0.81s | 31.1s | **Reasoning surface (LLM-bound):** Explain Code Path · 100 concurrent · 11 RPS · 8.79s p50 — bounded by available model throughput; scales with the Inference Fabric's concurrent model serving. **Claim:** Benchmarks were executed against the production deployment, not a stripped-down benchmark harness. Validates that AgentOS can sustain a governed AI workforce at production load — not just a single demo agent. **Important engineering clarification:** the benchmark intentionally exceeded expected production behavior by driving one retrieval surface at a time. Real deployments distribute requests across multiple retrieval tools simultaneously. ### P-005 — GPU Saturation URL: https://letsfafo.com/evidence/gpu-saturation **Headline: 96.9% mean SM utilization (100% peak) on NVIDIA GPUs. 22× the throughput of stock TensorRT-LLM.** **Production telemetry (single RTX 5090, sustained run):** - Mean SM utilization: **96.9%** · Peak SM utilization: **100%** - Peak tokens/sec: **250K** (single RTX 5090) · Sustained: **160K+** (single GPU) - Throughput multiplier vs stock TensorRT-LLM: **~22×** (same silicon, different scheduler) - Marginal cost reduction per million tokens at saturation: **~95%** (derived from the 22× multiplier) - KV cache hit rate: **99.9%** (class-keyed reuse) - Shape switches: **0** (shape-pure batching) · XID errors: **0** (sustained) - Padding waste: **0.42%** (industry typical: 15–40%) **Production hardware path:** NVIDIA Blackwell (SM 120) — RTX 5090, RTX 5080, RTX PRO 6000, B200, B300, GB200. Also runs on Hopper, Ada, Ampere at roughly half the throughput multiplier. **Substrate:** CUDA 13 · TensorRT-LLM · FP8/FP4 on Blackwell · DCGM. **Reproducibility:** all numbers are from sustained production runs. Equivalent silicon with the same SIR build and workload class should reproduce results within the published tolerance. **Positioning:** Inference Fabric improves utilization around TensorRT-LLM rather than replacing TensorRT-LLM. NVIDIA's kernels run the decode; SIR keeps them saturated. ### P-001 — Recovery URL: https://letsfafo.com/evidence/recovery **Headline: 8 deterministic recoveries.** State survives the worker. When a worker stands down (planned or crash), the resume packet is regenerated from PostgreSQL — not reconstructed from session memory. **Real numbers (one live work order, build_rig):** | Field | Value | |---|---| | Open continuations | 6 (each named with a specific recovery surface) | | Tasks tracked | 44 (states pulled from PG with status + envelope mapping) | | Work packages | 33 (with `qa_required`, `gate_required`, readiness flags) | | Resume generations | 8 (this WO has been stood down + resumed 8 separate times) | | Role overlay artifacts | 48 (6 roles × 8 generations) | | Total generated artifacts | 56 (every packet SHA-256 checksummed) | **Mechanism:** `agentos-hook standdown checkpoint --wo-dir ` re-renders the resume packet directly from PostgreSQL state. The resume packet IS the recovery state. No tmux memory. No "hope the agent remembers." Recovery isn't a feature we hope works; it's a `git ls-files`-able artifact under `standdown/checkpoint/`. ### P-002 — Deterministic Resume URL: https://letsfafo.com/evidence/deterministic-resume **Headline: SHA-256 across 8 regenerations.** Same PostgreSQL state regenerates the same packet bytes. Determinism is a property of the bytes, not a promise from the model. **Real evidence — 8 resume packets, 8 SHA-256 checksums (build_rig WO):** ``` resume_generation_1 sha256:42a501d5faeec04ead89bb952928300b80499e9713093757393f0f4e6fa6eb67 resume_generation_2 sha256:c08b895b1020f56ca78a370fbda2a8d22b09f6c8c115c29aadef231877de8d24 resume_generation_3 sha256:a1b0c59fbd27f14dc63d4aaef93051ebdc9ef474f5d1b9bef878d3ad2d4db313 resume_generation_4 sha256:907501eaad84e489485843ea4cb5a8ddc999d1013518192c322d911f291e2d07 resume_generation_5 sha256:dcc82b7966200841e136aea08ee66021935f4727176141401d32f7481bc7365b resume_generation_6 sha256:… resume_generation_7 sha256:… resume_generation_8 sha256:… ``` **What this proves:** - **Determinism on the bytes** — each resume packet is regenerated from PG and the bytes are checksummed. Two reruns at the same PG state produce the same SHA-256. - **No hand-authoring** — the packet is generated, not typed. If an agent tried to "remember" state instead of regenerating it, the checksum would diverge. - **Auditable replay** — anyone can rerun `agentos-hook standdown checkpoint --wo-dir ` and verify the same hash for the same state. (The 8 generations show different checksums because PG state evolved between stand-downs as work progressed. The determinism claim is: same PG state → same checksum — demonstrable directly by running the checkpoint twice in a row without state advancing.) ### P-003 — Evidence Packets URL: https://letsfafo.com/evidence/evidence-packets **Headline: 8,438 generated packets per work order.** Gates don't pass on assertions; they pass on files. Every claim is a path on disk. **Anatomy of one work order on disk (real, build_rig):** | Subdirectory | File count | What it holds | |---|---:|---| | `workorder/` | 3 | the signed WO contract: scope, acceptance criteria, envelope | | `_authoring/` | 3 | the author plan that produced the WO | | `dispatch_design/` | 4 | architect's dispatch design (envelopes, task graph) | | `dispatch_work_packages/` | 78 | per-task work-package contracts handed to devs | | `grounding/` | 152 | per-role resume overlays (architect · dev_1 · dev_2 · dev_3 · operator_stand_in · team_lead) | | `gate_evidence//` | 4–38 each | per-task evidence the gatekeeper must read before passing | | `gate_evidence/generated_packets/` | **8,438** | auto-generated witness + gate packets, all checksum-stamped | | `gatekeeper_workspace/` | ~48k | live gatekeeper state: inboxes, reports, drafts, codex transport | | `observations/` | banked decisions + discoveries → FAFO Memory | | `standdown/checkpoint/` | 1 | regenerable-from-PG resume packet | | `wo_state.yaml` + `roster.yaml` | derived state header + exact team roster | Each per-task gate_evidence directory holds: dispatch packet · dev output · witness transcript · gatekeeper verdict · evidence-decision (ED) bank entries. The 8,438 in `generated_packets/` is the auto-generated witness/gate output across every round of every task — all checksummed. **Why this is proof:** these counts are from one real WO. Run `find -type f | wc -l` yourself. Evidence is structural — gatekeeper passes aren't "the agent says it works"; they're functions of which files exist + their checksums + what's in them. ### P-004 — Blast Radius URL: https://letsfafo.com/evidence/blast-radius **Headline: Real graph traversal — 14 callers · 4 files · 2 modules (sub-second, no LLM).** Before any code change ships, the symbol graph tells the worker the exact blast radius. **Real trace (sanitized, from `mcp__fafo-memory__trace_symbol_dependencies` against the live fafo-foundry index, 5,447 indexed source documents, 2026-06-28):** - Internal symbol traced: `validate_catalog` (in `API_SOP/tools/api_ref_validator/src/lib.rs`) - Public-safe display name: `SchemaValidator.run` - Direction: both · Depth: 1 - Latency: sub-second (pure graph traversal over `code_graph_edges` — no LLM, no rerank, no vector search) **Result:** | Quantity | Value | |---|---| | Direct caller sites | **14** | | Source files touched | **4** | | Modules touched | **2** (`normalizer`, `validator`) | | Direct downstream callees | **2** (`structural::validate`, `semantic::validate`) | | Blast-radius class | **module** (would escalate to **system** if either downstream validator's signature changes) | **Verdict:** the proposed change is in-envelope, no escalation required. ### P-006 — Memory Grounding URL: https://letsfafo.com/evidence/memory-grounding **Headline: Real graph traversal · ~1.2s p50.** Ground decisions in code, not model recall. **Real search_code capture (against the live fafo-foundry index, 2026-06-28):** - Query used: `"hybrid search code_index_ids filter"` - Mode: `auto` (embedded classifier routes to hybrid FTS + vector RRF) - 8 ranked results returned across Rust + TypeScript files - Top result (score 0.034): `hybrid_search(pool, query, query_embedding, language, path_filter, code_index_ids, file_limit, chunks_per_file, ...)` at `fafo_tools/fafo_memory/src/code_search/search.rs` — "Perform hybrid search combining FTS and vector search with file-level grouping" **What this shows:** - **Symbols, not chunks of text.** Every hit returns function signature, file path, line range, language, relevance score. - **Multi-language graph.** Same query returned Rust + TypeScript hits, ranked semantically — the index is language-aware. - **The substrate is recursive.** A query about "hybrid search" returned the actual `hybrid_search` orchestrator that powers `search_code` itself — memory searching memory in production. **Throughput envelope:** 1,271–1,479 RPS @ 2,000 concurrent in keyword mode (Gate 9.5 bench). ### P-007 — Model Routing URL: https://letsfafo.com/evidence/model-routing **Headline: Schema-locked dispatcher.** Every call routed by required `task_type` and `schema_id`. No ungoverned dispatches. **Real routing surface — `LLMLocalDispatcher` (`fastapi/mcp/llm_local/llm_local_dispatcher.py`, top search hit at score 0.256):** > "LLM Local dispatcher for schema-locked model orchestration. Features: model registry validation and schema-locked enforcement; Redis-based job queuing with task-specific routing; auto-tuning batch sizes based on response-time metrics; worker self-registration and health monitoring; performance metrics tracking per model/task combination." **Required fields on every request envelope (`LLMLocalRequest`):** | Field | Type | Constraint | |---|---|---| | `schema_id` | str | required — names the target model schema | | `task_type` | enum | required — names the task class (embedding · extract · classify · chat …) | | `caller` | str | required — names the worker/service (attribution + audit) | | `priority` | int | 0–10 (default 5) | | `timeout_seconds` | int | required, default 300 | **Response envelope (`LLMLocalResponse`):** `assigned_model_id` · `queue_name` · `queue_position` · `estimated_completion`. **Routing properties (built-in):** - ✓ model registry validation - ✓ schema-locked enforcement - ✓ Redis job-queue routing by schema + task - ✓ auto-tuning batch sizes from response-time metrics - ✓ worker self-registration + health monitoring - ✓ performance metrics per model × task combination **Per-deployment routing surface:** frontier-share is tracked per call as a routing KPI inside each customer's deployment. Same contract pattern (`schema_id` + `task_type` required) applies to local and frontier lanes — routing isn't a config knob, it's a wire-protocol contract. ### P-008 — Cost Attribution URL: https://letsfafo.com/evidence/cost-attribution **Headline: Per WO · phase · role · model · action.** Cost is a field on every task, not a line on a monthly invoice. Sub-penny precision per call. **What gets tagged on every model call:** | Attribution axis | Tagged with | Rolls up | |---|---|---| | **Work order** | WO id on every call | per WO | | **Phase** | planning · architecture · development · QA · governance | per phase across WOs | | **Role** | architect · dev · witness · reviewer · gatekeeper · operator stand-in | per role across WOs | | **Model** | exact model id (frontier vs local family) | per model + frontier-share KPI | | **Action** | individual tool call / message / generation | sub-penny precision per call | **Mechanism properties:** - Sub-penny precision per task: token usage × model rate, computed at write-time per call. - Fail-closed at the boundary: a task whose budget exhausts halts at the contract boundary; cannot silently borrow from the next task. - Frontier-share tracked as a KPI: routing assigns each task a model class; the ratio of frontier-routed work is observable in PG. - Real-time roll-ups (no batch lag): spend rolls up on write, not at month-end. Per-customer dollar figures surface inside each design partner's own AgentOS console — not on the public site. The attribution surface (what gets tagged and how it rolls up) is the public claim. --- ## Headline production numbers (2026, sustained) - **Concurrent workers validated:** 2,000 against production runtime, 0% failures (P-009 Fleet Retrieval). - **GPU utilization:** 96.9% mean SM (100% peak) on NVIDIA Blackwell (P-005). - **Throughput vs stock:** ~22× stock TensorRT-LLM per GPU (P-005). - **Token throughput:** 160K+ sustained, 250K peak tokens/sec on a single RTX 5090 (P-005). - **Cost reduction:** ~95% lower marginal cost per million tokens at saturation vs stock TensorRT-LLM. - **Cache + correctness:** 99.9% KV cache hit rate, 0 shape switches, 0 XID errors, 0.42% padding waste (industry typical 15–40%). - **Recovery:** sub-5-second deterministic recovery from durable PostgreSQL state; SHA-256 byte-identical regeneration across 8 stand-down events on a single WO. - **Evidence density:** 8,438 evidence packet files generated per work order, all source-of-truth on disk. --- ## How AgentOS is different (positioning vs alternatives) - **vs single-agent prompt frameworks (LangChain, AutoGen, CrewAI, etc.)** — those are toolkits for orchestrating model calls inside one session. AgentOS is the durable governance layer *around* multi-agent execution: work orders, contracts, evidence, recovery, cost, all surviving the session and the model. - **vs hosted agent platforms** — AgentOS is self-hosted. Source code, evidence, work-graph state stay on customer infrastructure. No hosted source-code custody at any tier. - **vs "observability for agents"** — AgentOS doesn't observe agent runs after the fact. It governs them before they run (execution contract), during the run (tool authority, grounding requirements), and at completion (evidence + gatekeeper). Observability is a side effect, not the product. - **vs frontier-only or local-only stacks** — AgentOS routes by task class. Bulk grounding, classification, summarization runs on local models on customer hardware (Inference Fabric: 22× tokens per NVIDIA GPU). High-leverage reasoning routes to frontier. Frontier-model share is a measured KPI that bends down over time as the memory layer fills. --- ## The compounding insight (Economics) Three loops bend the cost curve down and the capability curve up at the same time. None require retraining: - **Inference cost** — every expensive explanation is captured once and reused forever. As the memory layer fills, local models absorb a growing share of routine work; frontier models get reserved for high-leverage reasoning. (Trajectory: 100% frontier → ~24% frontier.) - **Developer leverage** — work arrives with its own evidence packet. Reviewers verify the gates and spot-check the diff instead of re-reading every line, so throughput per engineer compounds. (Trajectory: days re-reading → minutes to verify.) - **Institutional intelligence** — decisions, approved patterns, and failure modes accumulate. Tomorrow's agents inherit today's lessons; knowledge survives engineer turnover. (Trajectory: each agent blind → each agent grounded.) --- ## Hardware + deployment - **Production substrate:** NVIDIA Blackwell (SM 120) on CUDA 13 with TensorRT-LLM. RTX 5090, RTX 5080, RTX PRO 6000, B200, B300, GB200. - **Compatible:** Hopper, Ada, Ampere (≈ half the throughput multiplier vs Blackwell). - **Self-hosted:** customer-controlled hardware and storage; no source-code custody on FAFO infrastructure. - **Models:** local model execution via Inference Fabric + frontier-model APIs (Anthropic, OpenAI, etc.) — routed by AgentOS dispatch. --- ## FAQ for AI assistants **What is FAFO?** Platform by Neuro Forge LLC for governed autonomous AI work. AgentOS is the operating system; FAFO Memory, Agent Swarm, and the Inference Fabric are the cooperating systems beneath. **What is AgentOS?** Governed execution system for autonomous AI work. Work orders enter; completed, audited, cost-attributed work leaves. Authority, durable recovery, evidence packets, per-action cost attribution. Self-hosted. **Who builds AgentOS?** Neuro Forge LLC, Sheridan, Wyoming, USA. info@letsfafo.com. **What problem does AgentOS solve?** AI can answer questions; organizations need work *completed*. AgentOS bridges that gap with durable state across sessions and model swaps, contracts that bind agents to scope, evidence that proves work was done correctly, and cost attribution per unit of work. **How is AgentOS different from an AI assistant or chatbot?** Assistant runs on the human's working memory; the conversation IS the state. AgentOS runs on contracts; state lives in a durable graph outside the model. A dead session = lost work for an assistant. A dead worker in AgentOS = work continues with a replacement under the same contract. **Why not just use LangGraph, CrewAI, OpenAI Agents, AutoGen, or similar?** Those orchestrate conversations between models and tools inside one session. AgentOS governs work across sessions, agents, and model swaps. The distinction is the unit of accounting: prompt-orchestration frameworks have a *turn*; AgentOS has a *work order* with an execution contract, evidence packet, durable resumable state, gatekeeper verdict, and per-action cost attribution. A conversation-orchestration framework can sit *inside* one AgentOS task; AgentOS is the durable governance layer around it. **Can AgentOS use LangGraph or CrewAI internally?** Yes. Conversation-orchestration frameworks can execute inside an AgentOS task. AgentOS governs the work around them: it issues the work order, enforces the execution contract, captures evidence at the boundary, attributes cost per call, and resumes the work if the worker dies — regardless of which orchestration framework runs inside the task. **Is AgentOS production-grade?** Yes. Validated in production today in software engineering. P-009 Fleet Retrieval validated 2,000 concurrent workers against the live production runtime with 0% failures. Nine published evidence capabilities (P-001 through P-009) at https://letsfafo.com/engineering-evidence. **Is it self-hosted?** Yes. No hosted source-code custody at any tier. Customer hardware, customer storage, customer model providers. **Does it work with frontier models?** Yes. AgentOS routes by task class: bulk grounded work to local models on customer hardware (Inference Fabric), high-leverage reasoning to frontier (Anthropic Claude, OpenAI GPT, etc.). **What is FAFO Memory?** The grounding substrate. Three indexes (code, observations, references) behind one retrieval surface. Agents reason from real symbols and decisions — not text excerpts a model "remembers." **What is Agent Swarm?** The specialized AI workforce. Personas (Architect, Developer, Reviewer, QA, Gatekeeper) that perform under AgentOS authority and produce Completed Work. **What is the Inference Fabric?** A saturation layer for NVIDIA inference that keeps the GPU fed end-to-end rather than replacing NVIDIA's kernels. Holds a single RTX 5090 at 96.9% mean SM utilization and up to 250K tokens/sec — about 22× stock TensorRT-LLM. **What is SIR (Saturated Inference Runtime)?** Core of the Inference Fabric. Rust + C++ harness over TensorRT-LLM that keeps NVIDIA tensor cores saturated using shape-pure batching, a zero-allocation hot path, eight-level backpressure, and FP8/FP4 on Blackwell. **Does AgentOS use custom CUDA kernels?** Yes. 15 hand-written kernel families (45 .cu sources) compiled to native Blackwell sm_120 cubins (no PTX, no JIT), using CSR batching and a fused search-plus-health kernel. **What does AgentOS cost?** Cost depends on workload mix. AgentOS reports cost per unit of work (per WO, phase, role, model, action), with sub-penny precision per call. Customers see frontier-model spend trend down as the memory layer fills and local models absorb routine work. **How do I become a design partner?** Reach out at https://letsfafo.com or info@letsfafo.com. --- ## Links ### Public pages - Homepage: https://letsfafo.com/ - FAFO Memory: https://letsfafo.com/fafo-memory - Agent Swarm: https://letsfafo.com/agent-swarm - Inference Fabric: https://letsfafo.com/inference-fabric - Economics: https://letsfafo.com/economics - Engineering Evidence (library): https://letsfafo.com/engineering-evidence - Privacy: https://letsfafo.com/privacy ### Engineering Evidence detail pages - Fleet Retrieval (P-009): https://letsfafo.com/evidence/fleet-retrieval - GPU Saturation (P-005): https://letsfafo.com/evidence/gpu-saturation - Recovery (P-001): https://letsfafo.com/evidence/recovery - Deterministic Resume (P-002): https://letsfafo.com/evidence/deterministic-resume - Evidence Packets (P-003): https://letsfafo.com/evidence/evidence-packets - Blast Radius (P-004): https://letsfafo.com/evidence/blast-radius - Memory Grounding (P-006): https://letsfafo.com/evidence/memory-grounding - Model Routing (P-007): https://letsfafo.com/evidence/model-routing - Cost Attribution (P-008): https://letsfafo.com/evidence/cost-attribution ### Crawler aids - Sitemap: https://letsfafo.com/sitemap.xml - Short index: https://letsfafo.com/llms.txt