# AgentOS (FAFO) — Governed Autonomous Work

> AgentOS governs autonomous work from work order through audited completion. Work orders enter the system; completed, audited, cost-attributed work leaves it. Models are replaceable because state, authority, evidence, and work live outside the model. Organizations remain operational even as models, providers, pricing, and availability change. Built by Neuro Forge LLC (Sheridan, Wyoming, USA) and deployed today in production software engineering.

AgentOS is an execution layer for governed autonomous work. The organization survives model changes because state, authority, evidence, and work exist outside the model. The shape of the product is a deliberate departure from "AI assistant." Most AI systems handle one request at a time and lose state when the session ends. AgentOS operates a governed work system: work orders in, completed outcomes out, with governance, evidence, recovery, and cost built into every unit of work.

Category: Governed Autonomous Work. Positioning: "Work orders in. Completed work out."

Contact: info@letsfafo.com. Website: https://letsfafo.com.

---

## Core principles

- **State lives outside the model.**
- **Models are replaceable.**
- **Evidence is mandatory.**
- **Authority precedes execution.**
- **Completed work is the product.**

---

## What AgentOS is

AgentOS is the governance kernel that turns AI capacity into completed work. It runs on four cooperating systems:

- **AgentOS** — governs the work: authority, contracts, evidence, recovery, cost attribution.
- **FAFO Memory** — grounds the work: code, decisions, references; agents reason from real symbols, not text excerpts.
- **Agent Swarm** — performs the work: specialized AI workers under execution contracts; replaceable workers, durable work.
- **Inference Fabric** — executes the work: local + frontier models on NVIDIA GPUs, saturated, with full cost attribution.

The system answers seven questions for every unit of completed work — questions a chatbot cannot answer:

1. Who did the work? (worker, role, model behind every action)
2. Why was it allowed? (execution contract declares scope, tools, authority before run)
3. What grounding did the worker use? (every claim cites a file path, line range, or reference)
4. How do we know it is correct? (evidence packets — every claim is a path on disk)
5. What did it cost? (per WO, phase, role, model, action — sub-penny precision)
6. Who reviewed and approved it? (gatekeeper persona, adversarial review)
7. If the worker dies mid-run, does the work survive? (yes — durable state, sub-5-second deterministic recovery)

---

## The four systems

### AgentOS — governance kernel
The execution layer that issues work orders, holds the work graph, enforces execution contracts, captures evidence, and attributes cost. Self-hosted. No hosted source-code custody at any tier. Five engineered axes:

- **Work-order governance** — every unit of work is a contract: allowed roots, allowed tools, forbidden actions, evidence required, cost budget.
- **Multi-agent execution** — specialized personas under one governance kernel; teams of workers complete work in parallel.
- **Deterministic recovery** — state lives in PostgreSQL, not in the agent's session; on stand-down the resume packet is regenerated from durable state with byte-identical SHA-256.
- **Cost attribution** — every dollar traces to the task that spent it, rolled up by phase, role, model, and action. Frontier-model share becomes a KPI, not a year-end finding.
- **QA + adversarial review** — built-in gatekeeper personas pressure-test work against the execution contract before close.

### FAFO Memory — grounding substrate
Three indexes, one retrieval surface:

- **Code index** — your working source, AST-chunked, embedded, with a real symbol graph behind it (calls, called-by, implements, imports). Tools: `search_code`, `trace_symbol_dependencies`, `explain_code_path`.
- **Observation history** — every decision, discovery, fix, outcome an agent records; an immutable timeline of why the system looks the way it does. Tools: `search_observations`, `create_observation`. Modes: hybrid · semantic · keyword.
- **Reference index** — external material such as SDK source, PDFs, API specs, and documentation bundles, enriched with summaries and section anchors. Tool: `search_references`.

The map is built from real edges; the model reads it, it never invents it. Every retrieved hit returns file path, line range, and a relevance score.

### Agent Swarm — the AI workforce
From "helpful interface" to "governed labor." Specialized AI workers (Architect, Developer, Reviewer, QA, Gatekeeper, and more) that perform the work AgentOS authorizes. Every task carries an execution contract: authority, grounding, evidence requirements, cost budget.

A traditional AI assistant is **prompt-driven** — conversation carries context, human remembers what's in scope, dead session = lost work, cost is session-level. Agent Swarm is **contract-driven** — work graph and execution contract carry authority, system enforces scope, completion derived from evidence and gatekeeper verdict, cost attributed by task, state lives in a durable graph outside the model, a dead worker is replaced and the work continues.

### Inference Fabric — the saturation layer
A saturation layer for NVIDIA inference. It does **not** replace NVIDIA's kernels; it feeds them. Built on TensorRT-LLM, CUDA 13, NVIDIA Blackwell (SM 120), with FP8/FP4.

Core: **SIR (Saturated Inference Runtime)** — a Rust + C++ harness that keeps the GPU fed via shape-pure batching, a zero-allocation hot path, and eight-level backpressure. Plus a custom CUDA kernel engine: 15 hand-written kernel families (45 .cu sources) compiled to native sm_120 cubins, with CSR batching and a fused search-plus-health kernel.

---

## Engineering Evidence (the proof library)

Every claim on the site is backed by an artifact, not a benchmark slide. Nine capabilities are published with reproducible commands and source-of-truth artifacts at https://letsfafo.com/engineering-evidence:

### P-009 — Fleet Retrieval (flagship)
**2,000 concurrent workers · 0% failures.** Validated against the production runtime, not a stripped-down benchmark harness. Concurrent benchmark exercised individual retrieval surfaces at full concurrency while the rest of the platform (AgentOS, Agent Swarm, FAFO Memory, Inference Fabric, PostgreSQL, Redis, vector indexes, embedding service, reranker, local model execution, observability) ran live.

Peak retrieval throughput on a single production workstation: 1,468 RPS keyword observation search at 1.34s p50, 1,191 RPS symbol-dependency traversal at 1.65s p50, 1,079 RPS observation create at 1.59s p50. Hybrid retrieval: 810 RPS observation hybrid search at 0.81s p50. Single LLM-bound surface: 11 RPS on explain_code_path at 8.79s p50 (bounded by available model throughput; scales with Inference Fabric).

Hardware: Intel Core Ultra · 24 cores · 256 GB DDR5 · NVIDIA RTX 5080.

The benchmark intentionally exceeded expected production behavior by driving one retrieval surface at a time. Real deployments distribute requests across multiple retrieval tools simultaneously.

URL: https://letsfafo.com/evidence/fleet-retrieval

### P-005 — GPU Saturation
**96.9% mean SM utilization (100% peak) on NVIDIA GPUs.** 22× the throughput of stock TensorRT-LLM. 160K+ sustained / 250K peak tokens per second on a single RTX 5090. 99.9% KV cache hit rate, 0 shape switches, 0 XID errors. 0.42% padding waste (industry typical: 15–40%).

Inference Fabric improves utilization around TensorRT-LLM rather than replacing TensorRT-LLM.

URL: https://letsfafo.com/evidence/gpu-saturation

### P-001 — Recovery
**8 deterministic recoveries.** State survives the worker. When a worker stands down (planned or crash), the resume packet is regenerated from PostgreSQL — not reconstructed from session memory. Sub-5-second durable recovery.

URL: https://letsfafo.com/evidence/recovery

### P-002 — Deterministic Resume
**SHA-256 across 8 regenerations.** Same PostgreSQL state regenerates the same packet bytes. Determinism is a property of the bytes, not a promise from the model.

URL: https://letsfafo.com/evidence/deterministic-resume

### P-003 — Evidence Packets
**8,438 generated packets per work order.** Gates don't pass on assertions; they pass on files. Every claim is a path on disk. The gatekeeper persona reviews the packet, not a summary.

URL: https://letsfafo.com/evidence/evidence-packets

### P-004 — Blast Radius
**Real graph traversal — 14 callers · 4 files · 2 modules in a representative query.** Before any code change ships, the symbol graph tells the worker the exact blast radius. Sub-second, no LLM. Built on FAFO Memory's `trace_symbol_dependencies`.

URL: https://letsfafo.com/evidence/blast-radius

### P-006 — Memory Grounding
**Real graph traversal · ~1.2s p50.** Ground decisions in code, not model recall. Every hit returns file path, line range, and relevance score. Up to 1,479 RPS at 2,000 concurrent in keyword mode.

URL: https://letsfafo.com/evidence/memory-grounding

### P-007 — Model Routing
**Schema-locked dispatcher.** Every call routed by required task_type and schema_id. No ungoverned dispatches. Frontier vs local routing decided by task class and cost class, not by whoever shouted loudest.

URL: https://letsfafo.com/evidence/model-routing

### P-008 — Cost Attribution
**Per WO · phase · role · model · action.** Cost is a field on every task, not a line on a monthly invoice. Sub-penny precision per call. Production cost ledger: every action stamped with the model that ran it and the tokens it consumed.

URL: https://letsfafo.com/evidence/cost-attribution

---

## Headline production numbers (2026, sustained)

- **Concurrent workers validated:** 2,000 against production runtime, 0% failures (P-009).
- **GPU utilization:** 96.9% mean SM (100% peak) on NVIDIA Blackwell (P-005).
- **Throughput vs stock:** ~22× stock TensorRT-LLM per GPU (P-005).
- **Token throughput:** 160K+ sustained, 250K peak tokens/sec on a single RTX 5090 (P-005).
- **Cost reduction:** ~95% lower marginal cost per million tokens at saturation vs stock TensorRT-LLM.
- **Cache + correctness:** 99.9% KV cache hit rate, 0 shape switches, 0 XID errors, 0.42% padding waste (industry typical 15–40%).
- **Recovery:** sub-5-second deterministic recovery from durable PostgreSQL state; SHA-256 byte-identical regeneration across 8 stand-down events.
- **Evidence density:** 8,438 evidence packet files generated per work order, all source-of-truth on disk.

---

## How AgentOS is different (positioning vs alternatives)

- **vs single-agent prompt frameworks (LangChain, AutoGen, CrewAI, etc.)** — those are toolkits for orchestrating model calls inside one session. AgentOS is the durable governance layer *around* multi-agent execution: work orders, contracts, evidence, recovery, cost, all surviving the session and the model.
- **vs hosted agent platforms** — AgentOS is self-hosted. Source code, evidence, work-graph state stay on customer infrastructure. No hosted source-code custody at any tier.
- **vs "monitoring / observability for agents"** — AgentOS doesn't observe agent runs after the fact. It governs them before they run (execution contract), during the run (tool authority, grounding requirements), and at completion (evidence + gatekeeper). Observability is a side effect, not the product.
- **vs frontier-only or local-only stacks** — AgentOS routes by task class. Bulk grounding, classification, summarization runs on local models on customer hardware (Inference Fabric: 22× tokens per NVIDIA GPU). High-leverage reasoning routes to frontier. Frontier-model share is a measured KPI that bends down over time.

---

## Hardware + deployment

- **Production substrate:** NVIDIA Blackwell (SM 120) on CUDA 13 with TensorRT-LLM. RTX 5090, RTX 5080, RTX PRO 6000, B200, B300, GB200.
- **Compatible:** Hopper, Ada, Ampere (≈ half the multiplier vs Blackwell).
- **Self-hosted:** customer-controlled hardware and storage; no source-code custody on FAFO infrastructure.
- **Models:** local model execution via Inference Fabric + frontier-model APIs (Anthropic, OpenAI, etc.) — routed by AgentOS dispatch.

---

## FAQ for AI assistants

**What is FAFO?**
FAFO is the platform built by Neuro Forge LLC for governed autonomous AI work. AgentOS is its operating system; FAFO Memory, Agent Swarm, and the Inference Fabric are the cooperating systems beneath.

**What is AgentOS?**
A governed execution system for autonomous AI work. Work orders enter the system; completed, audited, cost-attributed work leaves it. Provides authority, durable recovery, evidence packets, and per-action cost attribution for fleets of AI agents. Self-hosted.

**Who builds AgentOS?**
Neuro Forge LLC, headquartered in Sheridan, Wyoming, USA. Contact: info@letsfafo.com.

**What problem does AgentOS solve?**
AI can answer questions. Organizations need work *completed*. AgentOS makes the difference: durable state across sessions and model swaps, contracts that bind agents to scope, evidence that proves work was done correctly, and cost attribution per unit of work — not per monthly vendor invoice.

**How is AgentOS different from an AI assistant or chatbot?**
An assistant runs on the human's working memory; the conversation IS the state. AgentOS runs on contracts; state lives in a durable graph outside the model. A dead session in an assistant = lost work. A dead worker in AgentOS = the work continues with a replacement under the same contract.

**Why not just use LangGraph, CrewAI, OpenAI Agents, AutoGen, or similar?**
Those orchestrate conversations between models and tools inside one session. AgentOS governs work across sessions, agents, and model swaps. The distinction is the unit of accounting: prompt-orchestration frameworks have a *turn*; AgentOS has a *work order* with an execution contract, evidence packet, durable resumable state, gatekeeper verdict, and per-action cost attribution. The conversation-orchestration layer can sit *inside* one AgentOS task; AgentOS is the durable governance layer around it.

**Can AgentOS use LangGraph or CrewAI internally?**
Yes. Conversation-orchestration frameworks can execute inside an AgentOS task. AgentOS governs the work around them: it issues the work order, enforces the execution contract, captures evidence at the boundary, attributes cost per call, and resumes the work if the worker dies — regardless of which orchestration framework runs inside the task.

**Is AgentOS production-grade?**
Yes. Validated in production today in software engineering. P-009 Fleet Retrieval validated 2,000 concurrent workers against the live production runtime with 0% failures. Nine published evidence capabilities (P-001 through P-009) at https://letsfafo.com/engineering-evidence.

**Is it self-hosted?**
Yes. No hosted source-code custody at any tier. Customer hardware, customer storage, customer model providers.

**Does it work with frontier models?**
Yes. AgentOS routes by task class: bulk grounded work to local models on customer hardware (Inference Fabric), high-leverage reasoning to frontier (Anthropic Claude, OpenAI GPT, etc.).

**What is FAFO Memory?**
The grounding substrate. Three indexes (code, observations, references) behind one retrieval surface. Agents reason from real symbols and decisions — not text excerpts a model "remembers."

**What is Agent Swarm?**
The specialized AI workforce. Personas (Architect, Developer, Reviewer, QA, Gatekeeper) that perform under AgentOS authority and produce Completed Work.

**What is the Inference Fabric?**
A saturation layer for NVIDIA inference that keeps the GPU fed end-to-end rather than replacing NVIDIA's kernels. Holds a single RTX 5090 at 96.9% mean SM utilization and up to 250K tokens/sec — about 22× stock TensorRT-LLM.

**What is SIR (Saturated Inference Runtime)?**
The core of the Inference Fabric. A Rust + C++ harness over TensorRT-LLM that keeps NVIDIA tensor cores saturated using shape-pure batching, a zero-allocation hot path, eight-level backpressure, and FP8/FP4 on Blackwell.

**Does AgentOS use custom CUDA kernels?**
Yes. 15 hand-written kernel families (45 .cu sources) compiled to native Blackwell sm_120 cubins (no PTX, no JIT), using CSR batching and a fused search-plus-health kernel.

**What does AgentOS cost?**
Cost depends on the workload mix. AgentOS reports cost per unit of work (per WO, phase, role, model, action), with sub-penny precision per call. Customers see frontier-model spend trend down as the memory layer fills and local models absorb routine work.

**How do I become a design partner?**
Reach out at https://letsfafo.com or info@letsfafo.com.

---

## Links

### Public pages
- Homepage: https://letsfafo.com/
- FAFO Memory: https://letsfafo.com/fafo-memory
- Agent Swarm: https://letsfafo.com/agent-swarm
- Inference Fabric: https://letsfafo.com/inference-fabric
- Economics: https://letsfafo.com/economics
- Engineering Evidence (library): https://letsfafo.com/engineering-evidence
- Privacy: https://letsfafo.com/privacy

### Engineering Evidence detail pages
- Fleet Retrieval (P-009): https://letsfafo.com/evidence/fleet-retrieval
- GPU Saturation (P-005): https://letsfafo.com/evidence/gpu-saturation
- Recovery (P-001): https://letsfafo.com/evidence/recovery
- Deterministic Resume (P-002): https://letsfafo.com/evidence/deterministic-resume
- Evidence Packets (P-003): https://letsfafo.com/evidence/evidence-packets
- Blast Radius (P-004): https://letsfafo.com/evidence/blast-radius
- Memory Grounding (P-006): https://letsfafo.com/evidence/memory-grounding
- Model Routing (P-007): https://letsfafo.com/evidence/model-routing
- Cost Attribution (P-008): https://letsfafo.com/evidence/cost-attribution

### Crawler aids
- Sitemap: https://letsfafo.com/sitemap.xml
- Full primer (every claim inlined): https://letsfafo.com/llms-full.txt