Triskelion: Agents as Documents

The problem with AI agents isn't that they're hard to build. It's that they're impossible to see into.

You launch an agent, it starts spinning — tool calls, memory writes, LLM queries, internal state updates — and then either it finishes, or it doesn't. When it doesn't, you have logs. Maybe. If you remembered to add them. And those logs tell you what happened in sequence, but not what the agent believed about the world at the moment it made a bad decision. You can't pause mid-run and SQL-query the agent's reasoning state. You can't replay a specific conversation from saved context. You can't restart a failed agent and have it pick up exactly where it left off, because "exactly where it left off" was in process memory that's now gone.

This is why I built Triskelion differently.

The Core Idea: Agent State Is a Subtree

In Triskelion, an agent is not a running process. It's a subtree of documents in a PostgreSQL database.

Everything the agent knows, everything it has decided, every tool call it has made and every result it received — all of it lives in the database as rows in a document hierarchy. The agent doesn't hold state; the database does. The services that do the actual work (LLM reasoning, tool execution, semantic matching) are stateless workers that read from and write to that hierarchy. They can crash and restart. They can be scaled horizontally. The agent's "state" survives any of it, because the agent's state is data.

The document tree for a session looks like this:

portal (id: 100)
└── session (id: 200)
    ├── agent_instance (id: 300)
    │   └── references agent_card (id: 50)
    └── messages/
        ├── user_message (id: 400)
        ├── query_response (id: 401)   ← LMQL reasoning output
        ├── tool_call (id: 402)
        ├── tool_result (id: 403)
        ├── query_response (id: 404)   ← Next reasoning iteration
        └── agent_message (id: 405)   ← Response to user

That's not pseudocode for illustration purposes. That's an actual document hierarchy in PostgreSQL. You can query it with SELECT. You can filter by usetype. You can join tool calls against tool results by parent_id. The agent's entire cognitive history — what it was asked, what it reasoned, what tools it called, what answers it got, how it responded — is a set of rows that you can inspect with any database client you already know.

How We Got Here: Three Generations

This architecture didn't arrive fully formed. It's the third generation of a system that started as local_docs_embedding in May 2025 — a simple embedding pipeline that stored chunked documents with pgvector for semantic search. It became vdo_frontend in August 2025, which developed the hierarchical document model (the VDO — Vector Doctree Ontology) and pushed it to 109 commits as a standalone library. Then in November 2025, Triskelion absorbed the lessons from both predecessors and added the agent orchestration layer on top.

225 commits in 6 weeks. November 21 to December 30, 2025. That's the number people notice when they look at the project history.

I'm not citing it as a vanity metric. I'm citing it because it reflects something specific: the architecture held. When an architecture is fundamentally wrong, development velocity collapses as you fight it. You spend your time working around design decisions instead of building on them. The 225-commit sprint happened because every major architectural decision made the next problem easier to solve, not harder. The document model gave us observability for free. The stateless services gave us fault tolerance without effort. The task queue gave us orchestration without a framework.

The Parts That Actually Do the Work

Three core modules, layered with a hard dependency constraint:

        vdo_core          (zero external dependencies)
       /        \
      ↑          ↑
dispatcher_core  agent_core

vdo_core is the foundation. It owns the PostgreSQL schema, the document CRUD operations, the vector storage (1024-dimensional ModernBERT embeddings via pgvector), the knowledge graph (subject-predicate-object triples), and the task queue. It imports nothing from the other modules. This isn't a soft preference — it's an architectural constraint enforced by the module structure.

Both dispatcher_core and agent_core depend only on vdo_core. They communicate with each other not by calling each other's APIs, but by reading and writing VDO documents. If I want the dispatcher to know that an agent needs a GPU service started, it writes a task into the task queue. The dispatcher reads from the queue. No direct coupling.

This matters for the same reason good API design matters. Circular dependencies are where systems go to die slowly.

The Execution Loop: Documents All the Way Down

Here's how a user message becomes an agent response:

User message is written to the document tree as a chat document
A PostgreSQL NOTIFY trigger fires: document_created event with the document ID and type
The Workflow Coordinator (listening on that channel) creates an lmql_reasoning task in the task queue
The LMQL Service picks up the task, builds context from the document tree, runs constrained generation via LMTP, and writes a query_response document
That write fires another NOTIFY trigger
Workflow Coordinator sees query_response, creates a tool_matching task
Classification Service picks it up, runs ModernBERT cross-encoder against the available tools, writes a tool_selection document
Another NOTIFY. Workflow Coordinator creates arg_generation task.
LMQL generates the argument list, writes a tool_call document
Tools Service executes the tool, writes a tool_result document
NOTIFY fires again. Loop continues until the agent calls wait().

Every step creates a document. Every document is observable. The loop is driven entirely by PostgreSQL triggers and a task queue — no event bus, no message broker, no ZeroMQ. The infrastructure that was already there for "store documents in a database" is doing double duty as the agent orchestration layer.

Why Not LangChain, AutoGPT, or Semantic Kernel?

The honest answer is that I've used all of these and they all share the same failure mode: they think the right place to manage agent state is in Python.

LangChain and LlamaIndex treat agents as running processes with event loops. State lives in memory. Conversation history is a Python list. Tool results are Python objects that get passed to the next function call. This works fine until it doesn't — until you need to pause mid-run, until you need to debug what the agent was thinking three steps ago, until you need to resume a failed run from a checkpoint. At that point, you're serializing state to JSON and hoping you got everything.

LangGraph is the current state of the art in this space and it's a real improvement: it encodes agent state as a typed dictionary flowing through a compiled directed graph, and it has checkpointing built in. If you're building on top of an existing LangChain application, LangGraph is a meaningful upgrade. But the graph is still compiled Python. The checkpointer serializes state to a store — SQLite or Redis or Postgres — as a snapshot at each node boundary, not as structured first-class objects you can query relationally. You can restore a run, but you can't ask "how many times did the planner node choose tool A vs tool B across all runs last week" with a JOIN.

AutoGPT's approach is similar in spirit, just more opinionated: the agent is a process, memory goes to a vector store, tool calls happen in-process, and the whole thing is orchestrated by polling. The vector store adds persistence, but it's second-class — the canonical state is still in the process.

Triskelion flips this. The database is canonical. Processes are workers. Here's the comparison that actually matters:

Feature	Triskelion	LangChain/LangGraph	AutoGPT
Agent state	PostgreSQL documents	Process memory	Process memory
Debugging	`SELECT` queries	Log parsing	Log parsing
Replay	Native (read docs)	Complex (re-run)	Not supported
Fault recovery	Restart any service	Agent restart = state loss	Partial
Coordination	NOTIFY triggers	Event loops / compiled graphs	Polling
Multi-agent	Native document sharing	Explicit graph edges	Not designed
State durability	Committed to DB on creation	Serialized on request	Serialized on request

The thing that's different isn't just the storage location. It's that SQL becomes your debugging interface. When something goes wrong, I don't read log files — I run:

-- What was the agent's last reasoning step?
SELECT content AS reasoning,
       structured_content->>'task_description' AS next_task
FROM documents
WHERE usetype = 'query_response'
ORDER BY create_ts DESC
LIMIT 1;

-- What tool calls happened and what did they return?
SELECT d1.structured_content->>'tool' AS tool,
       d1.structured_content->>'args' AS args,
       d2.content AS result
FROM documents d1
JOIN documents d2 ON d2.parent_id = d1.id
  AND d2.usetype = 'tool_result'
WHERE d1.usetype = 'tool_call'
ORDER BY d1.create_ts DESC;

That's the state of the world at any moment. Not filtered through a logging framework, not reconstructed from stack traces — the actual decision history, queryable in real time.

Two Technical Innovations Worth Naming

The architecture is the main story, but there are two technical decisions inside it that I'd build into any future agent system:

LMQL for constrained generation. Standard prompting for tool calls is unreliable. The LLM might return valid JSON, or it might return almost-valid JSON with a trailing comma, or it might return a tool call wrapped in markdown fencing, or it might return an explanation of what it would call instead of actually calling it. Every one of those variants requires parsing logic that will eventually fail in production.

LMQL enforces constraints during token generation, not after. The output is structurally correct by construction:

@lmql.query
async def reasoning_query(...):
    '''lmql
    argmax
    "Reasoning: [REASONING]"
    where STOPS_AT(REASONING, "\n\n")
      and len(TOKENS(REASONING)) < 200
    return {"reasoning": REASONING.strip()}
    '''

The model cannot generate a REASONING that is longer than 200 tokens or fails to terminate on \n\n. Those constraints are applied at the sampling level. There's no regex, no JSON parsing, no try-except around the output. This is the approach I should have been using for two years before I found it.

ModernBERT cross-encoder for tool selection. The naive approach to tool routing is keyword matching: the agent's output contains "search", so you call the search tool. This breaks on synonyms ("find", "look up", "query"), paraphrasing ("can you check if there are any docs about..."), and anything slightly indirect.

A cross-encoder handles all of that. ModernBERT-large scores each (task description, tool description) pair with cross-attention, giving you similarity scores across the full semantic field:

Task: "find documents about Python"
vdo_search (Search VDO for documents): 0.89  ← selected
send (Send message to user):           0.12
wait (Wait for user input):            0.05

The model runs locally with GPU acceleration — no EnsoNet dependency, no cloud call. Separately, for the embedding pipeline, CPU ModernBERT runs 24/7 for real-time document ingestion; the dispatcher only spins up the GPU variant for bulk operations.

What It Actually Enables

The agents-as-documents architecture makes some things trivially easy that are genuinely hard in process-based systems:

Checkpoint and resume. An agent that gets halfway through a long task and then the LMQL service crashes can be resumed exactly where it stopped. The services are stateless; the task queue shows what was in-flight; the document tree shows exactly what the agent knew. Restart the service, re-claim the task, continue.

Conversation replay. Feed the existing document tree to a test service and replay the reasoning, with or without executing real tool calls. This is how you regression-test agent behavior: run the same document context through an updated LMQL service and check whether the tool selection changed.

Multi-agent coordination without a framework. If Agent A needs to ask Agent B a question, A writes a chat document into B's session. B's NOTIFY listener fires, the orchestration loop kicks off, and A gets a tool_result back when B finishes. No RPC layer. No message broker. Documents.

Full observability by default. There's no "add logging to debug this." Every action creates a document. Every document has a timestamp, a type, a parent, and content. You can always ask "what did the agent do between 18:30 and 18:35" and get a complete answer.

Where the Sharp Edges Are

Phase 1 is complete. Phase 1.5 — the full tool execution loop with multiple reasoning iterations and proper wait() termination — is in progress as of the last commit. This is honest: the architecture is proven, the core pipeline is working, and the loop termination is the next engineering problem.

The testing situation is typical for a six-week sprint: the integration test passes, but coverage is thin. For something that's going to run production multi-agent workflows, the enrichment pipeline (NER, entity extraction, relationship mining) needs more hardening. It had race conditions that were fixed in the commit history, and it had an infinite loop bug that was found and patched. Neither of those is a mark against the architecture — they're the bugs you expect in a rapidly-built pipeline. But they're also the reminder that the infrastructure is not yet production-hardened.

Where It Goes

The long-term vision is the convergence point for the whole ecosystem. Voice notes, processed into VDO documents. Logseq and Obsidian exports, flowing into the knowledge graph. YouTube transcripts, chunked and embedded. Code documentation, pulled in on demand. All of it searchable with hybrid BM25 + semantic search, all of it queryable by agents through the same tool calling pipeline.

The architecture that makes this possible isn't complicated. It's the combination of two old ideas: store everything as data (not processes), and let event-driven triggers handle coordination (not polling). PostgreSQL has been doing both of these things since before most current LLM frameworks existed.

The interesting part is that applying those old ideas to agent state, specifically, turns out to unlock a lot. Observability you get for free. Fault tolerance you get for free. Replay you get for free. Multi-agent coordination is just document writes.

225 commits says it works. The next 225 will tell us where it scales.

This article was scaffolded with backblog.

links

social