OpenXyz: my agentic harness experiments, and why I stopped building it
Written with AI assistance, pulling from the ~190 journal entries I kept while building OpenXyz.
For a few months, I was building a side project called OpenXyz, an agent harness for agentic workflows. I got it working, ran it as my own assistant for a while, and then stopped. This is what it was, and why I stopped, which mostly comes down to realizing the harness isn’t where the value is.
What is OpenXyz?
OpenXyz is an agent harness for agentic workflows, not a coding tool. The idea is a personal assistant, a chief-of-staff or a janitor or a researcher, backed by one shared agent session that you talk to wherever you already are. The same agent on Telegram, in a terminal, in Slack, rather than another app you have to go open.
The agent lives in a filesystem it can rewrite. Tools, skills, sub-agents, and the channel adapters are all just
files in a project directory, and the agent can write its own tools that take effect on the next turn.
openxyz start boots a single process: one agent loop, fed by whatever channels the template declares.
It worked end to end. I ran the janitor template as my own chief-of-staff and used it most days.
Motivation
I didn’t start with a thesis. After Claude Code, building an agent harness became the default side project for anyone paying attention, and I’m not immune to a default. opencode, OpenClaw, a dozen framework repos, everyone converging on roughly the same shape at the same time. So I built one too. The honest version is that I stumbled into it the same way most engineers did, by reaching for the obvious thing to build.
The one decision I made deliberately was to point it at my own life instead of code. I wanted something for the day to day: a personal assistant for my own tasks, a bot my wife and I could both talk to, a shared one in the group chat with friends, something to quietly handle the family logistics. That’s where the names came from, “OpenBrain” for the personal assistant, “OpenFamily” for the shared household one, and OpenXYZ for the harness they all run on.
Like any project in a crowded space, I started by reading the prior art. I keep a shelf of clean shallow clones of the upstreams worth reading, OpenCode, OpenClaw, Hermes, Codex, the Vercel AI SDK, and a short reference note on each for where its agent loop lives and how it handles compaction. Steal patterns, not code. Reading a dozen of them next to each other is also the fastest way to see what everyone has quietly agreed on, and where they still disagree because nobody really knows yet.
What’s actually in an agent loop
The first thing building one teaches you is that the agent loop is small. The model does the work, and these days the SDK hands you the loop itself. What separates a toy from something you would run is the unglamorous reliability you wrap around it, and almost none of it shows up in a demo.
Take termination. I didn’t really write the loop, the AI SDK’s tool-loop agent owns the natural stop and ends the turn when the model replies with no tool calls. What I wrote is the guardrail around it: a hard step budget so a runaway turn can’t go forever, and a final-step guard that, one step before the budget, makes the model put its tools down and write a clean reply instead of getting cut off mid-tool-call.
const MAX_STEPS = 100; // normal turns finish in 1-15; 100 means something is wrong
new ToolLoopAgent({
model,
tools,
stopWhen: stepCountIs(MAX_STEPS), // hard runaway stop
prepareStep: ({ stepNumber }) => {
if (stepNumber < MAX_STEPS - 1) return undefined;
return {
toolChoice: 'none', // last step: no more tools, just wrap up
system: [
{
role: 'system',
content: "You've hit the step budget. Summarise what you did and reply without calling more tools.",
},
],
};
},
});The piece I found most interesting was subtler. By default the agent gets rebuilt every turn from a lossy view of its own past, only the text it showed the user, not the tool calls and results behind that text. It can see what it did but not how it got there. So the agent keeps its own ledger, separate from what gets displayed. It’s just a list of messages, user and assistant turns with the tool-call and tool-result pairs in between.
That ledger would balloon fast. A single calendar pull or web fetch can be tens of kilobytes, and left alone it sits in the prompt forever. So on every write the older tool outputs get pruned to a stub, while the call id and tool name stay, so the model still sees the shape of the loop, the call happened and returned something, just without the bytes.
[
{ "role": "user", "content": "who am I having lunch with this week?" },
{
"role": "assistant",
"content": [
{ "type": "text", "text": "Let me check your calendar." },
{
"type": "tool-call",
"toolCallId": "call_1",
"toolName": "mcp__calendar_read",
"input": { "range": "this_week" }
}
]
},
{
"role": "tool",
"content": [
{
"type": "tool-result",
"toolCallId": "call_1",
"toolName": "mcp__calendar_read",
"output": { "type": "text", "value": "[pruned 18423 bytes — re-run the tool if you need this output]" }
}
]
},
{ "role": "assistant", "content": "Lunch with Wei on Tuesday and your wife on Friday." }
]It’s cheaper than compaction because there’s no model call, and it keeps the prompt prefix stable, which is exactly what prompt caching wants. The freshest output survives one turn, since the agent just watched it stream. Two tools, skill and delegate, are never pruned, their output is context the agent is actively working from rather than a result it can re-fetch. And if the model really does need a pruned result, it just runs the tool again.
Once that’s in place, compaction handles what pruning can’t reach, the conversation itself.
The part I like is that compaction isn’t a subsystem. There’s no special summariser engine, compact is just an agent, defined by a markdown file with frontmatter, exactly like every user-facing one. The frontmatter sandboxes it to read-only filesystem access and a small toolset, and the body is its system prompt.
---
description: Summarize a conversation into goal, discoveries, accomplishments, pending, and references
filesystem: read-only
model: auto
tools:
bash: true
read: true
glob: true
grep: true
---
You are a compaction agent. Your only job is to summarize a conversation into a
dense, preservation-focused summary that another agent will read instead of the
raw transcript.When a session outgrows the model’s input budget, the runtime spins that agent up, feeds it the older turns, and writes the summary back in their place, keeping the last two user turns verbatim so the current reply doesn’t lose its thread. The same agent runs again mid-turn, for when a single turn’s own tool-loop blows past the context ceiling. Both paths fail open: if the summariser errors, the turn proceeds with the oversized prompt rather than dying.
Channels were their own swamp of edge cases, and I lost real weeks to Telegram in particular, but that part is plumbing. The point is that the agentic core, the bit everyone is racing to build, just isn’t that big. That eventually became the reason I stopped.
Over-engineering past the point
Once the loop was settled, the hard problems all moved outward, into the substrate the loop runs on. That is the honest reason the project kept growing, and the point where the depth stopped paying for itself.
It started as a hard fork of opencode, for the TUI and the session infrastructure. Eleven iterations in, the fork
cost more than it returned, between browser build conditions breaking imports, flaky TTY handoff, an Effect runtime
that turned plain code into ceremony, and a surgical dead-code dance to stay mergeable with upstream. So I cut it
out and rebuilt clean on Bun and the AI SDK. Then I spent a month making the whole thing serverless before admitting
it was the wrong target. Every Vercel and Cloudflare Workers headache traced back to the same cause, forcing a
stateful agent into a runtime designed for stateless isolates, so I deleted that too. The runtime became a container
running openxyz start, no build step and nothing platform-specific. Easily the best decision I made on it.
That decision then opened a door I should have left shut. The in-process shell I had been running was [just-bash]
(https://github.com/vercel-labs/just-bash), a sandboxed bash that executes against a virtual filesystem instead of
the real one. The part of it I actually liked is that mounts compose, and any mount can be wrapped read-only. So the
deployer decides how much of itself the agent can touch: mount the harness, its instructions and skills and tools,
read-only and the agent can read how it works but never rewrite it; mount it read-write and it can edit its own
tools; expose a single context file and that’s the one thing it’s allowed to change. A read-only wrapper that throws
EACCES on every write is the whole enforcement.
// any filesystem, wrapped read-only: reads pass through, writes throw
class ReadOnlyFs implements IFileSystem {
readFile(path: string) {
return this.inner.readFile(path);
}
async writeFile() {
throw eacces('writeFile');
}
async rm() {
throw eacces('rm');
}
// ...every other mutation throws EACCES too
}But once the container itself is the sandbox, an in-process bash simulating a filesystem is redundant, and the agent
should just get a real shell with the real CLIs. So just-bash had to go, and I talked myself into the grand version:
everything is a file. Session as an append-only log. GitHub, Linear and Notion mounted as directories you cat and
grep like any folder. Memory as a per-mount file that keeps itself up to date. To make it real I started a second
project in Rust, a filesystem daemon with FUSE and NFS mounts, provider processes talking over a socket, and
schema-checked writes so a malformed issue fails before it ever hits the API.
The one thing that filesystem couldn’t do is the thing it most wanted to: you can’t turn grep into semantic search
at the filesystem layer. POSIX has no search verb. From the mount’s side, grep -r foo /mnt is just a flurry of
read(file, offset, count) calls returning bytes, while the word foo lives in grep’s memory and never reaches the
filesystem at all. I learned this reading SMFS, which wants exactly that
and pulls it off without touching the filesystem. It ships a grep shim on your PATH: inside one of its mounts, a
flagless grep pattern calls a semantic search API, anything with a flag like grep -n pattern execs the real grep,
and outside a mount it’s plain grep untouched. Flagless is semantic, flagged is literal, and that split is the whole
UX. The filesystem does nothing; the shim does all the work.
It is good systems engineering work. But it is also a very long way from a person talking to an assistant—many degrees of separations from agentic workflows.
Why I stopped
Build the same project as everyone else, and you tend to land on the same realization as everyone else. IKEA effect.
Reading all those harnesses next to each other, what stood out is how much of what they carefully build is dissolving underneath them. Two of the ones I studied had hand-rolled retry logic that the AI SDK now ships for free. The clever loop, the compaction, the cost tracking, the platform keeps swallowing it. I was polishing the exact layer the model and the SDK are busy commoditizing. A harness like OpenXyz, or OpenClaw, or any of them, has no moat on its own. It still needs a network effect to matter as a product, and a harness has none.
The clearest proof landed right after I stopped, when Vercel shipped Eve. “The framework for building agents”, where an agent is a directory of Markdown instructions and TypeScript tools, one codebase deploying across Slack, Discord, WhatsApp, Linear and the web, durable by default so it can park between messages and resume on delivery. That is the omni-channel agentic assistant I had been building, shipped by the people who already own the model gateway, the sandbox, and the runtime it sits on. The harness didn’t get commoditized in the abstract. It showed up as a platform feature in roughly my exact shape.
There is a second-order point in that. If standing up an agent is this easy, nobody serious is going to buy a run-of-the-mill one. They will build their own, shaped to how they actually work, and let it grow with them. The value was never the harness, it is the customization, the slow fitting of an agent to one specific context. That points away from the usual SaaS shape and toward something more like an agency, where agents are grown around a business instead of sold to everyone as the same product.
And then there is the harness itself, which my last note already half-knew. The whole idea of “harness fit”, carefully tuning the scaffold around the model, assumes the model is weak and needs the scaffold. A model like the Mythos/Fable class doesn’t. A good enough agent is adaptive, and given a shell it will spin up and arrange its own harness on the fly. The elaborate, opinionated one I was building optimizes for exactly the thing that is about to go away.
So I stopped. Not because it didn’t work, it did, but because I had built a careful answer to a question that is coming apart. The code is still up at github.com/fuxingloh/openxyz if you want to read it, with the usual caveat that it’s experimental and now abandoned.