Signal to Noise

A few weeks ago I published The Context Problem Nobody Talks About With Multi-Agent Teams. That piece was about why multi-agent teams (in particular human teams with multiple agents) fall apart, and a soft pitch for Mycelium, the project I lead at work. Mycelium is a system for sharing context between agents and the centerpiece is something we call semantic negotiation: a structured back-and-forth where two or three or four agents with asymmetric context, asymmetric goals, or asymmetric incentives actually hash it out before they act. It works and it's quite interesting.

But something that I think is also interesting are these ... patterns of behavior I see in myself, my coworkers, my teammates.

The sheer volume of stuff being produced has cranked way, way up. Not even just code, but reports, analyses, documents... all of the things that drive day-to-day work theres just ... a LOT more of it and the heuristics I used to use to tell signal from noise have been quietly eroding underneath it. Detailed bug reports used to mean something. Long PR descriptions used to mean something. Design docs, powerpoint presentations, dataflow diagrams used to mean that someone sat down for 3-4 hours to think really hard about a specific problem to come up with an elegant solution.

None of that is reliably true anymore.

The auditing problem

Ostensibly, AI is about letting more people do more things. People are shipping work this year they couldn't have shipped last year, and that's honestly great. I've upskilled a lot in my own career by being able to take my existing skillset and apply it to areas that I'm less familiar with.

But the friction we removed was in some ways doing a second job as a quality gate.

When you remove the friction, you also remove what came packaged with it: the implicit certification that the person on the other end had to acquire some minimum competence to produce the artifact at all.

What replaces that, if anything, is the audit. The work of figuring out whether a piece of work is real. Whether it's bullshit. Whether the person who sent it understood what they were sending.

Auditing is much harder to scale than pure AI generation. Generation can be automated, and auditing, so far, mostly can't.

When the loop breaks

We've been trying to hook Mycelium up to OpenClaw, and frankly OpenClaw is a hot mess. There are so many ways to deploy it and so many ways to configure it. If you're just running it as a personal agent it's fine, but when you're building a complex multi-agent coordination system on top, the variance is super hard to reason about. Predictably, we've been getting a lot of bug reports, and what we see in those reports is interesting.

We have a lot of users with less context or technical capability than the agent they're operating to install and use Mycelium. The agent generates an answer. The user can't actually evaluate the output, so the agent has full leeway to do whatever it wants. It edits the wrong config files, it hallucinates fixes, it monkey-patches in awkward and unscalable ways. The user ends up in this flow where they're just copying and pasting errors into the chat over and over.

Eventually a bug report lands against the project. It's AI-written, polished, articulate, and confidently specific about what's wrong. The catch is that the actual problem could be living anywhere: the agent's misconfiguration, the user's setup, a real Mycelium bug, an OpenClaw quirk, a new OpenClaw version behavior, or some interaction across all of them. The report tells me where the user thinks the bug is. The audit is the real man-hours required to figure out where it actually is.

None of this is malicious and the people doing it are genuinely trying to help. They want the project to succeed. The bug reports are real artifacts of real time spent, and they are often genuinely useful.

But each one needs real due diligence before I can act on it. The surface used to tell me a lot: was the report careful, was the language specific, did the person clearly poke at the thing. But that stuff is gone. Every report now demands the same time investment whether it's the careful one or the throwaway one.

I'm not exempt to this either

Everything I just described is also true of me.

When I run semantic negotiation tests, the output is ten to twenty pages of agent logs per run. I run a baseline condition without negotiation alongside the test condition with it and I will absolutely just dump those in front of a teammate and say here lol, check this out like it's nothing. The logs contain the inputs, the outputs, the accept/reject/counteroffer trace, the agents' own reasoning. The information is real, but jesus christ, there could be any amount of noise in there and neither I nor my teammate could feasibly validate all of that data.

Twenty pages of dense log is basically a wall, and almost no one will read it carefully. Even if you pipe it through your agent and ask for a one-to-five score on the efficacy of the system, the judge inherits the same audit problem the human reviewer would have. I think its inarguably better than having humans read all of the outputs and reasoning about success metrics, but whether the LLM-as-judge score is meaningful depends on whether the model is looking for, catching, and appropriately surfacing the various issues within a huge, amorphous blob of text.

I've also built a documentation site for Mycelium with a giant copy button at the top of every page, because the docs are not really for humans. They're for agents. I'm explicitly telling my users don't read this, just dump it into your terminal.

I have switched my development flow to being very GitHub-centric and sometimes I'll write a GitHub issue like it's my Magnum Opus -- I'll sit with Claude and craft exactly the issue, exactly the design spec, we'll audit and cross-reference everything that the new feature or bugfix will touch. Aaaaand, then sometimes I'll be in the middle of work, I'll spot a bug, then I'll just open a new Ghostty tab and be like "CLAUDE IM GETTING THIS ERROR, PLS FILE AN ISSUE" and Claude just writes slop and I just hit the send button without a care in the world.

The hardest part isn't the throwaway artifacts. It's that the careful ones and the throwaway ones look nearly identical from the outside.

The three-hour issue and the five-second issue have the same voice, the same structure, the same length. There's no visible feature that distinguishes them. And this stays true even when the person producing them is competent and well-intentioned, because volume is the actual constraint of modern work. You cannot put three hours into every artifact. AI lets you produce the five-second version, and the five-second version looks just as serious as the three-hour version did last year.

So competent, well-intentioned people lose the ability to signal their own care. Care is no longer visible on the surface of the work.

A small thing I started doing

I wrote about this Webex skill in the last piece. When I have an AI agent post on Webex under my name, I make it prefix the message with [claude-code-agent].

It's a small move, and I'm don't know what my coworkers think about it, but I like it because it's honest. It tells the reader, before they spend any cognitive energy parsing the message, what kind of artifact they're looking at. They can choose how much weight to give it. They can throw it away as a hallucination or they can use it as context for their own agents.

It also restores, in a small way, what bylines used to do. When my actual name is on a comment, I feel pressure to stand behind the content. I don't want to be the person in a meeting who, when asked how a feature works, has to admit I don't know because the AI wrote the comment.

I have absolutely had moments where someone asks me how a piece of my own system works and I have to go ask Claude. The pattern from the OpenClaw story is also the pattern in my own codebase, on bad days. I'm not immune. The [claude-code-agent] prefix is a small bit of honesty that I can share with the world that "I was only somewhat involved in the output of this message."

Auditing all the way down

This pattern of sifting through crap for the outlier is basically my entire job.

My day job is Mycelium, which involves going through agent logs and monitoring behavior of autonomous systems. I spend half of the time writing new features and half of the time diagnosing why the agents simply refuse to coordinate sometimes.

My side project is Polaris. We are a platform for student interviews and we transcribe and tag them, run them through a multi-pass pipeline, and produce reports that surface the quotes that matter.

My recent weekends are spent auditing my co-founder's Claude Code output on Polaris. He handles the pedagogy part of Polaris, but the report-building process requires a significant amount of web development, and Claude Code has given him both the ability to ship features and the ability to make a lot mistakes in a domain he doesn't have direct judgment for.

It's the same problem every time. Which bug report is real, which agent log shows actual success, which student quote is the one, which task is safe to route to a cheaper model and which is mission-critical. The bottleneck is always the audit. And auditing is always bound to someone with enough domain familiarity to do it. Which is the one thing the systems we're building now cannot manufacture.

What I want to do next

I don't have a great answer yet, honestly.

What I want to do with Mycelium, in the next few months, is start working seriously on noise reduction rather than noise generation. I don't fully know what that looks like yet. I think it's largely an unsolved problem, and most of the obvious paths are paths I'm skeptical of.

Knowledge graphs are an obvious candidate. I'm skeptical for the same reason I'm skeptical of any tertiary memory store: they go stale. The graph drifts away from the system it represents and you end up auditing a third artifact alongside the first two.

Agentic curation is the other obvious candidate. I'm even more skeptical there as my experience in using these tools continues to mature. If you let an agent curate your dataset, you've effectively made the noise the entire dataset.

Spec-driven development is the version of this I see promoted most in engineering circles, and it suffers from the same problem. The spec doesn't compile. The spec doesn't run. The spec drifts the moment the code moves. You end up with two artifacts to keep in sync, both of which can be AI-generated, neither of which is grounded in execution, and your job becomes verifying the text matches the code and the code matches the text. For most projects, a CLAUDE.md and a wiki and clean commits do the job. Adding a spec layer on top is solving the wrong problem in the wrong direction.

I think the answer, whatever it ends up being, lives in systems that tighten and audit existing outputs rather than produce more of them. I don't know exactly what those look like yet. But I'm pretty sure that's where the interesting work is, and I'm excited to go figure it out.