What breaks first when you actually use your own product

Every component in the system passed its tests and the cognitive pipeline ran cleanly in development and the memory retrieval returned relevant results and the model routing dispatched requests to the right models, and everything worked.

Then we started running real ventures on it and “works” turned out to have a lot of gradations we hadn’t considered.

The context management problem

The first thing to break wasn’t a component, it was the interaction between components, and each module in the context assembly pipeline was doing its job correctly and memories were being retrieved and knowledge was being retrieved and tool descriptions were being loaded and all of it was relevant.

The problem was volume though and it was a problem I hadn’t anticipated, and every module added context and none of them knew how much context the others were adding so by the time all 11 modules had run the context window was saturated and the agent had so much information that it couldn’t reason effectively about any of it.

In testing we’d used short conversation histories and limited knowledge bases but in production the venture had months of accumulated memories and hundreds of knowledge documents and 46 available tools and the math simply didn’t work.

That’s when we learned that constraints aren’t the enemy of good AI, they’re the prerequisite, and we built the 80% token budget ceiling and the gradient compression system to enforce it, and the system now actively prioritizes what goes into the context window and cuts the least relevant information rather than trying to include everything, and it was a humbling lesson—more information doesn’t mean better reasoning, curated information does.

The model cost surprise

Our model routing system worked exactly as designed and expensive models handled hard tasks and cheap models handled easy ones and the routing logic was sound.

What we didn’t anticipate was how many tasks the system classified as “hard,” and the routing was correct in each individual case but in aggregate the cost was higher than expected because our initial calibration of what counts as “hard” was too generous.

This is a subtle bug, the kind that hides in aggregation, and no single request was misrouted but the overall pattern was just expensive so we had to recalibrate the tier thresholds and push more tasks to the local models and start tracking cost per conversation turn to make the economics work at scale.

The memory relevance problem

Memory retrieval was technically accurate and when the agent searched for memories related to “customer onboarding” it got memories related to customer onboarding but the problem was that “related to customer onboarding” returned 47 memories and some from last week and some from three months ago and some that were tangentially relevant and some that were central.

The agent treated them all with roughly equal weight and a decision made three months ago in a different context got surfaced alongside a conversation from yesterday and the agent didn’t have a strong sense of which memories were more relevant to the current situation versus which ones just happened to contain similar words.

This pushed us to rethink memory weighting and not just “is this memory semantically similar to the current query?” but “how recent is it and how significant was the original context and how many times has it been referenced since?”, and we’re still iterating on this and I don’t think we’ve cracked it yet.

The tool awareness gap

I mentioned this in a previous article but it bears repeating because it was one of the more surprising failures, and the agent had access to a set of tools and the agent was capable of using those tools but the agent’s awareness of which tools were available and when to use them was inconsistent.

Sometimes the agent would attempt to manually do something that a tool could have handled in seconds and other times it would reference a tool that wasn’t in its current permission set and the tool discovery module we built was a direct response to this problem and it uses semantic search to surface relevant tools based on the current task.

But I wonder whether this is a deeper issue than tool indexing, and maybe it’s about how agents understand their own capabilities, and when you hire a person they know what they can do but an AI agent only knows what you’ve told it and “telling” it through tool descriptions and permission lists is a surprisingly lossy form of communication.

What this taught us about testing

The meta-lesson is that unit tests and integration tests are necessary but not sufficient for systems this complex and every component worked in isolation but the failures were emergent and arising from how components interacted under real-world load with real-world data volumes.

The only way to find these problems was to run real workloads and not simulated workloads with curated data but real ventures and real conversations and real accumulated memory and real cost pressures.

Dogfooding isn’t just a product strategy and it’s a testing strategy and for systems with this many interacting components it might be the only testing strategy that actually finds the problems that matter.