Building agentic systems that actually work, for messy data

December 12, 2025

AI build

Models like Claude Code are phenomenal at boilerplate generation, but less so at high-level system design. Flip that around to a data-first context and you get a powerful idea: an AI that excels at the grunt work of pattern detection, transformation suggestion, and rapid iteration, while the human focuses on domain rules and judgement.

That is precisely what we built with Prysm.

Prysm is an agentic data-cleaning system for messy CSV dumps. It operates column-first and moves through four phases: ingestion, profiling, planning, and execution. Each split into a worker and reviewer session so humans remain in the loop at every decision point. It is fully auditable, can generate novel Python code when a transformation is too specific for the built-in functions.

Below, I’ll share what I learned from building a production-ready agentic system for real-world data, not toy datasets.

Memory, memory, memory

AI agents are like kids: they forget where they put their homework, they’re about to leave the house without a jacket, they ask the same question a million times and then immediately forget the explanation. For kids, this is a feature of a developing brain. For agents, it’s a consequence of one simple fact: their memory systems are terrible. Every agent developer runs into the same brick wall:

Orchestration can’t reliably track progress across columns.
Agents loop in endless self-talk.
Information discovered in ingestion mysteriously evaporates by the time you reach execution.
“Long-term memory” is often just a polite euphemism for vibes.

Very quickly, it became obvious that it is not yet prime time for naïve agentic systems. Despite the toy demos (flight-booking agents, trip planners, etc.) floating around in ADK samples, reality is harsher.

Instead of running the entire cleaning pipeline inside one giant shared-memory session we took the opposite approach:

Each phase runs in its own isolated session.
Every agent invocation receives a tailored memory state, containing only what is relevant for that task.
Sessions do not bleed into each other. No accidental long-term context leakage.

This single design choice dramatically reduced agent drift, side quests, and nonsense behaviour. If you give too much context, it becomes increasingly difficult for the agents to correctly identify their task.

Agents only do the local reasoning, but they do not decide the global flow. The actual pipeline progression is controlled by a deterministic state machine that:

Tracks progress across all columns
Stores phase-to-phase learnings in a structured pipeline memory
Decides when to re-loop a worker
Decides when human review is required
Decides when to advance to the next phase

So in essence we got a deterministic state machine with extremely narrow intelligence at specific stages.

Only do one thing, and do it well

Another issue with LLM-agents is that if you give them too many tasks, they start to increase their error rate. If you tell your agent to look at context, call tools and output into a structured format, the last mile, the structured format almost always fails. What we learned is that it is much more safe to split agents into “sub-agents” and assemble them into a small assembly line using the sequential agent approach.

For instance, the profiling worker agent is a sequential agent made of two sub-agents. The first agent reads context for a given column and profiles it via different tools to detect nulls, string formatting issues, numerical outliers etc. The second agent then takes all of this “research” and puts it into a very structured json format making outputs across all columns identical.

Now to facilitate human interaction a review agent takes over. This is a mini-orchestration agent (controlling the local flow within the phase), that reads the json (the structured research) together with the column context and presents the findings to the user. The user can now tell the review agent what to interpret differently, how to resolve potential problems etc. The review agent will then either do this directly or call the worker agent flow to correct things.

Similar to standard software development the partition of concerns / responsibility makes for a much more stable system, and the agents are not overwhelmed with choices but have very niche functions that when put into a whole feels more “intelligent”.

Novel tasks, learning system

While the agentic system already has a large tool box of standard data transformations, it cannot cover all possible mismatches of data quality issues. Adding more tools would likely end up confusing the agents, as it becomes more and more difficult to figure out what tool to use in a given situation: Should I use trim whitespace, or trim whitespace not tab etc.

What is really neat with agents is that they can produce code, and run it. So apart from a series of standard transformations we gave the agents the ability to suggest a custom code solution, and spell out what the code should do, what the input would look like, and what the expected result should be. Similar to how you would instruct your code AI assistant.

This turned out to be immensely powerful. The agents were now able to address novel problems, reason about them, suggest code, test the code, re-iterate if it failed, and save successful code snippets for future use.

That essentially provides a single mechanism for the agentic system to adapt to a specific data domain, by creating more and more bespoke transformations based on the data it encounters.

Audit and transparency

Debugging agentic systems is painful due to their stochastic nature. It is almost impossible to trigger the same output twice, so we explicitly made our system rely on structured outputs . Meaning each column in each phase results in a json formatted output describing the agents insights and planned actions. This makes debugging much easier, as you don’t have to worry about the entire conversation trace, but can focus on one completely defined output variable. And it has the benefit of making the entire system auditable. You can just open a given column file for a given phase and see exactly what the agent learned, what cleaning tools it plans on using, and how the actual execution of those tools changed your data.

Roadmap

Next on the roadmap is row-centric data cleaning to handle standard issues like duplicated or nearly identical entries, and possible inference of missing cells.

In line with inference of missing data, many real-world data sets suffer from incomplete data that could be inferred from outside sources. For instance zip code if you have the rest of the address. Enabling the agentic system to connect to other resources will greatly increase the ability to do novel missing data replacement.

What I learned, what have you learned?

These patterns—isolated sessions, single-responsibility agents, deterministic orchestration—aren’t specific to data cleaning. They’re what it takes to make agentic systems reliable today. If you’re building something similar, I’m curious what’s working for you.