Using AI to build your AI foundation

A barrier to entering AI is data. You might not need massive data, depending on your use case. But you will need “clean” data. That means entities are encoded consistently, you have the same value for null values, everything in a string column is actually a string etc.

For many companies this represents a huge barrier and potential upfront investment in getting legacy data into shape. Cost is not the only driver, often you lack the internal resources (data engineers), infrastructure and simply time away from day to day operations keeping the lights on.

However, most companies are eager to get their AI pilots rolling. So why not let the first pilot be data cleaning? Then you will learn exactly how hard it is to get agentic AI to work, you’ll get a sense for your data quality and data access restrictions. And at the end of the pilot you’ll have a system that builds the foundation for all future AI pilots.

Building a data refinery

While manual data cleaning is expensive and requires both training and experience, data cleaning can also be viewed as a finite search space problem. Meaning, for any given data set there is a limited number of data types, and a limited number of valid operations on these types to bring them from an unclean to a clean state.

As an easy example consider phone numbers. All US phone numbers have the same length, but they are often written differently: (555) 123 4567 or 555-123-4567 or 555.123.4567 or some other variation thereof either with or without country code. However, there is an ITU standard for international phone numbers (E.164) and for any given variation of a phone number there is a set of finite transformations needed to get to that format. Hence for an AI system it is a matter of finding the correct chains of operations for each data entry to get to that format.

We do not need more than a few hundred rows of phone numbers and a few different entry formats before that becomes a labor intensive task for a human, but for an AI system it is as simple as “reason” (pattern match) against the different variations and suggest operations and observe if the end result is adhering to the correct format.

Enter Agents

This pattern of ReAct is exactly what makes AI agents (LLMs with tool use) so powerful. They do a reasoning step (pattern recognition) and then act by executing a given tool. For instance,

  • I am about to clean the column “phone”.
  • It is likely a phone number – let me test how many different variants of phone number I see.
  • I see three different patterns.
  • Now let me create three different chaining of cleaning operations to get all phone variants into the same format,
  • and execute that cleaning pattern on the correct rows.
  • Let me finally check if all phone numbers are now following the correct pattern.

This is a series of relatively simple reasoning and acting steps. Such simple steps can be made extremely powerful if you are able to break down each data domain, or data type into a finite set of operations that logically follow each other.

For numerical data we often want min, max values, range to find outliers that could represent bad data entries. Numerical values that are floating points in an integer list etc.

The trick is to equip the agent with a diverse enough toolbox to handle all of these use cases, but in a simple enough format that there is no ambiguity about what a given tool does and when it should be used. Once that is established you have your data profiling agent.

A similar approach can be used for the actual cleaning steps in your pipeline, that is, the sequence of operations needed to go from one format to another. You can think of these as a series of atomic operations executed sequentially to arrive at the final format. Continuing with our phone number example this could be:

  • Remove whitespace
  • Remove dots
  • Remove dashes
  • Remove (
  • Remove )
  • Check length

Again the trick is to think carefully about what operations are unique, atomic and applicable to a given domain. If you can do that, and operations can be chained, you are able to clean any type of data providing a finite sequence of atomic steps.

Agents are stupid and lazy

What you will discover as you work with AI agents (LLMs with tools) you’ll see that they are lazy or outright stupid. They will continue to use the wrong tool to gain information about a given column, or they will continue to try using the same tool for cleaning the data even when it has no effect.

This is a common problem, and has been solved before with a suggester / validator, or writer / critic flow. Essentially one less than brilliant agent suggests a series of operations, tests them out and report the effect. Another agent then looks it over and suggest changes to improve the result. This flow, when done iteratively is surprisingly powerful and will in many cases vastly improve the results from just a naive single agent execution of the steps.

All the non agent stuff

But the agents in such an agentic data cleaning pipeline is almost the least important. Don’t get me wrong, they are the glue that makes this possible, but only if equipped with the right tools and the right logical structures. A carpenter without tools is not able to build anything, a developer without automated testing and deployment slows to a crawl. The same is true for agents, they depend entirely on the environment you create around them.

Then you have to remember you are dealing with precious data, and agents do make mistakes. So you have to safeguard against mistakes. That means figuring out where to have human in the loop approvals of profiling plans, transformations etc. It also means thinking carefully about what kind of output you should present to users such that they can get only the relevant information for making a decision. No need to show the 100s of rows with malformed phone numbers, but an overview of: we have 3 variants of phone numbers, would you like all of them to have format X?

Speaking of mistakes, data transaction systems fail all the time. That is why we love ACID atomicity, consistency, isolation, durability. In short, your entire sequence of operations is successful and valid, or nothing is done with the data. Once you implement your data cleaning steps following this pattern you get audit trailing and rollback capabilities “for free”. Meaning you never ever overwrite data in a way that cannot be undone later on.

Finally, there is the usual problems of data size versus hardware constraints, I/O limitations, persistent storage with rollback, data versioning, monitoring of agent performance etc.

TL;DR

Your very first AI pilot could be the pilot to fuel all future pilots. Implementing an agentic data cleaning service will teach you a great deal about the importance of tool definitions, the stupidity of LLMs and how to mitigate this, and give you insights to just how small a component AI is in an AI system. It is that little nugget of powerful pattern recognition, but overlaid with good, old trusty software tools.

We’ve built it, and are continuously refining it to handle more and more advanced data, include more and more domains. We learned a ton and continue to learn. Experimenting with self-improving tools, tools suggestion and building out a data ontology across domains – basically a seen it once, repeat forever, pattern textbook.

What we have now is a powerful framework for:

Raw data -> Profile -> Suggest -> Validate -> Clean -> Audit -> Store.

This is capable of handling 80% of most data cleaning needs in a fraction of the time of a manual process, and requires understanding of the business purpose of the data, not coding and data wrangling skills.

If data is your oil, your first AI pilot should build the refinery. The foundation for everything that follows.