CrushBank CTO: Data Preparation For AI Still Needs ‘Human In The Loop’

‘You can build the processing, but you need to have human oversight, human error checking. There’s a lot of training that needs to be done. I think we’re slowly moving away from the days where you need to be a high-end data scientist to do some of this stuff, because some of the tools can add that functionality, but you need to understand how these data sets work,’ CrushBank CTO David Tan tells CRN.

https://www.youtube.com/watch?v=oRl0cjtmN-A

One of the biggest minefields organizations must navigate before adopting AI into their processes lies in selecting and preparing the data that is going to be used, Crushbank CTO and cofounder David Tan told CRN.

“That data is messy and it’s disparate and it’s incomplete and it’s inaccurate and there’s multiple versions of it,” Tan said. “There’s just a lot of challenges. So it’s kind of funny. I don’t want to jump to the end, but one of my key pieces of advice and key takeaways every time I talk about AI, especially with MSPs trying to work with small to mid-sized businesses, is become a data expert. Get your client’s data in order, understand where it is, understand who owns it, understand the lineage, the security around it. It’s not simple, but it is doable.”

Tan is the former co-owner of an MSP that was headquartered in New York, but began working with machine learning models in 2015, and training them on data sets related to the tickets, training modules and other knowledge-based articles (KBAs) that are used by techs on their help desk.

Along the way, he said, they learned several common pitfalls around data preparation prior to being absorbed by AI models. He said the method they use is to run the data through translations, interpretations and extractions so it is stored in three different formats.

“We store pure unstructured data in a data lake that you can get access to and you can use to build your AI solution,” he said. “We also chunk that data up, store it into vector databases so you can do conversational search and generative AI. We also extract data and store that into a structured database so it looks a lot like SQL on the backend. That’s not what it is. But say you’re ingesting a bunch of invoices or proposals, and you’re pulling out the key metrics and storing them in a database, now it really doesn’t matter how messy those are, because I can do a query and I can see how much we spent with XYZ company last quarter because I have that data stored in a SQL database.”

While many software tools have arrived to aid in this process, Tan warns that it’s a large manual process, and there is not yet an AI agent that can be told to “clean that up.”

“There’s a big human-in-the-loop component,” he said. “So what I mean by that is you can build the processing, but you need to have human oversight, human error checking. There’s a lot of training that needs to be done. I think we’re slowly moving away from the days where you need to be a high-end data scientist to do some of this stuff, because some of the tools can add that functionality, but you need to understand how these data sets work.”