The critical flaw of AI: data

One of the biggest flaws of artificial intelligence is poor quality data, claims MarkLogic CEO Gary Bloom

Artificial intelligence is all the rage.

Governments are throwing money at it, new companies are being formed faster than they are being sold, and industry giants like Toyota and Google have set-up their own venture capital funds to identify and invest in the potential winners in the coming revolution.

The excitement is easy to grasp given the intoxicating notion of AI: apply intelligence against data to gain better insights and make better decisions. It's what the human brain does all the time. If data is the "new oil", AI has been dubbed the "new electricity".

But there's often something missing from these discussions: data quality.

Artificial intelligence — in fact, any intelligence — can only be as good as the data from which it draws inferences. And data kept today by companies, government agencies and institutions is often a mess, stuck in silos built over decades as corporations and entities grew.

Throw in mergers and IT/corporate turf wars, and you have even more silos. Even publicly available data is often a mishmash in which, for example, different terms describe the same thing. For many organisations, the state of data is like a house full of messy cupboards, each one separate and packed with stuff — some of which is valuable, some of which actually belongs elsewhere and some of which no one knows is even there.

Getting all of that stuff into one AI engine is an arduous task, but doing so is crucial if AI is to reach its potential.

IBM Watson: all about the data

One high-profile example of the challenges for AI and data is MD Anderson's $62 million attempt to use IBM Watson to help it improve cancer care.

The effort was put on hold after it fell short of its goals. A University of Texas audit detailed a number of shortcomings, including a change of focus and the fact that "clinical trial and drug protocol data in the system are outdated, and the pilot program doesn't work with the hospital's current electronic health records," according to the Wall Street Journal.

While IBM told the WSJ that the "pilot was a success", the WSJ quoted Peter Szolovits, head of the Clinical Decision-Making Group at the MIT Computer Science and Artificial Intelligence Laboratory. Szolovits noted that medical institutions "often struggle to bring all data onto the same platform" and that the way "medical information is stored and labeled can differ widely, even between departments at the same institution".

If the way data is stored or labeled changes, "often the artificial-intelligence software must be retrained," he told the Wall Street Journal.

The scrutiny that IBM's Watson AI has come under is no surprise. It is "one of the more mature cognitive computing platforms", a research report by investment banking firm Jefferies noted. As such, Watson is closely watched as a bellwether for the AI industry.

Data and talent: key ingredients

The idea that some of the MD Anderson issues revolved around data is also not a surprise.

After all, data and talent, not algorithms, will be the two main sources of competitive advantage in the AI war, according to the Jefferies report, but scarcity of talent is a big issue, it adds.

Already, 86 per cent of tech hiring managers and recruiters say it is challenging to find and hire tech talent. With AI investment rising, this will only get worse. Companies are already nabbing top talent from universities, and each other.

Data will be a big deal, too. "If there isn't enough data available, or if the data is of poor quality in content or structure, smart machines won't be able to make a reliable decision," wrote Gartner's Kasey Panetta.

To that end, leading companies are investing in, and focusing on, data quality to aid AI efforts. In 2016, eBay purchased Expertmaker, a Swedish AI and machine learning firm, to help organise and analyse huge sets of data.

Joaquin Candela, director of applied machine learning at Facebook, told Harvard Business Review that he's focused on getting more and better data — and on the speed of experimentation, not on better algorithms. "I'm not saying don't work with the algorithm at all. I'm saying that focusing on giving it more data and better data, and then experimenting faster, makes a lot more sense," Candela said.

Siloed data is a common problem in the enterprise and will prove vexing for all kinds of AI efforts.

"The biggest obstacle to using advanced data analysis isn't skill base or technology; it's plain old access to the data," wrote Ed Wilder-James, vice president of technology strategy at Silicon Valley Data Science, in a Harvard Business Review article last year. The headline of the article? "Breaking Down Data Silos."

Reliable decisions

Not only is enterprise data often stuck in silos, which makes integrating it difficult and expensive, but data scientists report that 80 per cent of their time is spent wrangling data into shape.

That problem won't abate with AI. Artificial intelligence needs data that is clean, current and well-governed so that everyone knows where it came from, under what terms it was collected and how it may have been transformed before getting to an AI engine. AI also needs data that is easily accessible and shared.

However, sharing requires knowledge of who should have access to either the data or parts of the data and who shouldn't. Privacy concerns around consumer data will only continue to grow as more intelligence is set against data, revealing connections that may violate privacy boundaries.

If AI is a recipe for increased efficiency in all kinds of business and social ways, then good data is the key ingredient.

Gary Bloom is the former CEO of Veritas and now CEO at NoSQL database software vendor MarkLogic. He can be contacted via the MarkLogic website.

Computing's Cloud & Infrastructure Summit Live returns on Wednesday 19 September, featuring panel discussions with end-users, strategic and technical streams and a session with guest speaker Inma Martinez. The event is FREE to qualifying IT leaders and senior IT pros, but places are going fast. Register now!