Not all machine learning is created equal

Kevin Gidney of Seal Software talks about the training and work that must go alongside machine learning

Many solution providers talk about using machine learning, essentially AI, to extract intelligence from documents, whether it is contracts, invoices or orders. The part they don't talk about is that it is not just an ML engine that will successfully extract the valuable insight, but all of the training and work that goes into it.

There are more and more companies entering the contract discovery and analytics market, and that is a good thing. In fact, it confirms that contracts hold critical information which help organisations to better manage their M&A activity; their regulatory compliance initiatives; and their procurement and sales functions; as well as make better decisions and create competitive advantages for them in their markets.

The customers I work with typically have two requirements: first, they have lots of contracts to be processed, typically from tens- to hundreds of thousands, from which they need to get information; and second, they have a specific set of business objectives in mind.

Those objectives range from 'Getting their house in order' to migrating contracts to a business system; running broad scale analytics; kicking off a regulatory compliance initiative; performing due diligence; or any number of projects where contract data is needed. The need for understanding contractual data is driven by ever-changing regulations, business events or the need to look for cost savings or revenue-generating opportunities.

Different projects will require varying levels of accuracy to achieve the informational objectives. These objectives translate into scores for precision and recall, which are achieved by the approaches in tuning and training a system.

Precision is the percentage of retrieved instances in a search that are relevant to the search (or how useful the results are). Recall is the percentage of relevant instances in the population that are retrieved in an extraction (or how complete the results are). It is the combination of massive numbers of contracts, and specific precision and recall requirements for a project, that drive the need for higher degrees of scalability and accuracy out of an ML system.

On top of these is the need for increased usability, which means abstracting away the complexity of the system for business users, so that they, and not only data scientists, can effectively use the system and maximise the value they receive. There is another factor in training a solution that is almost always disregarded by data analytics solution providers: namely standard deviation. The standard deviation in terms of ML is the 'trustability' of any model or method to extract information. When we talk about trusting a model, we expect there to be a low standard deviation.

Once a model is ‘re-trained', the recall and precision values actually decrease significantly. A good model, like any statistical function, needs data - but it also needs an appropriate amount of data before the swings in its learning are smoothed out. This is called the learning curve, and typically results in a gradual reduction in the standard deviation.

To put this into context, suppose there was a cohort of 20 people, all American, from the East Coast states. Then a model learns from only those elements and it is very good on that specific dataset. However, if five people from Europe and five people from the West Coast are added to the dataset, the model now performs badly. This is because it had too few examples in the first place, and adding more data caused it to change significantly.

This is high standard deviation, and it is the reason the best solutions use a traffic light system and different algorithms dependent on data amounts and requirements, but done in a way that abstracts it from the user in a simple and automated way.

Not all machine learning is created equal

Kevin Gidney of Seal Software talks about the training and work that must go alongside machine learning

Along with the intense focus on accuracy, system scalability and usability are critical to solution providers. I've always wanted to create a system that could be used by people who need data. If we had a system that was only appropriate for legal professionals or data scientists, we know the distance between extractions and business value may be too large to consistently overcome. The trick is resolving the often-opposing forces of sophistication and power on the one hand, and the usability for business users on the other.

Finally, it is the ML engine, and the components that make up the broader platform, that meets the precision and recall objectives for a particular clause. An ML engine cannot provide all the capabilities on its own to deliver the results that businesses require — it involves several technologies and techniques working together, including:

Natural Language Processing (NLP) to optimise the capabilities for the system to understand written language and process it within the ML engine;
Latent Semantic Indexing (LSI) for identifying and extracting information not presented in standard terms or language, but which exists through associations of words or phrases or in different locations in a document;
The use of Deep Learning methods to increase performance of the ML engine;
The inclusion of UDML to simplify training and automatically select the best model and hyper parameters for any given data, with users only required to select the text to train on;
Including document review capabilities within the system for efficient side-by-side review and comparison across clauses and language;
Extensive reporting and data visualisation to be able to easily draw actionable insight from the data;
Automatic discovery and linkage of related documents such as amendments to master agreements; and
Simplicity within the UI for information layering and normalisation, to allow the ML framework to effectively use all available information and to allow users and engineers to quickly find and prepare it for use

If my experience has taught me anything, it's that businesses need more than an ML engine to successfully extract the valuable and actionable insights from their data and not all machine learning solutions are created equal, particularly in the contract discovery and analytics market.

Kevin Gidney is co-founder of Seal Software

Kevin has held various senior technical positions within Legato, EMC, Kazeon, Iptor and Open Text. His roles have included management, solutions architecture and technical pre-sales, with a background in electronics and computer engineering, applied to both software and hardware solutions.