Getting to grips with machine learning
Jean-Cyril Schütterlé explains how ML is taking us from rules-based algorithms to data-driven processes
Machine learning (ML) may sound like a daunting concept to anyone unfamiliar with it. Some may believe it to lead to outlandish ideas about machines poised to enslave mankind. Fortunately this isn't what ML is. It's basically a major advancement in the development of IT. For ML to benefit your organisation you first have to understand what the full benefits can be and also its limitations.
While the principles of ML are rather simple and intuitive to grasp, it does require the use of specific statistical and IT skills that few people currently possess. To understand the idea think of a common and rather mundane language translation service - like Google Translate - this helped me realise the transformative potential of ML.
To simplify it, language translation software has long been based on programming dictionaries, grammatical rule and their numerous exceptions. This approach involves considerable effort.
From rule-based to data-driven processes
The new methodology stemmed from a simpler idea: don't try to define rule and lexical tables from scratch, let the software discover them. How?
In three steps:
- A collection of millions of pages, already translated from one language to another, are collected from international organisations. These include documentation available online from, for example, the UN or European institutions.
- When a user submits text for translation, the software slices it into basic elements and then searches for similar ones in the same language.
- The most likely translation is the extracted from the bilingual corpus which is suggested to the user. Relevant statistical patterns found in the data, therefore, replace translation rules. Instead of having to be painstakingly programmed, they are simply "learned" by the software. This approach is highly cost efficient and the quality of the translation is often on par with a traditional approach.
In areas less complex than translating human languages, the productivity gains are compounded by substantial quality improvement. Anyone who's worked on software knows how complex it can be to anticipate all the potential problems once it's entered production. The software's functional rules are based on assumptions that are limited to a linear number of observations. Reality often proves to be far more complex than expected, meaning automation is eventually suboptimal or the software ends up requiring expensive corrections.
Machine learning, on the other hand, absorbs and develops itself using all available data, regardless of the volume. This means the risk of patterns or a use case being left out of the picture is therefore limited.
Humans must remain in charge
Limitations show their head when machines avoid human intelligence and are restricted to imperfect selections.
A good example is that of the automated processing of loan requests received by banks. An algorithm parses the archives of previous requests where each borrower's key information is recorded (age, wealth, family status, etc.) along with reimbursement information (whether they day pay the bank or defaulted). It therefore highlights the likely relationship between a borrower profile and a default risk. Applied to a new loan request, the algorithm will predict, with a level of accurate considered sufficient, whether the borrower will pay back the loan. This removes the risk of a bad decision and the impact of a bank operative's mood.
Nonetheless, it is crucial that humans remain the ultimate decision makers. The software is not perfect, it is governed by settings made by humans. For instance, it may have been optimised to avoid false-positives (whether the loan is granted to a borrower who will default) and so will lean towards rejecting certain applications. It may also discard observations that don't fit with its criteria.
Therefore users must check that the systems recommendations are legitimate and, if necessary, reject them. If a loan is granted despite the system recommendation and it eventually turns out the borrower meets the payment schedule, the new learned criteria will have to be introduced so that the algorithm accepts similar applicant profiles next time.
Another key reason is humans should ensure ethical standards are met, especially when concerning an individual's rights. The law pertaining to the automated processing of non-anonymised date is likely to evolve further to protect citizens and consumers against harmful effects of excessive statistical generalisation.
Data über alles
The performance of the automation will depend on meeting two imperatives:
- Data quality - To diminish false observations, many cleansing and formatting activities are required. This task is huge compared to the effort needed to set up the model.
- Training set representatives - The automation is far more efficient when ML is carried out on unbiased observations. These should be similar to real life cases the software will have to deal with. For instance consider the range of wages for one company may be substantially larger than another.
Access to data is crucial to ML project's success, ultimately no level of algorithmic sophistication will make up for a poor set of data.
Machine learning has a tendency to dismiss arbitrary behaviours. It is up to us to make sure it does not replace these with inappropriate over-generalisations.
Jean-Cyril Schütterlé is VP product and data science at Sidetrade