We need to talk about metadata: an interview with IBM distinguished engineer Mandy Chessell
Computing speaks to IBM's Mandy Chessell about the need for a common language for data about data
Everyone understands the importance of data. The modern world is utterly dependent on the stuff. But data on its own is pretty useless, a chess set with no instructions on how to play the game. If you want to share data or analyse it you need to know certain things about its properties - where it came from, the format, who created or modified it, its quality, who should be permitted to use it, its relative value, when it was generated and many other contextual factors all of which come under the heading of metadata - or data about data.
At the recent DataWorks Summit, Computing spoke to Mandy Chessell, distinguished engineer and master inventor at IBM, about the need to treat metadata as an asset rather than an afterthought. Chessell, together with industry colleagues at Hortonworks and ING Bank, is working on a model to standardise how metadata is classified and exchanged.
More data, more metadata
For organisations, particularly large ones, a robust data strategy that underpins the business's goals is increasingly important. And as the number of locations where data is generated, controlled and accessed proliferate along with growing use of cloud (particularly DBaaS) and the IoT a proper handle on metadata is required too, so the organisation can keep tabs on it all.
While plenty of metadata standards exist, their use tends to be colloquial, restricted to small domains. The information about location and lens settings embedded in a digital photo is an example; while extremely useful for data scientists and holiday snappers alike, the vocabulary of digital photography is not widely applicable elsewhere.
The same applies to the metadata created by different platform vendors; they are unlikely to be compatible with the data catalogues deployed by competitors. For those trying to build applications on top of such platforms the mismatched standards can be a major headache. What is required is the metadata equivalent of the web's W3C standards: a shared vocabulary in which the standards are broadly shared by everyone involved.
If there's no standard there'll be bedlam
"If there's no standard there'll be bedlam. Our goal is to drive some sort of consolidation in that space, regardless of any of the commercial implications which that might have," said Scott Gnau, CTO of Hortonworks, one of the companies working on metadata standards.
The joint effort began three years ago. It is being carried out under the auspices of the ODPi, a project of the Linux Foundation dedicated to the simplification and standardisation of the big data ecosystem.
The open metadata standards being developed by ODPi are managed through a data infrastructure based on Apache Atlas, the data governance and metadata framework for Hadoop. Atlas is the basis of an open and extensible library of types, interchange flows, interchange protocols and formats designed to enable metadata repositories from different vendors to work together in a peer-to-peer manner.
A common language
Metadata might be every bit as important as the primary data itself (a former CIA and NSA chief famously said that those agencies kill people on the basis of metadata) but it is not nearly as well understood or appreciated.
IBM's Mandy Chessell has been advising large organisations about data and information strategy, including metadata, for more than a decade. She says it can be very difficult to bring real-world meaning to what is, on the face of it, a fairly dry and academic subject. Metadata is out of sight, out of mind.
"Many people are surprised that metadata isn't taken care of automatically", she said. "And at other times we have to explain that it isn't just keeping a spreadsheet of all the datasets. It's about using metadata to drive the data infrastructure and put the business in greater control".
Data looks different depending on your viewpoint
A common metadata language for metadata is required by the wider digital economy, and this is a major driver for the current efforts, Chessell said.
"A lot of this is ensuring we have a market for vendors so that vendors can bring their value-add."
The metadata standards system being designed by IBM and its partners is necessarily decentralised. Rather than having a central metadata repository which would be impossible to manage, individual vendors and other parties maintain their own repositories with proprietary classifications mapped to the respective agreed standard versions.
Computing Women IT Excellence Awards 2018
Make sure your talents or those of a colleague get the recognition they deserve.
Adoption is critical to the success of any standardisation effort, and ultimately it is hoped that the vendors will adopt the more common standards as they develop their data catalogues.
Asked about competing schemes Chessell said a winner-takes-all scenario is unlikely.
"There are so many standards and it's almost impossible to cover everything so you want the design to be open and extensible. Our model is a combination of lots of different standards."
[Turn to next page]
We need to talk about metadata: an interview with IBM distinguished engineer Mandy Chessell
Computing speaks to IBM's Mandy Chessell about the need for a common language for data about data
In cohorts
Within individual organisations cohorts can be formed. These are simply groups of metadata repositories that an organisation has decided should exchange metadata. Rules are imposed about how the repositories can communicate with others in the cohort. One repository might be able to take a read-only reference copy of another, while a second might be able to run Atlas-based queries and extract metadata that way.
"When a repository joins the cohort it becomes part of the metadata ecosystem and will receive metadata from other repositories, and if changes are made to its own repository then information about those changes is sent out to the cohort," Chessell explained.
It's a peer-to-peer protocol, there's no one repository in charge
"It's a peer-to-peer protocol, there's no one repository in charge and a repository can join multiple cohorts, so in an international organisation you might have country-level cohorts then a head office repository that connects to all the cohorts and can see across the whole enterprise."
Which is all very well, but it's still a bit, well, meta. Chessell gave some examples of how the system can help businesses lighten the administrative burden.
"Let's say we change the classifiction and somebody's vehicle registration is now sensitive data. The moment that the business makes that change in one repository the system should react and propagate it. So its easier for the business to set up, define what needs to happen and sign it off without there being lots of developers translating that into code which then is baked into the system," she said.
"What we're trying to do is to get to the point where the business is responsible and able to manage its policies around data."
Data is defined by the data that surrounds it
Another example is automatic masking of sensitive data. Analysis of payroll data is made difficult by the presence of confidential information in the dataset such as salaries. With the correct structures in place rules can be set up in the HR data catalogue to limit access and redact automatically any sensitive information when the dataset is used outside of its own domain.
Contextual rules can also be enforced using APIs. A company might have a separate API for data science tools, security tools or BI tools, each offering a different window onto the same dataset. These APIs have their own metadata and need to be managed in the same way.
"We need to be able to retain metadata about APIs - 'this API has this function, it's using this data and the data is located here' - because that becomes a requirement for things like GDPR where you have to demonstrate you're using the data only in ways for which you have permission. So tying together metadata about APIs and metadata about policy and showing how one is using the other is a key part of demonstrating compliance," said Chessell.
Have metadata will travel
Another consideration is where and how the metadata is stored. To go back to the digital photos example, information about the lens, the lighting conditions and location is embedded in the photo itself, enabling anyone to read it from a digital copy. "The metadata travels with the data," as Chessell puts it.
Similarly, governance rules describing how the data may or may not be used can be incorporated. This can help with information sharing between different organisations.
"You might want to have the Ts&Cs embedded, classifications, information about its lineage, quality metrics and things like that. Once you have an open standard you have a mechanism for both archiving to file and exchanging it across the network. Then it becomes much easier to have that metadata embedded with data as its bought and sold or exchanged between organisations," said Chessell.
In other cases though, metadata may be best retained within the repository, separate from the data.
Data strategy
Considerations about metadata are very much bound up with data strategy, and that strategy is subordinate to the overall goals of the business.
"Decide what you're going to do with the data because then you can target assets," Chessell asserted when asked how a business should get started.
"Often the data strategy comes from having some sense of where they are trying to get to. Are they focused on customer data to deliver better customer care, or on efficiency around certain parts of the operation? Are they interested in mining particular types of IP? So depending on their business strategy their data strategy should then follow through, and they can start to identify where the priorities are."
It is then that the metadata aspects come into play: "Data discovery, cataloguing and clarifying how that data should be managed and then thinking how do we automate that, how do we push that into the underlying operation into the tools and systems we're using every day?"
Ultimately, metadata needs to be everyone's business, she said.
"In most organisations knowledge about data and its context is widely scattered. It's often not written down and it's spread between technical people and business people. So we need an ecosystem that's gathering that information as an ongoing process that's part of everybody's everyday job."
Computing 's Women in IT Excellence Awards take place in November
We believe the way to inspire the next generation is to show them the remarkable footsteps they're following in. We want to highlight the entrepreneurs, the innovators, the leaders and the pioneers: the women building the infrastructure of the future.