Databricks open-sources Delta Lake, its next-gen data lake platform
Company wants more developers to drive adoption
Databricks, the company founded by the originators of Apache Spark, has open sourced its Delta Lake storage platform.
Delta Lake is a storage layer that allows ACID transactions on Spark and big data workloads. It is designed to address some of the weaknesses of data lakes - repositories where all sorts of data, structured and unstructured, is stored before being extracted for analysis. Data lakes were a response to the big data revolution, the idea being to keep as much data as possible in the hope that analysing it will reveal patterns that were hitherto impossible to ascertain.
However, many organisations have struggled to make this promise a reality. According to Ali Ghodsi, CEO of Databricks, one of the reasons is that the data stored in data lakes gradually goes stale over time. Schemas change, semantics are subtly altered, duplicates build up and incomplete or failed operations leave a sediment of malformed data. When the data is eventually queried, results can be unreliable and performance lacking, the causes of the inaccuracies difficult to pin down.
"We've seen a lot of projects delayed, projects that are not performing and there's a whole slew of issues behind that," Ghodsi told Computing.
"Just dumping everything into a gigantic lake was a mistake. It turns out that if you didn't think about scalability and data quality at the outset, and if you just dumped the data hoping that in five years time you could get performance and scalability and reliability, well that's really unlikely. It's garbage in, garbage out."
According to Ghodsi, Delta Lake offers a 'filter' for data before it is stored in the form of an adjustable (and optional) schema. As the data is ingested (both batch and streaming data can be processed) data that is malformed, corrupt or which in other ways doesn't fit into the required schema is rejected. The process is atomic, meaning that data is either accepted or rejected. There is no halfway-house possible, so failed attempts do not pollute the system. And the system is versioned so developers can step back should something go wrong or test different iterations of a machine learning algorithm.
Available both as a managed cloud services or as an on-premises deployment, Delta Lake allows for fully ACID transactions simultaneously by multiple readers and writers, meaning again that data stored in Delta Lake's Parquet file format can be queried and operated upon without any of the side effects that can lead to inconsistencies and performance woes.
Asked why they had waited until now to open source the project, Ghodsi said customers' use of the system had underlined its potential as a general purpose big data storage platform, and he wanted as many developers working on it as possible.
"Seeing how customers have used it we realised it's something really massive. If you really want to achieve a paradigms shift in the market and have 10 million developers use it then it had better be open source, and it had better be real open source."
Delta Lake is released under the permissive Apache v2 licence.
In another announcement Databricks said that Microsoft is joining its open source machine learning project MLflow as an active contributor, and is adding native support for the platform in its Azure Machine Learning service.