Microsoft unveils machine learning library for Apache Spark

Software giant hopes data scientists will be able to be more productive with Spark

Microsoft has unveiled a machine learning library for Apache Spark, which it says will make data scientists more productive when using the big data processing tool.

The software and, increasingly, cloud services company aims to increase the rate of experimentation by data scientists, and enable them to make better use of new machine learning techniques - including deep learning, on very large datasets.

Microsoft said that while its customers have found Spark to be a powerful platform for building scalable machine learning models, they've struggled with low-level APIs to index strings, assemble feature vectors and coerce data into a layout expected by machine learning algorithms.

It said its library simplifies many of these tasks for building models in PySpark, enabling the data scientists to be more productive and focus on the data science aspect of machine learning, while the library would take care of tokenising strings, converting them into numerical vectors, assembling the numerical vectors together and indexing the label column.

In addition, Microsoft said that its MMLSpark tool provides Python APIs that operate on Spark DataFrames and are integrated into the SparkML pipeline model.

"By using these APIs, you can rapidly build image analysis and computer vision pipelines that use the cutting-edge DNN algorithms," it said.

One of the capabilities of MMLSpark is using a pre-trained neural network to extract features from images and then pass these feature on to traditional machine learning models such as logistic regression or decision forests. Another capability is to be able to train a DNN model when a pre-trained model is too domain-specific and therefore unsuitable.

"You can use Spark worker nodes to pre-process and condense large datasets prior to DNN training, then feed the data to a GPU VM for accelerated DNN training, and finally broadcast the model to worker nodes for scalable scoring," it said.

Finally, data scientists can use OpenCV-based image transformations to read in and prepare their data.

MMLSpark has been released as an open source project on GitHub, and Microsoft is welcoming contributions, particularly from people who can provide feedback on issues, request features, report bugs, and from those who can contribute documentation, new features and bug fixes.