Big data in Excel? 'Bad idea!' IBM's top tips on managing analytics projects

IBM high-performance computing and medical data expert, Janis Landry-Lane, provides five tips for better big data

Take a good look at Spark, don't overcopy data and, for goodness sake, don't use Microsoft Excel. This was just some of the advice offered by Janis Landry-Lane, worldwide software-defined life sciences industry lead at IBM, as she delivered her headline session at the Computing IT Leaders Forum.

Speaking at the 'Learning from the Leaders in Big Data' event in London, Landry-Lane spelt out her key advice for big data practitioners based on 15 years of experience in high-performance computing, a field that led directly to the big data revolution. She is now spearheading big data projects that are helping sequence genomes and cure diseases.

1. Effective proof-of-concept periods lead to speed and provenance.

"Our role is to say 'Let's design the proof-of-concept [PoC] so then when it's wildly successful, you can get the answer today - not tomorrow, not the next day'," said Landry-Lane. It's a complex model indeed, which needs to take a huge array of factors into account.

"Speed is key, data provenance is key. Provenance means - in healthcare - if they're going to sequence your genome and give you a drug based on your genome, and it's the right drug, and that it's your genome, and the data has not been touched and that it's been processed with the latest known algorithms.

"These are the types of things we have to pay attention to."

2. Don't use Excel

Landry-Lane explained how one of her biggest takeaways from "installing some of the very largest installations in the world" was relating scaling to big data, and "making it better, and stopping people falling off the edge".

Falling off the edge she defined, basically, as falling back on spreadsheets to attempt big data analysis - still a surprisingly common practice among fledgling or simply lazy big data projects.

3. Use Spark

"I love the fact that Spark is gaining momentum," said Landry-Lane about the Apache open source cluster framework.

Landry-Lane is also a fan of "build, ship and run" automated deployment platform Docker. "I deal a lot in other industries with Docker containers - we have a big issue with reproducibility and a big issue of metadata," she said.

4. Keep a single dataset, and manage it properly

"We've seen users doing a lot of their own administration - they were moving data, copying data, analysing data," explained Landry-Lane.

"But big data is too big to copy, move and have many copies of. You really can get down to one copy - you can get to something called a global namespace, where everything shares a global file system.

"You do not have to have it copied over to Spark or Hadoop and then reload into MongoDB or whatever - you don't need to do this. There is technology - which has been out there since 1998 - so why should you do this? It just hasn't come to all industries yet from out of high performance computing."

Landry-Lane was referring to Spectrum Scale - an "IBM asset, and one of the finest we have", she said, recommending that software-defined storage for high performance workloads can lead to greater scalability with just one dataset.

5. Think also about software-defined storage simply because data is getting massive

A typical genome sequence weighs in at 600GB, and will still be 200GB when compressed, said Landry-Lane. This is an indicator of not just medical data, but how big data is, as ever, getting bigger and bigger.

"So let's use software to manage the data - you can't afford to keep data on spinning disk, you can't afford to keep it in object store, but you may need to keep it," she explained.

"We need to keep it potentially for the life of the patient, and have it sequenced many times, and analysed many times.

"The typical sequence size is 600GB. We're going to start keeping that kind of data on individuals, and that will change a lot, because your DNA sequences change."