Q&A: CERN group leader of fabric infrastructure and operations, Tony Cass

Cass explains how CERN is processing data from Large Hadron Collider experiments in a virtual environment for the first time

Cass: data created by Large Hadron Collider experiments is processed on virtual servers

The European Organisation for Nuclear Research (CERN), the scientific research facility using the Large Hadron Collider (LHC) to explain the mysteries of the universe, is testing a virtualised server environment spanning multiple shared datacentres that could eventually see all of the facility’s grid-based number crunching applications run on virtual, rather than physical, machines.

CERN group leader of fabric infrastructure and operations, Tony Cass, tells Computing how this presents a significant management and security challenge, especially when it comes to convincing 10,000 exacting physicists that virtual servers can run their batch processing jobs properly.

Computing: What is CERN doing with server virtualisation?
Cass: We are working on a virtualisation model for a batch computing environment as a test at the moment, involving 20 machines offered to one of our existing customers, but we plan to roll out virtualisation much more widely next year. We want to convince the researchers accessing our systems that their [batch] jobs will carry on running much as they did in the past. We have 5,000 physical systems each with 16 cores, which can potentially support 80,000 virtual machines (VMs). But we need something to help us manage and understand what is going on in that virtual environment.

What runs on those CERN servers?
It is a batch environment consisting of two types: data from the LHC experiments in the form of ones and zeros, and raw images like those from a digital camera. Around 10,000 physicists around the world process and analyse snapshots of this using the grid, submitting jobs to process the data that is held on a mass disk subsystem at CERN. The specific thing that is relevant is how we manage those jobs, routing them to the compute capacity we have.

We also need to provide enough processing capacity to support the production and analysis of more than 15 petabytes of data per year, including all of the data we collect from the LHC experiments, which we expect to add to year on year.

Do any of the researchers using CERN resources worry that virtualising systems will mean a performance drop off?
They have expressed concerns, but people are always nervous about that sort of thing if there is any chance of disruption. The LCH started last year, ran for 10 days, then stopped and has only just started up again, and people do not want to think anything can stop their data processing. The CPU and network access provides no performance penalty, the biggest challenge is accessing data on local disk servers, because there is input/output (I/0) all over the network.

What are the management challenges posed by virtualising servers on this scale?
We have to make sure all of those VMs are up to date with security patches and everything, and provide a policy-driven approach to guarantee a minimum level of service for some customers, while making sure we can expand [compute] capacity for others as necessary. In any server environment that uses live migration with a hypervisor, you have to know where all the VMs are, where the physical hardware is, and what will be affected.

How is CERN addressing this management challenge?
We are using Platform Computing’s ISF adaptive cluster and LSF grid workload management software to manage both virtual and physical machines, both those short-lived VMs used for batch processing and long-life machines used for classical server processes running for many days or months.

Back in the mid 1990s CERN was one of the first to support running tens of thousands of jobs on Linux, but now that is mainstream. ISF offers a pretty comprehensive range of software but there are other things we might need, like Open Nebula, an open source toolkit for cloud computing, which works in a virtual management machine. The attraction here is that it will interoperate with other packages as necessary, because we have to have the whole range rather than a single hypervisor management system from one company.