Flattening the IT stress curve
We're asking IT to do more, yet we still want the same level of accountability and security as business-as-usual. How can we avoid overloading systems, processes - and people?
Remember 2019? It was only a few months ago, and yet for all that's been going on this year it may as well have been 5 years ago. Clearly, things are different now but it's helpful to look back before looking ahead. This is a word cloud of analyst inquiries, representing what was top-of-mind for organisations in December 2019:
That all seems pretty typical and reasonable. Well, perhaps not the appearance of "application rationalisation" in the lower right-hand corner (if you find any truly rational software, cherish it like the treasure it is!). The same chart in March 2020 looks quite different:
Given what we now know about the world, this is also unsurprising. Though I'm assuming that "procurement" in the lower-right-hand corner here refers to IT infrastructure and not toilet paper or N95 masks, though I'm sure it's only a matter of time before AWS announces that feature at re:Invent 2020. These two charts are 3 months apart; it's amazing how little time it took for us to completely change what's top-of-mind.
It's not just what's on our minds that's changed. We've changed how we're working. We've changed what work we're prioritising. We've changed the constraints under which we work. The demands placed upon our applications and our infrastructure have changed. PagerDuty recently did some research to quantify the growth of IT incidents in recent times as compared to before the pandemic really took hold:
This points to a massive increase in how stressed organisations are. This is most acutely felt in sectors that are experiencing unprecedented demand, such as online learning and collaboration services. Demand is not always positive; for the travel industry, this can take the form of massive increases in refund or cancellation requests. But for those of us in the infrastructure world, it doesn't matter - after all, your servers don't care if the load is from users booking flights or asking for refunds.
So we're asking IT to do more, yet at the same time, companies are laying off IT staff and reducing IT spending to cope with current economic volatility. We're asking IT to make changes quickly to adapt to pandemic circumstances, yet we still want the same level of accountability and security from the time of business-as-usual. This combination of high load and high urgency conspires to overwhelm our infrastructure, our applications, our processes, and (most importantly) our people.
Many countries' pandemic responses are centred around the epidemiological principle of "flattening the curve": to slow the rate of infection so that at no point should the number of infected exceed the medical system's capacity to handle them.
IT is no stranger to capacity management; we balance watts, bytes, and budgets all the time. But we must similarly flatten the "stress curve": we must make sure that the peak stress we place upon our IT staff does not exceed their ability to handle it. This is true even in the best of times, but current circumstances have changed everything. Tens of thousands of IT teams are now working from home, leaving systems over-exposed. We've extended corporate networks into people's homes and opened up access to internal systems as a way to quickly adapt, creating greater attack surfaces. These IT teams are now tasked with managing complex infrastructure in ways that were difficult under ideal circumstances, let alone remotely where the demands of home life and work are fighting for attention. Failure to flatten the stress curve leads to more system failures, poorer-quality decisions, and burn-out.
The pandemic has magnified our strengths as well as our weaknesses. This is as true for IT as it is for humanity at large. The best parts of IT now have a chance to shine, but the places where IT has struggled are now under intense scrutiny and pressure. There's nowhere to hide.
The systems and processes that will best survive in this environment are the ones that are the most flexible and resilient. How quickly can you add capacity? How quickly can you patch your systems? How quickly can you push out changes to your applications? How easily can you move workloads to places where there's more capacity? How rapidly can you get a change approved by all the people that need to sign off on it? How quickly can you deploy new tools to the places that need them (e.g. VPN, videoconferencing)?
Agility in IT has been long sought-after; that's much of the driving force behind the rise of DevOps. And as noted in Puppet's 2019 State of DevOps report: the most nimble IT teams are the highest-performing, and are best adapted to countenance volatile circumstances.
Unfortunately, many of the organisations most vital to restoring any semblance of normalcy are among the least flexible, working in some of the least flexible sectors, using some of the least flexible infrastructure out there. In the US, for example state governments have been the worst hit. Their outdated systems have been inundated with unprecedented requests leaving millions left hitting refresh for hours - and sometimes days - to file for unemployment. But if we can't free up our systems' capacity, we can at least concentrate on freeing up human capacity.
In the delicate balance between time, our systems' capacity, and our actual brain capacity, the human element is what is under the most pressure. Organisations are under strain across the board and are facing vital decisions that must be met by cooperation and collaboration. At a time when collaboration is more important than ever, it is also more difficult than it has ever been. Urgency is high. The way to balance moving quickly versus moving carefully is by putting the people doing the actual moving first.
Automation isn't a cure-all, but it can be a definite force-multiplier, helping to free up people's time and brainpower. Even the most inflexible legacy systems can benefit from putting a bare minimum amount of automation in place. What are the most time-consuming, most repetitive tasks? Optimising those can help save people's energy so they can focus on more fundamental, less rote problems.
No matter the scale of the operation, the maturity of the organisation or the happiness of the people, the stress is real. The stress is compounded. Compounded by our systems, by the knowledge that there is only so much to be done and by the fact that in the other room, three kids are arguing about whose turn it is to be P1 in Super Smash Bros. If we can agree on the stressors that are pulling at us: systems, time, and human capacity, we can judge the risks and rewards when one of those is out of balance.
We can also give a little thanks. Have some gratitude for our teams, even if they aren't as agile as we'd hoped. Gratitude for our systems, inflexible though they may be. And gratitude for living in a time when remote work is possible.
Deepak Giridharagopal is CTO at Puppet