When considering a cloud contract, there are many things you need to consider as a business owner. If you ask, “What happens when my cloud server fails,” you are in the right headspace. You want to know the game plan when disaster strikes, and the answer hinges on two important aspects of managed cloud services: response time and recovery.
Before I dive into that, let’s put a bit of context on cloud services.
The cloud progression
Most people think “the cloud” is a recent phenomenon, but it isn’t. You can trace the history of the cloud or cloud computing back to 1963, right around the time that the idea for the Internet (via ARPANET) sparked a gleam in the eyes of nerds like J.C.R. Licklider who was, of all things, a psychologist that dreamed of an “Intergalactic Computer Network.”
No matter its origins or the corny original terms, the Internet did come into being, as did the cloud, although the full potential of the cloud only began to be realized in the 1990s.
Fast forward to 2021, and the cloud is used everywhere. Life as we now know it would not be possible without the cloud, as it gives us the ability to access information wherever we are, whenever we want.
Can you imagine not being able to access your Spotify playlist while waiting for your vaccine, or scrolling through thousands of movies on Netflix trying to find something you haven’t watched yet while on lockdown? Neither can I.
Yet as ubiquitous as the cloud is and generally reliable, few people really understand that it is not infallible. The cloud is an advanced technology, yes, but it is still technology with a physical structure and many moving parts. It can still fail, and it can still break down, and it has, somewhat spectacularly.
The big cloud server fails
People rely on the cloud for information access, while businesses rely on cloud servers to provide customers with access safely and reliably. It would not be an exaggeration to say that the biggest businesses today would not be around without the cloud. However, even the biggest players in the cloud service providers with all their redundancies and on which the biggest companies rely are not proof against failure.
In June 2019, Google Cloud was hit by a severe networking outage that companies such as Snapchat, Shopify, Discord, YouTube, Gmail, Google Search, and Google Docs experience significant downtime in their sites and services. It took hours for engineers to figure out the problem, which turned out to be an issue originating from Level 3, an internal network provider. The same provider was responsible for authentication failures that affected Office 365 and Azure Government Cloud among others in January of the same year.
This service provider suffered several outages recently that affected various parts of its operations. The reasons ranged from overcapacity (Europe) to the cooling system (US East region) to coding problems, affecting services such as Office 365 and the Azure Pipelines used by DevOps teams.
The question of size
Google Cloud and Microsoft Azure are massive service providers catering to equally massive businesses, and they have multiple teams of engineers and IT professionals managing their systems. Yet it took hours for them to solve the problem every time they took a hit. Why is that?
The most probable answer is size. When dealing with complex structures such as the cloud, scalability has its limits. It is hard to do maintenance and diagnostics effectively when you are dealing with a large number of interconnected servers and systems, and it gets exponentially harder as the capacity grows. Something is bound to slip through the cracks.
Chime in with size cliches:
The bigger they are, the harder they fall.
Mo’ money, mo’ problems.
Okay, I’m done.
The thing is companies with massive amounts of data and users probably have no choice but to use cloud services with enough capacity to house them. Smaller companies, however, do have other options.
The personal touch
I wanted to find out if size really did matter, so I reached out to Intelligent Technical Solutions (ITS) CEO Tom Andrulis. ITS is a managed IT service provider servicing small to medium-sized businesses in Las Vegas, Chicago, Phoenix, and Los Angeles. Tom describes his company as a “small local company” servicing a wide range of local and online businesses.
“Oh yeah, our response times are less than 10 minutes,” he said when I asked him how quickly he jumped on a client’s problem. “If somebody's server is down, it's less than 5 or 10 minutes like we’ll have an engineer working on that. We drop other things.”
As to the question of recovery, the answer was not so simple. “That’s a tough question,” he said laughingly. “There's all these different scenarios that can occur. The first step is to diagnose the problem. Our advantage over bigger companies is that we are pretty familiar with all the systems we maintain, so we typically find the problem pretty quickly.”
That sounds about right.
If you are a small company worried about what happens when your cloud server fails, you have a right to be, because it can, and it has. However, if you have a cloud contract with a reliable, local company, there is a very good chance your business is protected by safeguards in place against extended downtimes and permanent data loss.
I made a point of asking Tom what would happen in the worst-case scenario of data loss and ITS was solely responsible.
“We do our best to limit our liability, but the reality is, you know something catastrophic like that can happen,” he said. “So, we have what's called an Errors and Omissions insurance. If something goes bad on our end and it affects the client’s data or their service, our insurance carrier will compensate the client.”
I asked him if he ever had to make an E&O insurance claim in the 18 years he had been in the business.
He said no.