When Data Centers Fail, Excuses Don’t Cut It

Row of Racks in a DatacenterThis is no way to run an Internet.

When an Amazon data center in northern Virginia went out of service on June 29 in the midst of severe thunderstorms that caused power outages, it was a very big deal because Amazon provides its cloud-based Web Services to marquee names like Instagram, Pinterest, and Netflix among many others. Adding insult to injury was the fact that this was the second outage at that facility in less than a month.

We shouldn’t be living in a world where Internet users have to watch a weather map and wonder whether a cumulonimbus cloud hovering over the D.C. suburbs is going to derail their photo uploading and movie downloading. Aren’t these companies and their infrastructures supposed to be more failsafe than that?

Of course they are. Any company that powers its products or services with the help of a third-party cloud provider needs a fail-over plan. The easiest strategy is to minimize downtime risk by spreading the work among several data centers (something Netflix actually does). In fact, Amazon gives its cloud customers the option of storing their data at eight separate data centers, but the cost is prohibitive for up-and-coming companies that turned to the cloud as a money-saving strategy in the first place.

As for Amazon, there’s simply no excuse. Data center designs include uninterruptible power supply strategies, and if that means you have to install batteries that are as big as Mack trucks, then so be it. Whatever it takes. This is a business where 99 percent uptime doesn’t cut it. One can only hope that any government data centers that are vital to national security and that are very likely also located in northern Virginia rode out the storm better than Amazon’s did.

If any good comes out of this inconvenient and embarrassing mess, it will be in the form of serious discussion about the robustness of cloud-based implementations and how we can make them more robust going forward. After all, business is heading to the cloud, government is heading to the cloud, and most of us have already arrived there with our personal photos and files and e-mail.  We shouldn’t have to lose sleep over a little midnight thunder.

Comments

  1. BY Fred Bosick says:

    If you want reliability, take the MBAs out of the chain of command. Put in techies and *pay* for the equipment they specify.

    MBAs are geniuses at “satisficing”, how bad can we make a product before customers go elsewhere. MBAs care only for ROI. Customers are a distant 2nd.

    • BY Glen Smith says:

      I disagree that MBAs are only concerned about ROI, unless you mean THEIR ROI and not even the ROI of their employer. They seem more concerned with selling their projects as opposed to insuring that the project has a solid ROI for their company. If they were concerned only about ROI such things as you are complaining about would probably not happen. What you are talking about is bad ROI predictions. Bad ROI projections would include failure to properly adjust cost and revenue for negative risk, failure to properly discount future revenue, taking on too much risk that could have been dealt with on the front end and failure to properly cost project resources. What you are talking about is falsely projecting ROI as higher than what should have been projected so the given project is selected (ie, a focus on sales and not on ROI).

      Do you know the story about the condemned prisoner who told the king he’d teach the king’s horse to sing? The condemned prisoner is analogous to the MBA and his promise to teach the horse to sing is analogous to the false ROI. Now, the prisoner has saved his life (the MBA saves his job), gets to live in splendor (the MBA gets a nice income and high status), things may actually work out (I’ve seen badly planned projects work) or time may cause the king to be interested in another thing (there could be a better project in the future). In any case the prisoner is in a better position to escape (blame the project failure on something/someone else, redefine what failure/success is so you can claim success even though you failed or move on while the project looks OK from an outsiders view).

  2. BY Portaluser says:

    It doesn’t matter that a company receives what they think they deserve and what does and doesn’t “cut it.” If what a company receives is at or above the SLAs that they are paying for, they have no justification to act as if they have been slighted. If you feel you need more 9s than the cloud provider can give you at a given price point, find another solution, don’t pitch a fit because you didn’t get more than you paid for.

  3. BY Rob LoBue says:

    Neither Amazon or their clients have any excuse. Amazon needs to provide the uptime stated in their SLA and their clients need to have redundency in place. No amount of batteries prevents that DC from disappearing in a major disaster so anyone running ALL of their services in a single DC is asking for failure. Ultimately, the client is at fault from not providing THEIR customers a DR plan as Netflix did.

Post a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>