Super Bowl night in America, a national holiday. This year’s was interrupted by one of the
most widely viewed public power outages in the last decade. More than 30 minutes of play, and an
incalculable financial impact experienced on one of the most popular television
spectacles in the United States. But
when these power outages occur, one cannot help but wonder about redundancy and
resiliency of such crucial infrastructure services as power. And while the power requirements of the
Super Bowl might exceed those of a typical day in business, in hindsight of “the
outage” is your businesses IT infrastructure protected against similar failure?
Businesses often in-state similar power systems as one might
suspect existed at the Superdome. For
example, some datacenters will have redundant power feeds coming into the
building to avoid excessive impact during an interruption to a single
power-grid. After the incoming power,
businesses often use a battery-backup to avoid massive outages during momentary
losses of power, and a generator to sustain systems during a prolonged power
outage. Without any of these systems, or
without on-going maintenance, business might suffer the same impact as was
experienced Sunday Night.
But, through all of these fail safes, sometimes issues do
occur. And when they happens, are your
procedures updated to get all of your systems back up and running in a timely
fashion? Your recovery time-objectives
are directly linked to your ability to regain power, and sustain it through
the recovery process. For instance,
during the Superdome outage, we saw lights coming on following a detailed
checklist with teams allowing the lights to warm and reach operating current
prior to moving to the next set of units.
This is a strategy often in place in datacenter environments, in which
the startup current of most electronic devices far exceeds the operating
voltage. Most datacenters are sized for
the operating current of the devices in place, not necessarily allowing enough
electricity to start all devices in parallel.
Having a startup order for your hardware assets after a power outage is
a necessity in this situation in order to prioritize the workload and bring up
critical services first. Powering on too many devices at one time can cause
another outage to occur, while being too conservative can prolong outage times.
The issue of power distribution exists as well. Typical datacenter redundancy exists at the
power-supply of the server, plugging in two cords to two separate power supply
systems (i.e. – UPS’) distributes the load and avoids overloading any one side
during normal conditions. Unfortunately,
some devices don’t have redundant power, causing balancing issues between the
two power sources, potentially overloading one.
Carefully monitoring what is plugged in directly related to this type of
recovery plan is the only way to ensure that circuit overloading is not
occurring.
Finally, as we saw in the event of the Superdome, as well as
some of the other natural disasters in the world lately, pre-plan for any
potential incident. If there is expected
to be brown-outs in the area due to high air conditioner utilization, heavy
snow storms, or a major event at your business in which power-requirements will
be very high; forethought and planning around datacenter redundancy and
resiliency is always a good idea.
To summarize:
- Ensure that all systems (i.e. – Generator, UPS,
Batteries, HVAC) are up and running in peak operating condition
- Complete all regular maintenance on these units,
and keep track of life expectancies
- Maintain a list and timing for power on of a
complete datacenter in the event that many devices are lost in a single
incident
- Constantly monitor power systems to ensure that
no power systems are overloaded if they are singularly relied upon in the event
of a disaster
- Pre-Plan for any major event or upcoming weather
While none of these would have
specifically saved the Superdome, they may avoid your business suffering a
major loss of service, with a potentially public outage.
Labels: Battery, BCP, Disaster Recovery, DR, generator, Power, redundancy, resiliency, SuperBowl, UPS