November in Seattle is always cool and rainy and sometimes stormy – windstorms, that is. Seattle’s all time high temperature – for any day of the year – is 100 degrees. That all time high is, of course, outside. But it reached 108 degrees here on Sunday November 16th. Inside a data center. The City of Seattle’s data center.
To make a short blog entry even shorter, I’ll skip to the root cause: a failed power breaker on a pump for the domestic water supply to the building housing the data center. The water supply flows to CRAC (“crack” or computer room air conditioning) units which, in turn, cool the data center. For HomeCity Security reasons, I won’t reveal the actual location of the data center, but let’s just say it is in a downtown 60 story skyscraper which also houses about 3500 office workers during the week. The problem started about noon and was fixed at about 8:00 PM.
The data center holds about 500 servers, storage systems and other equipment. We shut down a lot of servers and many services starting almost immediately. Nevertheless the temperature in the data center rose to that toasty 108 degrees, setting a new record high (sort of) for Seattle.
So why is this notable? For two reasons: the problem and the response.
In terms of the “problem”, let me assure you (especially if you live in Seattle) that cooling problems like this will be rare to non-existent in the future. Years ago we installed a one megawatt generator for backup power. This year we’ve been working a project to install “dry coolers”. These aren’t really “dry”, but the water cooling the data center will flow in a “closed loop” between the new coolers and the center, so we’ll no longer be dependent on external water or power supplies. Unfortunately, the dry coolers don’t come online until January, which is why we went to 108 degrees last Sunday.
But there’s a more general issue here – every city and county government has data centers and servers and vital information. Every area of the country is subject to some sort of a disaster and every government needs to have a backup and recovery plan.
But for what disaster should we prepare?
Here in Seattle, everyone is concerned about the “big one” – a magnitude 8.0 earthquake. While we need to be ready for an major earthquake, we have about one of those “big ones” every 300 years. Much more likely are disasters like last Sunday – a failure of water and cooling, a “meltdown” if you will (non-radioactive, however!). Or perhaps the disaster will be the opposite – too much water from a broken pipe, and a flood drowning those servers. Or – and this also happens in computer centers – a fire followed by (drum roll), a flood as the fire suppression system kicks in. Should we have a plan for “the big one”, that earthquake? Sure. But most of our disaster preparation effort should plan for the much more probable disaster of fire and water.
Finally, any disaster response plan has one element which is vastly more important than any other: people. And, on November 14th, the “people” (employees) of the City of Seattle and its Department of Information Technology performed splendidly. A dozen IT professionals showed up on site within two hours (despite interference from the traffic around a nearby Seahawks football game). The computer center manager – a 44 year employee and true hero Ken Skraban – was on site and immediately in charge. Two employees set up an IT operations center with an incident commander and support staff. Several responded to the data center and shut down servers in an orderly, pre-planned, color coded (red-green-orange-yellow) fashion, with the most critical servers (for example “Blackberry” support) staying up continuously. Server administrators from every major department in City government responded on site.
And when the crisis was past and cool water was again flowing to the “crack” units, those same folks brought all services up in an orderly fashion. And there was not a single call to the help desk on Monday morning as a result of our unanticipated “summer” high.
Disasters happen. Careful planning and skilled, trained staff will always mitigate their effects.