Musings on IT, data management, whitewater rafting, and backpacking

Thursday, March 22, 2012

When triple redundancy isn't enough


Last year we implemented a data archiving system, where Availability and Survivability were important design parameters. We had one primary server in California, with mirrors in Florida and Massachusetts – triple redundancy. Life was good for many months.

Right now, all three systems have been down for over 24 hours. 

Here's the story.

Three servers running, and then ...

Six weeks ago, we lost remote administrative and mirror access to the system in Massachusetts. After some confusing questions and replies, we're told the local sys admins had revoked our access to the system in the name of Security, without warning or appeal. When we raised this problem with Management, the reply was: we support our local sys admins.

We asked for our system back, so we could move it to a location with more reasonable access policies. Another round of confusing questions and replies, and the final answer: Your system doesn't really exist, but we'll buy you a new one. Uh, OK, send it to me in California. It's sitting in my office now.

While waiting for delivery, I request space for the new server in another California server room we control. To make a long story short, I've been redirected to a centrally managed data center, and I'm waiting for a call back.

One down, two to go.

Four weeks ago, the California sys admin drops by to say he's leaving for much greener pastures. So he gives me the system password, and oh-by-the-way this system hangs at least once per week, so here's the hands-on process to revive it (power everything down, disconnect server from storage array, power up, wait for green lights, reconnect array). I ask him to cycle the system just before he leaves. The next day, it's down again.


We decide this is a good opportunity to test failover, so we failover to the Florida server. That works.


However, I don't have permission to touch computers in the California server room, and the replacement sys admin is "under discussion" with management, meaning don't hold your breath. So I start negotiating for server room access, or assignment of another sys admin – which has dragged on for three weeks now, in part because I've been out with a terrible cold.

Two down, one to go.

Two days ago, I'm copied on an email from Florida, stating they are changing IP addresses for all the servers there, but everything should be up in a couple of hours. I check several hours later: no response from our server.

Oops, they forgot to change the DNS TTL for our server, here's the new IP address, should resolve in 8 hours or so. Go direct to that IP address: no response. It's after hours in Florida by now, the'll fix it in the morning.

Morning comes: server's up, but storage array is dead. Needs hands-on access to fix, but the sys admin is off today.

Three down. Game over.


I'm not sure what lessons to learn from this, yet.

Obviously, a theoretically elegant system design can be defeated by circumstances beyond your control – or imagination.

No comments:

Post a Comment