Musings on IT, data management, whitewater rafting, and backpacking

Saturday, January 9, 2010

Massive Disk Failure, Part 2

Most of the dust has settled, though we are still waiting for root cause analysis from our drive/chassis vendor. They found nothing wrong with the first two drives that failed, but apparently this is a common finding.

Recovering from backup tapes was far more painful than anticipated. Despite weekly backups to tape, several people lost 2-3 weeks of work for various reasons:

  • Tapes written during our ill-advised "let's keep running during rebuild" phase had lots of mangled files and filenames.
  • The tapes before those just happened to be written during the first round of disk failures. More mangled files and filenames.
  • Discovering which files and names were mangled took lots of extra time and effort, including re-reading tapes.
In several ways we had a near worst-case-scenario for recovery time:
  • 2 drives failed, rebuild took most of a week
  • 2 more drives failed near the end of that week
  • Restoring from tapes took most of another week plus
  • "pax" tapes must be read from beginning to end while searching for mangled files.
For our 8 drive ZFS pool, our real-life bad-case recovery time is 2+ weeks.

We've had some heated internal discussions about what this implies for our three Thumpers (Sun X4500 servers). Those are currently configured with one giant pool of 40, 500 GB drives.

By my math, this means the real-life recovery time might be:
  • 2 drives fail, rebuild takes 5 weeks
  • Near the end of the five weeks, one or more additional drives fail
  • Restoring from tapes takes another 5+ weeks
For a 40 drive ZFS RAIDZ2 pool, about 10+ weeks bad case recovery time, during which the system must be offline. If we had 2 TB drives, that could be 40+ weeks!

Even in our relaxed environment, that's pretty bad. But that 10+ weeks must be weighed against our anticipated failure rate for this scenario. We have about 11 operating years experience with ZFS across 4 servers. One disaster every 44 server years? With 4 servers, one disaster every 2.5 years?

Maybe worse than that. Our oldest Thumper just started throwing disk error messages. One disk only at this time, so we are at the wait and see stage.

My confidence in ZFS is somewhat shaken, but maybe the problem lies in our expectations.

Our recovery from a true physical disaster (fire, earthquake, flood, etc.) would take much longer than 10 weeks, since we don't have spare servers and data centers in place.

The obvious solution - servers mirrored and geographically separated - is not in my budget for the foreseeable future. And I'm not quite sure how you keep from mirroring garbage when the primary system goes flaky but not down.

We've got some thinking to do.

No comments:

Post a Comment