Musings on IT, data management, whitewater rafting, and backpacking

Tuesday, December 29, 2009

Massive disk failure

I have been a big fan of Sun's ZFS file system for many reasons. We have about 100 TB of active ZFS storage, mostly ZFS RAIDZ2, which can correct from double drive failures.

We recently suffered the biggest storage failure we've had in decades. The story is still unfolding, but the beginning and middle have already been written.


Two drives in an eight drive pool failed over one weekend. This pool is on our most heavily used production server, so we decided to let the system rebuild onto new drives, while staying in production. The rebuild was going to take more than one week. During rebuild, many critical files were found to be corrupted. Despite daily ZFS snapshots, we were forced to restore these critical files from tape or other sources during the rebuild.

Before the week-long rebuild completed, two more drives failed. Now we were truly hosed. Four drives failing on one system in one week! We disabled multi-user access to the system. We began the laborious and lengthy process of restoring from tape to a previously unused, identical array. The restore should be finished today -- about 10 days later! The drive array vendor is taking this incident very seriously. The vendor has overnighted replacement drives, and a new chassis with power supply, so we can move our existing drives. We've seen bad power supplies take out multiple disk drives or other components before. The vendor plans to analyze our failed drives, old chassis, and power supply, to discover the root cause.

Here are some of our preliminary lessons learned from this incident:
  • With ZFS RAIDZ2, double drive failure means we must take the pool offline for rebuild. Do not try to keep the pool in production during the rebuild.
  • Smaller ZFS pools rebuild or restore from tape much faster than larger pools.
  • We must balance larger ZFS pools versus rebuild/restore times. Larger pools have lower parity overhead, but much longer rebuild/restore times.
  • Tapes are way slow. We need to find a better scheme for disaster recovery.
Possible alternative disaster recovery schemes include:
  • Restore from multiple tape drives simultaneously. We need a more complex backup/restore scheme, which might have other problems. And we are ultimately limited by the write speeds to disk, which might be exceeded by two or three tape drives.
  • Offsite backup/restore over very high-speed WAN connections to disk pools. Many variations on this theme, all involve very expensive WAN links, duplication of storage, and offsite data center space.
  • Offsite server mirroring. Duplicate your primary server, mirror contents in near-real-time, and fail over to the off-site mirror when your primary fails. Expensive duplication of servers, might need WAN speed increase depending on the bandwidth needed by the mirroring process or user access, and need offsite data center space.
  • Outsourced offsite backup/restore over very high-speed WAN connections. Like Amazon AWS. We must encrypt anything that goes offsite, and that might slow down the process too much to be practical. And lots of money needed for the offsite storage and the WAN link.
  • Outsourced offsite server mirroring is not an option, due to our security requirements.
We could accept the risk of extended downtime every decade or so, and continue our existing practices, with some operational modifications.

We'll have some strategic discussions after the dust settles to decide how to proceed.

No comments:

Post a Comment