Musings on IT, data management, whitewater rafting, and backpacking

Friday, October 22, 2010

Massive Disk Failure Deja Vu

Last year, we suffered through extended downtime and data loss on our primary server due to multiple disk failures on a large ZFS RAID-Z2 array. The disk array vendor found no trouble in the array or drives, and my confidence in ZFS was badly shaken. If you need to refresh your memory: Part 1, Part 2, Part 3, Part 4.

It almost happened again a few weeks ago, with some bizzare new twists.



Two zpools share one hot spare drive. Our trouble started when that hot spare went offline over the weekend. Both zpools went offline, and we had an unusable system on Monday morning. Some frantic ZFS command-line efforts later, and we were back in operation, but resilvering both zpools. We crossed our fingers and hoped for the best overnight.

Next morning, another drive had gone offline. Too much like last year. We took the entire system down, reseated all the drives and array boards, and told ZFS to rebuild everything in sight. Like many systems, ZFS gives overly optimistic estimated completion times, but eventually the zpools are scrubbed. Based on last year's experience with file stumps (the file name is there but the file is empty), we use find and file to ensure every file still has reasonable contents. Then we rsync to a spare array, backup to tapes, and bring the system back online for users.

We exchange several rounds of cranky phone calls and emails with our array vendor again, including confusion over an array firmware update that was just released but turns out is irrelevant to our system. Our array vendor truly doesn't understand ZFS, and repeatedly pushes us to "upgrade" to their RAID 6 array to prevent further problems like this. We ignore them.

Bottom line:
  • No data lost. 
  • About two days lost productivity for dozens of people.
  • No further problems after nine days.
Now what?

Something must change. We have experienced one catastrophe and one near catastrophe in less than one year on our most important server. Unfortunately, we have no idea what the root cause might be.

Since we can't discover the root cause, we must improve our ability to recover from similar failures in the future.

Some ideas I'm considering in the short term:
  • Reduce zpool sizes, to reduce rebuild times and reduce the data at risk from drive failures. This will reduce our net storage, but we're not full yet.
  • Aggressively scrub zpools to get early warnings of problems.
  • rsync to another array daily, so we can redirect to the second array and resume operation while fixing the first array.
  • Backup to tape more often. We backup using ZFS snapshots, so our backup window is not tightly constrained.
In the long run, we must consider replacing our slow, commodity-drive based system with a fast, enterprise-drive based system. Our data are mostly "cold", but we need faster backup and recovery times than we can get with our current setup. The purchase cost will be 2x to 5x higher, but we're losing nearly that much money from lost productivity and potentially from lost data, every year.

And we need to migrate from Solaris to Red Hat Linux in 2011. Oracle support for Sun hardware and Solaris software has been terrible. Third party software support for Solaris has already vanished for several of our key applications.

Of course, Linux means no ZFS, which means back to RAID 6 or some other scheme to improve the reliability of the data we store. Yes, FreeBSD has ZFS, but none of our third party application providers support FreeBSD.

I really hate to walk away from many ZFS features, including flexible zpools and partitions, snapshots, granular block verification, etc.

Moving off Solaris will end 25+ years of using Sun products for our group. We picked up two Sun-1/100 workstations with serial numbers around 170 from their roll-up door startup in Mountain View in the early 1980s, and have used mostly Sun servers ever since.

But our failure experiences, combined with terrible Oracle support and lack of third party software support, will force us off of Solaris+ZFS.

6 comments:

  1. Update: Yes, you can get ZFS on Linux from http://zfsonlinux.org/ and http://zfs-fuse.net/ and other sources.

    No, you can't find any vendors that support ZFS on Linux, because of the legal roadblocks.

    Given our history of poor vendor support, and lack of in-house Linux hackers, we really need good vendor support.

    ReplyDelete
  2. Hi, Was this a backplane issue on the disk array? I've had some Chenbro backplanes go bad before.

    Would you share what brand disks and disk array these were?

    Thanks
    SR

    ReplyDelete
  3. Anonymous,

    We strongly suspect a backplane or array controller as the source of our problems. We're considering periodic shutdowns and reseats of controllers and drives to prevent future problems.

    Because we need to maintain a good working relationship with our array and drive vendors, and we really have not isolated the source of the problems, we won't reveal any names at this time. I know how frustrating this can be to others trying to evaluate vendors, because I've been on both sides now.

    ReplyDelete
  4. The problematic disks had no errors? That is strange. Was the problem a faulty power supply? What WAS the problem? How can you know that Linux will not show similar problems?

    Before you consider moving to hardware raid, maybe you should read this post:
    http://opensolaris.org/jive/thread.jspa?messageID=502969#502969

    ReplyDelete
  5. I've read your complete storing and several things come to mind :

    - multiple disks failure. Were these professional grade disks? Were they Seagate Barracuda of late 2009 ? These are terrible ( 8 to 15% failure rate). I even switched boats to Hitachi because of the awful failure rate of these drives. I'm talking of several thousands drives per year, this are significant statistics.

    - slow rebuild : yes, it's a bit of a pet peeve, but a decent hardware RAID controller rebuild a multi-TB array in 8 to 12 hours while in moderate use. The RAID controller doesn't rebuild more slowly when your CPU is drawn to its knees. Use hardware RAID. Use hardware RAID. Use friggin' hardware RAID, for chrissake! Your vendor is right on this one!

    - ZFS isn't the best thing since sliced bread. ZFS is great if you have tons of CPU power, tons of bandwidth, and you needn't tons of performances. Sad but true.

    - slow tape : I don't know how you use your tapes, but current tape drives (LTO-4 or LTO-5) read/write at more than 100 MB/s. If you don't get this throughput, you're doing it wrong. Did you set up a proper block size (32K should be enough, but don't go any lower)?

    ReplyDelete
  6. Anonymous@November 2,

    In round 1, we returned the entire disk array with disks to the vendor. They ran every test they had -- No Trouble Found.

    In round 2, after reseating everything, all errors vanished, including the red lights on drive slots.

    I think we have transient problems. Many disk errors are transients.

    When we move to Linux, we'll get new storage arrays from a different vendor.

    I'm very familiar with the limitations of hardware RAID and the benefits of ZFS. I've been very concerned with data integrity for decades, and have been writing internal papers on the topic for five years.

    But our real-world experiences are terrible with ZFS and this particular combination of controller, array, and drives. We have three Sun Thumpers (first generation X4500s) with much better track records.

    So maybe our lesson is that sprinkling ZFS pixie dust over cheap hardware doesn't turn a pumpkin into a carriage.

    ReplyDelete