Musings on IT, data management, whitewater rafting, and backpacking

Tuesday, October 26, 2010

Massive Disk Failure Deja Vu, Part 2

So far, no further problems with this system.

Some corrections to the events reported in Massive Disk Failure Deja Vu, and followups on what we can or can't do to recover faster.

Corrected early event sequence:
  1. One disk in one zpool went bad over the weekend, and started resilvering on the hot spare.
  2. The other zpool complained about not having a hot spare.
  3. The first zpool held the home directories of all users, and the resilver severely impacted performance -- logins wouldn't complete after 5 minutes!
  4. Eventually, we reduced the performance impact, and allowed normal use while the resilver completed.
All of this is standard ZFS behavior, except for the surprisingly severe performance hit.  The flurry of never-seen-before error messages, and many disgruntled users breathing down our necks, led to confusion about what actually happened.

The next day, when we discovered that a second drive in the same array went bad, we decided to be super-safe, and take the system down as reported earlier.

Followups on short term ideas to improve reliability:
  • Reduce zpool size – we decided this was too risky, and too much work, for the benefits we might get.
  • Aggressively scrub zpools -- we're currently scrubbing once each weekend, which takes 20+ hours with big performance impact.  Not practical to scrub during working hours.
  • rsync from snapshot to another array -- we're already doing that to the extent we can, unless we add more storage.  We'll adjust our rsync schedule to avoid some of the problems described here.
  • Backup to tape more often -- we're running full backups once per week that take 3.5 days to run, and can't run partial backups with our current software.  Not practical to backup more often.
Bottom line -- we're not changing much.

We need to replace the storage on this system or replace the system, to get significantly better reliability.

Until then, we hope for the best.

No comments:

Post a Comment