Some corrections to the events reported in Massive Disk Failure Deja Vu, and followups on what we can or can't do to recover faster.
Corrected early event sequence:
- One disk in one zpool went bad over the weekend, and started resilvering on the hot spare.
- The other zpool complained about not having a hot spare.
- The first zpool held the home directories of all users, and the resilver severely impacted performance -- logins wouldn't complete after 5 minutes!
- Eventually, we reduced the performance impact, and allowed normal use while the resilver completed.
All of this is standard ZFS behavior, except for the surprisingly severe performance hit. The flurry of never-seen-before error messages, and many disgruntled users breathing down our necks, led to confusion about what actually happened.
The next day, when we discovered that a second drive in the same array went bad, we decided to be super-safe, and take the system down as reported earlier.
Followups on short term ideas to improve reliability:
- Reduce zpool size – we decided this was too risky, and too much work, for the benefits we might get.
- Aggressively scrub zpools -- we're currently scrubbing once each weekend, which takes 20+ hours with big performance impact. Not practical to scrub during working hours.
- rsync from snapshot to another array -- we're already doing that to the extent we can, unless we add more storage. We'll adjust our rsync schedule to avoid some of the problems described here.
- Backup to tape more often -- we're running full backups once per week that take 3.5 days to run, and can't run partial backups with our current software. Not practical to backup more often.
Bottom line -- we're not changing much.
We need to replace the storage on this system or replace the system, to get significantly better reliability.
Until then, we hope for the best.