Here are some preliminary thoughts on a post-RAID storage design that meets our needs.
We've had big problems with RAID arrays, including multi-week rebuilds, and multi-week tape restores.
Now I'm designing a server refresh, and taking a fresh look at storage.
We need:
- 600 GB of ultra-fast storage
- 9 TB of fast storage
- 50 TB of slow, reliable storage.
- 600 GB PCIe Flash card for ultra-fast storage.
- 600 GB, 15K RPM, SAS drives with RAID 0 for fast storage.
- 2 TB, 7.2K RPM, SAS drives with RAID 6 for slow storage.
So how about this design instead?
- 600 GB PCIe Flash card for ultra-fast storage.
- 600 GB, 15K RPM, SAS drives as JBOD for fast storage.
- 2 TB, 7.2K RPM, SAS drives as JBOD for slow storage.
- More 2 TB, 7.2K RPM, SAS drives as JBOD mirrors of each of the above
- Daily rsync mirror of the ultra-fast storage to a 600 GB mirror drive partition.
- Daily rsync mirror of each fast disk to a 600 GB mirror drive partition.
- Daily rsync mirror of each slow disk to a mirror drive.
- Run the rsync jobs at about 6 pm, Monday through Friday only.
- We need to recognize the problem before the next rsync job runs
- Halt that particular rsync job
- Redirect to the mirror partition or drive, possibly with reduced performance
- Schedule down time to power cycle the failed drive, and replace if needed
- Copy the mirror partition or drive back to the original drive
- Redirect back to the original drive
- Restart the rsync job
Even if we needed to recover one disk from a backup tape, that will take much less time than recovering a large RAID array from many backup tapes.
We would have similar steps for recovering a corrupted or deleted file. Ideally, we would have daily ZFS snapshots, but that has other issues.
Why the specific rsync days and times?
- Virtually all of our work is done during normal working hours, 8 am to 6 pm Monday through Friday.
- We typically recognize and begin restoration procedures only during normal working hours.
- We don't want our mirrors to reflect corrupted disks or files before we get a chance to recognize and restore.
This design also reduces our risk from total RAID array recovery failure; i.e. everything goes wrong and we are unable to recover any data on a large RAID array. With this design, a double or triple disk failure only loses the data on those disks, rather than the entire array. And others have observe correlated or cascading disk failures due to identical designs, same manufacturing batch, and identical operating environments.
I'll have to think through all the implications of this design for a while.
0 comments:
Post a Comment