- Worst case RAID rebuild times
- Worst case storage fill times
Then I can choose my Maximum Storage Pain Threshold.
Recently I've been through total storage system failure, or right at the edge of failure. What I didn't know, and couldn't plan for, was how long recovery would take.
Scenario #1 – Full RAID rebuild:
- A RAID 5 system suffers single disk failure, or a RAID 6 system suffers double disk failures.
- Rebuild starts to hot spare(s) or swapped disks.
- How long will that rebuild take?
This number is almost impossible for customers to compute from component specifications. Each system has unique variables including:
- RAID controller performance
- Onboard RAID versus HBA RAID versus Software RAID
- Alternative data protection architectures like ZFS, WAFL, BeyondRAID, ...
- Storage system interfaces including SATA, SAS, FC, iSCSI, ...
- Disk interfaces including SATA, SAS, FC, ...
- Disk performance specs like total capacity, rotational speed, latency, seek times, I/O transfer rates, cache size, ...
Given all these variables, we probably need a scheme like SPEC with standard workloads, and the vendors or reviewers report the exact hardware and software used for each test. Testing like that still has problems, but it's better than nothing, which is what we have now.
Scenario #2 – Full restore:
- A completely full RAID 5 system suffers double disk failures, or a RAID 6 system suffers triple disk failures.
- Rebuild is impossible, so we swap disks and reformat
- We start restoring from backups – tape, another server, another storage array, the Cloud. For the purpose of this exercise, assume an infinitely fast backup source.
- How long will that restore take?
In addition to all the factors in Scenario #1, now we are stressing total system throughput. In this scenario, using the same SPEC-like model and computing maximum theoretical write speed might be sufficient.
What's my Maximum Storage Pain Threshold?
Short definition: It's the amount of time the storage system can be down before my job is threatened.
Longer, more cynical definition: Starting with a thorough analysis of business needs, blah, blah, blah, manipulate the numbers to match my budget and the amount of time the storage system can be down before my job is threatened.
MSPT will be a combination of times I've discussed before:
- Corruption Recognition
- Recovery Initiation
- Data Recovery Time
MSPT will vary depending on the criticality of the data.