Monday, January 25, 2010

XIV Disk Fault Questions [Updated]

UPDATE [8/6/2010]:  If you're interested in more current XIV information, I recommend reading Tony Pearson's recent posts here and here.  He also provided additional information in the comments to one of my posts here.
Today, I came across an XIV RAID-X post by IBMer KT Stevenson: RAID in the 21st Century.  It is a good overview of the XIV disk layout/RAID algorithm.  I have limited my questions in this post to ones raised by KT’s post since this post is already a bit lengthy.
In fact, DS8000 Intelligent Write Caching makes RAID-6 arrays on the DS8000 perform almost as well as pre-Intelligent Write Caching RAID-5 arrays.
Any array that does caching for all incoming writes should be able to claim the same (from a host perspective).  For large writes where the entire stripe is resident in memory for parity computations, there should be almost NO performance degradation.  It is great that the DS8000 performs well with RAID-6, but that is rapidly becoming “table stakes” if it isn’t already.
When data comes in to XIV, it is divided into 1 MB “partitions”.  Each partition is allocated to a drive using a pseudo-random distribution algorithm.  A duplicate copy of each partition is written to another drive with the requirement that the copy not reside within the same module as the original.  This protects against a module failure.  A global distribution table tracks the location of each partition and its associated duplicate.
The most common ways to mitigate the risk of data loss are to decrease the probability that a critical failure combination can occur, and/or decrease the window of time where there is insufficient redundancy to protect against a second failure.  RAID-6 takes the former approach.  RAID-10 and RAID-X take the combination approach.  Both RAID-10 and RAID-X reduce the probability that a critical combination of failures will occur by keeping two copies of each bit of data.  Unlike RAID-5, where the failure of any two drives in the array can cause data loss, RAID-10 and RAID-X require that a specific pair of drives fail.  In the case of RAID-X, there is a much larger population of drives than a typical RAID-10 array can handle, so the probability of having the specific pair of failures required to lose data is even lower than RAID-10.
While at first this paragraph made perfect sense to me, there was something that just didn't seem to sit right.  Namely, this portion:
In the case of RAID-X, there is a much larger population of drives than a typical RAID-10 array can handle, so the probability of having the specific pair of failures required to lose data is even lower than RAID-10.

The following is based off of what I've read online... allocations are divided up into 1 MB partitions, which are the distributed across the frame.  For the purpose of this question, I will assume 100% of all disks are available for distribution (which is untrue, but it is the absolute best case scenario) and that the data is perfectly evenly distributed.

In a fully loaded XIV frame, there are 180 physical disks.  What I'm interested in is the number of chunks that can be mirrored among the 180 disks without repeating the pair – once a ‘unique’ pair is repeated, you are vulnerable to a double disk failure with every allocation past that point.  So, 180 C 2 = 16110.  With 1MB per chunk, that is 16 GB.  From an array perspective, you run out of uniqueness after 16GB of utilization.  From an allocation perspective, any allocation larger than 16GB would be impacted by a double disk fault.  I assume XIV doesn’t “double up” on 1MB allocations (going for a less wide stripe for reducing the chances of a double fault) simply because I've always heard that hotspots aren't an issue.  This is best case assuming a perfect distribution, as near as I can reason - I'm sure any XIVers out there will correct this if I'm making an invalid assumption.

If you look closer, though, it’s a little worse than that.  Every single disk is not a candidate as a mirror target – as noted above, XIV does not mirror data within a module.  With 15 modules in a 180 disk system, that means for each mirror position there are 11 disks that can not be used.  The math gets beyond me at this point, so if anyone wants to comment on what that actually equates to, I’d be interested.

  1. What is the “blast radius” of a double drive fault with the two drives on different modules?  Is it just the duplicate 1MB chunks that are shared between the two drives, or does it have broader impacts (at the array level)?
  2. At what size of allocation does a double drive fault guarantee data loss (computed as roughly 16GB above)?
  3. What is the impact of a read error during a rebuild of a faulted disk?  How isolated is it?
  4. Does XIV varyoff the volumes that are affected by data loss incurred by a double drive issue, or is everything portrayed as “ok” until the invalid bits get read?
  5. If there is data loss due to a double drive issue, are there reports that can identify which volumes were affected?
Update (01/26/2010):
I realize that the math part of this post is a little hard to understand, especially with 180 spindles in play, so I went ahead and drew it out with only 5 spindles (5 C 2 = 10).

This shows the 5 spindle example half utilized.  In this diagram it is possible to lose 2 spindles without data loss... for example, you can lose spindle 2 and spindle 4 - since neither of them have both copies of a mirror, no data is lost.

This shows the 5 spindle example with all unique positions utilized.  In this diagram, it is impossible to lose 2 spindles without losing both sides of one of the mirrors.

The 16GB number quoted above is based off of a 1MB chunk size, which is what has been documented online.  If the chunk size was larger, then that amount would be higher before guaranteed loss.  Of course, if you lose the wrong two drives prior to 16GB, you'll still lose data.  The percentage chance of data loss increases as you get closer to 16GB.

I know KT is working on a response to this, I'm looking forward to being shown where the logic above is faulty (or where my assumptions went south).


K.T. Stevenson said...

Good stuff. I'll tackle your RAID-X specific observations in a little while. For now, I want to explore RAID-6 and caching algorithms a bit more.

Managing the write destage rate is critically important for squeezing high performance out of RAID-6. The degenerate case you describe ("large writes where the entire stripe is resident in memory for parity computations") is easy. Unfortunately, we don't always get the luxury of doing what is easy.

Since the number of physical I/O operations required to harden a RAID-6 write to stable storage is higher than what is required for RAID-5, back-end array response times are worse in RAID-6. We hide this response time from the application with write cache. All other things being equal, it takes more write cache in front of RAID-6 to maintain the same I/O response time as a RAID-5 array.

Cache is expensive. Customers, for some reason, only seem to want to buy as much of it as they absolutely have to. One of the ways the DS8000 differentiates itself is the quality of the caching algorithms. The Intelligent Write Caching feature I mentioned does an outstanding job of coalescing writes and reducing almost all physical I/O operations to full stripe writes. This allows us to maintain excellent I/O response times with less cache than would be require by other implementations.

techmute said...

Agree almost completely, all enterprise arrays have write cache, hopefully sized to the projected workload. The cache commonly serves to hide back end response time on writes, be it RAID 5 or RAID 6.

Similarly, all vendors say that their cache algorithms are the best. It practically impossible to determine whether or not a given algorithm requires less cache over a competing one as a customer.

My main point was, all vendors claim fast RAID 6 implementations with competitive advantages due to cache algorithms - hence these claims are "table stakes." I'd be a little concerned if someone didn't claim this :-).

Tim Stone said...

My main point was, all vendors claim fast RAID 6 implementations with competitive advantages due to cache algorithms - hence these claims are "table stakes."

A bit off topic: Vendors, or more specifically their marketers, rarely seem to get this principle. Nobody wants to compete on price alone, but often it's determined that competitive pressures require that our specs meet-or-beat their specs, so you should buy our stuff. So all competing products start to look alike, their similarities vaguely masked by the usage of different terminologies and semantics. So in the face of product homogeneity, other factors come into play, such as "When I call your support number, do I get to talk to someone who has a clue?"

I know you'll feel free to delete this comment if you don't like off-topics!

Colin H said...

Just a few general notes on the above comments.
Cache is only expensive in proprietary disk arrays, "open arrays" as sold by certain vendors using ZFS have econically priced SSD cache.
RAID-6 is at the end of its useful life RAID Z3 (triple parity) solves all RAID-5/6 issues.

SRJ said...


I can't speak to the accuracy of your math, but I can tell you that it isn't really necessary in order to get to the conclusion you're arriving at. Here's why:

The total capacity of a full XIV system is divided up into 16,000 primary slices and 16,000 secondary slices. These slices are spread evenly across all 180 disks in the system. Now that you know that...think about these facts:

All volume writes (1MB "partitions" or "chunks") are striped across all primary slices.

All writes to primary slices are pseudo-randomly mirrored to a secondary slice.

Minimum volume size on XIV is 16GB. You can make smaller volumes if you want, but 16GB will still be mapped out to ensure that even the smallest volumes are able to access all 180 spindles.

Does that 16GB number sound familiar? :)

Your math looks right to me, but it doesn't really matter if you know that bit about how the volumes are laid out on the system. Same result. Your point is correct...that any second drive failure out of the 168 candidates will cause data loss.

Game over for XIV? All the competitors would say so, but I disagree....and I think my position is reasonable. It would take a lot of time and space to explain why I think my position is reasonable, and I probably should do that. Perhaps I will, but comments on a blog are probably not the best place for that. Customers who don't automatically shut their ears after hearing the FUD from competitors usually agree with me after I've explained.

Now, to be fair...the XIV sales team tries to make it seem like RAID-X is absolutely impervious and is the end-all, be-all data protection solution. It very clearly is not. But in practice, it is also not any worse than (and is probably better than most) other vendor's best-practice designs. NetApp is the tricky one here since their best practices all use dual-parity. I'll say that the XIV is worse than RAID-DP in some important ways, and is also better in some important ways.

Lots to discuss...just the tip of the iceberg.


SRJ said...

Just realized I really only responded to your #2 question... Here goes my attempt at the others:

1. Only the 1MB partitions common to both failed drives are affected.

3. Please be more precise with the question...not sure exactly what you are asking.

4. If the system cannot recover from lost data, it will vary off. Support will try to force the failed disks back to life. Many times a "failed" drive isn't really dead and can easily be forced back to life. This is not unique.

5. Not "reports" per se...but there is a distributed map of where all the 1MB chunks and their mirrors are located. This is used to determine exactly which volumes are affected, etc... An end user would not have access to this information without the help of IBM support.

SRJ said...

As I was posting on our XIV Google Wave, I realized that I mis-spoke in my first comment here. I said:

"any second drive failure out of the 168 candidates will cause data loss."

This is not correct. It should read:

"a second drive failure in the opposite module type (so either a pool of 108 candidates or a pool of 72 candidates) will cause data loss."

That's a little more confusing, but it is technically more accurate. For details, can we point readers to the Wave?

Thanks for posting my correction...

Anonymous said...

I started thinking about this also, it's much worse than you thought. 78TB of capacity mirrored on 180 drives means that you have 78TB across 90 drives. You have potential to have 866GB per TB drive. If you drive the system to 75% utilization you have use around 650GB of capacity or 650,000 1 MB chunks. Those can only be mirrored, equally across 168 drives. Any 2 drives share ~3800 copies of data. If you have a 78 TB system and use 20GB luns, you'd have 4000 luns. A double drive failure would probably corrupt the entire array.

Anonymous said...

Your math is way off. With a 15 module system there are 180 drives that are used. XIV does not mirror disks but rather the 1 MB chunks, so your comment about only using 90 drives is flawed, which makes the rest of your equation irrelevant. When a LUN is created it is spread across all 180 drives and the rebuild of a failed drive would use 168 (180-12 bc no mirrored data is on the same module).

For arguments sake lets say that all the stars were aligned, you were wearing your purple stripped socks, standing on one foot, and there was a double drive failure. All that would be lost would be the union data between the two drive which would most likely not be all LUNs on the entire array.

As SRJ posted, those blocks that were lost can be pinpointed by support personnel and then restored.

Not to get too much off topic but >90% of DD failures happen within the same shelf and it is due to a common problem within that shelf.