Comments on techmute: XIV Disk Fault Questions [Updated]

Anonymous, Your math is way off. With a 15 module ...

2010-06-09T11:45:10.565-07:00

Anonymous,
Your math is way off. With a 15 module system there are 180 drives that are used. XIV does not mirror disks but rather the 1 MB chunks, so your comment about only using 90 drives is flawed, which makes the rest of your equation irrelevant. When a LUN is created it is spread across all 180 drives and the rebuild of a failed drive would use 168 (180-12 bc no mirrored data is on the same module).

For arguments sake lets say that all the stars were aligned, you were wearing your purple stripped socks, standing on one foot, and there was a double drive failure. All that would be lost would be the union data between the two drive which would most likely not be all LUNs on the entire array.

As SRJ posted, those blocks that were lost can be pinpointed by support personnel and then restored.

Not to get too much off topic but >90% of DD failures happen within the same shelf and it is due to a common problem within that shelf.

I started thinking about this also, it's much ...

2010-03-16T14:09:17.672-07:00

I started thinking about this also, it's much worse than you thought. 78TB of capacity mirrored on 180 drives means that you have 78TB across 90 drives. You have potential to have 866GB per TB drive. If you drive the system to 75% utilization you have use around 650GB of capacity or 650,000 1 MB chunks. Those can only be mirrored, equally across 168 drives. Any 2 drives share ~3800 copies of data. If you have a 78 TB system and use 20GB luns, you'd have 4000 luns. A double drive failure would probably corrupt the entire array.

As I was posting on our XIV Google Wave, I realize...

2010-03-09T22:50:45.621-08:00

As I was posting on our XIV Google Wave, I realized that I mis-spoke in my first comment here. I said:

"any second drive failure out of the 168 candidates will cause data loss."

This is not correct. It should read:

"a second drive failure in the opposite module type (so either a pool of 108 candidates or a pool of 72 candidates) will cause data loss."

That's a little more confusing, but it is technically more accurate. For details, can we point readers to the Wave?

Thanks for posting my correction...

Just realized I really only responded to your #2 q...

2010-03-01T23:31:18.924-08:00

Just realized I really only responded to your #2 question... Here goes my attempt at the others:

1. Only the 1MB partitions common to both failed drives are affected.

3. Please be more precise with the question...not sure exactly what you are asking.

4. If the system cannot recover from lost data, it will vary off. Support will try to force the failed disks back to life. Many times a "failed" drive isn't really dead and can easily be forced back to life. This is not unique.

5. Not "reports" per se...but there is a distributed map of where all the 1MB chunks and their mirrors are located. This is used to determine exactly which volumes are affected, etc... An end user would not have access to this information without the help of IBM support.

techmute, I can't speak to the accuracy of yo...

2010-03-01T21:51:35.316-08:00

techmute,

I can't speak to the accuracy of your math, but I can tell you that it isn't really necessary in order to get to the conclusion you're arriving at. Here's why:

The total capacity of a full XIV system is divided up into 16,000 primary slices and 16,000 secondary slices. These slices are spread evenly across all 180 disks in the system. Now that you know that...think about these facts:

All volume writes (1MB "partitions" or "chunks") are striped across all primary slices.

All writes to primary slices are pseudo-randomly mirrored to a secondary slice.

Minimum volume size on XIV is 16GB. You can make smaller volumes if you want, but 16GB will still be mapped out to ensure that even the smallest volumes are able to access all 180 spindles.

Does that 16GB number sound familiar? :)

Your math looks right to me, but it doesn't really matter if you know that bit about how the volumes are laid out on the system. Same result. Your point is correct...that any second drive failure out of the 168 candidates will cause data loss.

Game over for XIV? All the competitors would say so, but I disagree....and I think my position is reasonable. It would take a lot of time and space to explain why I think my position is reasonable, and I probably should do that. Perhaps I will, but comments on a blog are probably not the best place for that. Customers who don't automatically shut their ears after hearing the FUD from competitors usually agree with me after I've explained.

Now, to be fair...the XIV sales team tries to make it seem like RAID-X is absolutely impervious and is the end-all, be-all data protection solution. It very clearly is not. But in practice, it is also not any worse than (and is probably better than most) other vendor's best-practice designs. NetApp is the tricky one here since their best practices all use dual-parity. I'll say that the XIV is worse than RAID-DP in some important ways, and is also better in some important ways.

Lots to discuss...just the tip of the iceberg.

Thoughts?

Just a few general notes on the above comments. Ca...

2010-02-01T19:10:50.965-08:00

Just a few general notes on the above comments.
Cache is only expensive in proprietary disk arrays, "open arrays" as sold by certain vendors using ZFS have econically priced SSD cache.
RAID-6 is at the end of its useful life RAID Z3 (triple parity) solves all RAID-5/6 issues.

My main point was, all vendors claim fast RAID 6 i...

2010-01-26T16:02:44.451-08:00

My main point was, all vendors claim fast RAID 6 implementations with competitive advantages due to cache algorithms - hence these claims are "table stakes."

A bit off topic: Vendors, or more specifically their marketers, rarely seem to get this principle. Nobody wants to compete on price alone, but often it's determined that competitive pressures require that our specs meet-or-beat their specs, so you should buy our stuff. So all competing products start to look alike, their similarities vaguely masked by the usage of different terminologies and semantics. So in the face of product homogeneity, other factors come into play, such as "When I call your support number, do I get to talk to someone who has a clue?"

I know you'll feel free to delete this comment if you don't like off-topics!

Agree almost completely, all enterprise arrays hav...

2010-01-26T09:57:09.300-08:00

Agree almost completely, all enterprise arrays have write cache, hopefully sized to the projected workload. The cache commonly serves to hide back end response time on writes, be it RAID 5 or RAID 6.

Similarly, all vendors say that their cache algorithms are the best. It practically impossible to determine whether or not a given algorithm requires less cache over a competing one as a customer.

My main point was, all vendors claim fast RAID 6 implementations with competitive advantages due to cache algorithms - hence these claims are "table stakes." I'd be a little concerned if someone didn't claim this :-).

Good stuff. I'll tackle your RAID-X specific ...

2010-01-26T09:40:06.874-08:00

Good stuff. I'll tackle your RAID-X specific observations in a little while. For now, I want to explore RAID-6 and caching algorithms a bit more.

Managing the write destage rate is critically important for squeezing high performance out of RAID-6. The degenerate case you describe ("large writes where the entire stripe is resident in memory for parity computations") is easy. Unfortunately, we don't always get the luxury of doing what is easy.

Since the number of physical I/O operations required to harden a RAID-6 write to stable storage is higher than what is required for RAID-5, back-end array response times are worse in RAID-6. We hide this response time from the application with write cache. All other things being equal, it takes more write cache in front of RAID-6 to maintain the same I/O response time as a RAID-5 array.

Cache is expensive. Customers, for some reason, only seem to want to buy as much of it as they absolutely have to. One of the ways the DS8000 differentiates itself is the quality of the caching algorithms. The Intelligent Write Caching feature I mentioned does an outstanding job of coalescing writes and reducing almost all physical I/O operations to full stripe writes. This allows us to maintain excellent I/O response times with less cache than would be require by other implementations.