UPDATE [8/6/2010]: If you're interested in more current XIV information, I recommend reading Tony Pearson's recent posts here and here. He also provided additional information in the comments to one of my posts here.
A few weeks ago, I created a Google Wave to discuss the architecture surrounding XIV and the related FUD (some of it fact-based) that this architecture attracted. I intended to post a recap after the wave had died down.
This is not that recap. The recap was about 80% complete, but more reputable resources have posted much of the same information. For anyone interested in the actual Wave information, contact me and I'll send a PDF (provided there is some mechanism to decently print the Wave). There was a podcast Nigel hosted last week that I participated in available on his podcast archives.
New Zealand IBMer the Storage Buddhist wrote this post discussing the disk layout and points of failure in IBM's XIV array... which generated this response by NetApp's Alex McDonald. Both posts, especially the comments, are interesting and show both sides of the argument around disk reliability for XIV.
This post is meant to bridge a few gaps on both sides, and requires a little disclaimer. Most of the technical information below came from the Google Wave, primarily from IBM badged employees and VARs. I have been unable to independently guarantee accuracy- even the IBM RedBook on XIV has diagrams of data layout that contradict these explanations, but with disclaimers that basically say the diagrams are for illustrative purposes and don't actually show how it really works. So, caveat emptor - make sure you go over the architecture's tradeoffs with your sales team.
Hosts are connected to the XIV through interface nodes. Interface nodes are 6 of the 15 servers in an XIV system have FC and iSCSI Ethernet interfaces providing host connectivity. Prior to an unspecified capacity threshold, each incoming write is written to an interface node (most likely the one it came in on) and mirrored to a data node (one of the 9 other servers in an XIV system).
At this point, you can have drive failures in multiple interface nodes without data loss. In fact, one person claimed that you could lose all of the interface nodes without losing any data (of course, this would halt the array). The "data-loss" risk in this case is losing one drive in an interface module (40% of the disks) followed by one drive in a data module (60% of the disks) prior to a rebuild being complete (approximated at, worst case, 30-40 minutes). Or, as it was put in the wave:
"If I lose a drive from any of a pool of 72 drives, and then I lose a second disk from a separate pool of 108 drives before the rebuild completes for the first drive, I'm going to have a pretty huge problem."Past a certain unknown threshold, incoming writes start getting mirrored between two data nodes rather than an interface node and a data node. At that point, double disk failures between different data nodes can also cause a pretty huge problem.
From a 'hot spare' perspective, the XIV has space capacity to cover 15 drive failures. When you hear XIV resources discuss "sequential failures," they typically mean drive failures that occur after the previous one has rebuilt, but prior to the replacement of the failed drive. This is an important statistic from the perspective of double drive failures that occur because the failed drive was never detected (have you verified YOUR phone home lately?).
A couple of final thoughts. First off, the effect of a uncorrectable error during a rebuild was never fully explained. I have heard in passing that "the lab" can tell you what the affected volume is and that it shouldn't cause the same impact as two failed drives. Secondly, Hector Servadac mentions the following on the StorageBuddhist's post:
2 disk failures in specific nodes each one, during a 30 min windows, is likely as 2 controller failureUnless I'm not understanding the impact of a 2 controller failure, there is no data loss with that type of 'unlikely' failure... with the double drive failure, there is significant data loss. But as a yardstick of "how likely does XIV/IBM feel this outage scenario is," it serves as a decent yardstick.
I tried to make this as unbiased as possible. I am positive I will be brutally corrected in the comments :-).