Wednesday, March 24, 2010

XIV Final Thoughts- Drive Failures a Red Herring?

UPDATE [8/6/2010]:  If you're interested in more current XIV information, I recommend reading Tony Pearson's recent posts here and here.  He also provided additional information in the comments to one of my posts here.
--

Over the past few weeks, between the Wave and the blog posts, I've been thinking about XIV quite a bit. It has taken IBM quite a while to attempt to explain the impact and risk of double drive failures on XIV.

IBM definitely has an explanation, one that could have been told quite a bit ago.  In fact, I'd assume that this is the same explanation they've been giving customers who pushed the point; that the risk is less that it seems due to quick rebuilds and the way parity is distributed between interface and data nodes.  I realize that UREs are a very large concern, but to be honest, I bet less than 5% of customers even think about storage at that level.  Perhaps the double drive failure issue is just a red herring that draws attention away from other issues.

One thing that continues to stick out in my mind is the ratio of interface nodes to data nodes.  On the Google Wave, on of the IBM VARs made the following statement:
Remember there is more capacity in the data modules than in the interface modules.  (9 data, 6 interface)  Why they couldn't make this easy and have an equal number of both module types, I'll never know!  :) 
The interface nodes are only 40% of the array.  Even IBM VARs can't explain why this is a 40:60 ratio rather than 50:50.  It increases the probability of double drive faults causing data loss at high capacity and it is a pretty specific design decision.

I wonder if it is related to the Gig-E interconnect and driving out "acceptable" performance from non-interface nodes.  Jesse over at 50 Micron shares similar thoughts.  Thinking this through (and this is all simply a hypothesis)... perhaps the latency and other limitations of the Gig-E interconnect are somewhat offset by having additional spindles (IOPS+throughput) on the "remote data nodes."  I'd like to load a XIV frame to 50% utilization, run a 100% read workload at it, and see if the interface nodes are hit much harder than the data nodes (in effect, performing like a RAID 0, not a RAID 1).  If that were true, for optimal performance you'd never want to load a frame past the point where new volumes would be allocated solely from data nodes.

I am not claiming this is true (no way for me to test it), but if XIV changes the interconnect to a different type (Infiniband, for example), I will find it interesting if "suddenly" there is a 50:50 ratio of interface to data nodes.

6 comments:

Jesse said...

Funny - the IBM buy who was here installing the XIV did say that Infiniband was in the future... So you may be on to something.

If they do that they might be on to something.

So does that make The XIV IBM's version of Windows Vista? You know, the "We don't have anything so we'll release something even if it's not quite finished yet" type of deal.

I'm also curious as to how much memory the BLOB map takes up, and what the lookup times are for individual records? It seems that the access nodes would have to do SOME kind of lookup to determine where each of the logical parts of a file are, then go fetch them, then (presumably it happens in this order) put them into the correct sequence.

For 80TB usable, divided into 1MB chunks, and Mirrored (though I'm assuming fetch doesn't happen from the remote data nodes if it can be avoided due to network latency.) we're talking about 81920000000000000 or 81 thousand trillion records. (Double that to keep track of the mirror blobs)

That's a lot of space. Therefore it is probably NOT held in memory, or in fact held on any one of the access nodes by itself.

It's more likely that each access/data node keeps track of the blobs on it's own 12TB of data, and coordinates that information when reads/writes are issued.

Seems like it would work, seems like it would go SPECTACULARLY wrong when it doesn't.

techmute said...

Regarding the blob map-

Reading the Redbook, it appears everything occurs at 17GB intervals (which is slightly greater than the perfect Combination of all the drives).

I suspect that all Data+Mirror positions are not random and are actually an algorithmic layout. So XIV wouldn't need to hold the blob map in memory, just do a relatively quick derivation to determine data placement. If they do actually do that, two things come to mind.

1. That's pretty clever.
2. You'd see different performance on a loaded array during a rebuild on the affected 1MB chunks (unless they cache as much of the affected 800 GBish chunks those don't map to the perfect layout).

Hector Servadac said...

In fact it's pseudo-random distribution, because blocks copy in different nodes and there's some kind of preference to do that in specific nodes (eg: interface nodes or data nodes).
Remember, when XIV talks about rebuild and self-healing it's talking about spare space, not spare disks, so you can strip with all disks all the time.

techmute said...

So, if its indeed pseudo-random rather than algorithmically defined, where is the blob map stored? Globally or per node, and if its per node, how much is contained on each node?

Hector Servadac said...

The XIV Storage System reserves physical disk capacity for:
* Global spare capacity
* Metadata, including statistics and traces
* Mirrored copies of data

In other words: global space

Storagegorilla said...

Simple maths suggests:
79TB Usable x 2 copies = 158TB Raw required just for data.
Leaves 22TB - 15TB required for global rebuilds.
Leaves around 7TB for partition/distribution tables and other gubbins such as stats and OS (about 40GB per disk).
Still a plenty big space to run a database in.