Monday, March 22, 2010

XIV Recap

UPDATE [8/6/2010]:  If you're interested in more current XIV information, I recommend reading Tony Pearson's recent posts here and here.  He also provided additional information in the comments to one of my posts here.

A few weeks ago, I created a Google Wave to discuss the architecture surrounding XIV and the related FUD (some of it fact-based) that this architecture attracted.  I intended to post a recap after the wave had died down.

This is not that recap.  The recap was about 80% complete, but more reputable resources have posted much of the same information.  For anyone interested in the actual Wave information, contact me and I'll send a PDF (provided there is some mechanism to decently print the Wave).  There was a podcast Nigel hosted last week that I participated in available on his podcast archives.

New Zealand IBMer the Storage Buddhist wrote this post discussing the disk layout and points of failure in IBM's XIV array... which generated this response by NetApp's Alex McDonald.  Both posts, especially the comments, are interesting and show both sides of the argument around disk reliability for XIV.

This post is meant to bridge a few gaps on both sides, and requires a little disclaimer.  Most of the technical information below came from the Google Wave, primarily from IBM badged employees and VARs.  I have been unable to independently guarantee accuracy- even the IBM RedBook on XIV has diagrams of data layout that contradict these explanations, but with disclaimers that basically say the diagrams are for illustrative purposes and don't actually show how it really works.  So, caveat emptor - make sure you go over the architecture's tradeoffs with your sales team.

Hosts are connected to the XIV through interface nodes.  Interface nodes are 6 of the 15 servers in an XIV system have FC and iSCSI Ethernet interfaces providing host connectivity.  Prior to an unspecified capacity threshold, each incoming write is written to an interface node (most likely the one it came in on) and mirrored to a data node (one of the 9 other servers in an XIV system).

At this point, you can have drive failures in multiple interface nodes without data loss.  In fact, one person claimed that you could lose all of the interface nodes without losing any data (of course, this would halt the array).  The "data-loss" risk in this case is losing one drive in an interface module (40% of the disks) followed by one drive in a data module (60% of the disks) prior to a rebuild being complete (approximated at, worst case, 30-40 minutes).  Or, as it was put in the wave:
"If I lose a drive from any of a pool of 72 drives, and then I lose a second disk from a separate pool of 108 drives before the rebuild completes for the first drive, I'm going to have a pretty huge problem." 
Past a certain unknown threshold, incoming writes start getting mirrored between two data nodes rather than an interface node and a data node.  At that point, double disk failures between different data nodes can also cause a pretty huge problem.

From a 'hot spare' perspective, the XIV has space capacity to cover 15 drive failures.  When you hear XIV resources discuss "sequential failures," they typically mean drive failures that occur after the previous one has rebuilt, but prior to the replacement of the failed drive.  This is an important statistic from the perspective of double drive failures that occur because the failed drive was never detected (have you verified YOUR phone home lately?).

A couple of final thoughts.  First off, the effect of a uncorrectable error during a rebuild was never fully explained.  I have heard in passing that "the lab" can tell you what the affected volume is and that it shouldn't cause the same impact as two failed drives.  Secondly, Hector Servadac mentions the following on the StorageBuddhist's post:
2 disk failures in specific nodes each one, during a 30 min windows, is likely as 2 controller failure
Unless I'm not understanding the impact of a 2 controller failure, there is no data loss with that type of 'unlikely' failure... with the double drive failure, there is significant data loss.  But as a yardstick of "how likely does XIV/IBM feel this outage scenario is," it serves as a decent yardstick.

I tried to make this as unbiased as possible.  I am positive I will be brutally corrected in the comments :-).


Hector Servadac said...

Matt, thanks for your comments.
I've seen big redundant systems failing because of a misconfiguration. I'm talking about financial services companies with large multi-frame systems and suddenly... SAN is down.
Same applies to Operating Systems, people run magic scripts and then you have Solaris, AIX or Linux on knees, despite their integrated security.
In XIV things aren't perfect, neither so bad to call it a mistake.
If you can't touch internal architecture you can't broke it. Ease of configuration, with a good design, could reduce points of failure, like human creativity.
If you see system failures based on Operating System statistics, Windows fails more often than Solaris, Solaris fails more often than AIX, AIX fails more often than IBM i. Why if IBM i and AIX run on same system hardware? Why if IBM i use to run in 1 controller path to DAS and AIX in 2 controller paths to SAN? Human-ware!
Remember, you need 2 disk failures, 1 in interface node and 1 in data node in a 30 minutes rebuild-window. According to David Floyer calculation in you have 0,41% probability of double disk failure in 5 years, it doesn't take in account specific nodes, just 2 disks failures. I think it's more likely to occur in the same node, because of back-plane circuit, so if I' not so wrong, you have less than 0,41% in 5 years, thinking you didn't upgrade your system before that.

Now, I have some holes in my knowledge:

1) What happen with performance, latency, no-QoS + big IO consuming server eating internal bandwidth
2) What happen with ports balancing/distribution. I think you need some kind of balanced attachment to interface node ports like in Symmetrix to get enough bandwidth and avoid data moving through internal network
3) Tony Pearson mentions in some blog I can't remember now that not all data is duplicated between DATA and INTERFACE nodes, but between DATA and DATA nodes. It makes sense, you have 6 interface nodes and 9 data nodes, but what about 1 data-node and 1 data-node failure? I think this could change equation. Still a perfect pair, tho.

SRJ said...

Hey man - thought it was a fair write-up...nice work. Glad to see we made some progress clearing up some misconceptions out there. Unfortunately, that work will probably never be done...

I actually just checked the Wave and saw you had posted another question...sorry for the delay in answering. If you want to discuss the URE scenarios, fire away!

Stephen Foskett said...

I am very curious about URE as well as whole-disk failure. This wasn't much of a problem back when disks were small, but 1 bit in 12 TB does add up when you're talking something as large as the XIV!

If it's got single-parity RAID OR mirroring, failure is guaranteed.

Dave Vellante said...

I hope you don't mind me posting this. We went through a similar exercise about 60 days ago on Wikibon. It's amazing how polarized people are on XIV...but we tried (and I think succeeded) in summarizing the huge volume of feedback into a single note.

I'd love to see the PDF summary and would be interested in comparing and incorporating the key points here:

Feel free to use this as you see fit:

SRJ said...

I think it may be time to have a Wave to discuss the much-hyped URE. What do you think?

Unknown said...

I guess it depends what you decide is a "huge problem". Losing two drives within minutes results in only losing a few GB of data, which can be identified and restored in less time than a double drive failure on RAID-5 systems. See my post:

Tony Pearson (IBM)

techmute said...

Thanks for participating Tony. I'll start by stating, we are debating an extremely unlikely event. However, low probability/high impact events should be understood prior to implementation during due diligence.

Doing some rough "back of the napkin" math, a double disk failure (DDF) would probably result in a union list of approximately 5 GB (assuming 80 TB usable and a 16-17 GB allocation increment which is close to being a perfect distribution (minus the 60/40 split between interface and data nodes)).

First assumption: The 16-17GB interval size coupled with the public literature on RAID-X indicates that, in the case of a DDF between an interface node and an access node, you'll probably impact the vast majority of allocations (once the array gets to be full and both copies are written on data nodes, of course, that reduces that exposure). I believe best practice is to zone/allocate through all interface nodes, so I'm assuming no arbitrary host segregation.

While 5 GB of data loss isn't a substantial amount of data, its a huge deal if it impacts almost every allocation on the array. I'd be surprised if a lot of file systems just don't crash once the unavailable data is accessed (I'd reference a recent post on the quality of most filesystems, but google is failing me at the moment). Additionally, while file servers / unstructured data have the possibility of just restoring the affected files, any database allocations will most likely need to be completely restored.

Furthermore, per your blog, the process to identify the lost data is manual:

"the client can determine which hosts these LUNs are attached to, run file scan utility to the file systems that these LUNs represent. Files that get a media error during this scan will be listed as needing recovery. A chunk could contain several small files, or the chunk could be just part of a large file. To minimize time, the scans and recoveries can all be prioritized and performed in parallel across host systems zoned to these LUNs."

So, while 15GB isn't a huge amount of data, if it hits every allocation in the array, its a very large issue.

Main questions:

1. Regardless of data loss, what percentage of allocations/LUNs are affected by a DDF in a best-practices configuration?
2. How does snapshot data play into how much data is at risk (are single file restores an option, what if the 'gold' copy is affected, etc)?

Of course, correct any poor assumptions on my part up above...

Finally, could you please talk to whoever manages the IBM blogs and get them to accept OpenID instead of yet-another-registration to comment? The current implementation limits public commenting (which could be a feature rather than a bug, I suppose).

Unknown said...

To answer your first question, 80 percent of the time after a DDF, the IBM SE will be able to reboot the second drive in read-only mode using an HGST tool, which will then allow the original failed drive to complete rebuild, thus zero percent loss.

The other 20 percent of the time, the percentage of LUNs affected would be anywhere from 0 to 100 percent, depending on how far apart the two drives failed timewise, with "instantaneously" being closer to 100 and nearly 30 minutes apart being near 0.

Note that only the 1MB chunks are affected. The entire LUNs remain online in all cases you can read/write all unaffected data.

The rebuild process sorts by LUNid, so all the 1MB chunks of a specific LUN are done as a group, so that the entire LUN can be taken off the affected list.

As for the second question: a 1MB chunk can be part of a primary LUN, part of one or more snapshots, or a combination of the two. In other words, a 1MB chunk could be on a snapshot only, in which case only the snapshots that point to this chunk are affected. The same file scan process (fsck/CHKDSK) can be used against snapshots to see the files affected.

For some people, losing a snapshot is no big deal, and for others, that could have been very important point-in-time copy.

-- Tony Pearson (IBM)

Dimitris Krekoukias said...

Hello, D from NetApp here.

I'm a bit confused still -

How do modern OSes deal with little chunks of a LUN missing?

On a fairly full XIV system, a dual drive failure (drives failing in rapid succession) will, by definition, affect ALL LUNs if the drives do fail at about the same time.

Since, according to Tony Pearson, the LUNs stay online during such events, I wonder what the ramifications are on the client OSes.

Even if the drives can be rebooted with the special tool, what happens during the time they're down?

I guess I'm totally not OK with the box leaving the LUNs online while such things are going on...

I have no idea what happens to the OS if it figures out stuff is missing, I don't have a way to take out of circulation chunks of a LUN like XIV seems to, but I'd think that no modern filesystem would be exactly happy.

Why are we glossing over this? It has not been answered satisfactorily.

I'm not saying the likelihood of this happening is great, I'm saying the possibility exists. Insurance is there to cover such possibilities, nobody insures against the impossible.


Unknown said...

Fortunately, all modern OS have to deal with this already. The XIV will either suspend or fail any read or write request against an affected 1MB block with a Media Error. This is a standard process that all disk systems must support, as all disks have the possibility of an individual spot on the physical HDD to fail.

Modern file systems keep two or three copies of their master inode table. The affected 1MB chunk could impact many small files, or be part of a big file. In either case, the files are easily identified and can be restored using standard procedures. If the 1MB chunk happens to affect the directory structure, then the file systems journal may be needed to effect repair. Most modern file systems are now journaled for this purpose. Tools like "fsck" and "CHKDSK" were not written for the XIV, they existed long before the XIV for exactly the same reason: some disks don't fail all at once, but rather block by block.

Even WAFL on IBM's N series has an "fsck" tool, was it written for the XIV? No, it exists because even NetApp recognizes that block failures are possible.

-- Tony Pearson (IBM)

Joerg Hallbauer said...

The answer to the question is pretty straightforward, although maybe a little uncomfortable for IBM. As someone said above, in the case of a DDF somewhere between 100% and 0% of the LUNs will be affected. So on average, 50% will be effected.

OK, so now 50% of the LUNs have "holes" in them. But they stay on-line. What will happen is that when the host(s) try to read any of the blocks in those "holes" the result will be a SCSI read error, and in most cases a series of retrys that all fail as well. This is all happening below the FS layer and on UNIX systems it will fill your syslog with messages.

How the FS reacts to this depends on the FS (and the volume manager). Some volume managers will take the volume off-line. Others will just pass the read error on through to the FS, and then on to the application. How the application reacts is application dependent. For example, databases will barf all over the place, and as pointed out earlier, you're probably looking at restoring those databases to a point in time prior to the error. How painful that might be depends on a lot of factors I don't have time to go into right now.

The bottom line is that if anyone thinks that having to restore even 1/2 the databases using the array to a previous point in time isn't going to get someone on the business side questioning the sanity of the person who decided to buy this array then you must be living in an alternate reality from those of us who have to provide a service to an actual business.