techmute

Thanks, Matt. Pulled this out again today so I cou...

2011-02-19T10:47:25.920-08:00

Thanks, Matt. Pulled this out again today so I could burn my howto brewing DVD for iPad format.

Like the old saying goes: "The lawyer that re...

2010-10-19T09:00:48.278-07:00

Like the old saying goes: "The lawyer that represents himself has an idiot for a client." Oftentimes our self-interest does not serve our best interest.

There's also "that's the way we'v...

2010-10-18T06:01:43.559-07:00

There's also "that's the way we've always done it" view.

Hi Matt, Thanks for giving my book a mention. I&...

2010-09-21T16:27:10.790-07:00

Hi Matt,

Thanks for giving my book a mention. I've set up a companion site www.storage-brain.com for folks to voice opinions about trends in data storage evolution and have also set up a searchable reference library. I'll add your required reading to the library-

Larry

The answer to the question is pretty straightforwa...

2010-08-25T16:28:07.430-07:00

The answer to the question is pretty straightforward, although maybe a little uncomfortable for IBM. As someone said above, in the case of a DDF somewhere between 100% and 0% of the LUNs will be affected. So on average, 50% will be effected.

OK, so now 50% of the LUNs have "holes" in them. But they stay on-line. What will happen is that when the host(s) try to read any of the blocks in those "holes" the result will be a SCSI read error, and in most cases a series of retrys that all fail as well. This is all happening below the FS layer and on UNIX systems it will fill your syslog with messages.

How the FS reacts to this depends on the FS (and the volume manager). Some volume managers will take the volume off-line. Others will just pass the read error on through to the FS, and then on to the application. How the application reacts is application dependent. For example, databases will barf all over the place, and as pointed out earlier, you're probably looking at restoring those databases to a point in time prior to the error. How painful that might be depends on a lot of factors I don't have time to go into right now.

The bottom line is that if anyone thinks that having to restore even 1/2 the databases using the array to a previous point in time isn't going to get someone on the business side questioning the sanity of the person who decided to buy this array then you must be living in an alternate reality from those of us who have to provide a service to an actual business.

Matt, I like this post! Very good idea to have it ...

2010-08-07T06:29:42.803-07:00

Matt,
I like this post! Very good idea to have it sumed up for newbies. Trying to find quality posts via google would take hrs or longer.
Thanks

Nice to see you rounding up some of the better fou...

2010-08-07T04:50:17.674-07:00

Nice to see you rounding up some of the better foundational posts out there on the web.

I've got two more posts along these lines that I'm always asked for.

One discusses the basics of storage caching: http://chucksblog.emc.com/chucks_blog/2010/03/storage-caching-101.html

The other discusses the different approaches to managing storage in enterprises large and small: http://chucksblog.emc.com/chucks_blog/2010/07/how-will-you-manage-storage.html

Hope people find them useful ...

-- Chuck

Anonymous, Your math is way off. With a 15 module ...

2010-06-09T11:45:10.565-07:00

Anonymous,
Your math is way off. With a 15 module system there are 180 drives that are used. XIV does not mirror disks but rather the 1 MB chunks, so your comment about only using 90 drives is flawed, which makes the rest of your equation irrelevant. When a LUN is created it is spread across all 180 drives and the rebuild of a failed drive would use 168 (180-12 bc no mirrored data is on the same module).

For arguments sake lets say that all the stars were aligned, you were wearing your purple stripped socks, standing on one foot, and there was a double drive failure. All that would be lost would be the union data between the two drive which would most likely not be all LUNs on the entire array.

As SRJ posted, those blocks that were lost can be pinpointed by support personnel and then restored.

Not to get too much off topic but >90% of DD failures happen within the same shelf and it is due to a common problem within that shelf.

Dimitris, Fortunately, all modern OS have to deal ...

2010-06-08T16:29:07.292-07:00

Dimitris,
Fortunately, all modern OS have to deal with this already. The XIV will either suspend or fail any read or write request against an affected 1MB block with a Media Error. This is a standard process that all disk systems must support, as all disks have the possibility of an individual spot on the physical HDD to fail.

Modern file systems keep two or three copies of their master inode table. The affected 1MB chunk could impact many small files, or be part of a big file. In either case, the files are easily identified and can be restored using standard procedures. If the 1MB chunk happens to affect the directory structure, then the file systems journal may be needed to effect repair. Most modern file systems are now journaled for this purpose. Tools like "fsck" and "CHKDSK" were not written for the XIV, they existed long before the XIV for exactly the same reason: some disks don't fail all at once, but rather block by block.

Even WAFL on IBM's N series has an "fsck" tool, was it written for the XIV? No, it exists because even NetApp recognizes that block failures are possible.

-- Tony Pearson (IBM)

Hello, D from NetApp here. I'm a bit confused...

2010-06-08T11:49:19.959-07:00

Hello, D from NetApp here.

I'm a bit confused still -

How do modern OSes deal with little chunks of a LUN missing?

On a fairly full XIV system, a dual drive failure (drives failing in rapid succession) will, by definition, affect ALL LUNs if the drives do fail at about the same time.

Since, according to Tony Pearson, the LUNs stay online during such events, I wonder what the ramifications are on the client OSes.

Even if the drives can be rebooted with the special tool, what happens during the time they're down?

I guess I'm totally not OK with the box leaving the LUNs online while such things are going on...

I have no idea what happens to the OS if it figures out stuff is missing, I don't have a way to take out of circulation chunks of a LUN like XIV seems to, but I'd think that no modern filesystem would be exactly happy.

Why are we glossing over this? It has not been answered satisfactorily.

I'm not saying the likelihood of this happening is great, I'm saying the possibility exists. Insurance is there to cover such possibilities, nobody insures against the impossible.

D

Matt, To answer your first question, 80 percent of...

2010-06-07T17:07:52.032-07:00

Matt,
To answer your first question, 80 percent of the time after a DDF, the IBM SE will be able to reboot the second drive in read-only mode using an HGST tool, which will then allow the original failed drive to complete rebuild, thus zero percent loss.

The other 20 percent of the time, the percentage of LUNs affected would be anywhere from 0 to 100 percent, depending on how far apart the two drives failed timewise, with "instantaneously" being closer to 100 and nearly 30 minutes apart being near 0.

Note that only the 1MB chunks are affected. The entire LUNs remain online in all cases you can read/write all unaffected data.

The rebuild process sorts by LUNid, so all the 1MB chunks of a specific LUN are done as a group, so that the entire LUN can be taken off the affected list.

As for the second question: a 1MB chunk can be part of a primary LUN, part of one or more snapshots, or a combination of the two. In other words, a 1MB chunk could be on a snapshot only, in which case only the snapshots that point to this chunk are affected. The same file scan process (fsck/CHKDSK) can be used against snapshots to see the files affected.

For some people, losing a snapshot is no big deal, and for others, that could have been very important point-in-time copy.

-- Tony Pearson (IBM)

Thanks for participating Tony. I'll start by ...

2010-05-25T18:34:51.009-07:00

Thanks for participating Tony. I'll start by stating, we are debating an extremely unlikely event. However, low probability/high impact events should be understood prior to implementation during due diligence.

Doing some rough "back of the napkin" math, a double disk failure (DDF) would probably result in a union list of approximately 5 GB (assuming 80 TB usable and a 16-17 GB allocation increment which is close to being a perfect distribution (minus the 60/40 split between interface and data nodes)).

First assumption: The 16-17GB interval size coupled with the public literature on RAID-X indicates that, in the case of a DDF between an interface node and an access node, you'll probably impact the vast majority of allocations (once the array gets to be full and both copies are written on data nodes, of course, that reduces that exposure). I believe best practice is to zone/allocate through all interface nodes, so I'm assuming no arbitrary host segregation.

While 5 GB of data loss isn't a substantial amount of data, its a huge deal if it impacts almost every allocation on the array. I'd be surprised if a lot of file systems just don't crash once the unavailable data is accessed (I'd reference a recent post on the quality of most filesystems, but google is failing me at the moment). Additionally, while file servers / unstructured data have the possibility of just restoring the affected files, any database allocations will most likely need to be completely restored.

Furthermore, per your blog, the process to identify the lost data is manual:

"the client can determine which hosts these LUNs are attached to, run file scan utility to the file systems that these LUNs represent. Files that get a media error during this scan will be listed as needing recovery. A chunk could contain several small files, or the chunk could be just part of a large file. To minimize time, the scans and recoveries can all be prioritized and performed in parallel across host systems zoned to these LUNs."

So, while 15GB isn't a huge amount of data, if it hits every allocation in the array, its a very large issue.

Main questions:

1. Regardless of data loss, what percentage of allocations/LUNs are affected by a DDF in a best-practices configuration?
2. How does snapshot data play into how much data is at risk (are single file restores an option, what if the 'gold' copy is affected, etc)?

Of course, correct any poor assumptions on my part up above...

Finally, could you please talk to whoever manages the IBM blogs and get them to accept OpenID instead of yet-another-registration to comment? The current implementation limits public commenting (which could be a feature rather than a bug, I suppose).

I guess it depends what you decide is a "huge...

2010-05-12T23:49:42.077-07:00

I guess it depends what you decide is a "huge problem". Losing two drives within minutes results in only losing a few GB of data, which can be identified and restored in less time than a double drive failure on RAID-5 systems. See my post:

https://www.ibm.com/developerworks/mydeveloperworks/blogs/InsideSystemStorage/entry/ddf-debunked-xiv-two-years-later?lang=en

Tony Pearson (IBM)

Simple maths suggests: 79TB Usable x 2 copies = 15...

2010-03-30T19:37:27.650-07:00

Simple maths suggests:
79TB Usable x 2 copies = 158TB Raw required just for data.
Leaves 22TB - 15TB required for global rebuilds.
Leaves around 7TB for partition/distribution tables and other gubbins such as stats and OS (about 40GB per disk).
Still a plenty big space to run a database in.

The XIV Storage System reserves physical disk capa...

2010-03-30T15:50:11.558-07:00

The XIV Storage System reserves physical disk capacity for:
* Global spare capacity
* Metadata, including statistics and traces
* Mirrored copies of data

In other words: global space

So, if its indeed pseudo-random rather than algori...

2010-03-29T12:12:12.605-07:00

So, if its indeed pseudo-random rather than algorithmically defined, where is the blob map stored? Globally or per node, and if its per node, how much is contained on each node?

In fact it's pseudo-random distribution, becau...

2010-03-29T11:02:37.506-07:00

In fact it's pseudo-random distribution, because blocks copy in different nodes and there's some kind of preference to do that in specific nodes (eg: interface nodes or data nodes).
Remember, when XIV talks about rebuild and self-healing it's talking about spare space, not spare disks, so you can strip with all disks all the time.

Regarding the blob map- Reading the Redbook, it a...

2010-03-28T21:28:25.377-07:00

Regarding the blob map-

Reading the Redbook, it appears everything occurs at 17GB intervals (which is slightly greater than the perfect Combination of all the drives).

I suspect that all Data+Mirror positions are not random and are actually an algorithmic layout. So XIV wouldn't need to hold the blob map in memory, just do a relatively quick derivation to determine data placement. If they do actually do that, two things come to mind.

1. That's pretty clever.
2. You'd see different performance on a loaded array during a rebuild on the affected 1MB chunks (unless they cache as much of the affected 800 GBish chunks those don't map to the perfect layout).

I think it may be time to have a Wave to discuss t...

2010-03-25T20:47:34.772-07:00

I think it may be time to have a Wave to discuss the much-hyped URE. What do you think?

Funny - the IBM buy who was here installing the XI...

2010-03-25T10:26:32.179-07:00

Funny - the IBM buy who was here installing the XIV did say that Infiniband was in the future... So you may be on to something.

If they do that they might be on to something.

So does that make The XIV IBM's version of Windows Vista? You know, the "We don't have anything so we'll release something even if it's not quite finished yet" type of deal.

I'm also curious as to how much memory the BLOB map takes up, and what the lookup times are for individual records? It seems that the access nodes would have to do SOME kind of lookup to determine where each of the logical parts of a file are, then go fetch them, then (presumably it happens in this order) put them into the correct sequence.

For 80TB usable, divided into 1MB chunks, and Mirrored (though I'm assuming fetch doesn't happen from the remote data nodes if it can be avoided due to network latency.) we're talking about 81920000000000000 or 81 thousand trillion records. (Double that to keep track of the mirror blobs)

That's a lot of space. Therefore it is probably NOT held in memory, or in fact held on any one of the access nodes by itself.

It's more likely that each access/data node keeps track of the blobs on it's own 12TB of data, and coordinates that information when reads/writes are issued.

Seems like it would work, seems like it would go SPECTACULARLY wrong when it doesn't.

I hope you don't mind me posting this. We went...

2010-03-23T15:31:23.692-07:00

I hope you don't mind me posting this. We went through a similar exercise about 60 days ago on Wikibon. It's amazing how polarized people are on XIV...but we tried (and I think succeeded) in summarizing the huge volume of feedback into a single note.

I'd love to see the PDF summary and would be interested in comparing and incorporating the key points here:

Feel free to use this as you see fit:
http://wikibon.org/wiki/v/The_IBM_XIV_Storage_Array_Performance_and_Availability_Envelope

I am very curious about URE as well as whole-disk ...

2010-03-23T13:53:10.485-07:00

I am very curious about URE as well as whole-disk failure. This wasn't much of a problem back when disks were small, but 1 bit in 12 TB does add up when you're talking something as large as the XIV!

If it's got single-parity RAID OR mirroring, failure is guaranteed.

Hey man - thought it was a fair write-up...nice wo...

2010-03-23T00:10:21.744-07:00

Hey man - thought it was a fair write-up...nice work. Glad to see we made some progress clearing up some misconceptions out there. Unfortunately, that work will probably never be done...

I actually just checked the Wave and saw you had posted another question...sorry for the delay in answering. If you want to discuss the URE scenarios, fire away!

Matt, thanks for your comments. I've seen big ...

2010-03-22T21:31:19.945-07:00

Matt, thanks for your comments.
I've seen big redundant systems failing because of a misconfiguration. I'm talking about financial services companies with large multi-frame systems and suddenly... SAN is down.
Same applies to Operating Systems, people run magic scripts and then you have Solaris, AIX or Linux on knees, despite their integrated security.
In XIV things aren't perfect, neither so bad to call it a mistake.
If you can't touch internal architecture you can't broke it. Ease of configuration, with a good design, could reduce points of failure, like human creativity.
If you see system failures based on Operating System statistics, Windows fails more often than Solaris, Solaris fails more often than AIX, AIX fails more often than IBM i. Why if IBM i and AIX run on same system hardware? Why if IBM i use to run in 1 controller path to DAS and AIX in 2 controller paths to SAN? Human-ware!
Remember, you need 2 disk failures, 1 in interface node and 1 in data node in a 30 minutes rebuild-window. According to David Floyer calculation in
http://wikibon.org/wiki/v/The_IBM_XIV_Storage_Array_Performance_and_Availability_Envelope you have 0,41% probability of double disk failure in 5 years, it doesn't take in account specific nodes, just 2 disks failures. I think it's more likely to occur in the same node, because of back-plane circuit, so if I' not so wrong, you have less than 0,41% in 5 years, thinking you didn't upgrade your system before that.

Now, I have some holes in my knowledge:

1) What happen with performance, latency, no-QoS + big IO consuming server eating internal bandwidth
2) What happen with ports balancing/distribution. I think you need some kind of balanced attachment to interface node ports like in Symmetrix to get enough bandwidth and avoid data moving through internal network
3) Tony Pearson mentions in some blog I can't remember now that not all data is duplicated between DATA and INTERFACE nodes, but between DATA and DATA nodes. It makes sense, you have 6 interface nodes and 9 data nodes, but what about 1 data-node and 1 data-node failure? I think this could change equation. Still a perfect pair, tho.

I started thinking about this also, it's much ...

2010-03-16T14:09:17.672-07:00

I started thinking about this also, it's much worse than you thought. 78TB of capacity mirrored on 180 drives means that you have 78TB across 90 drives. You have potential to have 866GB per TB drive. If you drive the system to 75% utilization you have use around 650GB of capacity or 650,000 1 MB chunks. Those can only be mirrored, equally across 168 drives. Any 2 drives share ~3800 copies of data. If you have a 78 TB system and use 20GB luns, you'd have 4000 luns. A double drive failure would probably corrupt the entire array.