techmute: March 2010

Wednesday, March 24, 2010

XIV Final Thoughts- Drive Failures a Red Herring?

UPDATE [8/6/2010]: If you're interested in more current XIV information, I recommend reading Tony Pearson's recent posts here and here. He also provided additional information in the comments to one of my posts here.

Over the past few weeks, between the Wave and the blog posts, I've been thinking about XIV quite a bit. It has taken IBM quite a while to attempt to explain the impact and risk of double drive failures on XIV.

IBM definitely has an explanation, one that could have been told quite a bit ago. In fact, I'd assume that this is the same explanation they've been giving customers who pushed the point; that the risk is less that it seems due to quick rebuilds and the way parity is distributed between interface and data nodes. I realize that UREs are a very large concern, but to be honest, I bet less than 5% of customers even think about storage at that level. Perhaps the double drive failure issue is just a red herring that draws attention away from other issues.

One thing that continues to stick out in my mind is the ratio of interface nodes to data nodes. On the Google Wave, on of the IBM VARs made the following statement:

Remember there is more capacity in the data modules than in the interface modules. (9 data, 6 interface) Why they couldn't make this easy and have an equal number of both module types, I'll never know! :)

The interface nodes are only 40% of the array. Even IBM VARs can't explain why this is a 40:60 ratio rather than 50:50. It increases the probability of double drive faults causing data loss at high capacity and it is a pretty specific design decision.

I wonder if it is related to the Gig-E interconnect and driving out "acceptable" performance from non-interface nodes. Jesse over at 50 Micron shares similar thoughts. Thinking this through (and this is all simply a hypothesis)... perhaps the latency and other limitations of the Gig-E interconnect are somewhat offset by having additional spindles (IOPS+throughput) on the "remote data nodes." I'd like to load a XIV frame to 50% utilization, run a 100% read workload at it, and see if the interface nodes are hit much harder than the data nodes (in effect, performing like a RAID 0, not a RAID 1). If that were true, for optimal performance you'd never want to load a frame past the point where new volumes would be allocated solely from data nodes.

I am not claiming this is true (no way for me to test it), but if XIV changes the interconnect to a different type (Infiniband, for example), I will find it interesting if "suddenly" there is a 50:50 ratio of interface to data nodes.

Monday, March 22, 2010

XIV Recap

A few weeks ago, I created a Google Wave to discuss the architecture surrounding XIV and the related FUD (some of it fact-based) that this architecture attracted. I intended to post a recap after the wave had died down.

This is not that recap. The recap was about 80% complete, but more reputable resources have posted much of the same information. For anyone interested in the actual Wave information, contact me and I'll send a PDF (provided there is some mechanism to decently print the Wave). There was a podcast Nigel hosted last week that I participated in available on his podcast archives.

New Zealand IBMer the Storage Buddhist wrote this post discussing the disk layout and points of failure in IBM's XIV array... which generated this response by NetApp's Alex McDonald. Both posts, especially the comments, are interesting and show both sides of the argument around disk reliability for XIV.

This post is meant to bridge a few gaps on both sides, and requires a little disclaimer. Most of the technical information below came from the Google Wave, primarily from IBM badged employees and VARs. I have been unable to independently guarantee accuracy- even the IBM RedBook on XIV has diagrams of data layout that contradict these explanations, but with disclaimers that basically say the diagrams are for illustrative purposes and don't actually show how it really works. So, caveat emptor - make sure you go over the architecture's tradeoffs with your sales team.

Hosts are connected to the XIV through interface nodes. Interface nodes are 6 of the 15 servers in an XIV system have FC and iSCSI Ethernet interfaces providing host connectivity. Prior to an unspecified capacity threshold, each incoming write is written to an interface node (most likely the one it came in on) and mirrored to a data node (one of the 9 other servers in an XIV system).

At this point, you can have drive failures in multiple interface nodes without data loss. In fact, one person claimed that you could lose all of the interface nodes without losing any data (of course, this would halt the array). The "data-loss" risk in this case is losing one drive in an interface module (40% of the disks) followed by one drive in a data module (60% of the disks) prior to a rebuild being complete (approximated at, worst case, 30-40 minutes). Or, as it was put in the wave:

"If I lose a drive from any of a pool of 72 drives, and then I lose a second disk from a separate pool of 108 drives before the rebuild completes for the first drive, I'm going to have a pretty huge problem."

Past a certain unknown threshold, incoming writes start getting mirrored between two data nodes rather than an interface node and a data node. At that point, double disk failures between different data nodes can also cause a pretty huge problem.

From a 'hot spare' perspective, the XIV has space capacity to cover 15 drive failures. When you hear XIV resources discuss "sequential failures," they typically mean drive failures that occur after the previous one has rebuilt, but prior to the replacement of the failed drive. This is an important statistic from the perspective of double drive failures that occur because the failed drive was never detected (have you verified YOUR phone home lately?).

A couple of final thoughts. First off, the effect of a uncorrectable error during a rebuild was never fully explained. I have heard in passing that "the lab" can tell you what the affected volume is and that it shouldn't cause the same impact as two failed drives. Secondly, Hector Servadac mentions the following on the StorageBuddhist's post:

2 disk failures in specific nodes each one, during a 30 min windows, is likely as 2 controller failure

Unless I'm not understanding the impact of a 2 controller failure, there is no data loss with that type of 'unlikely' failure... with the double drive failure, there is significant data loss. But as a yardstick of "how likely does XIV/IBM feel this outage scenario is," it serves as a decent yardstick.

I tried to make this as unbiased as possible. I am positive I will be brutally corrected in the comments :-).

Tuesday, March 16, 2010

Breaking Datacenter Boundaries

Chuck Hollis (EMC) has an interesting post up regarding the future of workload optimization and fluid architectures. First off, he has one of the clearest definitions of cloud architecture and private clouds I've seen recently:

"What makes a cloud a cloud is three things: technology (dynamic pools of virtual resources), operations (low-touch and zero-touch service delivery) and consumption (convenient consumption models, including pay as you go)... What makes a cloud "private" is that IT exerts control over the resources and associated service delivery."

Let's take a look at today's dynamic datacenter, especially in an organization where private cloud is being pursued.

You have a very high virtualization rate. Due to less friction for resource acquisition, you can assume that more and more systems will become virtualized on the private cloud as time goes on.
You have a variable cost model, allowing for changing costs based off of actual consumption and performance utilization.
You have an automation engine, to drive processes/systems through the private cloud.
Regardless of technology, you're hopefully pursuing loosely coupled systems that do not have low latency requirements and provide rich web interfaces.

From a technology play, you have at least most of these in play:

VMware - VMs are moving among hosts based on dynamic workload decisions - "where" something is running becomes less important.
Intelligent Storage Optimization - placing the right data in the right place without sacrificing performance.
Replication - ensuring production data is recoverable in a remote location.

Virtualization allows IT organizations to break down silos and drive utilization up while controlling costs. Most large organizations maintain several data centers, and resources are not easily shared between them. That's the next silo to be knocked down... by leveraging the investment in virtualization and storage technologies, it could be possible in the near future.

You have extremely high visibility into utilization, data traffic, response times, frequency... basically, what drives the physical location of a VM. The main reason not to move a workload to a different data center typically has to do with latency between users and the application layer, or between the application layer and the data. By hooking into the hypervisor, you could determine likely candidates that can be moved without massively disrupting the user experience.
The "heavy lifting" of migrating large portions of production data is already taken care of. You have an asynchronous mirror of the data at the remote site, probably hooked up to an existing VMware Cluster. The remain "system state" information could be replicated with a brief outage at a predefined window and then promoted to production at the remote site (flipping the replication to maintain recoverability).

Given the end-to-end knowledge from #1, and the data proximity of #2, you can theoretically "warm" migrate a VM from one datacenter to another, keeping response times the same or better, and increase the flexibility of the environments.

So, in the end, it comes down to what percentage of applications are eligible for this type of workload distribution based on network and performance requirements. By optimizing at that level, you can more evenly spread out your workload requirements geographically. The notion of distributed cache coherence comes into play for applications that don't behave well in a higher latency location. Finally, once you have that technology in place, disaster recovery becomes much simpler - instead of vMotioning between hosts, you vMotion to an alternate datacenter.

Sure, none of this is available right now... but looking forward, you can see how an entirely fluid, geographically dispersed IT infrastructure is possible.

Thursday, March 11, 2010

Odds and Ends - Tiering, and Performance Planning

A few articles I wanted to briefly highlight:

Storagebod posted a brief article on automated storage tiering. To briefly summarize, imperfect automated storage tiering is better than nothing... it is an easy way to get value out of SSDs in an existing environment and it provides a mechanism to move less-used data off of FC drives and onto SATA drives. One thing is certain... the importance of manual data layouts is decreasing. Between array architectures that don't 'allow' it (XIV being the most notorious example), don't 'need' it (NetApp FAS), and traditional architectures getting performance-driven automated storage tiering, using Excel to mismanage storage layouts could finally be over. Dimitris makes the point that due diligence still needs to be applied to allocations that require high performance a few times a month to ensure the volumes don't get migrated to the wrong tier (among other comments). There are excellent comments on that post from EMC and NetApp discussing the two approaches.
Dimitris also has a good post on vendor competition and under-sizing proposals to get the sale. It is worth reading just for the 'basics' explanation of performance-sizing small arrays - it also has some good information on Compellant's architecture. My comments regarding this vendor comparison are attached to that post. As always, prior to storage acquisitions, make sure you understand how the vendor determined their bid's sizing and get guarantees on performance/capacity if you are at all concerned about meeting your requirements.
Chuck Hollis (EMC) and Marc Farley (3PAR) have excellent posts up on storage caching.

Monday, March 1, 2010

FAST & PAM Contrasted

** Updates Appended Below **

Over the past few days, I've been thinking about storage tiering... both in general, and specifically FAST and PAM II. Each takes a very different approach to providing better storage performance without highly specific tuning. This is an outsider's view based off of publicly available information (so, in cases where I'm wrong, both vendors have shown that they aren't shy in correcting misconceptions). First, some general definitions:

FASTv1: Released in December, it is the first version of EMC's Fully Automated Storage Tiering. It works at the LUN level, and requires identical LUN sizes across tiers. It is not compatible with Thin Provisioned/Virtual Provisioned LUNs.

FASTv2: Scheduled to be released in the second half of 2010, it is the next version of FAST that works at a sub-LUN level. It requires Thin Provisioning/Virtual Provisioning to manage the allocations since it utilizes that functionality to provide the granularity of migration.

PAM II: NetApp's Flash solution, Performance Acceleration Module. It acts as a additional layer of cache memory and does not have specific layout requirements.

Architecture Differences

FAST runs as a task on the processors of the DMX/VMAX. At user specified windows, it will determine volumes/sub-volumes that would benefit from relocation and perform a migration to a different tier. This requires some IO capacity to migrate the data, so offhours/weekends are ideal window candidates. It does a semi-permanent relocation so all reads/writes are serviced by the new location post-migration (semi-permanent since FAST can relocate an allocation back to the prior tier if the performance data indicates it is a good swap). Since RAID protection is maintained throughout the migration, the loss of components do not substantially affect response time.

PAM II is treated as an extremely large read cache. Basically, as a given read-block is expired in memory, it trickles down to PAM until it is finally flushed and resides solely on disk. This gives PAM II a few nice features. First of all, there is no performance hit during the 'charging' of the PAM - since it is fed by expired 'tier-1' cache, there is no additional performance impact after the un-cached block is read. Secondly, it does not cache any writes. This is a giant assumption on my part, but I assume that due to the 'virtualization' WAFL provides, PAM does not need to track changed blocks on the disk. Since everything is pointer based (think of NetApp snaps), when the track is changed on disk, future reads hit the new disk location then get migrated through the cache levels like 'new' reads since the location has changed (the old location/data gets expired fairly quickly). The downside to this approach is that the loss of PAM requires all reads to be serviced by disk+tier 1 memory until it is replaced and recharged.

One thing that the NetApp resources on Twitter kept repeating was the benefits of PAM as an extension of cache. I assume the main benefit of taking this approach to Flash is that it is accessed via memory routines (less layers/translation to execute through) rather than disk routines. Whether or not this is a significant performance benefit, I really can't say.

From the initial implementation, PAM will provide almost immediate benefit as data expires from cache. FAST will require a few iterations of the swap window before things have optimized. Taking a longer view, FAST will work best with consistent workloads... after a few weeks, the migrations should hit an equilibrium and response times should be stable and fast. Component failures should not adversely affect response time. PAM, as an extension of cache, will continuously optimize whatever blocks are getting hit hardest at any given moment. While this is more flexible day-to-day than data migrations, consistent performance could be an issue. Additionally, the IO hit of losing PAM would decrease response times, but the impact of this is somewhat reduced by the fact that ramping up PAM is much faster than the data migrations that FAST requires.

Both solutions make various trade offs between performance, stability, and consistency. Understanding these trade offs will benefit the customer as they choose which tools to leverage in their environment. Following are a few considerations...

Considerations

Many customers have performance testing environments. Since both of these approaches optimize as tests run, what relationship can be drawn between the 3rd-5th week of integration/performance testing and the production implementation? Theoretically, if the data is identical between performance testing and production, NetApp dedup could leverage performance testing optimizations during the production implementation.
Can customers run both FASTv1 and FASTv2 simultaneously since they have mutually exclusive volume requirements? Are both separately licensed? There are implementations where LUN level optimization may be preferred over sub-LUN.
NetApp can simulate the benefits of PAM II in an environment. Can similar benefits be simulated for FAST prior to implementation?
I assume that FAST will promote as much into SSD as possible to increase response time. How can customers determine when to grow that tier of storage?
If a customer is using PAM II to meet a performance requirement, what can they do to reduce the impact of a PAM II failure?
For both FASTv2 and PAM II, how can a customer migrate to a new array while keeping the current performance intact? With FASTv1, it is a simple LUN migration since it is determinable what tier a LUN is on. With FASTv2 and PAM II, it gets tricky (please note, I'm not talking about migrating the data, which is a standard procedure, I'm talking about making sure you hit performance requirements post-migration).

** Updates - 03/02/2010 AM **
To be clear, this is an "apples to oranges" comparison. Each solution takes a completely different approach to implementing flash into an array, and the two solutions behave very differently.

Additionally, since I was focusing on Flash in particular, I neglected to compare cache capacity directly. DMX/VMAX has a much higher cache capacity than the NetApp arrays. Per Storagezilla on Twitter: "Symmetrix already has acres of globally accessible DRAM for read/write and doesn't need anything like PAM."

Finally, cost does play into comparing the two approaches, but I don't have access to any sort of real-world pricing.

Pages