Monday, March 1, 2010

FAST & PAM Contrasted

** Updates Appended Below **

Over the past few days, I've been thinking about storage tiering... both in general, and specifically FAST and PAM II.  Each takes a very different approach to providing better storage performance without highly specific tuning.  This is an outsider's view based off of publicly available information (so, in cases where I'm wrong, both vendors have shown that they aren't shy in correcting misconceptions).  First, some general definitions:

FASTv1:  Released in December, it is the first version of EMC's Fully Automated Storage Tiering.  It works at the LUN level, and requires identical LUN sizes across tiers.  It is not compatible with Thin Provisioned/Virtual Provisioned LUNs.

FASTv2:  Scheduled to be released in the second half of 2010, it is the next version of FAST that works at a sub-LUN level.  It requires Thin Provisioning/Virtual Provisioning to manage the allocations since it utilizes that functionality to provide the granularity of migration.

PAM II:  NetApp's Flash solution, Performance Acceleration Module.  It acts as a additional layer of cache memory and does not have specific layout requirements.

Architecture Differences
FAST runs as a task on the processors of the DMX/VMAX.  At user specified windows, it will determine volumes/sub-volumes that would benefit from relocation and perform a migration to a different tier.  This requires some IO capacity to migrate the data, so offhours/weekends are ideal window candidates.  It does a semi-permanent relocation so all reads/writes are serviced by the new location post-migration (semi-permanent since FAST can relocate an allocation back to the prior tier if the performance data indicates it is a good swap).  Since RAID protection is maintained throughout the migration, the loss of components do not substantially affect response time.
PAM II is treated as an extremely large read cache.  Basically, as a given read-block is expired in memory, it trickles down to PAM until it is finally flushed and resides solely on disk.  This gives PAM II a few nice features.  First of all, there is no performance hit during the 'charging' of the PAM - since it is fed by expired 'tier-1' cache, there is no additional performance impact after the un-cached block is read.  Secondly, it does not cache any writes.  This is a giant assumption on my part, but I assume that due to the 'virtualization' WAFL provides, PAM does not need to track changed blocks on the disk.  Since everything is pointer based (think of NetApp snaps), when the track is changed on disk, future reads hit the new disk location then get migrated through the cache levels like 'new' reads since the location has changed (the old location/data gets expired fairly quickly).  The downside to this approach is that the loss of PAM requires all reads to be serviced by disk+tier 1 memory until it is replaced and recharged.

One thing that the NetApp resources on Twitter kept repeating was the benefits of PAM as an extension of cache.  I assume the main benefit of taking this approach to Flash is that it is accessed via memory routines (less layers/translation to execute through) rather than disk routines.  Whether or not this is a significant performance benefit, I really can't say.

From the initial implementation, PAM will provide almost immediate benefit as data expires from cache.  FAST will require a few iterations of the swap window before things have optimized.  Taking a longer view, FAST will work best with consistent workloads... after a few weeks, the migrations should hit an equilibrium and response times should be stable and fast.  Component failures should not adversely affect response time.  PAM, as an extension of cache, will continuously optimize whatever blocks are getting hit hardest at any given moment.  While this is more flexible day-to-day than data migrations, consistent performance could be an issue.  Additionally, the IO hit of losing PAM would decrease response times, but the impact of this is somewhat reduced by the fact that ramping up PAM is much faster than the data migrations that FAST requires.

Both solutions make various trade offs between performance, stability, and consistency.  Understanding these trade offs will benefit the customer as they choose which tools to leverage in their environment.  Following are a few considerations...

  1. Many customers have performance testing environments.  Since both of these approaches optimize as tests run, what relationship can be drawn between the 3rd-5th week of integration/performance testing and the production implementation?  Theoretically, if the data is identical between performance testing and production, NetApp dedup could leverage performance testing optimizations during the production implementation.
  2. Can customers run both FASTv1 and FASTv2 simultaneously since they have mutually exclusive volume requirements?  Are both separately licensed?  There are implementations where LUN level optimization may be preferred over sub-LUN.
  3. NetApp can simulate the benefits of PAM II in an environment.  Can similar benefits be simulated for FAST prior to implementation?
  4. I assume that FAST will promote as much into SSD as possible to increase response time.  How can customers determine when to grow that tier of storage?
  5. If a customer is using PAM II to meet a performance requirement, what can they do to reduce the impact of a PAM II failure?
  6. For both FASTv2 and PAM II, how can a customer migrate to a new array while keeping the current performance intact?  With FASTv1, it is a simple LUN migration since it is determinable what tier a LUN is on.  With FASTv2 and PAM II, it gets tricky (please note, I'm not talking about migrating the data, which is a standard procedure, I'm talking about making sure you hit performance requirements post-migration).
** Updates - 03/02/2010 AM **
To be clear, this is an "apples to oranges" comparison.  Each solution takes a completely different approach to implementing flash into an array, and the two solutions behave very differently.

Additionally, since I was focusing on Flash in particular, I neglected to compare cache capacity directly.  DMX/VMAX has a much higher cache capacity than the NetApp arrays.  Per Storagezilla on Twitter:  "Symmetrix already has acres of globally accessible DRAM for read/write and doesn't need anything like PAM."

Finally, cost does play into comparing the two approaches, but I don't have access to any sort of real-world pricing.


    Storagezilla said...

    You've used Symmetrix in the example but regardless of FAST Symm already has a massive DRAM cache globally available regardless of what you do with FAST.

    So yes FAST V1 will promote hot LUNs to a higher storage tier and FAST V2 will promote hot blocks to a higher storage tier but what's hotter still will sit in those acres of DRAM in the processing complex.

    Alex McDonald said...

    To clarify Zilla's "acres" comment; it's 128GB per V Max engine, for a total of 1TB. Each head in a FAS6080 can support 4TB per head, for a total of 8TB. Square miles by comparison.

    Alex McDonald said...

    Correction: 4TB not 8.

    Storagezilla said...

    That's 4TB of NAND Flash not 4TB of DRAM Alex.

    The Symm has up to 512GB of mirrored Global DRAM. Not non mirrored non global FLASH occupying slots in the head and costing anywhere between $80K to $100K per card.

    Again apples to oranges.