Tag Archives: snapshot

Defining RTO and RPO for your data…

Posted on by

Do you have a clearly defined Recovery Point Objective (RPO) for your data?  What about a clearly defined Recovery Time Objective (RTO)?

One challenge I run in to quite often is that, while most customers assume they need to protect their data in some way, they don’t have clear cut RPO and RTO requirements, nor do they have a realistic budget for deploying backup and/or other data protection solutions.  This makes it difficult to choose the appropriate solution for their specific environment.  Answering the above questions will help you choose a solution that is the most cost effective and technically appropriate for your business.

But how do you answer these questions?

First, let’s discuss WHY you back up… The purpose of a backup is to guarantee your ability to restore data at some point in the future, in response to some event.  The event could be inadvertent deletion, virus infection, corruption, physical device failure, fire, or natural disaster.  So the key to any data protection solution is the ability to restore data if/when you decide it is necessary.  This ability to restore is dependent on a variety of factors, ranging from the reliability of the backup process, to the method used to store the backups, to the media and location of the backup data itself.  What I find interesting is that many customers do not focus on the ability to restore data; they merely focus on the daily pains of just getting it backed up.  Restore is key! If you never intend to restore data, why would you back it up in the first place?

What is the Risk?

USA Today published an article in 2006 titled “Lost Digital Data Cost Businesses Billions” referencing a whole host of surveys and reports showing the frequency and cost to businesses who experience data loss.

Two key statistics in the article stand out.

  • 69% of business people lost data due to accidental deletion, disk or system failure, viruses, fire or another disaster
  • 40% Lost data two or more times in the last year

Flipped around, you have at least a 40% chance of having to restore some or all of your data each year.  Unfortunately, you won’t know ahead of time what portion of data will be lost.  What if you can’t successfully restore that data?

This is why one of my coworkers refuses to talk to customers about “Backup Solutions”, instead calling them “Restore Solutions”, a term I have adopted as well.  The key to evaluating Restore Solutions is to match your RPO and RTO requirements against the solution’s backup speed/frequency and restore speed respectively.

Recovery Point Objective (RPO)

Since RPO represents the amount of data that will be lost in the event a restore is required, the RPO can be improved by running a backup job more often.  The primary limiting factor is the amount of time a backup job takes to complete.  If the job takes 4 hours then you could, at best, achieve a 4-hour RPO if you ran backup jobs all day.  If you can double the throughput of a backup, then you could get the RPO down to 2 hours.  In reality, CPU, Network, and Disk performance of the production system can (and usually is) affected by backup jobs so it may not be desirable to run backups 24 hours a day.  Some solutions can protect data continuously without running a scheduled job at all.

Recovery Time Objective (RTO)

Since RTO represents the amount of time it takes to restore the application once a recovery operation begins, reducing the RTO can be achieved by shortening the time to begin the restore process, and speeding up the restore process itself.  Starting the restore process earlier requires the backup data to be located closer to the production location.  A tape located in the tape library, versus in a vault, versus at a remote location, for example affects this time.  Disk is technically closer than tape since there is no requirement to mount the tape and fast forward it to find the data.  The speed of the process itself is dependent on the backup/restore technology, network bandwidth, type of media the backup was stored on, and other factors.  Improving the performance of a restore job can be done one of two ways – increase network bandwidth or decrease the amount of data that must be moved across the network for the restore.

This simple graph shows the relationship of RTO and RPO to the cost of the solution as well as the potential loss.The values here are all relative since every environment has a unique profit situation and the myriad backup/restore options on the market cover every possible budget.

Improving RTO and/or RPO generally increases the cost of a solution.  This is why you need to define the minimum RPO and RTO requirements for your data up front, and why you need to know the value of your data before you can do that.  So how do you determine the value?

Start by answering two questions…

How much is the data itself worth?  

If your business buys or creates copyrighted content and sells that content, then the content itself has value.  Understanding the value of that data to your business will help you define how much you are willing to spent to ensure that data is protected in the event of corruption, deletion, fire, etc.  This can also help determine what Recovery Point Objective you need for this data, ie: how much of the data can you lose in the event of a failure.

If the total value of your content is $1000 and you generate $1 of new content per day, it might be worth spending 10% of the total value ($100) to protect the data and achieve an RPO of 24 hours.  Remember, this 10% investment is essentially an insurance policy against the 40% chance of data loss mentioned above which could involve some or all of your $1000 worth of content.  Also keep in mind that you will lose up to 24 hours of the most recent data ($1 value) since your RPO is 24 hours.  You could implement a more advanced solution that shortens the RPO to 1 hour or even zero, but if the additional cost of that solution is more than the value of the data it protects, it might not be worth doing.  Legal, Financial, and/or Government regulations can add a cost to data loss through fines which should also be considered.  If the loss of 24 hours of data opens you up to $100 in fines, then it makes sense to spend money to prevent that situation.

How much value does the data create per minute/hour/day?

Whether or not your data itself has value on it’s own, the ability to access it may have value.  For example, If your business sells products or services through a website and a database must be online for sales transactions to occur, then an outage of that database causes loss of revenue.  Understanding this will help you define a Recovery Time Objective, ie: for how long is it acceptable for this database to be down in the event of a failure, and how much should you spend trying to shorten the RTO before you get diminishing returns.

If you have a website that supports company net profits of $1000 a day,  it’s pretty easy to put together an ROI for a backup solution that can restore the website back into operation quickly.  In this example, every hour you save in the restore process prevents $42 of net loss.  Compare the cost of improving restore times against the net loss per hour of outage.  There is a crossover point which will provide a good return on your investment.

Your vendor will be happy when you give them specific RPO and RTO requirements.

Nothing derails a backup/recovery solution discussion quicker than a lack of requirements.  Your vendor of choice will most likely be happy to help you define them but it will help immensely if you have some idea of your own before discussions start.  There are many different data protection solutions on the market and each has it’s own unique characteristics that can provide a range of RPO and RTO’s as well as fit different budgets. Several vendors, including EMC, have multiple solutions of their own — one size definitely does not fit all.  Once you understand the value of your data, you can work with your vendor(s) to come up with a solution that meets your desired RPO and RTO while also keeping a close eye on the financial value of the solution.

Using Cloud as a SAN Tier?

Posted on by

I came across this press release today from a company that I wasn’t familiar with and immediately wanted more information.  Cirtas Systems has announced support for Atmos-based clouds, including AT&T Synaptic Storage.  Whenever I see these types of announcements, I read on in hopes of seeing real fiber channel block storage leveraging cloud-based architectures in some way.  So far I’ve been a bit disappointed since the closest I’ve seen has been NAS based systems, at best including iSCSI.

Cirtas BlueJet Cloud Storage Controller is pretty interesting in its own right though.  It’s essentially an iSCSI storage array with a cache and a small amount of SSD and SAS drives for local storage.  Any data beyond the internal 5TB of usable capacity is stored in “the cloud” which can be an onsite Private Cloud (Atmos or Atmos/VE) and/or a Public Cloud hosted by Amazon S3, Iron Mountain, AT&T Synaptic, or any Atmos-based cloud service provider.

Cirtas BlueJet

The neat thing with BlueJet is that it leverages a ton of the functionality that many storage vendors have been developing recently such as data de-duplication, compression, some kind of block level tiering, and space efficient snapshots to improve performance and reduce the costs of cloud storage.  It seems that pretty much all of the local storage (SAS, SSD, and RAM) is used as a tiered cache for hot data.  This gives users and applications the sense of local SAN performance even while hosting the majority of data offsite.

While I haven’t seen or used a BlueJet device and can’t make any observations about performance or functionality, I believe this sort of block->cloud approach has pretty significant customer value.  It reduces physical datacenter costs for power and cooling, and it presents some rather interesting disaster recovery opportunities.

Similar to how Compellent’s signature feature, tiered block storage, has been added to more traditional storage arrays, I think modified implementations of Cirtas’ technology will inevitably come from the larger players, such as EMC, as a feature in standard storage arrays.  If you consider that EMC Unified Storage and EMC Symmetrix VMAX both have large caches and block- level tiering today, it’s not too much of a stretch to integrate Atmos directly into those storage systems as another tier.  EMC already does this for NAS with the EMC File Management Appliance.

Conceptual Diagram

I can imagine leveraging FASTCache and FASTVP to tier locally for the data that must be onsite for performance and/or compliance reasons and pushing cold/stale blocks off to the cloud.  Additionally, adding cloud as a tier to traditional storage arrays allows customers to leverage their existing investment in Storage, FC/FCoE networks, reporting and performance trending tools, extensive replication options available, and the existing support for VMWare APIs like SRM and VAAI.

With this model, replication of data for disaster recovery/avoidance only needs to be done for the onsite data since the cloud data could be accessed from anywhere.  At a DR site, a second storage system connects to the same cloud and can access the cold/stale data in the event of a disaster.

Another option would be adding this functionality to virtualization platforms like EMC VPLEX for active/active multi-site access to SAN data, while only needing to store the majority of the company’s data once in the cloud for lower cost.  Customers would no longer have to buy double the required capacity to implement a disaster recovery strategy.

I’m eagerly awating the implementation of cloud into traditional block storage and I can see how some vendors will be able to do this easily, while others may not have the architecture to integrate as easily.  It will be interesting to see how this plays out.

EMC Unified: Guaranteed Efficiency with Better Application Availability

Posted on by

(Warning: This is a long post…)

You have a critical application that you can’t afford to lose:

So you want to replicate your critical applications because they are, well, critical.   And you are looking at the top midrange storage vendors for a solution.  NetApp touts awesome efficiency, awesome snapshots, etc while EMC is throwing considerable weight behind it’s 20% Efficiency Guarantee.  While EMC guarantees to be 20% more efficient in any unified storage solution, there is perhaps no better scenario than a replication solution to prove it.

I’m going to describe a real-world scenario using Microsoft Exchange as the example application and show why the EMC Unified platform requires less storage, and less WAN bandwidth for replication, while maintaining the same or better application availability vs. a NetApp FAS solution.  The example will use a single Microsoft Exchange 2007 SP2 server with ten 100GB mail databases connected via FibreChannel to the storage array.  A second storage array exists in a remote site connected via IP to the primary site and a standby Exchange server is attached to that array.

Basic Assumptions:

  • 100GB per database, 1 database per storage group, 1 storage group per LUN, 130GB LUNs
  • 50GB Log LUNs, ensure enough space for extra log creation during maintenance, etc
  • 10% change rate per day average
  • Nightly backup truncates logs as required
  • Best Practices followed by all vendors
  • 1500 users (Heavy Users 0.4IOPS), 10% of users leverage Blackberry (BES Server = 4X IOPS per user)
  • Approximate IOPS requirement for Exchange: 780IOPS for this server.
  • EMC Solution: 2 x EMC Unified Storage systems with SnapView/SANCopy and Replication Manager
  • NetApp Solution: 2 x NetApp FAS Storage systems with SnapMirror and SnapManager for Exchange
  • RPO: 4 hours (remote site replication update frequency)

Based on those assumptions we have 10 x 130GB DB LUNs and 10 x 50GB Log LUNs and we need approximately 780 host IOPS 50/50 read/write from the backend storage array.

Disk IOPS calculation: (50/50 read/write)

  • RAID10, 780 host IOPS translates to 1170 disk IOPS (r+w*2)
  • RAID5, 780 host IOPS translates to 1950 disk IOPS (r+w*4)
  • RAIDDP is essentially RAID6 so we have about 2730 disk IOPS (r + w*6)

Note: NetApp can create sequential stripes on writes to improve write performance for RAIDDP but that advantage drops significantly as the volumes fill up and free space becomes fragmented which is extremely likely to happen after a few months or less of activity.

Assuming 15K FiberChannel drives can make 180 IOPS with reasonable latencies for a database we’d need:

  • RAID10, Database 6.5 disks (round up to 8), using 450GB 15K drives =  1.7TB usable (1 x 4+4)
  • RAID5, 10.8 disks for RAID5 (round up to 12), using 300GB 15K drives = 2.8TB usable (2 x 5+1)
  • RAID6/DP, 15.1 disks for RAID6 (round up to 16), using 300GB 15K drives = 3.9TB usable (1 x 14+2)

Log writes are highly cachable so we generally need fewer disks; for both the RAID10 and RAID5 EMC options we’ll use a single RAID1 1+1 raid group with 2 x 600GB 15K drives.  Since we can’t do RAID1 or RAID10 on NetApp we’ll have to use at least 3 disks (1 data and 2 parity) for the 500GB worth of Log LUNs but we’ll actually need more than that.

Picking a RAID Configuration and Sizing for snapshots:

For EMC, the RAID10 solution uses fewer disks and provides the most appropriate amount of disk space for LUNs vs. the RAID5 solution.  With the NetApp solution there really isn’t another alternative so we’ll stick with the 16 disk RAID-DP config.  We have loads of free space but we need some of that for snapshots which we’ll see next.  We also need to allocate more space to the Log disks for those snapshots.

Since we expect about 10% change per day in the databases (about 10GB per database) we’ll double that to be safe and plan for 20GB of changes per day per LUN (DB and Log).

NetApp arrays store snapshot data in the same volume (FlexVol) as the application data/LUN so you need to size the FlexVol’s and Aggregates appropriately.  We need 200GB for the DB LUNs and 200GB for the Log LUNs to cover our daily change rate but we’re doubling that to 400GB each to cover our 2 day contingency.  In the case of the DB LUNs the aggregate has more than enough space for the 400GB of snapshot data we are planning for but we need to add 400GB to the Log aggregate as well so we need 4 x 600GB 15K drives to cover the Exchange logs and snapshot data.

EMC Unified arrays store snapshot data for all LUNs in centralized location called the Reserve LUN Pool or RLP.  The RLP actually consists of a number of LUNs that can be used and released as needed by snapshot operations occurring across the entire array.  The RLP LUNs can be created on any number of disks, using any RAID type to handle various IO loads and sizing an RLP is based on the total change rate of all simultaneously active snapshots across the array.  Since we need 400GB of space in the Reserve LUN Pool for one day of changes, we’ll again be safe by doubling that to 800GB which we’ll provide with 6 dedicated 300GB 15K drives in RAID10.

At this point we have 20 disks on the NetApp array and 16 disks on the EMC array.  We have loads of free space in the primary database aggregate on the NetApp but we can’t use that free space because it’s sized for the IOPS workload we expect from the Exchange server.

In order to replicate this data to an alternate site, we’ll configure the appropriate tools.

EMC:

  1. Install Replication Manager on a server and deploy an agent to each Exchange server
  2. Configure SANCopy connectivity between the two arrays over the IP ports built-in to each array
  3. In Replication Manager, Configure a job that quiesces Exchange, then uses SANCopy to incrementally update a copy of the database and log LUNs on the remote array and schedule for every 4 hours using RM’s built in scheduler.

NetApp:

  1. Install SnapManager for Exchange on each Exchange server
  2. Configure SnapMirror connectivity betweeen the two arrays over the IP ports built-in to each array
  3. In SnapManager, Configure a backup job that quiesces Exchange and takes a Snapshot of the Exchange DBs and Logs, then starts a SnapMirror session to replicate the updated FlexVol (including the snapshot) to the remote array.  Configure a schedule in Windows Task Manager to run the backup job every 4 hours.

Both the EMC and NetApp solutions run on schedule, create remote copies, and everything runs fine, until...

Tuesday night during the weekly maintenance window, the Exchange admins decide to migrate half of the users from DB1, to DB2 and DB3 and half of the users from DB4, to DB5 and DB6.  About 80GB of data is moved (25GB to each of the target DBs.)  The transactions logs on DB1 and DB4 jump to almost 50GB, 35GB each on DB2, DB3, DB5, and DB6.

On the NetApp array, the 50GB log LUNs already have about 10GB of snapshot data stored and as the migration is happening, new snapshot data is tracked on all 6 of the affected DB and Log LUNs.  The 25GB of new data plus the 10GB of existing data exceeds the 20GB of free space in the FlexVol that each LUN is contained in and guess what…  Exchange chokes because it can no longer write to the LUNs.

There are workarounds: First, you enable automatic volume expansion for the FlexVols and automatic Snapshot deletion as a secondary fallback.  In the above scenario, the 6 affected FlexVols autoextend to approximately 100GB each equaling 300GB of snapshot data for those 6 LUNs and another 40GB for the remaining 4 LUNs.  There is only 60GB free in the aggregate for any additional snapshot data across all 10 LUNs.  Now, SnapMirror struggles to update the 1200GB of new data (application data + snapshot data) across the WAN link and as it falls behind more data changes on the production LUNs increasing the amount of snapshot data and the aggregate runs out of space.  By default, SnapMirror snapshots are not included in the “automatically delete snapshots” option so Exchange goes down.  You can set a flag to allow SnapMirror owned snapshots to be automatically deleted but then you have to resync the databases from scratch.  In order to prevent this problem from ever occurring, you need to size the aggregate to handle >100% change meaning more disks.

Consider how the EMC array handles this same scenario using SANCopy.  The same changes occur to the databases and approximately 600GB of data is changed across 12 LUNs (6 DB and 6 Log).  When the Replication Manager job starts, SANCopy takes a new snapshot of all of the blocks that just changed for purposes of the current update and begins to copy those changed blocks across the WAN.

EMC Advantages:

  • SANCopy/Inc is not tracking the changes that occur AS they occur, only while an update is in process so the Reserve LUN Pool is actually empty before the update job starts.  If you want additional snapshots on top of the ones used for replication, that will increase the amount of data in the Reserve LUN Pool for tracking changes, but snapshots are created on both arrays independently and the snapshot data is NOT replicated.  This nuance allows you to have different snapshot schedules in production vs. disaster recovery for example.
  • Because SANCopy/Inc only replicates the blocks that have changed on the production LUNs, NOT the snapshot data, it copies only half of the data across the WAN vs SnapMirror which reduces the time out of sync.  This translates to lower WAN utilization AND a better RPO.
  • IF an update was occurring when the maintenance took place, the amount of data put in the Reserve LUN pool would be approximately 600GB (leaving 200GB free for more changed data).  More efficient use of the Snapshot pool and more flexibility.
  • IF the Reserve LUN Pool ran out of space, the SANCopy update would fail but the production LUNs ARE NEVER AFFECTED.  Higher availability for the critical application that you devoted time and money to replicate.
  • Less spinning disk on the EMC array vs. the NetApp.

EMC has several replication products available that each act differently.  I used SANCopy because, combined with Replication Manager, it provides similar functionality to NetApp SnapMirror and SnapManager.  MirrorView/Async has the same advantages as SANCopy/Incremental in these scenarios and can replicate Exchange, SQL, and other applications without any host involvement.

Higher Application availability, lower WAN Utilization , Better RPO, Fewer Spinning Disks, without even leveraging advanced features for even better efficiency and performance.

While EMC users benefit from Replication Manager, NetApp users NEED SnapManager

Posted on by

This is a follow up to my recent post NetApp and EMC: Replication Management Tools Comparison, in which I discussed the differences between EMC Replication Manager and NetApp SnapManager.

————

As a former customer of both NetApp and EMC, and now as an employee of EMC, I noticed a big difference between NetApp and EMC as far as marketing their replication management tools. As a customer, EMC talked about Replication Manager several times and we purchased it and deployed it. NetApp made SnapManager a very central part of their sales campaign, sometimes skipping any discussion of the underlying storage in favor of showing off SnapManager functionality. This is an extremely effective sales technique and NetApp sales teams are so good at this that many people don’t even realize that other vendors have similar, and in my opinion EMC has better, functionality.  One of the reasons for this difference in marketing strategy is that NetApp users NEED SnapManager, while EMC users do not always need Replication Manager.

The reason why is both simple and complex…

EMC storage arrays (Clariion, Symmetrix, RecoverPoint, Invista) all have one technology in common that NetApp Filers do not–Consistency Groups. A consistency group allows the storage system to take a snapshot of multiple LUNs simultaneously, so simultaneous in fact that all of the snapshots are at the exact same point in time down to the individual write. This means that, without taking any applications offline and without any orchestration software, EMC storage arrays can create crash-consistent copies of nearly any kind of data at any time.

The EMC Whitepaper “EMC CLARiiON Database Storage Solutions: Oracle 10g/11g with CLARiiON Storage Replication Consistency” downloadable from EMC’s website has the following explanation of consistency groups in general…

“…Consistent replication operates on multiple LUNs as a set such that if the replication action fails for one member in the set, replication for all other members of the set are canceled or stopped.  Thus the contents of all replicated LUNs in the set are guaranteed to be identical point-in-time replicas of their source and dependent-write consistency is maintained…”

“…With consistent replication, the database does not have to be shut down or put into “hot backup mode.”  Replicates created with SnapView or MV/S (or MV/A, Timefinder, SRDF, Recoverpoint, etc) consistency operations, without first quiescing or halting the application, are restartable point-in-time replicas of the production data and guaranteed to be dependent-write consistent.”

Consistency is important for any application that is writing to multiple LUNs at the same time such as SQL database and log volumes. SnapManager and Replication Manager actually prepare the application by quiescing the database during the snapshot creation process. This process creates “application-consistent” copies which are technically better for recovery compared with “storage-consistent” copies (also known as crash-consistent copies).

So, while I will acknowledge that quiescing the database during a snapshot/replication operation provides the best possible recovery image, that may not be realistic in some scenarios.  The first issue is that the actual operation of quiescing, snapping, checking the image, then pushing an update to a remote storage array takes some time.  Depending on the size of the dataset, this operation can take from several minutes to several hours to complete.  If you have a Recovery Point Objective (RPO) of 5 minutes or less, using either of these tools is pretty much a non-starter.

Another issue is one of application support.  EMC Replication Manager and NetApp SnapManager have very wide support for the most popular operating systems, filesystems, databases, and applications, they certainly don’t support every application.  A very simple example is a Novell Netware file server with a NSS pool/volume spanning multiple LUNs.  Neither NetApp nor EMC have support for Novell Netware in their replication management tools.  While you can certainly replicate all of the LUNs with NetApp SnapManager, SnapManager has no consistency technology built-in to keep the LUNs write-order consistent.  The secondary copy will appear completely corrupt to the Netware server if a recovery is attempted.  Through the use of consistency groups with MirrorView/Async, the replication of each LUN is tracked as a group and all of the LUNs are write-order consistent with each other, keeping the filesystem itself consistent.  You would need to have either array-level consistency technology, or support for Netware in the replication management tool in order to replication such a server..  Unfortunately, NetApp provides neither.

You may have complex applications that consist of Oracle and SQL databases, NTFS filesystems, and application servers running as VMs.  Using array-based consistency groups, you can replicate all of these components simultaneously and keep them all consistent with each other.  This way you won’t have transactions that normally affect two databases end up missing in one of the two after a recovery operation, even if those databases are different technologies (Oracle and MySQL, or PostgreSQL for example).

EMC Storage arrays provide consistency group technology for Snapshots and Replication in Clariion and Symmetrix storage arrays.  In fact, with Symmetrix, consistency groups can span multiple arrays without any host software.  By comparison, NetApp Filers do not have consistency group technology in the array.  Snapshots are taken (for local replicas and for SnapMirror) at the FlexVolume level.  Two FlexVolumes cannot be snapped consistently with each other without SnapManager.

There are a couple workarounds for NetApp users–you can snapshot an aggregate, but that is not recommended by NetApp for most customers, or you can put multiple LUNs in the same FlexVol, but that still limits you to 16TB of data including snapshot reserve space, and both options violate best practices for database designs of keeping data and logs in separate spindles for recovery.  Even with these workarounds, you cannot gain LUN consistency across the two controllers in an HA Filer pair, something the CLARiiON does natively, and can help for load balancing IO across the storage processors.

In general, I recommend that EMC customers use EMC Replication Manager and NetApp customers use SnapManager for the applications that are supported, and for most scenarios.  But when RPO’s are short, or the environment falls outside the support matrix for those tools, consistency groups become the best or only option.

Incidentally, with EMC RecoverPoint, you get the best of both worlds.  CDP or near-CDP replication of data using consistency groups for zero or near-zero RPOs plus application-consistent bookmarks made anytime the database is quiesced.  Recovery is done from the up-to-the-second version of the data, but if that data is not good for any reason, you can roll back to another point in time, including a point-in-time when the database was quiesced (a bookmark).

So, while EMC has, in Replication Manager, an equivalent offering to NetApp’s SnapManager, EMC customers are not required to use it, and in some cases they can achieve better results using array-based consistency technologies.

NetApp and EMC: Replication Management Tools Comparison

Posted on by

I started this post before I started working for EMC and got sidetracked with other topics.  Recent discussions I’ve had with people have got me thinking more about orchestration of data protection, replication, and disaster recovery, so it was time to finish this one up…

———————————–

Prior to me coming to work for EMC, I was working on a project to leverage NetApp and EMC storage simultaneously for redundancy.  I had a chance to put various tools from EMC and NetApp into production and have been able to make some observations with respect to some of the differences.  This is a follow up my previous NetApp and EMC posts…

NetApp and EMC: Real world comparisons
NetApp and EMC: Startup and First Impressions
NetApp and EMC: ESX and Exchange 2007 CCR
NetApp and EMC: Exchange 2007 Replication

Specifically this post is a comparison between NetApp SnapManager 5.x and EMC Replication Manager 5.x.  First, here’s a quick background on both tools based on my personal experience using them.

Description

EMC Replication Manager (RM) is a single application that runs on a dedicated “Replication Manager Server.”  RM agents are deployed to the hosts of applications that will be replicated.  RM supports local and remote replication features in EMC’s Clariion storage array, Celerra Unified NAS, Symmetrix DMX/V-Max, Invista, and RecoverPoint products.  With a single interface, Replication Manager lets you schedule, modify, and monitor snapshot, clone, and replication jobs for Exchange, SQL, Oracle, Sharepoint, VMWare, Hyper-V, etc.  RM supports Role-Based authentication so application owners can have access to jobs for their own applications for monitoring and managing replication.  RM can manage jobs across all of the supported applications, array types, and replication technologies simultaneously.  RM is licensed by storage array type and host count. No specific license is required to support the various applications.

NetApp SnapManager is actually a series of applications designed for each application that NetApp supports.  There are versions of SnapManager for Exchange, SQL, Sharepoint, SAP, Oracle, VMWare, and Hyper-V.  The SnapManager application is installed on each host of an application that will be replicated, and jobs are scheduled on each specific host using Windows Task Scheduler.  Each version of SnapManager is licensed by application and host count.  I believe you can also license SnapManager per-array instead of per-host which could make financial sense if you have lots of hosts.

Commonality

EMC Replication Manager and NetApp SnapManager products tackle the same customer problem–provide guaranteed recoverability of an application, in the primary or a secondary datacenter, using array-based replication technologies.  Both products leverage array-based snapshot and replication technology while layering application-consistency intelligence to perform their duties.  In general, they automate local and remote protection of data.  Both applications have extensive CLI support for those that want that.

Differences

  • Deployment
    • EMC RM – Replication Manager is a client-server application installed on a control server.  Agents are deployed to the protected servers.
    • NetApp SM – SnapManager is several applications that are installed directly on the servers that host applications being protected.
  • Job Management
    • EMC RM – All job creation, management, and monitoring is done from the central GUI. Replication Manager has a Java based GUI.
    • NetApp SM – Job creation and monitoring is done via the SnapManager GUI on the server being protected.  SnapManager utilizes an MMC based GUI.
  • Job Scheduling
    • EMC RM – Replication Manager has a central scheduler built-in to the product that runs on the RM Server.  Jobs are initiated and controlled by the RM Server, the agent on the protected server performs necessary tasks as required.
    • NetApp SM – SnapManager jobs are scheduled with Windows Task Scheduler after creation.  The SnapManager GUI creates the initial scheduled task when a job is created through the wizard.  Modifications are made by editing the scheduled task in Windows task scheduler.

So while the tools essentially perform the same function, you can see that there are clear architectural differences, and that’s where the rubber meets the road.  Being a centrally managed client-server application, EMC Replication Manager has advantages for many customers.

Simple Comparison Example: Exchange 2007 CCR cluster
(snapshot and replicate one of the two copies of Exchange data)

With NetApp SnapManager, the application is installed on both cluster nodes, then an administrator must log on to the console on the node that hosts the copy you want to replicate, and create two jobs which run on the same schedule.  Job A is configured to run when the node is the active node, Job B is configured to run when the node is passive.  Due to some of the differences in the settings, I was unable to configure a single job that ran successfully regardless of whether the node was active or passive.  If you want to modify the settings, you either have to edit the command line options in the Scheduled Task, or create a new job from scratch and delete the old one.

With EMC Replication Manager, you deploy the agent to both cluster nodes, then in the RM GUI, create a job against the cluster virtual name, not the individual node.  You define which server you want the job to run on in the cluster, and whether the job should run when the node is passive, active, or both.  All logs, monitoring, and scheduling is done in the same RM GUI, even if you have 50 Exchange clusters, or SQL and Oracle for that matter.  Modifying the job is done by right-clicking on the job and editing the properties.  Modifying the schedule is done in the same way.

So as the number of servers and clusters increases in your environment, having a central UI to manage and monitor all jobs across the enterprise really helps.  But here’s where having a centrally managed application really shines…

But what if it gets complicated?

Let’s say you have a multi-tier application like IBM FileNet, EMC Documentum, or OpenText and you need to replicate multiple servers, multiple databases, and multiple file systems that are all related to that single application.  Not only does EMC Replication Manager support SQL and Filesystems in the same GUI, you can tie the jobs together and make them dependent on each other for both failure reporting and scheduling.  For example, you can snapshot a database and a filesystem, then replicate both of them without worrying about how long the first job takes to complete.  Jobs can start other jobs on completely independent systems as necessary.

Without this job dependence functionality, you’d generally have to create scheduled tasks on each server and have dependent jobs start with a delay that is long enough to allow the first job to complete while as short as possible to prevent the two parts of the application from getting too far out of sync.  Some times the first job takes longer than usual causing subsequent jobs to complete incorrectly.  This is where Replication Manager shows it’s muscle with it’s ability to orchestrate complex data protection strategies, across the entire enterprise, with your choice of protection technologies (CDP, Snapshot, Clone, Bulk Copy, Async, Sync) from a single central user interface.

You don’t need a Backup solution!

Posted on by

Well, not exactly.  What you really need is a restore solution!

I was discussing this with a colleague recently as we compared difficulties multiple customers are having with backups in general.  My colleague was relating a discussion he had with his customer where he told them, “stop thinking about how to design a backup solution, and start thinking about how to design a restore solution!”

Most of our customers are in the same boat, they work really hard to make sure that their data is backed up within some window of time, and offsite as soon as possible in order to ensure protection in the event of a catastrophic failure.  What I’ve noticed in my previous positions in IT and more so now as a technical consultant with EMC is that (in my experience) most people don’t really think about how that data is going to get restored when it is needed.  There are a few reasons for this:

  • Backing up data is the prerequisite for a restore; IT professionals need to get backups done, regardless of whether they need to restore the data.  It’s difficult to plan for theoretical needs and restore is still viewed, incorrectly, as theoretical.
  • Backup throughput and duration is easily measured on a daily basis, restores occur much more rarely and are not normally reported on.
  • Traditional backup has been done largely the same way for a long time and most customers follow the same model of nightly backups (weekly full, daily incremental) to disk and/or tape, shipping tape offsite to Iron Mountain or similar.

I think storage vendors, EMC and NetApp particularly, are very good at pointing out the distinction between a backup solution and a restore solution, where backup vendors are not quite as good at this.  So what is the difference?

When designing a backup solution the following factors are commonly considered:

  • Size of Protected Data – How much data do I have to protect with backup (usually GB or TB)
  • Backup Window – How much time do I have each night to complete the backups (in hours)
  • Backup Throughput – How fast can I move the data from it’s normally location to the backup target
  • Applications – What special applications do I have to integrate with (Exchange, Oracle, VMWare)
  • Retention Policy – How long do I have to hang on to the backups for policy or legal purposes
  • Offsite storage – How do I get the data stored at some other location in case of fire or other disaster

If you look at it from a restore prospective, you might think about the following:

  • How long can I afford to be down after a failure?  Recovery Time Objective (RTO): This will determine the required restore speed.  If all backups are stored offsite, the time to recall a tape or copy data across the WAN affects this as well.
  • How much data can I afford to lose if I have to restore? Recovery Point Objective (RPO):  This will determine how often the backup must occur, and in many cases this is less than 24 hours.
  • Where do I need to restore the application? This will help in determining where to send the data offsite.

Answer these questions first and you may find that a traditional backup solution is not going to fulfill your requirements.  You may need to look at other technologies, like Snapshots, Clones, replication, CDP, etc.  If a backup takes 8 hours, the restore of that data will most likely take at least 8 hours, if not closer to 16 hours.  If you are talking about a highly transactional database, hosting customer facing web sites, and processing millions of dollars per hour, 8 hours of downtime for a restore is going to cost you tens or hundreds of millions of dollars in lost revenue.

Two of my customers have database instances hosted on EMC storage, for example, which are in the 20TB size range.  They’ve each architected a backup solution that can get that 20TB database backed up within their backup window.  The problem is, once that backup completes, they still have to offsite the backup, and replicate it to their DR site across a relatively small WAN link.  They both use compressed database dumps for backup because, from the DBA’s perspective, dumps are the easiest type of backup to restore from, and the compression helps get 20TB of data pushed across 1gbe Ethernet connections to the backup server.  One of the customers is actually backing up all of their data to DataDomain deduplication appliances already; the other is planning to deploy DataDomain.  The problem in both cases is that, if you pre-compress the backup data, you break deduplication, and you get no benefit from the DataDomain appliance vs. traditional disk.  Turning off compression in the dump can’t be done because the backup would take longer than the backup window allows.  The answer here is to step back, think about the problem you are trying to solve–restoring data as quickly as possible in the event of failure–and design for that problem.

How might these customers leverage what they already have, while designing a restore solution to meet their needs?

Since they are already using EMC storage, the first step would be to start taking snapshots and/or clones of the database.  These snapshots can be used for multiple purposes…

  • In the event of database corruption, or other host/filesystem/application level problem, the production volume can be reverted to a snapshot in a matter of minutes regardless of the size of the database (better RTO).  Snapshots can be taken many times a day to reduce the amount of data loss incurred in the event of a restore (better RPO).
  • A snapshot copy of the database can be mounted to a backup server directly and backed up directly to tape or backup disk.  This eliminates the requirement to perform database dumps at all as well as any network bottleneck between the database server and backup server.  Since there is no dump process, and no requirement to pre-compress the data, de-duplication (via DataDomain) can be employed most efficiently.  Using a small 10gbps private network between the backup media servers and DataDomain appliances, in conjunction with DD-BOOST, throughput can be 2.5X faster than with CIFS, NFS, or VTL to the same DataDomain appliance.  And with de-duplication being leveraged, retention can be very long since each day’s backup only adds a small amount of new data to the DataDomain.
  • Now that we’ve improved local restore RTO/RPO, eliminated the backup window entirely for the database server, and decreased the amount of disk required for backup retention, we can replicate the backup to another DataDomain appliance at the DR site.  Since we are taking full advantage of de-duplication now, the replication bandwidth required is greatly reduced and we can offsite the backup data in a much shorter period of time.
  • Next, we give the DBAs back the ability to restore databases easily, and at will, by leveraging EMC Replication Manager.  RM manages the snapshot schedules, mounting of snaps to the backup server, and initiation of backup jobs from the snapshot, all in a single GUI that storage admins and DBAs can access simultaneously.

So we leveraged the backup application they already own, the DataDomain appliances they already own, storage arrays they already own, built a small high-bandwidth backup network, and layered some additional functionality, to drastically improve their ability to restore critical data.  The very next time they have a data integrity problem that requires a restore, these customer’s will save literally millions of dollars due to their ability to restore in minutes vs. hours.

If RPO’s of a few hours are not acceptable, then a Continuous Data Protection (CDP) solution could be added to this environment.  EMC RecoverPoint CDP can journal all database activity to be used to restore to any point in time, bringing data loss (RPO) to zero or near-zero, something no amount of snapshots can provide, and keeping restore time (RTO) within minutes (like snapshots).  Further, the journaled copy of the database can be stored on a different storage array providing complete protection for the entire hardware/software stack.  RecoverPoint CDP can be combined with Continuous Remote Replication (CRR) to replicate the journaled data to the DR site and provide near-zero RPO and extremely low RTO in a DR/BC scenario.  Backups could be transitioned to the DR site leveraging the RecoverPoint CRR copies to reduce or eliminate the need to replicate backup data.  EMC Replication Manager manages RecoverPoint jobs in the same easy to use GUI as snapshot and clone jobs.

There are a whole host of options available from EMC (and other storage vendors) to protect AND restore data in ways that traditional backup applications cannot match.  This does not mean that backup software is not also needed, as it usually ends up being a combined solution.

The key to architecting a restore solution is to start thinking about what would happen if you had to restore data, how that impacts the business and the bottom line, and then architect a solution that addresses the business’ need to run uninterrupted, rather than a solution that is focused on getting backups done in some arbitrary daily/nightly window.

EMC CLARiiON and Celerra Updates – Defining Unified Storage

Posted on by

This past week, during EMC World 2010 in Boston, EMC made several announcements of updates to the Celerra and CLARiiON midrange platforms.  Some of the most impressive were new capabilities coming to CLARiiON FLARE in just a couple short months.  Major updates to Celerra DART will coincide with the FLARE updates and if you are already running CLARiiON CX4 hardware, or are evaluating CX4 (or Celerra), you will want to check these new features out.  They will be available to existing CX4(120,240,480,960)/NS(120,480,960) systems as part of a software update.

Here’s a list of key changes in FLARE 30:

  • Unified management for midrange storage platforms including CLARiiON and Celerra today, plus RecoverPoint, Replication Manager and more in the future.  This is a true single pane of glass for monitoring AND managing SAN, NAS, and data protection and it’s built in to the platform.  “EMC Unisphere” replaces Navisphere Manager and Celerra Manager and supports multiple storage systems simultaneously in a single window. (Video Demo)
  • Extremely large cache (ie: FASTCache) – Up to 2TB of additional read/write cache in CLARiiON using SSDs (Video Demo)
  • Block level Fully Automated Storage Tiering (ie: sub-LUN FAST) – Fully automated assignment of data across multiple disk types
  • Block Level Compression – Compress LUNs in the CLARiiON to reduce disk space requirements
  • VAAI Support – Integrate with vSphere ESX for improved performance

These features are in addition to existing features like:

  • Seamless and non-disruptive mobility of LUNs within a storage array – (via Virtual LUNs)
  • Non-Disruptive Data Migration – (via PowerPath Migration Enabler)
  • VMWare Aware Storage Management – (Navisphere, Unisphere, and vSphere Plugins giving complete visibility  and self-service provisioning for VMWare admins (Video Demo) AND Storage Admins
  • CIFS and NFS Compression – Compress production data on Celerra to reduce disk space requirements including VMs
  • Dynamic SAN path load balancing – (via PowerPath)
  • At-Rest-Encryption – (via PowerPath w/RSA)
  • SSD, FC, and SATA drives in the same system – Balance performance and capacity as needed for your application
  • Local and Remote replication with array level consistency – (SnapView, MirrorView, etc)
  • Hot-swap, Hot-Add, Hot-Upgrade IO Modules – Upgrade connectivity for FC, FCoE, and iSCSI with no downtime
  • Scale to 1.8PB of storage in a single system
  • Simultaneously provide FC, iSCSI, MPFS, NFS, and CIFS access

All together, this is an impressive list of features for a single platform. In fact, while many of EMC’s competitors have similar features, none of them have all of them in the same platform, or leverage them all simultaneously to gain efficiency.  When CLARiiON CX4 and Celerra NS are integrated and managed as a single Unified storage system with EMC Unisphere there is tremendous value as I’ll point out below…

Improve Performance easily…

  • Install a couple SSD drives into a CLARiiON and enable FASTCache to increase the array’s read/write cache from the industry competive 4GB-32GB up to 2TB of array based non-volatile Read AND Write cache available to ALL applications including NAS data hosted by the array.
  • Install PowerPath on Windows, Linux, Solaris, AND VMWare ESX hosts to automatically balance IO across all available paths to storage.  PowerPath detects latency and queuing occuring on each path and adjusts automatically, improving performance at the storage array AND for your hosts.  This is a huge benefit in VMWare environments especially.
  • When VMWare releases the updated version of vSphere ESX that supports VAAI, ESX will be able to leverage VAAI support in the CLARiiON to reduce the amount of IO required to do many tasks, improving performance across the environment again.
  • Upgrade from 1gbe iSCSI to 10gbe iSCSI, or from 4gbe FiberChannel to 8gbe FiberChannel, without a screwdriver or downtime.
  • Provide NAS shared file access with block-level performance for any application using EMC’s MPFS protocol.

Improve Efficiency and cost easily…

  • Create a single pool of storage containing some SSD, some FC, and some SATA drives, that automatically monitors and moves portions of data to the appropriate disk type to both improve performance AND decrease cost simultaneously.
  • Non-disruptively compress volumes and/or files with a single click to save 50% of your disk space in many cases.
  • Convert traditional LUNs to more efficient Thin-LUNs non-disruptively using PowerPath Migration Enabler, saving more disk space.

Increase and Manage Capacity easily…

  • Add additional storage non-disruptively with SSD, FC, and SATA drives in any mix up to 1.8PB of raw storage in a single CLARiiON CX4.
  • Using FASTCache, iSCSI, FC, and FCoE connectivity simultaneously does not reduce total capacity of the system.
  • Expanding LUNs, RAID Groups, and Storage Pools is non-disruptive.
  • Migrating LUNs between RAID groups and/or Storage Pools is non-disruptive using built-in CLARiiON LUN Migration, as is migrating data to a different storage array (using PowerPath Migration Enabler)!
  • Balancing workload between storage processors is non-disruptive and at individual LUN granularity.

Protect your data easily…

  • Snapshot, Clone, and Replicate any of the data to anywhere with built in array tools that can maintain complete data consistency across a single, or multiple applications without installing software.
  • Maintain application consistency for Exchange, SQL, Oracle, SAP, and much more, even within VMWare VMs, while replicating to anywhere with a single pane-of-glass.
  • Encrypt sensitive data seamlessly using PowerPath Encryption w/RSA.

Maintain Flexibility…

  • While you can do all of these things quickly and simply, you still have the flexibility to create traditional RAID sets using RAID 0, 1, 5, 6, and 10 where you need highly predicable performance, or tune read and write cache at the array and LUN level for specific workloads.  Do you want read/write snapshots? How about full copy clones on completely separate disks for workload isolation and failure protection? What about the ability to rollback data to different points in time using snapshots without deleting any other snapshots?  EMC Storage arrays have been able to do this for a long time and that hasn’t changed.

There are few manufacturers aside from EMC that can provide all of these capabilities, let alone provide them within a single platform.  That’s the definition of simple, efficient, Unified Storage in my opinion.