Tag Archives: data

Defining RTO and RPO for your data…

Posted on by

Do you have a clearly defined Recovery Point Objective (RPO) for your data?  What about a clearly defined Recovery Time Objective (RTO)?

One challenge I run in to quite often is that, while most customers assume they need to protect their data in some way, they don’t have clear cut RPO and RTO requirements, nor do they have a realistic budget for deploying backup and/or other data protection solutions.  This makes it difficult to choose the appropriate solution for their specific environment.  Answering the above questions will help you choose a solution that is the most cost effective and technically appropriate for your business.

But how do you answer these questions?

First, let’s discuss WHY you back up… The purpose of a backup is to guarantee your ability to restore data at some point in the future, in response to some event.  The event could be inadvertent deletion, virus infection, corruption, physical device failure, fire, or natural disaster.  So the key to any data protection solution is the ability to restore data if/when you decide it is necessary.  This ability to restore is dependent on a variety of factors, ranging from the reliability of the backup process, to the method used to store the backups, to the media and location of the backup data itself.  What I find interesting is that many customers do not focus on the ability to restore data; they merely focus on the daily pains of just getting it backed up.  Restore is key! If you never intend to restore data, why would you back it up in the first place?

What is the Risk?

USA Today published an article in 2006 titled “Lost Digital Data Cost Businesses Billions” referencing a whole host of surveys and reports showing the frequency and cost to businesses who experience data loss.

Two key statistics in the article stand out.

  • 69% of business people lost data due to accidental deletion, disk or system failure, viruses, fire or another disaster
  • 40% Lost data two or more times in the last year

Flipped around, you have at least a 40% chance of having to restore some or all of your data each year.  Unfortunately, you won’t know ahead of time what portion of data will be lost.  What if you can’t successfully restore that data?

This is why one of my coworkers refuses to talk to customers about “Backup Solutions”, instead calling them “Restore Solutions”, a term I have adopted as well.  The key to evaluating Restore Solutions is to match your RPO and RTO requirements against the solution’s backup speed/frequency and restore speed respectively.

Recovery Point Objective (RPO)

Since RPO represents the amount of data that will be lost in the event a restore is required, the RPO can be improved by running a backup job more often.  The primary limiting factor is the amount of time a backup job takes to complete.  If the job takes 4 hours then you could, at best, achieve a 4-hour RPO if you ran backup jobs all day.  If you can double the throughput of a backup, then you could get the RPO down to 2 hours.  In reality, CPU, Network, and Disk performance of the production system can (and usually is) affected by backup jobs so it may not be desirable to run backups 24 hours a day.  Some solutions can protect data continuously without running a scheduled job at all.

Recovery Time Objective (RTO)

Since RTO represents the amount of time it takes to restore the application once a recovery operation begins, reducing the RTO can be achieved by shortening the time to begin the restore process, and speeding up the restore process itself.  Starting the restore process earlier requires the backup data to be located closer to the production location.  A tape located in the tape library, versus in a vault, versus at a remote location, for example affects this time.  Disk is technically closer than tape since there is no requirement to mount the tape and fast forward it to find the data.  The speed of the process itself is dependent on the backup/restore technology, network bandwidth, type of media the backup was stored on, and other factors.  Improving the performance of a restore job can be done one of two ways – increase network bandwidth or decrease the amount of data that must be moved across the network for the restore.

This simple graph shows the relationship of RTO and RPO to the cost of the solution as well as the potential loss.The values here are all relative since every environment has a unique profit situation and the myriad backup/restore options on the market cover every possible budget.

Improving RTO and/or RPO generally increases the cost of a solution.  This is why you need to define the minimum RPO and RTO requirements for your data up front, and why you need to know the value of your data before you can do that.  So how do you determine the value?

Start by answering two questions…

How much is the data itself worth?  

If your business buys or creates copyrighted content and sells that content, then the content itself has value.  Understanding the value of that data to your business will help you define how much you are willing to spent to ensure that data is protected in the event of corruption, deletion, fire, etc.  This can also help determine what Recovery Point Objective you need for this data, ie: how much of the data can you lose in the event of a failure.

If the total value of your content is $1000 and you generate $1 of new content per day, it might be worth spending 10% of the total value ($100) to protect the data and achieve an RPO of 24 hours.  Remember, this 10% investment is essentially an insurance policy against the 40% chance of data loss mentioned above which could involve some or all of your $1000 worth of content.  Also keep in mind that you will lose up to 24 hours of the most recent data ($1 value) since your RPO is 24 hours.  You could implement a more advanced solution that shortens the RPO to 1 hour or even zero, but if the additional cost of that solution is more than the value of the data it protects, it might not be worth doing.  Legal, Financial, and/or Government regulations can add a cost to data loss through fines which should also be considered.  If the loss of 24 hours of data opens you up to $100 in fines, then it makes sense to spend money to prevent that situation.

How much value does the data create per minute/hour/day?

Whether or not your data itself has value on it’s own, the ability to access it may have value.  For example, If your business sells products or services through a website and a database must be online for sales transactions to occur, then an outage of that database causes loss of revenue.  Understanding this will help you define a Recovery Time Objective, ie: for how long is it acceptable for this database to be down in the event of a failure, and how much should you spend trying to shorten the RTO before you get diminishing returns.

If you have a website that supports company net profits of $1000 a day,  it’s pretty easy to put together an ROI for a backup solution that can restore the website back into operation quickly.  In this example, every hour you save in the restore process prevents $42 of net loss.  Compare the cost of improving restore times against the net loss per hour of outage.  There is a crossover point which will provide a good return on your investment.

Your vendor will be happy when you give them specific RPO and RTO requirements.

Nothing derails a backup/recovery solution discussion quicker than a lack of requirements.  Your vendor of choice will most likely be happy to help you define them but it will help immensely if you have some idea of your own before discussions start.  There are many different data protection solutions on the market and each has it’s own unique characteristics that can provide a range of RPO and RTO’s as well as fit different budgets. Several vendors, including EMC, have multiple solutions of their own — one size definitely does not fit all.  Once you understand the value of your data, you can work with your vendor(s) to come up with a solution that meets your desired RPO and RTO while also keeping a close eye on the financial value of the solution.

If You Are Using SSDs, You Should Be Encrypting

Posted on by

I saw the following article come across Twitter today.

http://www.zdnet.com/blog/storage/ssd-security-the-worst-of-all-worlds/1326

In it, Robin Harris describes the issues around data recovery and secure erasure specific to SSD disks.  In layman’s terms, since SSDs do all sorts of fancy things with writes to increase longevity and performance, disk erasure is nearly impossible using normal methods, and forensic or malicious data recovery is quite easy.  So if you have sensitive data being stored on SSDs, that data is at risk of being read by someone, some day, in the future.  It seems that pretty much the only way to mitigate this risk is to use encryption at some level outside the SSD disk itself.

Did you know that EMC Symmetrix VMAX offers data-at-rest encryption that is completely transparent to hosts and applications, and has no performance impact?  With Symmetrix D@RE, each individual disk is encrypted with a unique key, managed by a built-in RSA key manager, so disks are unreadable if removed from the array.   Since the data is encrypted as the VMAX is writing to the physical disk, attempting to read data off an individual disk without the key is pointless, even for SSD disks.

The beauty of this feature is that it’s set-it-and-forget it.  No management needed, it’s enabled during installation and that’s it.  All disks are encrypted, all the time.

  • Ready to decomm an old array and return it, trade it, or sell it?  Destroy the keys and the data is gone.  No need for an expensive Data Erasure professional services engagement.
  • Failed disk replaced by your vendor?  No need for special arrangements with your vendor to keep those disks onsite, or certify erasure of a disk every time one is replaced.  The key stays with the array and the data on that disk is unreadable.

If you have to comply with PCI and/or other compliance rules that require secure erasure of disks, you should consider putting that data on a VMAX with data-at-rest encryption.

Now, What if you have an existing EMC storage system and the same need to encrypt data?  You can encrypt at the volume level with PowerPath Encryption.  PowerPath encrypts the data at the host with a unique key managed by an RSA Key Manager.  And it works with the non-EMC arrays that PowerPath supports as well.

Under normal circumstances, PowerPath Encryption does have some level of performance impact to the host however HBA vendors, such as Emulex, are now offering HBAs with encryption offload that works with PowerPath.  If you combine PowerPath Encryption with Emulex Encryption HBAs, you get in-flight AND at-rest encryption with near-zero performance impact.

  • Do you replicate your sensitive data to a 3rd party remote datacenter for business continuity?  PowerPath Encryption prevents unauthorized access to the data because no host can read it without the proper key.

Auto-Tiering, Cloud Storage, and Risk-Free Thin Pools

Posted on by

Some customers are afraid of thin provisioning…

Practically every week I have discussions with customers about leveraging thin provisioning to reduce their storage costs and just as often the customer pushes back worried that some day, some number of applications, for some reason, will suddenly consume all of their allocated space in a short period of time and cause the storage pool to run out of space.  If this was to happen, every application using that storage pool will essentially experience an outage and resolving the problem requires allocating more space to the pool, migrating data, and/or deleting data, each of which would take precious time and/or money.  In my opinion, this fear is the primary gating factor to customers using thin provisioning.  Exacerbating the issue, most large organizations have a complex procurement process that forces them to buy storage many months in advance of needing it, further reducing the usefulness of thin provisioning.  The IT organization for one of my customers can only purchase new storage AFTER a business unit requests it and approved by senior management; and they batch those requests before approving a storage purchase.  This means that the business unit may have to wait months to get the storage they requested.

This same customer recently purchased a Symmetrix VMAX with FASTVP and will be leveraging sub-LUN tiering with SSD, FC, and SATA disks totaling over 600TB of usable capacity in this single system.  As we began design work for the storage array the topic of thin provisioning came up and the same fear of running out of space in the pool was voiced.  To prevent this, the customer fully allocates all LUNs in the pool up front which prevents oversubscription.  It’s an effective way to guarantee performance and availability but it means that any free space not used by application owners is locked up by the application server and not available to other applications.  If you take their entire environment into account with approximately 3PB of usable storage and NO thin provisioning, there is probably close to $1 million in storage not being used and not available for applications.  If you weigh the risk of an outage causing the loss of several million dollars per hour of revenue, the customer has decided the risks outweigh the potential savings.  I’ve seen this decision made time and again in various IT shops.

Sub-LUN Tiering pushes the costs for growth down

I previously blogged about using cloud storage for block storage in the form of Cirtas BlueJet and how it would not be to much of a stretch to add this functionality to sub-LUN tiering software like EMC’s FASTVP to leverage cloud storage as a block storage tier as shown in this diagram.

Let’s first assume the customer is already using FASTVP for automated sub-LUN tiering on a VMAX.  FASTVP is already identifying the hot and cold data and moving it to the appropriate tier, and as a result the lowest tier is likely seeing the least amount of IOPS per GB.  In a VMAX, each tier consists of one or more virtual provisioned pools, and as the amount of data stored on the array grows FASTVP will continually adjust, pushing the hot data up to higher tiers and cold data down to the lower tiers  The cold data is more likely to be old data as well so in many cases the data sort of ages down the tiers over time and its the old/least used portion of the data that grows.  Conceptually, the only tier you may have to expand is the lowest (ie: SATA) when you need more space.  This reduces the long term cost of data growth which is great.  But you still need to monitor the pools and expand them before they run out of space, or an outage may occur.  Most storage arrays have alerts and other methods to let you know that you will soon run out of space.

Risk-Free Thin Provisioning

What if the storage array had the ability automatically expand itself into a cloud storage provider, such as AT&T Synaptic, to prevent itself from running out of space?  Technically this is not much different from using the cloud as a tier all it’s own but I’m thinking about temporary use of a cloud provider versus long term.  The cloud provider becomes a buffer for times when the procurement process takes too long, or unexpected growth of data in the pool occurs.  With an automated tiering solution, this becomes relatively easy to do with fairly low impact on production performance.  In fact, I’d argue that you MUST have automated tiering to do this or the array wouldn’t have any method for determining what data it should move to the cloud.  Without that level of intelligence, you’d likely be moving hot data to the cloud which could heavily impact performance of the applications.

Once the customer is able to physically add storage to the pool to deal with the added data, the array would auto-adjust by bringing the data back from the cloud freeing up that space.  The cloud provider would only charge for the transfer of data in/out and the temporary use of space.  Storage reduction technologies like compression and de-duplication could be added to the cloud interface to improve performance for data stored in the cloud and reduce costs.  Zero detect and reclaim technologies could also be leveraged to keep LUNs thin over time as well as prevent the movement of zero’d blocks to the cloud.

Using cloud storage as a buffer for thin provisioning in this way could reduce the risk of using thin provisioning, increasing the utilization rate of the storage, and reducing the overall cost to store data.

What do you think?  Would you feel better about oversubscribing storage pools if you had a fully automated buffer, even if that buffer cost some amount of money in the event it was used?

Compression better than Dedup? NetApp Confirms!

Posted on by

The more I talk with customers, the more I find that the technical details of how something works is much less important than the business outcome it achieves.  When it comes to storage, most customers just want a device that will provide the capacity and performance they need, at a price they can afford–and it better not be too complicated.  Pretty much any vendor trying to sell something will attempt to make their solution fit your needs even if they really don’t have the right products.  It’s a fact of life, sell what you have.  Along these lines, there has been a lot of back and forth between vendors about dedup vs. compression technology and which one solves customer problems best.

After snapshots and thin provisioning, data reduction technology in storage arrays has become a big focus in storage efficiency lately; and there are two primary methods of data reduction — compression and deduplication.

While EMC has been marketing compression technology for block and file data in Celerra, Unified, and Clariion storage systems, NetApp has been marketing deduplication as the technology of choice for block and file storage savings.  But which one is the best choice?  The short answer is.. it depends.  Some data types benefit most from deduplication while others get better savings with compression.

Currently, EMC supports file compression on all EMC Celerra NS20, 40, 80, 120, 480, 960, VG2, and VG8 systems running DART 5.6.47.x+ and block compression on all CX4 based arrays running FLARE30.x+.  In all cases, compression is enabled on a volume/LUN level with a simple check box and processing can be paused, resumed, and disabled completely, uncompressing the data if desired.  Data is compressed out-of-band and has no impact on writes, with minimal overhead on reads.  Any or all LUN(s) and/or Filesystem(s) can be compressed if desired even if they existed prior to upgrading the array to newer code levels.

With the release of OnTap 8.0.1, NetApp has added support for in-line compression within their FAS arrays.  It is enabled per-FlexVol and as far as I have been able to determine, cannot be disabled later (I’m sure Vaughn or another NetApp representative will correct me if I’m wrong here.)  Compression requires 64-bit aggregates which are new in OnTap 8, so FlexVols that existed prior to an upgrade to 8.x cannot be compressed without a data migration which could be disruptive.  Since compression is inline, it creates overhead in the FAS controller and could impact performance of reads and writes to the data.

Vaughn Stewart, of NetApp, expertly blogged today about the new compression feature, including some of the caveats involved, and to me the most interesting part of the post was the following graphic he included showing the space savings of compression vs. dedup for various data types.

Image Credit: Vaughn Stewart, NetApp

The first thing that struck me was how much better compression performed over deduplication for all but one data type (Virtualization will usually fare well because in a typical environment there are many VMs with the same operating system files).  In fact, according to NetApp, deduplication achieves very little savings, if any, for the majority of the data types here.
 
The light green bar indicates savings with both dedupe AND compression enabled on the same dataset.  In 5 out of 9 cases, dedup adds ZERO savings over compression alone.  I can’t help but wonder why anyone would enable dedup on those data types if they already had compression, since both features use storage array CPU resources to find and compress or dedup data.  I am aware that in some cases, dedup can improve performance on NetApp systems due to dedup-aware cache, but I also believe that any performance gain is directly related to the amount of duplication in the data.  Using this chart, virtualization is really the only place where dedup seems particularly effective and hence the only place where real performance gains would likely present themselves.
 
The challenge for NetApp customers will be getting their data into a configuration that supports compression due to the 64-bit aggregate requirement, lack of an easy and non-disruptive LUN migration feature (DataMotion appears to only support iSCSI and NFS and requires several additional licenses), and no way to convert an aggregate from 32-bit to 64-bit.  Once compression has been enabled, if there is truly no way to disable it, any resulting performance impact will be very difficult to rectify.
 
On the other hand, any EMC customer with current maintenance can upgrade their NS or CX4 array to newer versions of DART or FLARE, and compression can be enabled on any existing data after the fact.  If performance becomes an issue for a particular dataset once compressed, the data can be uncompressed later.  Both operations are completely non-disruptive and run in the background.  While block compression only works on LUNs in a virtual pool, as opposed to a traditional RAID group, enabling compression on a normal LUN will automatically migrate the LUN into a virtual pool, perform zero-page reclaim, followed by compression, and the entire process is completely non-disruptive to the application.  Oh, and compressed data can still be tiered with FASTVP across SSD, FC, and SATA disk and/or benefit from up to 2TB of FASTCache.
 
I admit that there is a place for deduplication as well as compression in reducing the footprint of customer data.  However, based on what I’ve seen in my career as an IT professional, and with my customers in my current role at EMC, there are more use cases for compression than there are for deduplication when it comes to primary data, whether SAN or NAS.  Either way, if I was using a new technology for the first time on a particular data set, whether compression or deduplication, I would definitely want a backout plan in case the drawbacks outweight the benefits.