Can you Compress AND Dedupe? It Depends

Posted on by

My recent post about Compression vs Dedupe, which was sparked by Vaughn’s blog post about NetApp’s new compression feature, got me thinking more about the use of de-duplication and compression at the same time.  Can they work together?  What is the resulting effect on storage space savings?  What if we throw encryption of data into the mix as well?

What is Data De-Duplication?

De-duplication in the data storage context is a technology that finds duplicate patterns of data in chunks of blocks (sized from 4-128KB or so depending on implementation), stores each unique pattern only once, and uses reference pointers in order to reconstruct the original data when needed.  The net effect is a reduction in the amount of physical disk space consumed.

What is Data Compression?

Compression finds very small patterns in data (down to just a couple bytes or even bits at a time in some cases) and replaces those patterns with representative patterns that consume fewer bytes than the original pattern.  An extremely simple example would be replacing 1000 x “0”s with “0-1000”, reducing 1000 bytes to only 6.

Compression works on a more micro level, where de-duplication takes a slighty more macro view of the data.

What is Data Encryption?

In a very basic sense, encryption is a more advanced version of compression.  Rather than compare the original data to itself, encryption uses an input (a key) to compute new patterns from the original patterns, making the data impossible to understand if it is read without the matching key.

Encryption and Compression break De-Duplication

One of the interesting things about most compression and encryption algorithms is that if you run the same source data through an algorithm multiple times, the resulting encrypted/compressed data will be different each time.  This means that even if the source data has repeating patterns, the compressed and/or encrypted version of that data most likely does not.  So if you are using a technology that looks for repeating patterns of bytes in fairly large chunks 4-128KB, such as data de-duplication, compression and encryption both reduce the space savings significantly if not completely.

I see this problem a lot in backup environments with DataDomain customers.  When a customer encrypts or compresses the backup data before it gets through the backup application and into the DataDomain appliance, the space savings drops and many times the customer becomes frustrated by what they perceive as a failing technology.  A really common example is using Oracle RMAN or using SQL LightSpeed to compress database dumps prior to backing up with a traditional backup product (such as NetWorker or NetBackup).

Sure LightSpeed will compress the dump 95%, but every subsequent dump of the same database is unique data to a de-duplication engine and you will get little if any benefit from de-duplication.   If you leave the dump uncompressed, the de-duplication engine will find common patterns across multiple dumps and will usually achieve higher overall savings.  This gets even more important when you are trying to replicate backups over the WAN, since de-duplication also reduces replication traffic.

It all depends on the order

The truth is you CAN use de-duplication with compression, and even encryption.  They key is the order in which the data is processed by each algorithm.  Essentially, de-duplication must come first.  After data is processed by de-duplication, there is enough data in the resulting 4-128KB blocks to be compressed, and the resulting compressed data can be encrypted.  Similar to de-duplication, compression will have lackluster results with encrypted data, so encrypt last.

Original Data -> De-Dupe -> Compress -> Encrypt -> Store

There are good examples of this already;

EMC DataDomain – After incoming data has been de-duplicated, the DataDomain appliance compresses the blocks using a standard algorithm.  If you look at statistics on an average DDR appliance you’ll see 1.5-2X compression on top of the de-duplication savings.  DataDomain also offers an encryption option that encrypts the filesystem and does not affect the de-duplication or compression ratios achieved.

EMC Celerra NAS – Celerra De-Duplication combines single instance store with file level compression.  First, the Celerra hashes the files to find any duplicates, then removes the duplicates, replacing them with a pointer.  Then the remaining files are compressed.  If Celerra compressed the files first, the hash process would not be able to find duplicate files.

So what’s up with NetApp’s numbers?

Back to my earlier post on Dedupe vs. Compression; what is the deal with NetApp’s dedupe+compression numbers being mostly the same as with compression alone?  Well, I don’t know all of the details about the implementation of compression in ONTAP 8.0.1, but based on what I’ve been able to find, compression could be happening before de-duplication.  This would easily explain the storage savings graph that Vaughn provided in his blog.  Also, NetApp claims that ONTAP compression is inline, and we already know that ONTAP de-duplication is a post-process technology.  This suggests that compression is occurring during the initial writes, while de-duplication is coming along after the fact looking for duplicate 4KB blocks.  Maybe the de-duplication engine in ONTAP uncompresses the 4KB block before checking for duplicates but that would seem to increase CPU overhead on the filer unnecessarily.

Encryption before or after de-duplication/compression – What about compliance?

I make a recommendation here to encrypt data last, ie: after all data-reduction technologies have been applied.  However, the caveat is that for some customers, with some data, this is simply not possible.  If you must encrypt data end-to-end for compliance or business/national security reasons, then by all means, do it.  The unfortunate byproduct of that requirement is that you may get very little space savings on that data from de-duplication both in primary storage and in a backup environment.  This also affects WAN bandwidth when replicating since encrypted data is difficult to compress and accelerate as well.

7 comments on “Can you Compress AND Dedupe? It Depends

  1. […] This post was mentioned on Twitter by Carlo, Richard Anderson. Richard Anderson said: Can you Compress AND Dedupe? It Depends: #efficiency http://wp.me/pxrRl-5b […]

  2. Exar’s BitWackr technology, embedded in appliances like the BridgeSTOR Application Optimized Storage, perform hashing for deduplication, compression using eLZS and encryption using AES256 in a single DMA operation using a 9725 co-processor on a PCIe card. This keeps latency low for in-line performance and very effectively size-reduces data while keeping it secure through encryption.

  3. How to use EMC NAS dedupe with Data Domain…storage needs to reduce space from buying disk and backup also need to reduce storage …if both deduped then either one is doing redundant work and doesnt add value…what is the best practice? thx

    • The answer is, it depends… I’m assuming you are using some sort of backup software (Commvault, Networker, Avamar, Netbackup, BackupExec, etc) to manage the backups right?

      The EMC NAS compresses files within the file system to save space. When a client reads that file, it is decompressed on the fly so the client doesn’t see it as compressed. In the case of a backup, how it looks to DataDomain depends on the method of backup.

      On the receiving end, the DataDomain will achieve higher deduplication ratios if the data it receives is not compressed

      If your backup software is just mounting the file share and reading via NFS or CIFS, then the data will be sent to datadomain uncompressed since the NAS will decompress each file as it’s being backed up. This will work fine with DataDomain’s deduplication.

      If you are using NDMP backup, there are two possible scenarios.
      1.) TAR or DUMP backups via NDMP will cause the NAS to uncompress the files on the fly for the NDMP backup and therefore the DataDomain works normally.
      2.) EMC VBB backup via NDMP backs up the entire filesystem at a block level, then sends the file list to the backup catalog. This makes for a much faster backup than reading every file one by one. However, since it’s reading the filesystem itself, rather than the individual files, any compressed data is sent to the backup system still compressed. This would reduce the DataDomain’s deduplication effectiveness.

      VBB backups are an option on EMC NAS and are not commonly used from what I’ve seen. Assuming your backup application is using NAS mount or NDMP DUMP or TAR, you should be just fine.

      The most optimized way to back up NAS is to use Avamar (with or without Datadomain) since it can perform incremental-forever backups for NAS using NDMP. With Avamar, you don’t need the weekly fulls anymore so your backups complete super fast. DataDomain is also a very fine choice.

  4. “Compressed deduplication” compress better than simple deduplication …
    but do “compressed deduplication” compress better than simple compression ? 🙂

    (is there a benefit to deduplicate before compressing )

    • The results you can expect from compression vs. deduplication vs. compression+deduplication will vary dependent on the data set and the way it’s accessed and stored.

      For example, on primary storage the data tends to be more random and accessed frequently, while on secondary (ie: backup to disk) there is much more redundancy across files and less frequent access.

      It also depends on whether the deduplication process and/or compression process happens inline or as a post process action. There are trade-offs to each technology depending on the different data types, as well as the process.

      For example, inline deduplication can do very fast data reduction but requires large amounts of memory to hold the hash table and alot of CPU to handle the work. When done as a post-process, dedupication can eliminate the requirement for a large amount of memory and CPU, but it requires more disk space since new writes are not deduplicated. Same goes for inline compression vs. post-process.

      Different datasets make a large difference in which technologies work well. Lots of very small text files would be better suited to compression instead of deduplication because the commonality inside each file may be too small for deduplication to find. Compression algorithms usually can find savings from extremely small datasets.

      Compressing a single entire Windows operating system install in a virtual machine would net maybe 50% savings, while deduplication would not save as much. However, if you compress 100 identical virtual machines you still get 50% savings while deduplication could be drastically higher since each of those windows virtual machines has a ton of common files.

      Anyway, it all really depends. The best plan is to start with what you are actually trying to accomplish, and then talk to each vendor about how their solutions. The other thing to consider is that deduplication and compression are just two of the ways to gain efficiency in storage and reduce cost. Reducing cost being the ultimate goal, you may find that some vendors have other technologies that don’t necessarily reduce the amount of disk space being used, but save you money in other ways.

  5. […] See source […]