Medusa

Implementing Data Reduction

What do you do when things start to get a little too messy and disorganized in your home? Unless you derive a strange sort of pleasure from wading through heaps of boxes, clothes, and old 90’s CDs, anytime you need to locate something, you grab a trash bag and a broom and start cleaning. Similar logic applies to computers.

Rapidly growing levels of data on all kinds of computers from personal laptops to corporate servers mean increased storage space demands for containing and organizing that data. When things get a little too chaotic and storage space dwindles, there are some things you can do to effectively “clean up” the mess – namely, data reduction.

Why is Data Reduction Important?

Data reduction in a nutshell is simply a method of decreasing the amount of data that needs to be stored. Methods of data reduction include de-duplication and compression. Hopefully, it’s already been made abundantly clear that the most important function of data reduction is to increase available storage space on a computer.

This can lower expenditures on additional disks, improve disk retention, make backups less of a necessity, and reduce the sheer amount of data sent over a WAN for things like disaster recovery. Let’s take a closer look at the different methods of data reduction.

Data de-duplication is also sometimes referred to as single instance storage or intelligent compression. The goal of de-duplication is to wipe out any redundant data, or identical pieces of data that occur multiple times on a particular storage device. The process of de-duplication eliminates all of the superfluous copies of data and leaves only a single instance of a given piece of data.

For instance, if there are 250 different instances of a single 1MB data set, then 250MB of storage space is being occupied by this one piece of data. If you apply de-duplication however, the extra 249 instances of the data are eliminated and replaced by a pointer that leads to the single unique copy which requires just a single MB of storage space.

De-duplication usually works by utilizing hash algorithms which generate a unique identification number for each data set that is then indexed. Problems can occasionally arise when there is something called a “hash collision”. A hash collision occurs when the algorithm assigns two different chunks of data the same identification number which will result in the exclusion of one of the files which it reads as a duplicate, resulting in data loss.

Using a weak hashing algorithm in an attempt to optimize CPU performance puts you at greater risk for a hash collision.

Data compression involves reducing the size of data which can save precious storage space as well as cut back on transmission time. Similar to de-duplication, compression programs use a special algorithm to reduce data. For example, text file compression can involve an algorithm that deletes all the spaces or replaces the most frequently occurring characters. Algorithms are also used in order to decompress the data afterwards.

Compressing graphic images can prove a little trickier and there are several different methods for reducing image size. Some of these methods are “lossy”, meaning small pieces of information are lost during the process, and some are “lossless”, meaning the decompressed image will be identical to the original. Zip files are a common example of a compressed file format.

While compression or de-duplication can be applied on their own, the most effective data reduction techniques require that both be applied. When utilized together, these two methods can effectively “clean up” your system which will optimize storage space and performance.

Written by Steven Bishop. Steven assists in giving customers the freedom and flexibility to protect their own data.