Although many large-capacity drives have already been produced, there are still many customers who need to save capacity and achieve the goal of 1 to more capacity. There are many ways to save capacity, the most commonly heard is Deduplication in the market. Today is to send us what are the advantages and disadvantages, and how the storage of each brand is to save capacity.
What is Data Deduplication?
Data deduplication is a process that eliminates excessive copies of data and significantly decreases storage capacity requirements.
Deduplication can be run as an inline process as the data is being written into the storage system and/or as a background process to eliminate duplicates after the data is written to disk.
The performance overhead is minimal for deduplication operations because it runs in a dedicated efficiency domain that is separate from the client read/write domain. It runs behind the scenes, regardless of what application is run or how the data is being accessed (in NAS or SAN).
Deduplication savings are maintained as data moves around – when the data is replicated to a DR site, when it’s backed up to a vault, or when it moves between on-premises, hybrid cloud, or public cloud.
Why I need deduplication?
It helps storage administrators reduce costs that are associated with duplicated data. Large datasets often have a lot of duplication, which increases the costs of storing the data. For example:
- User file shares may have many copies of the same or similar files.
- Virtualization guests might be almost identical from VM-to-VM.
- Backup snapshots might have minor differences from day today.
The capacity savings that you can gain from Data Deduplication depend on the dataset or workload on the volume. Datasets that have high duplication could see optimization rates of up to 95%, or a 20x reduction in storage utilization. In addition, it can also “Improve write performance” and “Save network bandwidth”.
What kind of environment will need this?
Data deduplication is to find large blocks of repeated data in a relatively large range, and the size of the repeated data block is generally above 1KB. This technology is widely used in network hard disks, emails, disk backup media devices, etc.
It is useful regardless of workload type. Maximum benefit is seen in virtual environments where multiple virtual machines are used for test/dev and application deployments.
Virtual desktop infrastructure (VDI) is another very good candidate for deduplication because the duplicate data among desktops is very high.
Some relational databases such as Oracle and SQL do not benefit greatly from deduplication, because they often have a unique key for each database record, which prevents the deduplication engine from identifying them as duplicates.
Advantage & Challenge
Cost effective without compromise (AFA VS Hybrid)
Look at the comparison table above, deduplication methods could be varied from different vendors, not only saves space but also works with SSD Cache to accelerate performance is a new trend and could be a must in the future, that is very helpful when trying to save time while backing up or transferring data even when the dedup ratio is high. Plus, the Hybrid-Design by using SSD Cache while deduplication is enabled is way more affordable than the All-Flash-Design when using deduplication in some brands, making you achieve the high performance and low latency with lower budget, and saving lots of storage capacity via deduplication.
However, if you need high random IOPS and low latency, that is if it is used in storage IO patterns, such as a large number of SQL access or VDI environment, then the NVMe all-flash will be the best choice. Trying not using Deduplication in this kind of scenario, every single product design has its own purpose.
Deduplication design can optimize the read performance with affected little write performance and also provide the capacity saving advantage according to the research mentioned above, choosing the right one that is best-fitted in your environment is quite important before you are going to purchase one storage with limited budget. Though the disk size is getting larger nowadays, people always want to maximize the use of resources. There are additional techniques in the market to eliminate the concerns about the risks during the rebuilding process of a RAID, like RAID2.0, Fast Rebuild, which may help you a lot of minimizing the rebuilding time while larger capacity disks are used.