Question 1

What is deduplicaton?

Accepted Answer

Data deduplication is the process of removing duplicate copies of datasets to optimize storage resources and enhance their performance. By eliminating redundant information, the system frees storage space and reduces the size of datasets. The changes lower the cost of storage and improve the performance of applications that can carry out tasks using smaller sets of data. Data deduplication makes sense only when applied to secondary storage locations with backup data. The secondary storage repositories used for backing up data tend to have higher duplication rates that make dedup necessary. On the other hand, primary storage used for production environments prioritize performance over other factors and may opt not to utilize dedup. When deploying deduplication, it is important to be aware that some relational databases such as Oracle and Microsoft SQL do not benefit from dedup. Such databases often keep a unique key for each record, preventing deduplication engines from recognizing duplicate records as duplicates. Let us look at a common example. The CEO of the company sends an email to all 100 employees with an updated organization chart as an attachment. All 100 employees receive the same file and save it to their desktop so they can refer to it later. When the backup runs later that night it saves all 100 copies of that file, but because of deduplication only 1 actual file is saved, with 99 references or pointers to the original. In this example, we saw a deduplication ratio of 100 to 1. When paired with data compression, only the saved instances of the data are compressed to further enhance storage capacity with the efficient encoding of data.

Question 2

What is the deduplication process, and how does it work?

Accepted Answer

Deduplication involves analyzing the data to identify the unique blocks of data before storing them. When duplicates of the individual data block are encountered, the redundant patterns are deleted and replaced with references, or pointers, to the saved unique data. The agent performing the deduplication assigns a unique identifier number to each stored block of data. When duplicate data is found when checking the unique identifier numbers, the newly discovered redundant information is removed. As part of the deduplication process, blocks or chunks of data are compared by scanning the unique identifier numbers assigned to stored data. When a duplicate identifier number is found, the new data chunk with the duplicate identifying number is deemed a redundant copy and deleted. The deduplication agent assumes that duplicate unique numbers mean a redundant data block. Data deduplication can be performed either in-line as the data is flowing or “post-process” after the information has already been written to storage devices. Data deduplication runs in the background and follows prescribed optimization policies to identify files that need work. The dedup agent then breaks the target files into variable data chunks, identifies and stores the unique ones with assigned unique identifier numbers. As part of the deduplication process, duplicate data blocks are removed and replaced with identical saved data references. Next, the dedup agent re-establishes the original file stream with the newly optimized files. As noted earlier in the email example, it is not uncommon (with certain data types) for deduplication ratios to be as high as 90:1, which increases the benefits of deduplication. More storage efficiencies are realized by coupling deduplication with data compression that further reduces the size of stored data and its storage footprint. Data deduplication is a simple concept that contributes dramatically to reducing storage resources and the costs associated with them. When coupled with data compression, the savings increase, and the storage devices’ performance and applications improve.

Question 3

Why is deduplication important?

Accepted Answer

The unchecked exponential data growth coupled with shrinking backup windows, data retention SLAs, and regulatory requirements strain IT resources and business growth. Data deduplication technology reduces physical disk capacity requirements while meeting data retention requirements. The amount of data created daily is staggering, and all of it must be optimized for better utilization of storage capacity and protected against loss. On average, in 2020, humans generated 1.7MB of data every second for every person on earth.1 By 2025, it is estimated that 463 exabytes of data will be created each day.2 Unfortunately, most of the stored data are duplicates that increase customers’ storage costs and lower cloud systems’ performance and utilization. For example, users’ file sharing results in many duplicate copies of the same file. In virtualized environments, guest operating systems may be almost identical to several of the virtual machines (VMs). Even backup snapshots may have minor differences from one day to another. Deduplication removes these inefficiencies from storage systems. According to Microsoft’s estimate, virtualization libraries may see typical space-saving up to 50%, while virtualization libraries might realize up to 95% space savings.3 The savings vary depending on datasets.

Question 4

Does Metallic offer deduplication?

Accepted Answer

Yes! Metallic offers enterprise-grade deduplication in our portfolio of data protection offerings. Through our sophisticated dedup capabilities, Metallic customers enjoy fast performance, storage optimization, and cost savings. Learn more about Metallic by visiting Metallic.io.

Data deduplication

Data deduplication definition

What is data deduplicaton?

What is the data deduplication process, and how does it work?

Why is data deduplication important?

Benefits of data deduplication

Common use cases for data deduplication

Does Metallic offer data deduplication?

A free trial designed for your real world

Sources