Data deduplication - Cloud Backup - Alibaba Cloud Documentation Center

Data deduplication is a key technology in the backup and data management field, which can help enterprises effectively address the challenges of rapid data growth, significantly improve data protection efficiency, and reduce costs. As an important cornerstone of Cloud Backup, the data deduplication technology plays a vital role in improving backup performance, speeding up transmission, and saving storage space. This topic describes how the data deduplication technology of Cloud Backup works.

Workflow

The preceding figure shows the schematic diagram of data deduplication. Cloud Backup uses source deduplication based on efficient slicing algorithms and each backup vault functions as an independent deduplication domain, implementing global data deduplication at the level of backup vaults. The workflow consists of the following five steps:

During a backup, the backup engine (a backup client or backup service cluster) reads data of your original files, databases, and virtual machines (VMs).
The backup engine slices the raw data.
The backup engine compares the data fragments with the slice data that already exists in the backup vault to identify the slices that do not exist in the backup vault.
The backup engine uploads only the slices that do not exist in the backup vault to the backup vault.
The backup engine stores the slice IDs contained in each file or VM in the backup vault. This way, the raw data can be assembled and written to the recovery destination during restoration.

Benefits

Reduced storage consumption of backup data: Cloud Backup uses the data slicing and deduplication feature, which can identify duplicate data in a finer-grained manner, compared with the file-level deduplication feature. The deduplication process is performed at the backup vault level. All data stored in the same backup vault is globally deduplicated. Therefore, this process provides a larger deduplication scope and a higher deduplication ratio. Compared with the backup services that do not provide the deduplication feature, Cloud Backup significantly reduces the required backup storage space. When you use Cloud Backup, you are charged only for the actual backup storage capacity after deduplication.
Saved network bandwidth: You can perform data deduplication on a source database and upload only the data fragments that do not exist in the backup vault. This method saves bandwidth for cloud adoption in hybrid cloud environments.
Improved backup performance: Data is read, sliced, and deduplicated, establishing an efficient pipeline. Cloud Backup can quickly identify duplicate data and prevent the transmission of duplicate data, improving the backup performance by more than several times.
Improved backup data security: Cloud Backup slices data into fragments of random lengths and uploads the data fragments so that the file content is re-organized. As a result, it is difficult for attackers to identify the original format of the data, enhancing security in data transmission and storage.

Associated concepts

Compression: Compression is the process of using an encoding technology to reduce the storage space occupied by data. Cloud Backup combines the data deduplication and compression technologies to further reduce the amount of data that needs to be stored. After the deduplication, each data fragment is compressed and then stored in backup vaults, saving storage space effectively.
Deduplication and compression ratio (deduplication ratio for short): the ratio of the total amount of raw data that is backed up to the amount of data that is actually stored in a backup vault. For example, if you perform a full backup on a 30 GB directory every day and retain the data version for 30 days, the total amount of the raw data backed up is 30 GB × 30 = 900 GB. After the data is deduplicated and compressed, the actual storage space occupied by the data in the backup vault is 28 GB. Therefore, the deduplication ratio is 900:28, which is approximately equal to 32:1.
Incremental-forever backup: Cloud Backup uses an efficient incremental-forever backup mechanism to back up files, such as on-premises files, Elastic Compute Service (ECS) files, File Storage NAS (NAS) file systems, Object Storage Service (OSS) buckets, and Cloud Parallel File Storage (CPFS) file systems. Each time a backup job is run, the backup engine identifies the files (including new, modified, and deleted files) that have changed compared with the last backup, reads only the content of these changed files, and writes the content to a backup vault. Then, at the backend of the backup vault, Cloud Backup merges the incremental data collected this time with the data of the last full backup to generate a new full backup point. This significantly reduces the amount of read customer data, effectively improving the overall backup efficiency. Note that the deduplication ratio is calculated based on the full data corresponding to each backup point.