Validate data consistency after migration - Data Online Migration

Validate data consistency after you complete a migration task in Data Online Migration.

Important

Data Online Migration does not guarantee data consistency. You must validate data consistency.
During the validation, requesting data from the source or destination may incur request and traffic fees.
If data is being written to the source or destination during the validation, the validation results may be affected.

Validation dimensions

Data validation includes three dimensions: file count consistency, file content consistency, and file metadata (partial) consistency. Perform all three to complete data consistency validation.

File count consistency: the source file set (S1) and destination file set (S2) match, with no missing files.
File content consistency: each file has the same content on both ends, with no corruption or disorder.
File metadata (partial) consistency: supported metadata entries for each file are consistent between source and destination. Data Online Migration migrates only part of metadata. For details, see the Notes and Limits sections in the relevant migration tutorials.

Validation logic

Different destination types offer different validation features. For example, OSS can generate a file list for a bucket and provides reliable CRC-64 values for all uploaded objects. A local file system (LocalFS) does not have similar features. You must list the files and calculate the checksums yourself.

The following describes the data consistency validation logic for different destination types:

Destination type	Validation dimension	Validation logic
OSS	File count consistency	Validation principle: Traverse the source file collection and ensure each file exists in the destination collection. Validation steps: Get the source file list: Obtain it using the relevant traversal interface, or generate it using the list feature (if available). Get the destination file list: Obtain it using the `ListObjects` interface, or generate it using the OSS list feature. Perform validation: Traverse the source file list and ensure each file exists in the destination file list.
	File content consistency	Validation principle: Compare the checksums of each file. Validation steps: Get the source CRC-64: If the source has the same type of CRC-64 value, obtain it directly using the interface. Otherwise, read and calculate the source file's CRC-64 value one by one. Get the destination CRC-64: The CRC-64 value for each successfully migrated object is listed in the migration report. Call the `HeadObject` interface to retrieve it from OSS in real time. Perform validation: Compare the CRC-64 values obtained from both ends.
	File metadata (partial) consistency	Validation principle: Compare the (partial) metadata entries of each file. Validation steps: Get the source file metadata: Obtain it using the relevant interface, such as S3's `HeadObject`. Get the destination file metadata: Obtain it using the `HeadObject` interface. Perform validation: Compare the two sets of metadata entries (only those supported by Data Online Migration) one by one.
LocalFS	File count consistency	Validation principle: Traverse the source file collection and ensure each file exists in the destination collection. Validation steps: Get the source file list: Obtain it using the relevant traversal interface, or generate it using the list feature (if available). Get the destination file list: Obtain it only using the relevant traversal interface. Perform validation: Traverse the source file list and ensure each file exists in the destination file list.
	File content consistency	Validation principle: Compare the checksums of each file. Validation steps: Get the source checksum: If the source has a directly usable checksum (MD5/CRC32/CRC-64, etc.), obtain it directly using the interface. Otherwise, select a validation algorithm (MD5 is recommended) in advance, then read and calculate the source file's checksum using that algorithm one by one. Get the destination checksum: Because file systems do not have checksum metadata, the corresponding validation field in the migration report is unavailable. Read and calculate the LocalFS file's checksum using the same algorithm as the source, one by one. Perform validation: Compare the checksums obtained from both ends.
	File metadata (partial) consistency	Validation principle: Compare the (partial) metadata entries of each file. Validation steps: Get the source file metadata: Obtain it using the relevant interface, such as S3's `HeadObject`. Get the destination file metadata: Obtain it using a POSIX file I/O interface, such as `stat`. Perform validation: Compare the two sets of metadata entries (only those supported by Data Online Migration) one by one.

Cost and performance optimization suggestions

Use a private network for validation: Request data from both source and destination over a private network. This reduces latency and can eliminate or lower related network costs.
Optimize the validation logic: When you validate file content consistency, first compare the file sizes. If the sizes are inconsistent, mark the validation as failed. If the sizes are consistent, proceed to retrieve the checksums.
Use a sampling validation strategy: For very large datasets, such as terabytes or petabytes of storage and hundreds of millions of files, full validation is extremely costly. Use sampling to balance credibility and cost. Include files with different characteristics. The following suggestions apply:
- Sample by file size:
  - Small file samples: Extract several files that are smaller than 150 MB. These files are typically uploaded using a simple upload method.
  - Large file samples: Extract several files that are larger than 150 MB. These files are typically uploaded using concurrent sharding.
  - Extra-large file samples: If your files exceed 5 GB in size, sample them separately. These files typically have long transfer periods and consume high bandwidth.
  - Directory and empty file samples: These are edge cases. Confirm that they are successfully created and their metadata meets expectations, such as the Uid, Gid, and Permissions of directories.
- Sample by business features and metadata:
  - Core business samples: If the migrated data includes core business files, perform a full validation on these files.
  - File type samples: Based on your business type, extract files of key formats, such as .jpg, .pdf, .log, .json, and .sql.
  - Special metadata samples: Based on your business type, if source files have custom metadata, such as x-oss-meta-* or specific HTTP headers, extract these files to validate the consistency of the custom metadata.
  - Hot and cold data samples: Extract recently updated files and historical archived objects. Validate whether metadata, such as LastModified, is accurately retained.
  - Path and naming samples: For example, you can sample files with deep directory levels (such as more than 10 levels) and files with spaces, Chinese characters, Unicode characters, or special symbols in their paths and names. Validate for encoding, decoding, and escape issues.
- Random statistical sampling:
  - Use a random algorithm to extract a certain percentage of files, such as 1% to 5%, from the file list. This helps discover irregular random errors.

Important

A sampling validation strategy is a compromise that balances validation cost and efficiency in scenarios that involve a massive amount of data. Sampling validation is based on statistical principles and carries a risk of missed detections. It cannot replace a full validation. For core business data, a full validation is still strongly recommended.

Validation issues and troubleshooting

If you find data inconsistencies during the validation, perform the following steps to troubleshoot the issues:

Confirm whether the file was updated on the source or destination.
Retrieve all information about the file from the migration report. Compare the information with the source and destination information separately and analyze the possible causes.
If you cannot locate the issue, submit a ticket to contact technical support. In the ticket, specify information such as the console region, task ID, and file path.

Mark task status

After you complete the validation and confirm that no issues exist, log on to the Data Online Migration console. On the Task List page, find the task and select Confirm data integrity and consistency in the Status column to confirm that the validation is complete.