Validate data consistency after you complete a migration task in Data Online Migration.
Data Online Migration does not guarantee data consistency. You must validate data consistency.
During the validation, requesting data from the source or destination may incur request and traffic fees.
If data is being written to the source or destination during the validation, the validation results may be affected.
Validation dimensions
Data validation includes three dimensions: file count consistency, file content consistency, and file metadata (partial) consistency. Perform all three to complete data consistency validation.
File count consistency: the source file set (S1) and destination file set (S2) match, with no missing files.
File content consistency: each file has the same content on both ends, with no corruption or disorder.
File metadata (partial) consistency: supported metadata entries for each file are consistent between source and destination. Data Online Migration migrates only part of metadata. For details, see the Notes and Limits sections in the relevant migration tutorials.
Validation logic
Different destination types offer different validation features. For example, OSS can generate a file list for a bucket and provides reliable CRC-64 values for all uploaded objects. A local file system (LocalFS) does not have similar features. You must list the files and calculate the checksums yourself.
The following describes the data consistency validation logic for different destination types:
Destination type | Validation dimension | Validation logic |
OSS | File count consistency |
|
File content consistency |
| |
File metadata (partial) consistency |
| |
LocalFS | File count consistency |
|
File content consistency |
| |
File metadata (partial) consistency |
|
Cost and performance optimization suggestions
Use a private network for validation: Request data from both source and destination over a private network. This reduces latency and can eliminate or lower related network costs.
Optimize the validation logic: When you validate file content consistency, first compare the file sizes. If the sizes are inconsistent, mark the validation as failed. If the sizes are consistent, proceed to retrieve the checksums.
Use a sampling validation strategy: For very large datasets, such as terabytes or petabytes of storage and hundreds of millions of files, full validation is extremely costly. Use sampling to balance credibility and cost. Include files with different characteristics. The following suggestions apply:
Sample by file size:
Small file samples: Extract several files that are smaller than 150 MB. These files are typically uploaded using a simple upload method.
Large file samples: Extract several files that are larger than 150 MB. These files are typically uploaded using concurrent sharding.
Extra-large file samples: If your files exceed 5 GB in size, sample them separately. These files typically have long transfer periods and consume high bandwidth.
Directory and empty file samples: These are edge cases. Confirm that they are successfully created and their metadata meets expectations, such as the Uid, Gid, and Permissions of directories.
Sample by business features and metadata:
Core business samples: If the migrated data includes core business files, perform a full validation on these files.
File type samples: Based on your business type, extract files of key formats, such as
.jpg,.pdf,.log,.json, and.sql.Special metadata samples: Based on your business type, if source files have custom metadata, such as
x-oss-meta-*or specific HTTP headers, extract these files to validate the consistency of the custom metadata.Hot and cold data samples: Extract recently updated files and historical archived objects. Validate whether metadata, such as LastModified, is accurately retained.
Path and naming samples: For example, you can sample files with deep directory levels (such as more than 10 levels) and files with spaces, Chinese characters, Unicode characters, or special symbols in their paths and names. Validate for encoding, decoding, and escape issues.
Random statistical sampling:
Use a random algorithm to extract a certain percentage of files, such as 1% to 5%, from the file list. This helps discover irregular random errors.
A sampling validation strategy is a compromise that balances validation cost and efficiency in scenarios that involve a massive amount of data. Sampling validation is based on statistical principles and carries a risk of missed detections. It cannot replace a full validation. For core business data, a full validation is still strongly recommended.
Validation issues and troubleshooting
If you find data inconsistencies during the validation, perform the following steps to troubleshoot the issues:
Confirm whether the file was updated on the source or destination.
Retrieve all information about the file from the migration report. Compare the information with the source and destination information separately and analyze the possible causes.
If you cannot locate the issue, submit a ticket to contact technical support. In the ticket, specify information such as the console region, task ID, and file path.
Mark task status
After you complete the validation and confirm that no issues exist, log on to the Data Online Migration console. On the Task List page, find the task and select Confirm data integrity and consistency in the Status column to confirm that the validation is complete.