The PAI dynamic dataset loading service lets you dynamically load and automatically replace datasets on demand during training in DSW or DLC. This service improves resource utilization and ensures efficient training.
This feature is currently in invitational preview. To use this feature, contact your account manager.
Background
In large model training, AI dataset sizes have grown rapidly from terabytes (TB) to petabytes (PB), or even larger. This growth introduces challenges related to storage costs and performance bottlenecks. To address these issues, PAI provides the dynamic dataset loading service. This service supports on-demand dynamic loading and automatic replacement of datasets during training. It significantly reduces storage costs, improves resource utilization, and ensures an efficient and stable training experience. The service stores the full dataset in the low-cost Object Storage Service (OSS). When data is needed for training, the service dynamically pulls only the required data subsets into a high-performance Cloud Paralleled File System (CPFS) cache. This method greatly reduces the need for high-performance storage capacity without sacrificing performance.
Scenarios
Very large datasets: The total dataset size exceeds the CPFS storage capacity. For example, training a 100 PB dataset using only a 10 PB CPFS cache.
Clear access locality: Suitable for large files or directories with spatio-temporal locality access patterns, such as video, image, audio, and clip data.
High requirements for developer experience: Compatibility with native file APIs eliminates the need to change data reading logic. This minimizes code refactoring and learning costs.
Technical architecture
System components
OSS: Persistently stores the original full dataset. It provides a low-cost and highly reliable base storage layer.
CPFS: Serves as a high-performance cache that holds the data subsets required by training tasks in real time.
CPFS data flow component: Pulls data on demand from OSS to the CPFS cache based on scheduling instructions.
Main flow

Project code integration steps
This example shows how to integrate the code for a DLC training task. The CPFS file system is mounted to /mnt/data/ using the Mount Dataset feature.
1. Initialize the dataset file preloading client
When the training process starts, create a DatasetFileLoaderClient instance to initialize the context:
import os
from dataset_file_loader import DatasetFileLoaderClient
from dataset_file_loader.model import DatasetFile
worker_id = int(os.environ.get('LOCAL_RANK', 0)) # Get the current worker ID
client = DatasetFileLoaderClient(
worker_id=worker_id,
cpfs_mount_path="/mnt/data/", # The CPFS mount path, which must be the same as in the training task
persistence_file_path=".dataset_file_persistence/task_1/", # The path for metadata persistence
preload_request_buffer_size=20 # The size of the preload buffer. A value from 10 to 50 is recommended.
)worker_idmust be the same as the local rank in the distributed training framework, such asLOCAL_RANKin PyTorch.cpfs_mount_pathandpersistence_file_pathmust be exactly the same as in the preload service configuration.
2. Report preload requirements (before using data)
During the data sampling phase, report the upcoming file paths in batches to preload the data files:
# Query the current preload quota to avoid overload
quota = client.get_preload_dataset_files_quota(index)
for _ in range(quota):
client.preload_dataset_file(
DatasetFile(
dataset_file_path=file_path, # The relative path of the OSS folder (file) to preload
sampler_index=index # The sampler index
)
)3. Mark data usage status (when data use begins)
When you start to read a dataset file, notify the system:
client.active_dataset_file(
DatasetFile(
dataset_file_path=file_path, # The relative path of the OSS folder (file) to preload
sampler_index=index # The sampler index
)
)4. Release data resources (after using data)
After you finish reading the data, immediately release the resources. This action notifies the system that it can safely clear the cache to free up space:
client.release_dataset_file(
DatasetFile(
dataset_file_path=file_path, # The relative path of the OSS folder (file) to preload
sampler_index=index # The sampler index
)
)Billing
The PAI dynamic dataset loading service is currently available at no charge.