All Products
Search
Document Center

Platform For AI:Dynamic dataset loading service

Last Updated:Nov 03, 2025

The PAI dynamic dataset loading service lets you dynamically load and automatically replace datasets on demand during training in DSW or DLC. This service improves resource utilization and ensures efficient training.

Note

This feature is currently in invitational preview. To use this feature, contact your account manager.

Background

In large model training, AI dataset sizes have grown rapidly from terabytes (TB) to petabytes (PB), or even larger. This growth introduces challenges related to storage costs and performance bottlenecks. To address these issues, PAI provides the dynamic dataset loading service. This service supports on-demand dynamic loading and automatic replacement of datasets during training. It significantly reduces storage costs, improves resource utilization, and ensures an efficient and stable training experience. The service stores the full dataset in the low-cost Object Storage Service (OSS). When data is needed for training, the service dynamically pulls only the required data subsets into a high-performance Cloud Paralleled File System (CPFS) cache. This method greatly reduces the need for high-performance storage capacity without sacrificing performance.

Scenarios

  • Very large datasets: The total dataset size exceeds the CPFS storage capacity. For example, training a 100 PB dataset using only a 10 PB CPFS cache.

  • Clear access locality: Suitable for large files or directories with spatio-temporal locality access patterns, such as video, image, audio, and clip data.

  • High requirements for developer experience: Compatibility with native file APIs eliminates the need to change data reading logic. This minimizes code refactoring and learning costs.

Technical architecture

System components

  • OSS: Persistently stores the original full dataset. It provides a low-cost and highly reliable base storage layer.

  • CPFS: Serves as a high-performance cache that holds the data subsets required by training tasks in real time.

  • CPFS data flow component: Pulls data on demand from OSS to the CPFS cache based on scheduling instructions.

Main flow

image.png

Project code integration steps

This example shows how to integrate the code for a DLC training task. The CPFS file system is mounted to /mnt/data/ using the Mount Dataset feature.

1. Initialize the dataset file preloading client

When the training process starts, create a DatasetFileLoaderClient instance to initialize the context:

import os
from dataset_file_loader import DatasetFileLoaderClient
from dataset_file_loader.model import DatasetFile

worker_id = int(os.environ.get('LOCAL_RANK', 0))  # Get the current worker ID

client = DatasetFileLoaderClient(
    worker_id=worker_id,
    cpfs_mount_path="/mnt/data/",                   # The CPFS mount path, which must be the same as in the training task
    persistence_file_path=".dataset_file_persistence/task_1/",  # The path for metadata persistence
    preload_request_buffer_size=20                  # The size of the preload buffer. A value from 10 to 50 is recommended.
)
Note
  • worker_id must be the same as the local rank in the distributed training framework, such as LOCAL_RANK in PyTorch.

  • cpfs_mount_path and persistence_file_path must be exactly the same as in the preload service configuration.

2. Report preload requirements (before using data)

During the data sampling phase, report the upcoming file paths in batches to preload the data files:

# Query the current preload quota to avoid overload
quota = client.get_preload_dataset_files_quota(index)

for _ in range(quota):
    client.preload_dataset_file(
        DatasetFile(
            dataset_file_path=file_path, # The relative path of the OSS folder (file) to preload
            sampler_index=index # The sampler index
        )
    )

3. Mark data usage status (when data use begins)

When you start to read a dataset file, notify the system:

client.active_dataset_file(
    DatasetFile(
        dataset_file_path=file_path, # The relative path of the OSS folder (file) to preload
        sampler_index=index # The sampler index
    )
)

4. Release data resources (after using data)

After you finish reading the data, immediately release the resources. This action notifies the system that it can safely clear the cache to free up space:

client.release_dataset_file(
    DatasetFile(
        dataset_file_path=file_path, # The relative path of the OSS folder (file) to preload
        sampler_index=index # The sampler index
    )
)

Billing

The PAI dynamic dataset loading service is currently available at no charge.