Store and access checkpoints in OSS - Object Storage Service

This article shows how to use OssCheckpoint to directly read from and write to checkpoints in Object Storage Service (OSS). A checkpoint saves a model's state at a specific point during training.

Prerequisites

OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.

OssCheckpoint

Use OssCheckpoint to read and write training results during model training.

This example shows how to use OssCheckpoint to read from and write to checkpoints.

import torch
from osstorchconnector import OssCheckpoint

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CRED_PATH = "/root/.alibabacloud/credentials"
CONFIG_PATH = "/etc/oss-connector/config.json"

# Create a checkpoint object.
checkpoint = OssCheckpoint(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Read from a checkpoint.
CHECKPOINT_READ_URI = "oss://checkpoint/epoch.0"
with checkpoint.reader(CHECKPOINT_READ_URI) as reader:
   state_dict = torch.load(reader)

# Write to a checkpoint.
CHECKPOINT_WRITE_URI = "oss://checkpoint/epoch.1"
with checkpoint.writer(CHECKPOINT_WRITE_URI) as writer:
   torch.save(state_dict, writer)

Data types

The checkpoint object created by OssCheckpoint implements common I/O interfaces. For more information, see Data types in OSS Connector for AI/ML.

Parameters

OssCheckpoint requires the following parameters.

Parameter	Type	Required	Description
endpoint	string	Yes	The access domain name for OSS. For more information, see Regions and endpoints.
cred_path	string	Yes	The default path of the credential file is `/root/.alibabacloud/credentials`. For more information, see Configure access credentials.
config_path	string	Yes	The default path of the OSS Connector configuration file is `/etc/oss-connector/config.json`. For more information, see Configure OSS Connector.

Distributed checkpoint (DCP)

OSS Connector for AI/ML supports the PyTorch Distributed Checkpoint (DCP) feature starting from V1.2.3. You can use OssDCPFileSystem to directly store and read distributed checkpoints on OSS.

This example shows how to use OssDCPFileSystem to save and load a distributed checkpoint.

import torchvision
import torch.distributed.checkpoint as DCP
from osstorchconnector import OssDCPFileSystem
import torch

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ossconnectorbucket/dcp-checkpoint-resnet18"

model = torchvision.models.resnet18()

# Write to OSS.
fs = OssDCPFileSystem(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
oss_storage_writer = fs.writer(OSS_URI)
# Use DCP.save or DCP.async_save.
checkpoint_future = DCP.async_save(
    state_dict=model.state_dict(),
    storage_writer=oss_storage_writer,
)
checkpoint_future.result()


# Load from OSS.
loaded_state_dict = {
    key: torch.zeros_like(value) for key, value in model.state_dict().items()
}
oss_storage_reader = fs.reader(OSS_URI)
DCP.load(
    loaded_state_dict,
    storage_reader=oss_storage_reader,
)

Safetensors

OSS Connector for AI/ML supports the safetensors format starting from V1.2.0rc6. You can use OssSafetensor to directly store and read safetensors files on OSS.

This example shows how to use OssSafetensor to save and load safetensors files.

import torch
from osstorchconnector import OssSafetensor

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ossconnectorbucket/safetensors/model.safetensors"

sfts = OssSafetensor(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Save tensors as a safetensors file to OSS.
tensors = {"embedding": torch.rand((512, 1024)), "attention": torch.rand((256, 256))}
metadata = {"a": "a", "b": "b"}
sfts.save_file(tensors, OSS_URI, metadata)

# Load a safetensor file from OSS.
loaded_tensors = sfts.load_file(OSS_URI, device="cpu")

# Or load tensors by using safe_open.
with sfts.safe_open(OSS_URI, device ="cpu") as f:
    metadata = f.metadata() # Get metadata.
    for key in f.keys(): # Read tensors by key.
        tensor = f.get_tensor(key)