All Products
Search
Document Center

Object Storage Service:Store and access checkpoints by using OssCheckpoint

Last Updated:Apr 01, 2026

This article shows how to use OssCheckpoint to directly read from and write to checkpoints in Object Storage Service (OSS). A checkpoint saves a model's state at a specific point during training.

Prerequisites

OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.

OssCheckpoint

Use OssCheckpoint to read and write training results during model training.

This example shows how to use OssCheckpoint to read from and write to checkpoints.

import torch
from osstorchconnector import OssCheckpoint

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CRED_PATH = "/root/.alibabacloud/credentials"
CONFIG_PATH = "/etc/oss-connector/config.json"

# Create a checkpoint object.
checkpoint = OssCheckpoint(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Read from a checkpoint.
CHECKPOINT_READ_URI = "oss://checkpoint/epoch.0"
with checkpoint.reader(CHECKPOINT_READ_URI) as reader:
   state_dict = torch.load(reader)

# Write to a checkpoint.
CHECKPOINT_WRITE_URI = "oss://checkpoint/epoch.1"
with checkpoint.writer(CHECKPOINT_WRITE_URI) as writer:
   torch.save(state_dict, writer)

Data types

The checkpoint object created by OssCheckpoint implements common I/O interfaces. For more information, see Data types in OSS Connector for AI/ML.

Parameters

OssCheckpoint requires the following parameters.

Parameter

Type

Required

Description

endpoint

string

Yes

The access domain name for OSS. For more information, see Regions and endpoints.

cred_path

string

Yes

The default path of the credential file is /root/.alibabacloud/credentials. For more information, see Configure access credentials.

config_path

string

Yes

The default path of the OSS Connector configuration file is /etc/oss-connector/config.json. For more information, see Configure OSS Connector.

Distributed checkpoint (DCP)

OSS Connector for AI/ML supports the PyTorch Distributed Checkpoint (DCP) feature starting from V1.2.3. You can use OssDCPFileSystem to directly store and read distributed checkpoints on OSS.

This example shows how to use OssDCPFileSystem to save and load a distributed checkpoint.

import torchvision
import torch.distributed.checkpoint as DCP
from osstorchconnector import OssDCPFileSystem
import torch

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ossconnectorbucket/dcp-checkpoint-resnet18"

model = torchvision.models.resnet18()

# Write to OSS.
fs = OssDCPFileSystem(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
oss_storage_writer = fs.writer(OSS_URI)
# Use DCP.save or DCP.async_save.
checkpoint_future = DCP.async_save(
    state_dict=model.state_dict(),
    storage_writer=oss_storage_writer,
)
checkpoint_future.result()


# Load from OSS.
loaded_state_dict = {
    key: torch.zeros_like(value) for key, value in model.state_dict().items()
}
oss_storage_reader = fs.reader(OSS_URI)
DCP.load(
    loaded_state_dict,
    storage_reader=oss_storage_reader,
)

Safetensors

OSS Connector for AI/ML supports the safetensors format starting from V1.2.0rc6. You can use OssSafetensor to directly store and read safetensors files on OSS.

This example shows how to use OssSafetensor to save and load safetensors files.

import torch
from osstorchconnector import OssSafetensor

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ossconnectorbucket/safetensors/model.safetensors"

sfts = OssSafetensor(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)

# Save tensors as a safetensors file to OSS.
tensors = {"embedding": torch.rand((512, 1024)), "attention": torch.rand((256, 256))}
metadata = {"a": "a", "b": "b"}
sfts.save_file(tensors, OSS_URI, metadata)

# Load a safetensor file from OSS.
loaded_tensors = sfts.load_file(OSS_URI, device="cpu")

# Or load tensors by using safe_open.
with sfts.safe_open(OSS_URI, device ="cpu") as f:
    metadata = f.metadata() # Get metadata.
    for key in f.keys(): # Read tensors by key.
        tensor = f.get_tensor(key)