This article shows how to use OssCheckpoint to directly read from and write to checkpoints in Object Storage Service (OSS). A checkpoint saves a model's state at a specific point during training.
Prerequisites
OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.
OssCheckpoint
Use OssCheckpoint to read and write training results during model training.
This example shows how to use OssCheckpoint to read from and write to checkpoints.
import torch
from osstorchconnector import OssCheckpoint
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CRED_PATH = "/root/.alibabacloud/credentials"
CONFIG_PATH = "/etc/oss-connector/config.json"
# Create a checkpoint object.
checkpoint = OssCheckpoint(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
# Read from a checkpoint.
CHECKPOINT_READ_URI = "oss://checkpoint/epoch.0"
with checkpoint.reader(CHECKPOINT_READ_URI) as reader:
state_dict = torch.load(reader)
# Write to a checkpoint.
CHECKPOINT_WRITE_URI = "oss://checkpoint/epoch.1"
with checkpoint.writer(CHECKPOINT_WRITE_URI) as writer:
torch.save(state_dict, writer)
Data types
The checkpoint object created by OssCheckpoint implements common I/O interfaces. For more information, see Data types in OSS Connector for AI/ML.
Parameters
OssCheckpoint requires the following parameters.
Parameter | Type | Required | Description |
endpoint | string | Yes | The access domain name for OSS. For more information, see Regions and endpoints. |
cred_path | string | Yes | The default path of the credential file is |
config_path | string | Yes | The default path of the OSS Connector configuration file is |
Distributed checkpoint (DCP)
OSS Connector for AI/ML supports the PyTorch Distributed Checkpoint (DCP) feature starting from V1.2.3. You can use OssDCPFileSystem to directly store and read distributed checkpoints on OSS.
This example shows how to use OssDCPFileSystem to save and load a distributed checkpoint.
import torchvision
import torch.distributed.checkpoint as DCP
from osstorchconnector import OssDCPFileSystem
import torch
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ossconnectorbucket/dcp-checkpoint-resnet18"
model = torchvision.models.resnet18()
# Write to OSS.
fs = OssDCPFileSystem(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
oss_storage_writer = fs.writer(OSS_URI)
# Use DCP.save or DCP.async_save.
checkpoint_future = DCP.async_save(
state_dict=model.state_dict(),
storage_writer=oss_storage_writer,
)
checkpoint_future.result()
# Load from OSS.
loaded_state_dict = {
key: torch.zeros_like(value) for key, value in model.state_dict().items()
}
oss_storage_reader = fs.reader(OSS_URI)
DCP.load(
loaded_state_dict,
storage_reader=oss_storage_reader,
)Safetensors
OSS Connector for AI/ML supports the safetensors format starting from V1.2.0rc6. You can use OssSafetensor to directly store and read safetensors files on OSS.
This example shows how to use OssSafetensor to save and load safetensors files.
import torch
from osstorchconnector import OssSafetensor
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ossconnectorbucket/safetensors/model.safetensors"
sfts = OssSafetensor(endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH)
# Save tensors as a safetensors file to OSS.
tensors = {"embedding": torch.rand((512, 1024)), "attention": torch.rand((256, 256))}
metadata = {"a": "a", "b": "b"}
sfts.save_file(tensors, OSS_URI, metadata)
# Load a safetensor file from OSS.
loaded_tensors = sfts.load_file(OSS_URI, device="cpu")
# Or load tensors by using safe_open.
with sfts.safe_open(OSS_URI, device ="cpu") as f:
metadata = f.metadata() # Get metadata.
for key in f.keys(): # Read tensors by key.
tensor = f.get_tensor(key)