This topic describes how to use OSS SDK for Python and OSS API for Python to read data from and write data to Object Storage Service (OSS).
Background information
OSS is a secure, cost-effective, and highly reliable cloud storage service provided by Alibaba Cloud. You can use OSS to store a large amount of data in the cloud. By default, Data Science Workshop (DSW) instances of Platform for AI (PAI) are preset with File Storage NAS file systems and compatible with OSS.
If you need to frequently access and process large-scale data, we recommend that you register an OSS bucket as a dataset when you create a DSW instance and then mount the dataset. If you need to only temporarily access OSS data based on specific business requirements, you can use OSS in a flexible manner, such as OSS SDKs and APIs.
OSS Python SDK
DSW is preinstalled with oss2 package for Python. The following section describes how to read data from or write data to OSS.
Authentication and initialization.
import oss2 auth = oss2.Auth('<your_AccessKey_ID>', '<your_AccessKey_Secret>') bucket = oss2.Bucket(auth, 'http://oss-cn-beijing-internal.aliyuncs.com', '<your_bucket_name>')
Configure the parameters described in the following table based on your requirements.
Parameter
Description
<your_AccessKey_ID>
The AccessKey ID of your Alibaba Cloud account.
<your_AccessKey_Secret>
The AccessKey secret of your Alibaba Cloud account.
http://oss-cn-beijing-internal.aliyuncs.com
The endpoint of the OSS bucket. Select an endpoint based on the region in which your DSW instances are deployed. Examples:
Pay-as-you-go instances deployed in the China (Beijing) region: oss-cn-beijing.aliyuncs.com
Subscription instances deployed in the China (Beijing) region: oss-cn-beijing-internal.aliyuncs.com
GPU P100 instances and CPU instances deployed in the China (Shanghai) region: oss-cn-shanghai.aliyuncs.com
GPU M40 instances deployed in the China (Shanghai) region: oss-cn-shanghai-internal.aliyuncs.com
<your_bucket_name>
The name of the OSS bucket. The name cannot start with oss://.
Read data from and write data to OSS.
#Read a file from OSS. result = bucket.get_object('<your_file_path/your_file>') print(result.read()) #Read data by range. result = bucket.get_object('<your_file_path/your_file>', byte_range=(0, 99)) #Write data to OSS. bucket.put_object('<your_file_path/your_file>', '<your_object_content>') #Append a file. result = bucket.append_object('<your_file_path/your_file>', 0, '<your_object_content>') result = bucket.append_object('<your_file_path/your_file>', result.next_position, '<your_object_content>')
<your_file_path/your_file>
specifies the OSS path from which the data is read or to which the data is written.<your_object_content>
specifies the content that you want to append to the file.
OSS Python API
For PyTorch users, DSW provides OSS API for Python to read data from and write data to OSS.
You can store training data, logs, and model files in OSS.
Load training data
You can store training data in an OSS bucket. The path and labels of the data must be stored in an index file in the same OSS bucket. You can customize DataSet and call the DataLoader API in PyTorch to read data by using multiple threads in parallel. Sample code:
import io import oss2 import PIL import torch class OSSDataset(torch.utils.data.dataset.Dataset): def __init__(self, endpoint, bucket, auth, index_file): self._bucket = oss2.Bucket(auth, endpoint, bucket) self._indices = self._bucket.get_object(index_file).read().split(',') def __len__(self): return len(self._indices) def __getitem__(self, index): img_path, label = self._indices(index).strip().split(':') img_str = self._bucket.get_object(img_path) img_buf = io.BytesIO() img_buf.write(img_str.read()) img_buf.seek(0) img = Image.open(img_buf).convert('RGB') img_buf.close() return img, label dataset = OSSDataset(endpoint, bucket, auth, index_file) data_loader = torch.utils.data.DataLoader( dataset, batch_size=batch_size, num_workers=num_loaders, pin_memory=True)
In the preceding code,
endpoint
specifies the endpoint of OSS,bucket
specifies the name of the OSS bucket,auth
specifies the objects that are authenticated, andindex_file
specifies the path of the index file.NoteIn this topic, samples in the index file are separated with commas (,). The sample path and labels are separated with colons (:).
Save or load models
You can use OSS2 API for Python to save or load PyTorch models. For information about how to save or load models by using PyTorch, see PyTorch.
Save a model
from io import BytesIO import torch import oss2 # bucket_name bucket_name = "<your_bucket_name>" bucket = oss2.Bucket(auth, endpoint, bucket_name) buffer = BytesIO() torch.save(model.state_dict(), buffer) bucket.put_object("<your_model_path>", buffer.getvalue())
In the preceding code,
endpoint
specifies the endpoint of OSS and<your_bucket_name>
specifies the name of the OSS bucket. The name cannot start with oss://.auth
specifies the objects that are authenticated, and<your_model_path>
specifies the path of the model file.Load a model
from io import BytesIO import torch import oss2 bucket_name = "<your_bucket_name>" bucket = oss2.Bucket(auth, endpoint, bucket_name) buffer = BytesIO(bucket.get_object("<your_model_path>").read()) model.load_state_dict(torch.load(buffer))
In the preceding code,
endpoint
specifies the endpoint of OSS and<your_bucket_name>
specifies the name of the OSS bucket. The name cannot start with oss://.auth
specifies the objects that are authenticated, and<your_model_path>
specifies the path of the model file.