All Products
Search
Document Center

Platform For AI:Read data from and write data to OSS

Last Updated:Oct 31, 2024

This topic describes how to use OSS SDK for Python and OSS API for Python to read data from and write data to Object Storage Service (OSS).

Background information

OSS is a secure, cost-effective, and highly reliable cloud storage service provided by Alibaba Cloud. You can use OSS to store a large amount of data in the cloud. By default, Data Science Workshop (DSW) instances of Platform for AI (PAI) are preset with File Storage NAS file systems and compatible with OSS.

If you need to frequently access and process large-scale data, we recommend that you register an OSS bucket as a dataset when you create a DSW instance and then mount the dataset. If you need to only temporarily access OSS data based on specific business requirements, you can use OSS in a flexible manner, such as OSS SDKs and APIs.

OSS Python SDK

DSW is preinstalled with oss2 package for Python. The following section describes how to read data from or write data to OSS.

  1. Authentication and initialization.

    import oss2
    auth = oss2.Auth('<your_AccessKey_ID>', '<your_AccessKey_Secret>')
    bucket = oss2.Bucket(auth, 'http://oss-cn-beijing-internal.aliyuncs.com', '<your_bucket_name>')

    Configure the parameters described in the following table based on your requirements.

    Parameter

    Description

    <your_AccessKey_ID>

    The AccessKey ID of your Alibaba Cloud account.

    <your_AccessKey_Secret>

    The AccessKey secret of your Alibaba Cloud account.

    http://oss-cn-beijing-internal.aliyuncs.com

    The endpoint of the OSS bucket. Select an endpoint based on the region in which your DSW instances are deployed. Examples:

    • Pay-as-you-go instances deployed in the China (Beijing) region: oss-cn-beijing.aliyuncs.com

    • Subscription instances deployed in the China (Beijing) region: oss-cn-beijing-internal.aliyuncs.com

    • GPU P100 instances and CPU instances deployed in the China (Shanghai) region: oss-cn-shanghai.aliyuncs.com

    • GPU M40 instances deployed in the China (Shanghai) region: oss-cn-shanghai-internal.aliyuncs.com

    <your_bucket_name>

    The name of the OSS bucket. The name cannot start with oss://.

  2. Read data from and write data to OSS.

    #Read a file from OSS. 
    result = bucket.get_object('<your_file_path/your_file>')
    print(result.read())
    #Read data by range. 
    result = bucket.get_object('<your_file_path/your_file>', byte_range=(0, 99))
    #Write data to OSS. 
    bucket.put_object('<your_file_path/your_file>', '<your_object_content>')
    #Append a file. 
    result = bucket.append_object('<your_file_path/your_file>', 0, '<your_object_content>')
    result = bucket.append_object('<your_file_path/your_file>', result.next_position, '<your_object_content>')

    <your_file_path/your_file> specifies the OSS path from which the data is read or to which the data is written. <your_object_content> specifies the content that you want to append to the file.

OSS Python API

For PyTorch users, DSW provides OSS API for Python to read data from and write data to OSS.

You can store training data, logs, and model files in OSS.

  • Load training data

    You can store training data in an OSS bucket. The path and labels of the data must be stored in an index file in the same OSS bucket. You can customize DataSet and call the DataLoader API in PyTorch to read data by using multiple threads in parallel. Sample code:

    import io
    import oss2
    import PIL
    import torch
    class OSSDataset(torch.utils.data.dataset.Dataset):
        def __init__(self, endpoint, bucket, auth, index_file):
            self._bucket = oss2.Bucket(auth, endpoint, bucket)
            self._indices = self._bucket.get_object(index_file).read().split(',')
        def __len__(self):
            return len(self._indices)
        def __getitem__(self, index):
            img_path, label = self._indices(index).strip().split(':')
            img_str = self._bucket.get_object(img_path)
            img_buf = io.BytesIO()
            img_buf.write(img_str.read())
            img_buf.seek(0)
            img = Image.open(img_buf).convert('RGB')
            img_buf.close()
            return img, label
    dataset = OSSDataset(endpoint, bucket, auth, index_file)
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        num_workers=num_loaders,
        pin_memory=True)

    In the preceding code, endpoint specifies the endpoint of OSS, bucket specifies the name of the OSS bucket, auth specifies the objects that are authenticated, and index_file specifies the path of the index file.

    Note

    In this topic, samples in the index file are separated with commas (,). The sample path and labels are separated with colons (:).

  • Save or load models

    You can use OSS2 API for Python to save or load PyTorch models. For information about how to save or load models by using PyTorch, see PyTorch.

    • Save a model

      from io import BytesIO
      import torch
      import oss2
      # bucket_name
      bucket_name = "<your_bucket_name>"
      bucket = oss2.Bucket(auth, endpoint, bucket_name)
      buffer = BytesIO()
      torch.save(model.state_dict(), buffer)
      bucket.put_object("<your_model_path>", buffer.getvalue())

      In the preceding code, endpoint specifies the endpoint of OSS and <your_bucket_name> specifies the name of the OSS bucket. The name cannot start with oss://. auth specifies the objects that are authenticated, and <your_model_path> specifies the path of the model file.

    • Load a model

      from io import BytesIO
      import torch
      import oss2
      bucket_name = "<your_bucket_name>"
      bucket = oss2.Bucket(auth, endpoint, bucket_name)
      buffer = BytesIO(bucket.get_object("<your_model_path>").read())
      model.load_state_dict(torch.load(buffer))

      In the preceding code, endpoint specifies the endpoint of OSS and <your_bucket_name> specifies the name of the OSS bucket. The name cannot start with oss://. auth specifies the objects that are authenticated, and <your_model_path> specifies the path of the model file.