All Products
Search
Document Center

Platform For AI:Read data from and write data to OSS

Last Updated:Aug 03, 2023

This topic describes how to use Object Storage Service (OSS) SDK for Python, OSS API for Python to read data from and write data to OSS.

Background information

OSS is a secure, cost-effective, and highly reliable cloud storage service provided by Alibaba Cloud. It enables you to store a large amount of data in the cloud. By default, Data Science Workshop (DSW) instances are attached with Network Attached Storage (NAS) file systems. You can also use DSW instances with OSS if you require larger storage.

OSS SDK for Python

In most cases, you can use OSS SDK for Python to read data from and write data to OSS. For more information, see . DSW has preinstalled OSS2 for Python packages. The following code block describes how to read data from or write data to OSS:

  1. Authentication and initialization.

    import oss2
    auth = oss2.Auth('<your_AccessKey_ID>', '<your_AccessKey_Secret>')
    bucket = oss2.Bucket(auth, 'http://oss-cn-beijing-internal.aliyuncs.com', '<your_bucket_name>')

    Set the following parameters based on your requirements.

    Parameter

    Description

    <your_AccessKey_ID>

    The AccessKey ID of your Alibaba Cloud account.

    <your_AccessKey_Secret>

    The AccessKey secret of your Alibaba Cloud account.

    http://oss-cn-beijing-internal.aliyuncs.com

    The endpoint of OSS. Select an endpoint based on the region where your DSW instances are deployed.

    • Pay-as-you-go instances deployed in China (Beijing): oss-cn-beijing.aliyuncs.com

    • Subscription instances deployed in China (Beijing): oss-cn-beijing-internal.aliyuncs.com

    • GPU P100 instances and CPU instances deployed in China (Shanghai): oss-cn-shanghai.aliyuncs.com

    • GPU M40 instances deployed in China (Shanghai): oss-cn-shanghai-internal.aliyuncs.com

    <your_bucket_name>

    The name of the OSS bucket. It cannot start with oss://.

  2. Read data from and write data to OSS.

    #Read a file from OSS.
    result = bucket.get_object('<your_file_path/your_file>')
    print(result.read())
    #Read data by range.
    result = bucket.get_object('<your_file_path/your_file>', byte_range=(0, 99))
    #Write data to OSS.
    bucket.put_object('<your_file_path/your_file>', '<your_object_content>')
    #Append a file.
    result = bucket.append_object('<your_file_path/your_file>', 0, '<your_object_content>')
    result = bucket.append_object('<your_file_path/your_file>', result.next_position, '<your_object_content>')

    <your_file_path/your_file> indicates the OSS path from which the data is to be read and to which the data is to be written. <your_object_content> indicates the content that you want to append to a file. Set the parameters based on your requirements.

OSS API for Python

For PyTorch users, DSW provides OSS API for Python to read data from and write data to OSS.

You can store training data and model files in OSS.

  • Load training data

    You can store training data in an OSS bucket. The path and labels of the data must be stored in an index file in the same OSS bucket. You can customize DataSet and call the DataLoader API in PyTorch to read data through multiple threads in parallel. The following code block shows an example:

    import io
    import oss2
    import PIL
    import torch
    class OSSDataset(torch.utils.data.dataset.Dataset):
        def __init__(self, endpoint, bucket, auth, index_file):
            self._bucket = oss2.Bucket(auth, endpoint, bucket)
            self._indices = self._bucket.get_object(index_file).read().split(',')
        def __len__(self):
            return len(self._indices)
        def __getitem__(self, index):
            img_path, label = self._indices(index).strip().split(':')
            img_str = self._bucket.get_object(img_path)
            img_buf = io.BytesIO()
            img_buf.write(img_str.read())
            img_buf.seek(0)
            img = Image.open(img_buf).convert('RGB')
            img_buf.close()
            return img, label
    dataset = OSSDataset(endpoint, bucket,auth, index_file)
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        num_workers=num_loaders,
        pin_memory=True)

    endpoint indicates the endpoint of OSS. bucket indicates the name of the OSS bucket. auth indicates the objects that are authenticated. index_file indicates the path of the index file. Set the parameters based on your requirements.

    Note

    In this topic, samples in the index file are separated with commas (,). The sample path and labels are separated with colons (:).

  • Save or load models

    You can use OSS2 API for Python to save or load PyTorch models. For more information about how to save or load models by using PyTorch, see PyTorch.

    • Save a model

      from io import BytesIO
      import torch
      import oss2
      # bucket_name
      bucket_name = "<your_bucket_name>"
      bucket = oss2.Bucket(auth, endpoint, bucket_name)
      buffer = BytesIO()
      torch.save(model.state_dict(), buffer)
      bucket.put_object("<your_model_path>", buffer.getvalue())

      endpoint indicates the endpoint of OSS. bucket indicates the name of the OSS bucket. It cannot start with oss://. auth indicates the objects that are authenticated. <your_model_path> indicates the path where you want to store the model. Set the parameters based on your requirements.

    • Load a model

      from io import BytesIO
      import torch
      import oss2
      bucket_name = "<your_bucket_name>"
      bucket = oss2.Bucket(auth, endpoint, bucket_name)
      buffer = BytesIO(bucket.get_object("<your_model_path>").read())
      model.load_state_dict(torch.load(buffer))

      endpoint indicates the endpoint of OSS. bucket indicates the name of the OSS bucket. It cannot start with oss://. auth indicates the objects that are authenticated. log_file indicates the path where the model is stored. Set the parameters based on your requirements.