All Products
Search
Document Center

Read and write OSS data

Last Updated: Mar 02, 2020

Alibaba Cloud Object Storage Service (OSS) (https://www.aliyun.com/product/oss(https://www.aliyun.com/product/oss)) is a high-capacity, secure, cost-effective, and highly reliable cloud storage service. Besides the built-in Network Attached Storage (NAS) system, Data Science Workshop (DSW) is also integrated with OSS. This topic describes how to use DSW to read data from and write data to OSS.

OSS Python SDK

In most cases, you can call the OSS Python API to read data from and write data to OSS.
For more information, see https://aliyun-oss-python-sdk.readthedocs.io/en/stable/oss2.html

The oss2 python package is already installed in DSW. Before you call the OSS API, reference the following example to learn how to perform authentication and initialization:

  1. import oss2
  2. auth = oss2.Auth('your-access-key-id', 'your-access-key-secret')
  3. # OSS endpoint descriptions:
  4. # DSW CPU instances deployed in China (Beijing): oss-cn-beijing.aliyuncs.com
  5. # DSW GPU instances deployed in China (Beijing): oss-cn-beijing-internal.aliyuncs.com
  6. # DSW GPU P100 instances and CPU instances deployed in China (Shanghai): oss-cn-shanghai.aliyuncs.com
  7. # DSW GPU M40 instances deployed in China (Shanghai): oss-cn-shanghai-internal.aliyuncs.com
  8. # The name of the specified OSS bucket (oss:// is excluded).
  9. bucket = oss2.Bucket(auth, 'http://oss-cn-beijing-internal.aliyuncs.com', '<your_bucket_name>')

Run the following sample code to read data from and write data to OSS:

  1. # Read all data in an object.
  2. result = bucket.get_object('path/to/your_file')
  3. print(result.read())
  4. # Read data within a specified range.
  5. result = bucket.get_object('path/to/your_file', byte_range=(0, 99))
  6. # Write data to OSS.
  7. bucket.put_object('path/to/your_file', 'content of the object')
  8. # Append an object.
  9. result = bucket.append_object('path/to/your_file', 0, 'content of the object')
  10. result = bucket.append_object('path/to/your_file', result.next_position, 'content of the object')

TensorFlow OSS IO

For TensorFlow users, DSW allows them to use the tensorflow_io.oss (https://github.com/tensorflow/io/blob/master/tensorflow_io/oss/README.md) module to read data stored on OSS. With this module, you no longer need to copy data or model files when you run TensorFlow model training tasks.

The tensorflow_io.oss package is already installed in DSW. Before you use import tensorflow.oss, concatenate the URL of the OSS bucket in the following format:

  1. import tensorflow as tf
  2. import tensorflow_io.oss
  3. access_id="<your_ak_id>"
  4. access_key="<your_ak_key>"
  5. # OSS endpoint descriptions:
  6. # DSW CPU instances deployed in China (Beijing): oss-cn-beijing.aliyuncs.com
  7. # DSW GPU instances deployed in China (Beijing): oss-cn-beijing-internal.aliyuncs.com
  8. # DSW GPU P100 instances and CPU instances deployed in China (Shanghai): oss-cn-shanghai.aliyuncs.com
  9. # DSW GPU M40 instances deployed in China (Shanghai): oss-cn-shanghai-internal.aliyuncs.com
  10. host = "oss-cn-beijing-internal.aliyuncs.com"
  11. bucket="oss://<your_bucket_name>"
  12. oss_bucket_root="{}\x01id={}\x02key={}\x02host={}/".format(bucket, access_id, access_key, host)

Use GFile to read and write OSS text objects. The sample code is as follows:

  1. oss_file = oss_bucket_root + "test.txt"
  2. with tf.gfile.GFile(oss_file, "w") as f:
  3. f.write("xxxxxxxxx")
  4. with tf.gfile.GFile(oss_file, "r") as f:
  5. print(f.read())

You can also use TextLineDataset to read data stored on OSS. The sample code is as follows:

  1. # Test textline reader op
  2. oss_file = oss_bucket_root + "test.txt"
  3. dataset = tf.data.TextLineDataset([oss_file])
  4. iterator = dataset.make_initializable_iterator()
  5. a = iterator.get_next()
  6. with tf.Session() as sess:
  7. tf.global_variables_initializer().run()
  8. sess.run(iterator.initializer)
  9. print(sess.run(a))

Note: The preceding sample code all uses the TensorFlow 1.0 API. TensorFlow 2.0 is still at the preview stage. For more information about how to migrate TensorFlow 1.0 code to 2.0, see https://www.tensorflow.org/beta/guide/migration_guide.

Call the OSS Python API in PyTorch

For PyTorch users, they can use OSS to store training data, log data, and models. To perform these tasks, call the OSS Python API in PyTorch.

Read training data

You can store your training data in an OSS bucket, and then save the path and relevant label to an index file in the same OSS bucket. PyTorch allows you to call the Data Loader API to read data from a predefined dataset through multiple processes running in parallel.

The sample code is as follows. In the sample code, the endpoint argument specifies the host of OSS. The bucket argument specifies the name of the OSS bucket. The auth argument specifies the authentication object. The index_file argument specifies the path of the index file.

In the sample index file, datasets are separated with commas (,). The dataset path and label are separated with a colon (:).

  1. import io
  2. import oss2
  3. import PIL
  4. import torch
  5. class OSSDataset(torch.utils.data.dataset.Dataset):
  6. def __init__(self, endpoint, bucket, auth, index_file):
  7. self._bucket = oss2.Bucket(auth, endpoint, bucket)
  8. self._indices = self._bucket.get_object(index_file).read().split(',')
  9. def __len__(self):
  10. return len(self._indices)
  11. def __getitem__(self, index):
  12. img_path, label = self._indices(index).strip().split(':')
  13. img_str = self._bucket.get_object(img_path)
  14. img_buf = io.BytesIO()
  15. img_buf.write(img_str.read())
  16. img_buf.seek(0)
  17. img = Image.open(img_buf).convert('RGB')
  18. img_buf.close()
  19. return img, label
  20. dataset = OSSDataset(endpoint, bucket, index_file)
  21. data_loader = torch.utils.data.DataLoader(
  22. dataset,
  23. batch_size=batch_size,
  24. num_workers=num_loaders,
  25. pin_memory=True)

Write log data to OSS

You can compile a StreamHandler to import log data to OSS. In the following sample code, the endpoint argument specifies the host of OSS. The bucket specifies the name of the OSS bucket. The auth argument specifies the authentication object. The log_file argument specifies the path of the log file.

However, you are not allowed to import a log file through multiple processes.

  1. import oss2
  2. import logging
  3. class OSSLoggingHandler(logging.StreamHandler):
  4. def __init__(self, endpoint, bucket, auth, log_file):
  5. OSSLoggingHandler.__init__(self)
  6. self._bucket = oss2.Bucket(auth, endpoint, bucket)
  7. self._log_file = log_file
  8. self._pos = self._bucket.append_object(self._log_file, 0, '')
  9. def emit(self, record):
  10. msg = self.format(record)
  11. self._pos = self._bucket.append_object(self._log_file, self._pos.next_position, msg)
  12. oss_handler = OSSLoggingHandler(endpoint, bucket, log_file)
  13. logging.basicConfig(
  14. stream=oss_handler,
  15. format='[%(asctime)s] [%(levelname)s] [%(process)d#%(threadName)s] ' +
  16. '[%(filename)s:%(lineno)d] %(message)s',
  17. level=logging.INFO)

Save and load models

You can call the OSS2 Python API to save and load PyTorch models. For more information about how to save and load PyTorch models, see https://pytorch.org/tutorials/beginner/saving_loading_models.html(https://pytorch.org/tutorials/beginner/saving_loading_models.html)

Save PyTorch models:

  1. from io import BytesIO
  2. import torch
  3. import oss2
  4. # The name of the specified OSS bucket (oss:// is excluded).
  5. bucket_name = "your_bucket_name"
  6. bucket = oss2.Bucket(auth, endpoint, bucket_name)
  7. buffer = BytesIO()
  8. torch.save(model.state_dict(), buffer)
  9. bucket.put_object("your_model_path", buffer.getvalue())

Load PyTorch models:

  1. from io import BytesIO
  2. import torch
  3. import oss2
  4. # The name of the specified OSS bucket (oss:// is excluded).
  5. bucket_name = "your_bucket_name"
  6. bucket = oss2.Bucket(auth, endpoint, bucket_name)
  7. buffer = BytesIO(bucket.get_object("your_model_path").read())
  8. model.load_state_dict(torch.load(buffer))