OssIterableDataset - Object Storage Service - Alibaba Cloud Documentation Center

An OssIterableDataset is ideal for scenarios that involve limited memory or large data volumes. It is primarily used for sequential processing where random access and parallel processing are not required. This topic describes how to build a dataset using OssIterableDataset.

Prerequisites

OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.

Build a dataset

Methods

You can build a dataset using OssIterableDataset in three ways:

OSS URI prefix: Use this method when OSS storage paths follow a consistent pattern.
List of OSS URIs: Use this method for specific, non-sequential OSS storage paths.
Manifest file: Use this method to reduce the overhead of listing OSS objects. This method is suitable for datasets with many files, such as tens of millions, that require repeated loading. It is also suitable for buckets where the OSS scalar retrieval feature is enabled.

Build a dataset from an OSS URI prefix

The following example shows how to use the from_prefix method of OssIterableDataset to build a dataset from a specified prefix (OSS URI) in OSS.

from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Build a dataset using the from_prefix method of OssIterableDataset
iterable_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)

# Traverse the objects in the dataset
for item in iterable_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))
    item.close()

Build a dataset from a list of OSS URIs

The following example shows how to use the from_objects method of OssIterableDataset to build a dataset from a specified list of OSS URIs. In the example, uris is a string iterator that contains multiple OSS URIs.

from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"

uris = [
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]

# Build a dataset using the from_objects method of OssIterableDataset
iterable_dataset = OssIterableDataset.from_objects(uris, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)

# Traverse the objects in the dataset
for item in iterable_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))
    item.close()

Build a dataset from a manifest file

Before you build a dataset from a manifest file, you must first create the manifest file.

Create a manifest file:

Run the touch manifest_file command in any location to create a manifest file. Then, populate the manifest file as shown in the examples.

Example of a manifest file with OSS object names:

Img/BadImag/Bmp/Sample001/img001-00001.png
Img/BadImag/Bmp/Sample001/img001-00002.png
Img/BadImag/Bmp/Sample001/img001-00003.png

Example of a manifest file with OSS object names and labels:

Img/BadImag/Bmp/Sample001/img001-00001.png label1
Img/BadImag/Bmp/Sample001/img001-00002.png label2
Img/BadImag/Bmp/Sample001/img001-00003.png label3

Build the dataset from the manifest file:

The following example shows how to use the from_manifest_file method of OssIterableDataset to build a dataset from a specified manifest file.

from osstorchconnector import OssIterableDataset
from osstorchconnector import imagenet_manifest_parser

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_BASE_URI = "oss://ai-testset/EnglistImg/"

# Build a dataset from a local file using the from_manifest_file method of OssIterableDataset
# The manifest_file_path parameter specifies the local path of the manifest file.
# The manifest_parser parameter is the method for parsing the manifest file. This example uses the built-in parsing method imagenet_manifest_parser.
# The oss_base_uri parameter specifies the base OSS URI. It is used to concatenate with the URI parsed from the manifest to form a full OSS URI. FULL_OSS_URI = BASE_OSS_URI + URI.
MANIFEST_FILE_LOCAL = "/path/to/manifest_file"
iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_LOCAL, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in iterable_dataset:
    print(item.key)
    print(item.size)
    print(item.label)
    content = item.read()
    print(len(content))
    item.close()

# Build a dataset from a manifest file in an OSS Bucket using the from_manifest_file method of OssIterableDataset
MANIFEST_FILE_URI = "oss://ai-testset/EnglistImg/manifest_file"
iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_URI, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in iterable_dataset:
    print(item.key)
    print(item.size)
    print(item.label)
    content = item.read()
    print(len(content))
    item.close()

Data types in the dataset

Objects in the dataset are of a data type that implements common I/O interfaces. For more information, see Data types in OSS Connector for AI/ML.

Parameter description

Configure the following parameters to build a dataset using OssMapDataset or OssIterableDataset.

Parameter	Type	Required	Description
endpoint	string	Yes	Common parameter: The endpoint for accessing the OSS service. For more information, see Regions and endpoints.
region	string	No	Common parameter: The OSS region, such as `cn-beijing`. If not set, the connector automatically infers it from the `endpoint`. However, inference may fail if the endpoint does not contain region information. Explicitly specify the region.
transform	object	No	Common parameter: A transform function used to convert a DataObject (OSS object) to any type. You can customize this method as needed. For more information, see transform. Important Do not directly return a `DataObject` object in the transform function. This may cause the iterator to stop working. To return an object, call the copy method.
cred_path	string	Yes	Common parameter: The default path of the authentication file is `/root/.alibabacloud/credentials`. For more information, see Configure access credentials.
config_path	string	Yes	Common parameter: The default path of the OSS Connector configuration file is `/etc/oss-connector/config.json`. For more information, see Configure OSS Connector.
oss_uri	string	Yes	from_prefix method parameter: The OSS resource path used to build the dataset from an OSS URI prefix. Only OSS URIs that start with `oss://` are supported.
object_uris	string	Yes	from_objects method parameter: A list of OSS resource paths used to build the dataset. Only OSS URIs that start with `oss://` are supported.
manifest_file_path	string	Yes	from_manifest_file method parameter: The path of the manifest file. Local file paths and OSS URIs that start with `oss://` are supported.
manifest_parser	Callable Object	Yes	from_manifest_file method parameter: A built-in method for parsing the manifest file. It accepts an opened manifest file as input and returns an iterator. Each element in the iterator is a tuple of `(oss_uri, label)`. For more information, see manifest_parser. You can also customize the manifest_parser method based on the format of different dataset manifest files.
oss_base_uri	string	Yes	from_manifest_file method parameter: The base OSS URI. It is used to create a complete OSS URI by concatenating with a potentially incomplete OSS URI from the manifest file. If there is no oss_base_uri, set this parameter to `""`.

Built-in methods

transform

When you build a dataset, the dataset provides an iterator that yields the result of `transform(DataObject)`. `DataObject` is a data type in OSS Connector for AI/ML.

The transform method can be customized. If you do not specify a transform method when building the dataset, the default method is used.

Default transform method

The following example shows the default transform method. You do not need to specify it when building a dataset.

# Default transform function
def identity(obj: DataObject) -> DataObject:
    if obj is not None:
        return obj.copy()
    else:
        return None

Custom transform method

The following example shows how to use a custom transform method when building a dataset.

import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Define the transformation operations for image data
trans = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Create a transform method to process the input object
def transform(object):
    try:
        img = Image.open(io.BytesIO(object.read())).convert('RGB')
        val = trans(img)
    except Exception as e:
        raise e
    return object.key, val

# Use the transform=transform parameter when building the dataset
iterable_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in iterable_dataset:
    print(item[0])
    print(item[1].shape)

manifest_parser

To build a dataset using the default manifest_parser method, import it as shown in the following example.

from osstorchconnector import imagenet_manifest_parser

The following example shows the default manifest_parser method.

def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
    lines = reader.read().decode("utf-8").strip().split("\n")
    for i, line in enumerate(lines):
        try:
            items = line.strip().split('\t')
            if len(items) >= 2:
                key = items[0]
                label = items[1]
                yield (key, label)
            elif len(items) == 1:
                key = items[0]
                yield (key, '')
            else:
                raise ValueError("format error")
        except ValueError as e:
            logging.error(f"Error: {e} for line {i}: {line}")

Create a data loader with PyTorch from a dataset

The following example shows how to create a PyTorch data loader using a dataset built with OssIterableDataset as the data source.

import torch
from osstorchconnector import OssIterableDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"


def transform(obj):
    return obj.key, obj.label

# Build a dataset using the from_prefix method of OssIterableDataset
map_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT,transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)

# Create a PyTorch data loader based on map_dataset
loader = torch.utils.data.DataLoader(map_dataset, batch_size=256, num_workers=32, prefetch_factor=2)
# Use the data in the training loop
# for batch in loader:
     # Perform training operations

References

OSS Connector for AI/ML can also be used for data training tasks in a containerized environment. For more information, see Build a Docker image that contains the OSS Connector for AI/ML environment.