OssMapDataset - Object Storage Service - Alibaba Cloud Documentation Center

An `OssMapDataset` is suitable for scenarios with small data volumes and sufficient memory that require frequent random access and parallel processing. This topic describes how to build a dataset using `OssMapDataset`.

Prerequisites

OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.

Build a dataset

Methods

You can build a dataset with `OssMapDataset` in three ways:

OSS_URI prefix: Suitable for scenarios where OSS storage paths follow a uniform pattern.
List of OSS_URIs: Suitable for scenarios where OSS storage paths are specific but scattered.
Manifest file: Reduces the overhead of OSS `list object` operations. This method is suitable for datasets with many files, such as tens of millions, that are loaded repeatedly, and for buckets where the scalar retrieval feature for OSS is enabled.

Build a dataset from an OSS_URI prefix

This example shows how to use the `from_prefix` method of `OssMapDataset` to build a dataset from a specified prefix (OSS_URI) in OSS.

from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Build a dataset using the from_prefix method of OssMapDataset.
map_dataset = OssMapDataset.from_prefix(oss_uri=OSS_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)

# Randomly access an object in the created dataset.
item = map_dataset[0]
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()

# Traverse the objects in the dataset.
for item in map_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))
    item.close()

Build a dataset from a list of OSS_URIs

This example shows how to use the `from_objects` method of `OssMapDataset` to build a dataset from a specified list of OSS_URIs.

from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"

# uris is a string iterator that contains multiple OSS_URIs.
uris = [
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]

# Build a dataset using the from_objects method of OssMapDataset.
map_dataset = OssMapDataset.from_objects(object_uris=uris, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)

# Randomly access an object in the created dataset.
item = map_dataset[1]
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()

# Traverse the objects in the dataset.
for item in map_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))
    item.close()

Build a dataset from a manifest file

Before you can build a dataset from a manifest file, you must create the manifest file.

Create a manifest file.

Run the touch manifest_file command in any directory to create a manifest file. Then, populate the file as shown in the following examples.

Example of a manifest file with OSS object names:

Img/BadImag/Bmp/Sample001/img001-00001.png
Img/BadImag/Bmp/Sample001/img001-00002.png
Img/BadImag/Bmp/Sample001/img001-00003.png

Example of a manifest file with OSS object names and labels:

Img/BadImag/Bmp/Sample001/img001-00001.png label1
Img/BadImag/Bmp/Sample001/img001-00002.png label2
Img/BadImag/Bmp/Sample001/img001-00003.png label3

Build the dataset from the manifest file.

This example shows how to use the `from_manifest_file` method of `OssMapDataset` to build a dataset from a specified manifest file.

from osstorchconnector import OssMapDataset
from osstorchconnector import imagenet_manifest_parser

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_BASE_URI = "oss://ai-testset/EnglistImg/"

# Build a dataset from a local file using the from_manifest_file method of OssMapDataset.
# The manifest_file_path parameter specifies the local path of the manifest file.
# The manifest_parser parameter specifies the method for parsing the manifest file. This example uses the built-in parser imagenet_manifest_parser.
# The oss_base_uri parameter specifies the base OSS_URI. It is concatenated with the URI parsed from the manifest to form the full OSS_URI. The format is: FULL_OSS_URI = BASE_OSS_URI + URI.
MANIFEST_FILE_LOCAL = "/path/to/manifest_file"
map_dataset = OssMapDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_LOCAL, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in map_dataset:
    print(item.key)
    print(item.size)
    print(item.label)
    content = item.read()
    print(len(content))
    item.close()

# Build a dataset from a manifest file in an OSS bucket using the from_manifest_file method of OssMapDataset.
MANIFEST_FILE_URI = "oss://ai-testset/EnglistImg/manifest_file"
map_dataset = OssMapDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_URI, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in map_dataset:
    print(item.key)
    print(item.size)
    print(item.label)
    content = item.read()
    print(len(content))
    item.close()

Data types in OSS Connector for AI/ML

The data type for objects in the dataset implements common I/O interfaces. For more information, see Data types in OSS Connector for AI/ML.

Parameters

When you build a dataset using `OssMapDataset` or `OssIterableDataset`, you can configure the following parameters.

Parameter	Type	Required	Description
endpoint	string	Yes	Common parameter: The public endpoint of the OSS service. For more information, see Regions and endpoints.
region	string	No	Common parameter: The OSS region, such as `cn-beijing`. If you do not set this parameter, the region is inferred from the `endpoint`. This inference can fail if the endpoint does not contain region information. We recommend that you specify the region.
transform	object	No	Common parameter: A transform function that converts a DataObject (OSS object) to any type. You can customize this method as needed. For more information, see transform. Important Do not return a `DataObject` object directly in the transform. This may cause the iterator to stop working. To return an object, call the `copy` method.
cred_path	string	Yes	Common parameter: The path of the access credential file. The default path is `/root/.alibabacloud/credentials`. For more information, see Configure access credentials.
config_path	string	Yes	Common parameter: The path of the OSS Connector configuration file. The default path is `/etc/oss-connector/config.json`. For more information, see Configure OSS Connector.
oss_uri	string	Yes	Parameter for the `from_prefix` method: The path of an OSS resource. Use this parameter to build a dataset from an OSS_URI prefix. Only OSS_URIs that start with `oss://` are supported.
object_uris	string	Yes	Parameter for the `from_objects` method: A list of OSS resource paths. Use this parameter to build a dataset from the paths in the list. Only OSS_URIs that start with `oss://` are supported.
manifest_file_path	string	Yes	Parameter for the `from_manifest_file` method: The path of the manifest file. You can use a local file path or an OSS_URI that starts with `oss://`.
manifest_parser	Callable Object	Yes	Parameter for the `from_manifest_file` method: The built-in method to parse the manifest file. It accepts an opened manifest file as input and returns an iterator. Each element in the iterator is a `(oss_uri, label)` tuple. For more information, see manifest_parser. You can also create a custom `manifest_parser` method that suits the format of your dataset's manifest file.
oss_base_uri	string	Yes	Parameter for the `from_manifest_file` method: The base OSS_URI. It is concatenated with a possibly incomplete OSS_URI from the manifest file to form a complete OSS_URI. If you do not have a base OSS_URI, set this parameter to `""`.

Built-in methods

transform

Note

When you build a dataset, the dataset provides an iterator that yields the result of `transform(DataObject)`. `DataObject` is a data type in OSS Connector for AI/ML.

The `transform` method can be customized. If you do not specify a `transform` method when you build a dataset, the default method is used.

Default transform method

The following example shows the default `transform` method. You do not need to specify it when you build a dataset.

# Default transform function
def identity(obj: DataObject) -> DataObject:
    if obj is not None:
        return obj.copy()
    else:
        return None

Custom transform method

The following example shows how to use a custom `transform` method when you build a dataset.

import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Define the transformation operations for image data.
trans = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Create a transform method to process the input object.
def transform(object):
    try:
        img = Image.open(io.BytesIO(object.read())).convert('RGB')
        val = trans(img)
    except Exception as e:
        raise e
    return object.key, val

# Use the transform=transform parameter when you build the dataset.
map_dataset = OssMapDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in map_dataset:
    print(item[0])
    print(item[1].shape)

manifest_parser

To use the default `manifest_parser` method when you build a dataset, import it as shown in the following example.

from osstorchconnector import imagenet_manifest_parser

The following code shows the default `manifest_parser` method.

import io
import logging
from typing import Iterable, Tuple


def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
    lines = reader.read().decode("utf-8").strip().split("\n")
    for i, line in enumerate(lines):
        try:
            items = line.strip().split('\t')
            if len(items) >= 2:
                key = items[0]
                label = items[1]
                yield (key, label)
            elif len(items) == 1:
                key = items[0]
                yield (key, '')
            else:
                raise ValueError("format error")
        except ValueError as e:
            logging.error(f"Error: {e} for line {i}: {line}")

Create a PyTorch data loader from a dataset

The following example shows how to create a PyTorch data loader that uses a dataset built by `OssMapDataset` as the data source.

import torch
from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"


def transform(object):
    data = object.read()
    return object.key, object.label

# Build a dataset using the from_prefix method of OssMapDataset.
map_dataset = OssMapDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform,cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)

# Create a PyTorch data loader based on map_dataset.
loader = torch.utils.data.DataLoader(map_dataset, batch_size=256, num_workers=32, prefetch_factor=2, shuffle=True)
# Use the data in the training loop.
# for batch in loader:
    # Perform training operations.

References

You can also use OSS Connector for AI/ML for data training tasks in a containerized environment. For more information, see Build a Docker image that contains the OSS Connector for AI/ML environment.