All Products
Search
Document Center

Object Storage Service:Use OSS data to build a map-style dataset for random reads

Last Updated:Mar 20, 2026

OssMapDataset is a map-style dataset that loads OSS objects into memory and supports random access by index. Use it when your dataset is small enough to fit in memory and your training loop requires frequent random access, parallel processing, or shuffling.

Prerequisites

Before you begin, make sure that you have:

Build a dataset

OssMapDataset supports three construction methods. Choose the one that matches how your OSS data is organized:

MethodUse when
from_prefixAll objects share a common OSS URI prefix
from_objectsObject paths are known but scattered across the bucket
from_manifest_fileThe dataset is large (tens of millions of objects), is frequently loaded, and data indexing is enabled for the bucket. This method avoids the API listing fees incurred by the other two methods.

Build a dataset from a prefix

Use from_prefix when all your training data lives under a single OSS URI prefix.

from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Build a dataset from all objects under the prefix.
map_dataset = OssMapDataset.from_prefix(
    oss_uri=OSS_URI,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH,
)

# Access an object by index.
item = map_dataset[0]
print(item.key)
content = item.read()
print(item.size)
print(len(content))

# Traverse all objects.
for item in map_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))

Build a dataset from an object list

Use from_objects when your training data is spread across multiple paths in a bucket.

from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"

# Provide the full OSS URI for each object.
uris = [
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
    "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png",
]

# Build a dataset from the explicit URI list.
map_dataset = OssMapDataset.from_objects(
    object_uris=uris,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH,
)

# Access an object by index.
item = map_dataset[1]
print(item.key)
print(item.size)
content = item.read()
print(len(content))

# Traverse all objects.
for item in map_dataset:
    print(item.key)
    print(item.size)
    content = item.read()
    print(len(content))

Build a dataset from a manifest file

Use from_manifest_file when your dataset contains a large number of objects and data indexing is enabled on the bucket. The manifest file replaces the API listing call, which reduces costs at scale.

Step 1: Create a manifest file.

Run the touch manifest_file command to create a manifest file. Each line in the manifest file is one object path. Optionally, append a label after a space:

# Names only
Img/BadImag/Bmp/Sample001/img001-00001.png
Img/BadImag/Bmp/Sample001/img001-00002.png
Img/BadImag/Bmp/Sample001/img001-00003.png
# Names with labels
Img/BadImag/Bmp/Sample001/img001-00001.png label1
Img/BadImag/Bmp/Sample001/img001-00002.png label2
Img/BadImag/Bmp/Sample001/img001-00003.png label3

Step 2: Build the dataset.

The manifest_file_path parameter accepts either a local file path or an OSS URI. Both options are shown below:

import io
from typing import Iterable, Tuple, Union
from osstorchconnector import OssMapDataset, imagenet_manifest_parser

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_BASE_URI = "oss://ai-testset/EnglistImg/"

# Option 1: Load the manifest file from a local path.
MANIFEST_FILE_LOCAL = "/path/to/manifest_file.txt"
map_dataset = OssMapDataset.from_manifest_file(
    manifest_file_path=MANIFEST_FILE_LOCAL,
    manifest_parser=imagenet_manifest_parser,
    oss_base_uri=OSS_BASE_URI,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH,
)
for item in map_dataset:
    print(item.key)
    print(item.size)
    print(item.label)
    content = item.read()
    print(len(content))

# Option 2: Load the manifest file directly from OSS.
MANIFEST_FILE_URI = "oss://manifest_fileai-testset/EnglistImg/manifest_file"
map_dataset = OssMapDataset.from_manifest_file(
    manifest_file_path=MANIFEST_FILE_URI,
    manifest_parser=imagenet_manifest_parser,
    oss_base_uri=OSS_BASE_URI,
    endpoint=ENDPOINT,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH,
)

Parameters

All three construction methods share common parameters. Method-specific parameters are listed separately.

Common parameters

ParameterTypeRequiredDescription
endpointstringYesThe OSS endpoint. See Regions and endpoints.
cred_pathstringYesPath to the credentials file. Default: /root/.alibabacloud/credentials. See Configure access credentials.
config_pathstringYesPath to the OSS Connector configuration file. Default: /etc/oss-connector/config.json. See Configure OSS Connector.
transformobjectNoA function applied to each DataObject before it is returned. If not specified, the default identity function is used. See transform.

from_prefix parameters

ParameterTypeRequiredDescription
oss_uristringYesThe OSS URI prefix. Must start with oss://.

from_objects parameters

ParameterTypeRequiredDescription
object_urisstringYesA list of OSS URIs. Each URI must start with oss://.

from_manifest_file parameters

ParameterTypeRequiredDescription
manifest_file_pathstringYesPath to the manifest file. Accepts a local file path or an OSS URI starting with oss://.
manifest_parserCallableYesA function that reads an open manifest file and returns an iterator of (oss_uri, label) tuples. Use the built-in imagenet_manifest_parser or provide a custom implementation. See manifest_parser.
oss_base_uristringYesThe OSS base URI prepended to relative paths in the manifest file to form complete OSS URIs. If all paths in the manifest file are already complete, pass "".

For the DataObject attributes available on each item (key, size, label, read(), copy()), see Data type in OSS Connector for AI/ML.

Built-in methods

transform

Default transform method

When a dataset is built, each DataObject is passed through the transform function before being returned. The default transform copies the object and returns it:

# Default transform — applied automatically when transform is not specified.
def identity(obj: DataObject) -> DataObject:
    if obj is not None:
        return obj.copy()
    else:
        return None

Custom transform method

Important

Do not return a DataObject directly from a custom transform function — the iterator may fail. Return obj.copy() instead, or convert the object to another type (such as a tensor).

The following example applies torchvision transforms to image objects:

import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

# Define the image preprocessing pipeline.
trans = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

def transform(obj):
    img = Image.open(io.BytesIO(obj.read())).convert("RGB")
    return trans(img), obj.label

# Pass the transform function at construction time.
map_dataset = OssMapDataset.from_prefix(
    OSS_URI,
    endpoint=ENDPOINT,
    transform=transform,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH,
)

manifest_parser

manifest_parser is a callable that reads an open manifest file and returns an iterator of (oss_uri, label) tuples. The built-in imagenet_manifest_parser handles tab-separated manifest files:

from osstorchconnector import imagenet_manifest_parser

Implementation reference:

def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
    lines = reader.read().decode("utf-8").strip().split("\n")
    for i, line in enumerate(lines):
        try:
            items = line.strip().split("\t")
            if len(items) >= 2:
                key = items[0]
                label = items[1]
                yield (key, label)
            elif len(items) == 1:
                key = items[0]
                yield (key, "")
            else:
                raise ValueError("format error")
        except ValueError as e:
            logging.error(f"Error: {e} for line {i}: {line}")

Provide a custom manifest_parser if your manifest file uses a different format.

Create a PyTorch data loader

Pass OssMapDataset directly to torch.utils.data.DataLoader. The example below uses shuffle=True and multiple worker processes for parallel loading:

import torch
from osstorchconnector import OssMapDataset

ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"

def transform(obj):
    data = obj.read()
    return obj.key, obj.label

map_dataset = OssMapDataset.from_prefix(
    OSS_URI,
    endpoint=ENDPOINT,
    transform=transform,
    cred_path=CRED_PATH,
    config_path=CONFIG_PATH,
)

# Create the data loader.
loader = torch.utils.data.DataLoader(
    map_dataset,
    batch_size=256,
    num_workers=32,
    prefetch_factor=2,
    shuffle=True,
)

# Use the loader in your training loop.
for batch in loader:
    # Perform training operations.
    ...

What's next