OssMapDataset is a map-style dataset that loads OSS objects into memory and supports random access by index. Use it when your dataset is small enough to fit in memory and your training loop requires frequent random access, parallel processing, or shuffling.
Prerequisites
Before you begin, make sure that you have:
Installed OSS Connector for AI/ML. See Install OSS Connector for AI/ML.
Configured OSS Connector for AI/ML. See Configure OSS Connector for AI/ML.
Build a dataset
OssMapDataset supports three construction methods. Choose the one that matches how your OSS data is organized:
| Method | Use when |
|---|---|
from_prefix | All objects share a common OSS URI prefix |
from_objects | Object paths are known but scattered across the bucket |
from_manifest_file | The dataset is large (tens of millions of objects), is frequently loaded, and data indexing is enabled for the bucket. This method avoids the API listing fees incurred by the other two methods. |
Build a dataset from a prefix
Use from_prefix when all your training data lives under a single OSS URI prefix.
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Build a dataset from all objects under the prefix.
map_dataset = OssMapDataset.from_prefix(
oss_uri=OSS_URI,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH,
)
# Access an object by index.
item = map_dataset[0]
print(item.key)
content = item.read()
print(item.size)
print(len(content))
# Traverse all objects.
for item in map_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))Build a dataset from an object list
Use from_objects when your training data is spread across multiple paths in a bucket.
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
# Provide the full OSS URI for each object.
uris = [
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png",
]
# Build a dataset from the explicit URI list.
map_dataset = OssMapDataset.from_objects(
object_uris=uris,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH,
)
# Access an object by index.
item = map_dataset[1]
print(item.key)
print(item.size)
content = item.read()
print(len(content))
# Traverse all objects.
for item in map_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))Build a dataset from a manifest file
Use from_manifest_file when your dataset contains a large number of objects and data indexing is enabled on the bucket. The manifest file replaces the API listing call, which reduces costs at scale.
Step 1: Create a manifest file.
Run the touch manifest_file command to create a manifest file. Each line in the manifest file is one object path. Optionally, append a label after a space:
# Names only
Img/BadImag/Bmp/Sample001/img001-00001.png
Img/BadImag/Bmp/Sample001/img001-00002.png
Img/BadImag/Bmp/Sample001/img001-00003.png# Names with labels
Img/BadImag/Bmp/Sample001/img001-00001.png label1
Img/BadImag/Bmp/Sample001/img001-00002.png label2
Img/BadImag/Bmp/Sample001/img001-00003.png label3Step 2: Build the dataset.
The manifest_file_path parameter accepts either a local file path or an OSS URI. Both options are shown below:
import io
from typing import Iterable, Tuple, Union
from osstorchconnector import OssMapDataset, imagenet_manifest_parser
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_BASE_URI = "oss://ai-testset/EnglistImg/"
# Option 1: Load the manifest file from a local path.
MANIFEST_FILE_LOCAL = "/path/to/manifest_file.txt"
map_dataset = OssMapDataset.from_manifest_file(
manifest_file_path=MANIFEST_FILE_LOCAL,
manifest_parser=imagenet_manifest_parser,
oss_base_uri=OSS_BASE_URI,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH,
)
for item in map_dataset:
print(item.key)
print(item.size)
print(item.label)
content = item.read()
print(len(content))
# Option 2: Load the manifest file directly from OSS.
MANIFEST_FILE_URI = "oss://manifest_fileai-testset/EnglistImg/manifest_file"
map_dataset = OssMapDataset.from_manifest_file(
manifest_file_path=MANIFEST_FILE_URI,
manifest_parser=imagenet_manifest_parser,
oss_base_uri=OSS_BASE_URI,
endpoint=ENDPOINT,
cred_path=CRED_PATH,
config_path=CONFIG_PATH,
)Parameters
All three construction methods share common parameters. Method-specific parameters are listed separately.
Common parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
endpoint | string | Yes | The OSS endpoint. See Regions and endpoints. |
cred_path | string | Yes | Path to the credentials file. Default: /root/.alibabacloud/credentials. See Configure access credentials. |
config_path | string | Yes | Path to the OSS Connector configuration file. Default: /etc/oss-connector/config.json. See Configure OSS Connector. |
transform | object | No | A function applied to each DataObject before it is returned. If not specified, the default identity function is used. See transform. |
from_prefix parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
oss_uri | string | Yes | The OSS URI prefix. Must start with oss://. |
from_objects parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
object_uris | string | Yes | A list of OSS URIs. Each URI must start with oss://. |
from_manifest_file parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
manifest_file_path | string | Yes | Path to the manifest file. Accepts a local file path or an OSS URI starting with oss://. |
manifest_parser | Callable | Yes | A function that reads an open manifest file and returns an iterator of (oss_uri, label) tuples. Use the built-in imagenet_manifest_parser or provide a custom implementation. See manifest_parser. |
oss_base_uri | string | Yes | The OSS base URI prepended to relative paths in the manifest file to form complete OSS URIs. If all paths in the manifest file are already complete, pass "". |
For the DataObject attributes available on each item (key, size, label, read(), copy()), see Data type in OSS Connector for AI/ML.
Built-in methods
transform
Default transform method
When a dataset is built, each DataObject is passed through the transform function before being returned. The default transform copies the object and returns it:
# Default transform — applied automatically when transform is not specified.
def identity(obj: DataObject) -> DataObject:
if obj is not None:
return obj.copy()
else:
return NoneCustom transform method
Do not return a DataObject directly from a custom transform function — the iterator may fail. Return obj.copy() instead, or convert the object to another type (such as a tensor).
The following example applies torchvision transforms to image objects:
import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Define the image preprocessing pipeline.
trans = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
def transform(obj):
img = Image.open(io.BytesIO(obj.read())).convert("RGB")
return trans(img), obj.label
# Pass the transform function at construction time.
map_dataset = OssMapDataset.from_prefix(
OSS_URI,
endpoint=ENDPOINT,
transform=transform,
cred_path=CRED_PATH,
config_path=CONFIG_PATH,
)manifest_parser
manifest_parser is a callable that reads an open manifest file and returns an iterator of (oss_uri, label) tuples. The built-in imagenet_manifest_parser handles tab-separated manifest files:
from osstorchconnector import imagenet_manifest_parserImplementation reference:
def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
lines = reader.read().decode("utf-8").strip().split("\n")
for i, line in enumerate(lines):
try:
items = line.strip().split("\t")
if len(items) >= 2:
key = items[0]
label = items[1]
yield (key, label)
elif len(items) == 1:
key = items[0]
yield (key, "")
else:
raise ValueError("format error")
except ValueError as e:
logging.error(f"Error: {e} for line {i}: {line}")Provide a custom manifest_parser if your manifest file uses a different format.
Create a PyTorch data loader
Pass OssMapDataset directly to torch.utils.data.DataLoader. The example below uses shuffle=True and multiple worker processes for parallel loading:
import torch
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
def transform(obj):
data = obj.read()
return obj.key, obj.label
map_dataset = OssMapDataset.from_prefix(
OSS_URI,
endpoint=ENDPOINT,
transform=transform,
cred_path=CRED_PATH,
config_path=CONFIG_PATH,
)
# Create the data loader.
loader = torch.utils.data.DataLoader(
map_dataset,
batch_size=256,
num_workers=32,
prefetch_factor=2,
shuffle=True,
)
# Use the loader in your training loop.
for batch in loader:
# Perform training operations.
...What's next
To use OSS Connector for AI/ML in a Docker environment, see Build a Docker image that contains an OSS Connector for AI/ML environment.
For streaming datasets with lower memory overhead, see OssIterableDataset.