An OssIterableDataset is ideal for scenarios that involve limited memory or large data volumes. It is primarily used for sequential processing where random access and parallel processing are not required. This topic describes how to build a dataset using OssIterableDataset.
Prerequisites
OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.
Build a dataset
Methods
You can build a dataset using OssIterableDataset in three ways:
OSS URI prefix: Use this method when OSS storage paths follow a consistent pattern.
List of OSS URIs: Use this method for specific, non-sequential OSS storage paths.
Manifest file: Use this method to reduce the overhead of listing OSS objects. This method is suitable for datasets with many files, such as tens of millions, that require repeated loading. It is also suitable for buckets where the OSS scalar retrieval feature is enabled.
Build a dataset from an OSS URI prefix
The following example shows how to use the from_prefix method of OssIterableDataset to build a dataset from a specified prefix (OSS URI) in OSS.
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Build a dataset using the from_prefix method of OssIterableDataset
iterable_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
# Traverse the objects in the dataset
for item in iterable_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()Build a dataset from a list of OSS URIs
The following example shows how to use the from_objects method of OssIterableDataset to build a dataset from a specified list of OSS URIs. In the example, uris is a string iterator that contains multiple OSS URIs.
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
uris = [
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]
# Build a dataset using the from_objects method of OssIterableDataset
iterable_dataset = OssIterableDataset.from_objects(uris, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
# Traverse the objects in the dataset
for item in iterable_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()Build a dataset from a manifest file
Before you build a dataset from a manifest file, you must first create the manifest file.
Create a manifest file:
Run the
touch manifest_filecommand in any location to create a manifest file. Then, populate the manifest file as shown in the examples.Example of a manifest file with OSS object names:
Img/BadImag/Bmp/Sample001/img001-00001.png Img/BadImag/Bmp/Sample001/img001-00002.png Img/BadImag/Bmp/Sample001/img001-00003.pngExample of a manifest file with OSS object names and labels:
Img/BadImag/Bmp/Sample001/img001-00001.png label1 Img/BadImag/Bmp/Sample001/img001-00002.png label2 Img/BadImag/Bmp/Sample001/img001-00003.png label3Build the dataset from the manifest file:
The following example shows how to use the from_manifest_file method of OssIterableDataset to build a dataset from a specified manifest file.
from osstorchconnector import OssIterableDataset from osstorchconnector import imagenet_manifest_parser ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com" REGION = "cn-beijing" CONFIG_PATH = "/etc/oss-connector/config.json" CRED_PATH = "/root/.alibabacloud/credentials" OSS_BASE_URI = "oss://ai-testset/EnglistImg/" # Build a dataset from a local file using the from_manifest_file method of OssIterableDataset # The manifest_file_path parameter specifies the local path of the manifest file. # The manifest_parser parameter is the method for parsing the manifest file. This example uses the built-in parsing method imagenet_manifest_parser. # The oss_base_uri parameter specifies the base OSS URI. It is used to concatenate with the URI parsed from the manifest to form a full OSS URI. FULL_OSS_URI = BASE_OSS_URI + URI. MANIFEST_FILE_LOCAL = "/path/to/manifest_file" iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_LOCAL, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION) for item in iterable_dataset: print(item.key) print(item.size) print(item.label) content = item.read() print(len(content)) item.close() # Build a dataset from a manifest file in an OSS Bucket using the from_manifest_file method of OssIterableDataset MANIFEST_FILE_URI = "oss://ai-testset/EnglistImg/manifest_file" iterable_dataset = OssIterableDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_URI, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION) for item in iterable_dataset: print(item.key) print(item.size) print(item.label) content = item.read() print(len(content)) item.close()
Data types in the dataset
Objects in the dataset are of a data type that implements common I/O interfaces. For more information, see Data types in OSS Connector for AI/ML.
Parameter description
Configure the following parameters to build a dataset using OssMapDataset or OssIterableDataset.
Parameter | Type | Required | Description |
endpoint | string | Yes | Common parameter: The endpoint for accessing the OSS service. For more information, see Regions and endpoints. |
region | string | No | Common parameter: The OSS region, such as |
transform | object | No | Common parameter: A transform function used to convert a DataObject (OSS object) to any type. You can customize this method as needed. For more information, see transform. Important Do not directly return a |
cred_path | string | Yes | Common parameter: The default path of the authentication file is |
config_path | string | Yes | Common parameter: The default path of the OSS Connector configuration file is |
oss_uri | string | Yes | from_prefix method parameter: The OSS resource path used to build the dataset from an OSS URI prefix. Only OSS URIs that start with |
object_uris | string | Yes | from_objects method parameter: A list of OSS resource paths used to build the dataset. Only OSS URIs that start with |
manifest_file_path | string | Yes | from_manifest_file method parameter: The path of the manifest file. Local file paths and OSS URIs that start with |
manifest_parser | Callable Object | Yes | from_manifest_file method parameter: A built-in method for parsing the manifest file. It accepts an opened manifest file as input and returns an iterator. Each element in the iterator is a tuple of |
oss_base_uri | string | Yes | from_manifest_file method parameter: The base OSS URI. It is used to create a complete OSS URI by concatenating with a potentially incomplete OSS URI from the manifest file. If there is no oss_base_uri, set this parameter to |
Built-in methods
transform
When you build a dataset, the dataset provides an iterator that yields the result of `transform(DataObject)`. `DataObject` is a data type in OSS Connector for AI/ML.
The transform method can be customized. If you do not specify a transform method when building the dataset, the default method is used.
Default transform method
The following example shows the default transform method. You do not need to specify it when building a dataset.
# Default transform function
def identity(obj: DataObject) -> DataObject:
if obj is not None:
return obj.copy()
else:
return NoneCustom transform method
The following example shows how to use a custom transform method when building a dataset.
import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Define the transformation operations for image data
trans = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Create a transform method to process the input object
def transform(object):
try:
img = Image.open(io.BytesIO(object.read())).convert('RGB')
val = trans(img)
except Exception as e:
raise e
return object.key, val
# Use the transform=transform parameter when building the dataset
iterable_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in iterable_dataset:
print(item[0])
print(item[1].shape)manifest_parser
To build a dataset using the default manifest_parser method, import it as shown in the following example.
from osstorchconnector import imagenet_manifest_parserThe following example shows the default manifest_parser method.
def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
lines = reader.read().decode("utf-8").strip().split("\n")
for i, line in enumerate(lines):
try:
items = line.strip().split('\t')
if len(items) >= 2:
key = items[0]
label = items[1]
yield (key, label)
elif len(items) == 1:
key = items[0]
yield (key, '')
else:
raise ValueError("format error")
except ValueError as e:
logging.error(f"Error: {e} for line {i}: {line}")Create a data loader with PyTorch from a dataset
The following example shows how to create a PyTorch data loader using a dataset built with OssIterableDataset as the data source.
import torch
from osstorchconnector import OssIterableDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
def transform(obj):
return obj.key, obj.label
# Build a dataset using the from_prefix method of OssIterableDataset
map_dataset = OssIterableDataset.from_prefix(OSS_URI, endpoint=ENDPOINT,transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
# Create a PyTorch data loader based on map_dataset
loader = torch.utils.data.DataLoader(map_dataset, batch_size=256, num_workers=32, prefetch_factor=2)
# Use the data in the training loop
# for batch in loader:
# Perform training operations
References
OSS Connector for AI/ML can also be used for data training tasks in a containerized environment. For more information, see Build a Docker image that contains the OSS Connector for AI/ML environment.