An `OssMapDataset` is suitable for scenarios with small data volumes and sufficient memory that require frequent random access and parallel processing. This topic describes how to build a dataset using `OssMapDataset`.
Prerequisites
OSS Connector for AI/ML is installed and configured. For more information, see Install OSS Connector for AI/ML and Configure OSS Connector for AI/ML.
Build a dataset
Methods
You can build a dataset with `OssMapDataset` in three ways:
OSS_URI prefix: Suitable for scenarios where OSS storage paths follow a uniform pattern.
List of OSS_URIs: Suitable for scenarios where OSS storage paths are specific but scattered.
Manifest file: Reduces the overhead of OSS `list object` operations. This method is suitable for datasets with many files, such as tens of millions, that are loaded repeatedly, and for buckets where the scalar retrieval feature for OSS is enabled.
Build a dataset from an OSS_URI prefix
This example shows how to use the `from_prefix` method of `OssMapDataset` to build a dataset from a specified prefix (OSS_URI) in OSS.
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Build a dataset using the from_prefix method of OssMapDataset.
map_dataset = OssMapDataset.from_prefix(oss_uri=OSS_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
# Randomly access an object in the created dataset.
item = map_dataset[0]
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()
# Traverse the objects in the dataset.
for item in map_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()Build a dataset from a list of OSS_URIs
This example shows how to use the `from_objects` method of `OssMapDataset` to build a dataset from a specified list of OSS_URIs.
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.json"
CRED_PATH = "/root/.alibabacloud/credentials"
# uris is a string iterator that contains multiple OSS_URIs.
uris = [
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00001.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00002.png",
"oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/img001-00003.png"
]
# Build a dataset using the from_objects method of OssMapDataset.
map_dataset = OssMapDataset.from_objects(object_uris=uris, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
# Randomly access an object in the created dataset.
item = map_dataset[1]
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()
# Traverse the objects in the dataset.
for item in map_dataset:
print(item.key)
print(item.size)
content = item.read()
print(len(content))
item.close()Build a dataset from a manifest file
Before you can build a dataset from a manifest file, you must create the manifest file.
Create a manifest file.
Run the
touch manifest_filecommand in any directory to create a manifest file. Then, populate the file as shown in the following examples.Example of a manifest file with OSS object names:
Img/BadImag/Bmp/Sample001/img001-00001.png Img/BadImag/Bmp/Sample001/img001-00002.png Img/BadImag/Bmp/Sample001/img001-00003.pngExample of a manifest file with OSS object names and labels:
Img/BadImag/Bmp/Sample001/img001-00001.png label1 Img/BadImag/Bmp/Sample001/img001-00002.png label2 Img/BadImag/Bmp/Sample001/img001-00003.png label3Build the dataset from the manifest file.
This example shows how to use the `from_manifest_file` method of `OssMapDataset` to build a dataset from a specified manifest file.
from osstorchconnector import OssMapDataset from osstorchconnector import imagenet_manifest_parser ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com" REGION = "cn-beijing" CONFIG_PATH = "/etc/oss-connector/config.json" CRED_PATH = "/root/.alibabacloud/credentials" OSS_BASE_URI = "oss://ai-testset/EnglistImg/" # Build a dataset from a local file using the from_manifest_file method of OssMapDataset. # The manifest_file_path parameter specifies the local path of the manifest file. # The manifest_parser parameter specifies the method for parsing the manifest file. This example uses the built-in parser imagenet_manifest_parser. # The oss_base_uri parameter specifies the base OSS_URI. It is concatenated with the URI parsed from the manifest to form the full OSS_URI. The format is: FULL_OSS_URI = BASE_OSS_URI + URI. MANIFEST_FILE_LOCAL = "/path/to/manifest_file" map_dataset = OssMapDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_LOCAL, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION) for item in map_dataset: print(item.key) print(item.size) print(item.label) content = item.read() print(len(content)) item.close() # Build a dataset from a manifest file in an OSS bucket using the from_manifest_file method of OssMapDataset. MANIFEST_FILE_URI = "oss://ai-testset/EnglistImg/manifest_file" map_dataset = OssMapDataset.from_manifest_file(manifest_file_path=MANIFEST_FILE_URI, manifest_parser=imagenet_manifest_parser, oss_base_uri=OSS_BASE_URI, endpoint=ENDPOINT, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION) for item in map_dataset: print(item.key) print(item.size) print(item.label) content = item.read() print(len(content)) item.close()
Data types in OSS Connector for AI/ML
The data type for objects in the dataset implements common I/O interfaces. For more information, see Data types in OSS Connector for AI/ML.
Parameters
When you build a dataset using `OssMapDataset` or `OssIterableDataset`, you can configure the following parameters.
Parameter | Type | Required | Description |
endpoint | string | Yes | Common parameter: The public endpoint of the OSS service. For more information, see Regions and endpoints. |
region | string | No | Common parameter: The OSS region, such as |
transform | object | No | Common parameter: A transform function that converts a DataObject (OSS object) to any type. You can customize this method as needed. For more information, see transform. Important Do not return a |
cred_path | string | Yes | Common parameter: The path of the access credential file. The default path is |
config_path | string | Yes | Common parameter: The path of the OSS Connector configuration file. The default path is |
oss_uri | string | Yes | Parameter for the `from_prefix` method: The path of an OSS resource. Use this parameter to build a dataset from an OSS_URI prefix. Only OSS_URIs that start with |
object_uris | string | Yes | Parameter for the `from_objects` method: A list of OSS resource paths. Use this parameter to build a dataset from the paths in the list. Only OSS_URIs that start with |
manifest_file_path | string | Yes | Parameter for the `from_manifest_file` method: The path of the manifest file. You can use a local file path or an OSS_URI that starts with |
manifest_parser | Callable Object | Yes | Parameter for the `from_manifest_file` method: The built-in method to parse the manifest file. It accepts an opened manifest file as input and returns an iterator. Each element in the iterator is a |
oss_base_uri | string | Yes | Parameter for the `from_manifest_file` method: The base OSS_URI. It is concatenated with a possibly incomplete OSS_URI from the manifest file to form a complete OSS_URI. If you do not have a base OSS_URI, set this parameter to |
Built-in methods
transform
When you build a dataset, the dataset provides an iterator that yields the result of `transform(DataObject)`. `DataObject` is a data type in OSS Connector for AI/ML.
The `transform` method can be customized. If you do not specify a `transform` method when you build a dataset, the default method is used.
Default transform method
The following example shows the default `transform` method. You do not need to specify it when you build a dataset.
# Default transform function
def identity(obj: DataObject) -> DataObject:
if obj is not None:
return obj.copy()
else:
return None
Custom transform method
The following example shows how to use a custom `transform` method when you build a dataset.
import io
import torchvision.transforms as transforms
from PIL import Image
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
# Define the transformation operations for image data.
trans = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Create a transform method to process the input object.
def transform(object):
try:
img = Image.open(io.BytesIO(object.read())).convert('RGB')
val = trans(img)
except Exception as e:
raise e
return object.key, val
# Use the transform=transform parameter when you build the dataset.
map_dataset = OssMapDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform, cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
for item in map_dataset:
print(item[0])
print(item[1].shape)manifest_parser
To use the default `manifest_parser` method when you build a dataset, import it as shown in the following example.
from osstorchconnector import imagenet_manifest_parserThe following code shows the default `manifest_parser` method.
import io
import logging
from typing import Iterable, Tuple
def imagenet_manifest_parser(reader: io.IOBase) -> Iterable[Tuple[str, str]]:
lines = reader.read().decode("utf-8").strip().split("\n")
for i, line in enumerate(lines):
try:
items = line.strip().split('\t')
if len(items) >= 2:
key = items[0]
label = items[1]
yield (key, label)
elif len(items) == 1:
key = items[0]
yield (key, '')
else:
raise ValueError("format error")
except ValueError as e:
logging.error(f"Error: {e} for line {i}: {line}")
Create a PyTorch data loader from a dataset
The following example shows how to create a PyTorch data loader that uses a dataset built by `OssMapDataset` as the data source.
import torch
from osstorchconnector import OssMapDataset
ENDPOINT = "http://oss-cn-beijing-internal.aliyuncs.com"
REGION = "cn-beijing"
CONFIG_PATH = "/etc/oss-connector/config.test.json"
CRED_PATH = "/root/.alibabacloud/credentials"
OSS_URI = "oss://ai-testset/EnglistImg/Img/BadImag/Bmp/Sample001/"
def transform(object):
data = object.read()
return object.key, object.label
# Build a dataset using the from_prefix method of OssMapDataset.
map_dataset = OssMapDataset.from_prefix(OSS_URI, endpoint=ENDPOINT, transform=transform,cred_path=CRED_PATH, config_path=CONFIG_PATH, region=REGION)
# Create a PyTorch data loader based on map_dataset.
loader = torch.utils.data.DataLoader(map_dataset, batch_size=256, num_workers=32, prefetch_factor=2, shuffle=True)
# Use the data in the training loop.
# for batch in loader:
# Perform training operations.
References
You can also use OSS Connector for AI/ML for data training tasks in a containerized environment. For more information, see Build a Docker image that contains the OSS Connector for AI/ML environment.