MaxFrame, a distributed AI computing engine, provides distributed multimodal data processing. It efficiently processes massive multimodal datasets (such as images, videos, and audio) at scale, including operations like image decoding, video frame extraction, and speech recognition, providing data preprocessing for large model training and inference.
Overview
Challenges
In large-scale AI model training scenarios, processing multimodal data like images and audio presents the following challenges:
Low processing efficiency: Single-machine processing cannot handle large-scale multimodal data, resulting in long processing cycles.
Complex O&M: Requires building and maintaining dedicated clusters for data processing services.
Siloed development: Separate pipelines for structured data processing, multimodal data processing, and AI inference increase development complexity.
Key features
MaxFrame provides built-in multimodal data processing operators. It integrates multimodal processing with AI inference by combining large-scale data processing with the offline large model inference capabilities of AI-Functions.
Distributed processing
Powered by the MaxCompute distributed computing engine, MaxFrame supports parallel processing of massive multimodal datasets.
Native OSS integration
It allows direct access to multimodal data in OSS, eliminating the need for data migration.
Rich set of operators
Includes a comprehensive set of operators for image processing (decoding, property extraction, resizing, cropping, and format conversion) and audio processing (decoding, transcription, language detection, and voice activity detection (VAD)).
Declarative API
Features a Pandas-like API that supports method chaining to reduce the learning curve.
Security authentication
Supports using a RAM role (
role_arn) to access OSS, eliminating the need to hardcode AccessKey pairs.
Use cases
Image data cleaning: Filter out invalid images and select images that meet specific criteria.
Image preprocessing: Prepare training data for AI models, such as resizing, cropping, and format conversion.
Audio transcription: Transcribe audio files to text in batches with support for multi-language recognition.
Multimodal dataset construction: Build large-scale image and audio datasets for model training.
Prerequisites
Python version: Python 3.11 or later.
SDK version: MaxFrame SDK 2.6.0 or later.
# Check the version
python -c "import maxframe; print(maxframe.__version__)"
# Upgrade the SDK
pip install --upgrade maxframeOperators
Image operators
Image operators
Operator | Description | Example |
| Downloads a file from a URL or OSS and returns the content as a binary byte stream. |
|
| Decodes image bytes into an image object. |
|
| Returns the image width in pixels. |
|
| Returns the image height in pixels. |
|
| Returns the image file size in bytes. |
|
| Returns the image format, such as JPEG or PNG. |
|
| Returns the image color mode, such as RGB, RGBA, or L. |
|
Audio operators
Audio operators
Operator | Description | Example |
| Downloads an audio file from a URL or OSS and returns the content as a binary byte stream. |
|
| Decodes an audio byte stream into an AudioObject. |
|
| Returns the number of audio channels. |
|
| Returns the audio file size in bytes. |
|
| Returns the audio sample rate in Hz. |
|
| Returns the audio duration in seconds. |
|
| Returns the audio encoding format. |
|
| Automatically detects the language of the audio. |
|
| Transcribes speech to text and automatically detects the language. |
|
| Transcribes speech to text for a specified language. |
|
| Performs voice activity detection (VAD) and returns segments of active speech. |
|
OSS access authentication
OSS access authentication
The url.download() method uses the storage_options parameter to configure OSS access credentials:
# Authorize with a RAM role (recommended, automatically requests a temporary STS token)
df["img_bytes"] = df["oss_path"].url.download(
storage_options={"role_arn": "acs:ram::[account-id]:role/[role-name]"}
)Use cases
The following examples demonstrate typical use cases for multimodal operators. All examples assume that you have created a session and configured the engine.
We recommend setting DPE as the preferred engine using options.dag.settings = {"engine_order": ["DPE", "MCSQL", "SPE"]} for optimal multimodal data processing performance.Use case 1: Extract and filter image properties
Download images from OSS in batches, extract their properties, filter them based on conditions, and write the results to a sink table.
from maxframe import dataframe as md
# Read the source table that contains the OSS paths of images.
df = md.read_odps_table("image_src_table")
# Download, decode, and extract properties using method chaining.
df["img_bytes"] = df["oss_path"].url.download(
storage_options={"role_arn": ROLE_ARN}
)
df["img_obj"] = df["img_bytes"].image.decode()
df["width"] = df["img_obj"].image.width
df["height"] = df["img_obj"].image.height
df["size"] = df["img_obj"].image.size
df["format"] = df["img_obj"].image.format
df["mode"] = df["img_obj"].image.mode
# Filter images by width and height ranges.
df_filtered = df[
df["width"].between(1000, 5000) &
df["height"].between(2000, 6000)
]
# Write the results to the destination table.
df_filtered[
["id", "oss_path", "width", "height", "size", "format", "mode"]
].to_odps_table("image_sink_table", overwrite=True).execute()Use case 2: Preprocess images
Resize, crop, and convert image formats to prepare training data for AI models.
df = md.read_odps_table("image_src_table")
df["img_bytes"] = df["oss_path"].url.download(
storage_options={"role_arn": ROLE_ARN}
)
df["img_obj"] = df["img_bytes"].image.decode()
# Resize images to 224x224, a common input size for models.
df["img_resized"] = df["img_obj"].image.resize((224, 224))
# Crop a specific region of an image (left, top, right, bottom).
df["img_cropped"] = df["img_obj"].image.crop((100, 100, 500, 500))
# Convert the color mode, for example, from RGBA to RGB.
df["img_rgb"] = df["img_obj"].image.convert("RGB")
Usage notes
Item | Description |
Engine configuration | Multimodal operators require the DPE engine. Make sure that DPE is prioritized in the |
OSS path format | Use the full internal endpoint of OSS: |
Lazy execution | All operator actions are built lazily. You must call |
Security authentication | Use a RAM role ( |
Resource cleanup | After processing, call |
FAQ
Q1: How do I troubleshoot image download failures?
Verify that the OSS path format is correct (you must use an internal endpoint).
Verify that the RAM role has the required permissions to access OSS.
Review the error logs in Logview for details.
Q2: How do I process data at scale?
Multimodal operators run in parallel on the MaxFrame DPE engine. The engine automatically handles data slicing and task scheduling, so you do not need to manually configure parallelism. For custom processing logic, you can use the MaxFrame apply_chunk operator for batch processing.