All Products
Search
Document Center

MaxCompute:MaxFrame multimodal data processing operators

Last Updated:May 14, 2026

MaxFrame, a distributed AI computing engine, provides distributed multimodal data processing. It efficiently processes massive multimodal datasets (such as images, videos, and audio) at scale, including operations like image decoding, video frame extraction, and speech recognition, providing data preprocessing for large model training and inference.

Overview

Challenges

In large-scale AI model training scenarios, processing multimodal data like images and audio presents the following challenges:

  • Low processing efficiency: Single-machine processing cannot handle large-scale multimodal data, resulting in long processing cycles.

  • Complex O&M: Requires building and maintaining dedicated clusters for data processing services.

  • Siloed development: Separate pipelines for structured data processing, multimodal data processing, and AI inference increase development complexity.

Key features

MaxFrame provides built-in multimodal data processing operators. It integrates multimodal processing with AI inference by combining large-scale data processing with the offline large model inference capabilities of AI-Functions.

  • Distributed processing

    Powered by the MaxCompute distributed computing engine, MaxFrame supports parallel processing of massive multimodal datasets.

  • Native OSS integration

    It allows direct access to multimodal data in OSS, eliminating the need for data migration.

  • Rich set of operators

    Includes a comprehensive set of operators for image processing (decoding, property extraction, resizing, cropping, and format conversion) and audio processing (decoding, transcription, language detection, and voice activity detection (VAD)).

  • Declarative API

    Features a Pandas-like API that supports method chaining to reduce the learning curve.

  • Security authentication

    Supports using a RAM role (role_arn) to access OSS, eliminating the need to hardcode AccessKey pairs.

Use cases

  • Image data cleaning: Filter out invalid images and select images that meet specific criteria.

  • Image preprocessing: Prepare training data for AI models, such as resizing, cropping, and format conversion.

  • Audio transcription: Transcribe audio files to text in batches with support for multi-language recognition.

  • Multimodal dataset construction: Build large-scale image and audio datasets for model training.

Prerequisites

  • Python version: Python 3.11 or later.

  • SDK version: MaxFrame SDK 2.6.0 or later.

# Check the version
python -c "import maxframe; print(maxframe.__version__)"

# Upgrade the SDK
pip install --upgrade maxframe

Operators

Image operators

Image operators

Operator

Description

Example

url.download()

Downloads a file from a URL or OSS and returns the content as a binary byte stream.

df["path"].url.download(storage_options={...})

image.decode()

Decodes image bytes into an image object.

df["bytes"].image.decode()

image.width

Returns the image width in pixels.

df["img"].image.width

image.height

Returns the image height in pixels.

df["img"].image.height

image.size

Returns the image file size in bytes.

df["img"].image.size

image.format

Returns the image format, such as JPEG or PNG.

df["img"].image.format

image.mode

Returns the image color mode, such as RGB, RGBA, or L.

df["img"].image.mode

Audio operators

Audio operators

Operator

Description

Example

url.download()

Downloads an audio file from a URL or OSS and returns the content as a binary byte stream.

df["audio_path"].url.download(storage_options={...})

audio.decode()

Decodes an audio byte stream into an AudioObject.

df["audio_bytes"].audio.decode()

audio.channels

Returns the number of audio channels.

df["audio_obj"].audio.channels

audio.size

Returns the audio file size in bytes.

df["audio_obj"].audio.size

audio.sample_rate

Returns the audio sample rate in Hz.

df["audio_obj"].audio.sample_rate

audio.duration

Returns the audio duration in seconds.

df["audio_obj"].audio.duration

audio.format

Returns the audio encoding format.

df["audio_obj"].audio.format

audio.detect_language()

Automatically detects the language of the audio.

df["audio_bytes"].audio.detect_language()

audio.transcribe()

Transcribes speech to text and automatically detects the language.

df["audio_bytes"].audio.transcribe()

audio.transcribe(language="zh")

Transcribes speech to text for a specified language.

df["audio_bytes"].audio.transcribe(language="zh")

audio.vad_detect()

Performs voice activity detection (VAD) and returns segments of active speech.

df["audio_bytes"].audio.vad_detect()

OSS access authentication

OSS access authentication

The url.download() method uses the storage_options parameter to configure OSS access credentials:

# Authorize with a RAM role (recommended, automatically requests a temporary STS token)
df["img_bytes"] = df["oss_path"].url.download(
    storage_options={"role_arn": "acs:ram::[account-id]:role/[role-name]"}
)

Use cases

The following examples demonstrate typical use cases for multimodal operators. All examples assume that you have created a session and configured the engine.

We recommend setting DPE as the preferred engine using options.dag.settings = {"engine_order": ["DPE", "MCSQL", "SPE"]} for optimal multimodal data processing performance.

Use case 1: Extract and filter image properties

Download images from OSS in batches, extract their properties, filter them based on conditions, and write the results to a sink table.

from maxframe import dataframe as md

# Read the source table that contains the OSS paths of images.
df = md.read_odps_table("image_src_table")

# Download, decode, and extract properties using method chaining.
df["img_bytes"] = df["oss_path"].url.download(
    storage_options={"role_arn": ROLE_ARN}
)
df["img_obj"] = df["img_bytes"].image.decode()

df["width"]  = df["img_obj"].image.width
df["height"] = df["img_obj"].image.height
df["size"]   = df["img_obj"].image.size
df["format"] = df["img_obj"].image.format
df["mode"]   = df["img_obj"].image.mode

# Filter images by width and height ranges.
df_filtered = df[
    df["width"].between(1000, 5000) &
    df["height"].between(2000, 6000)
]

# Write the results to the destination table.
df_filtered[
    ["id", "oss_path", "width", "height", "size", "format", "mode"]
].to_odps_table("image_sink_table", overwrite=True).execute()

Use case 2: Preprocess images

Resize, crop, and convert image formats to prepare training data for AI models.

df = md.read_odps_table("image_src_table")

df["img_bytes"] = df["oss_path"].url.download(
    storage_options={"role_arn": ROLE_ARN}
)
df["img_obj"] = df["img_bytes"].image.decode()

# Resize images to 224x224, a common input size for models.
df["img_resized"] = df["img_obj"].image.resize((224, 224))

# Crop a specific region of an image (left, top, right, bottom).
df["img_cropped"] = df["img_obj"].image.crop((100, 100, 500, 500))

# Convert the color mode, for example, from RGBA to RGB.
df["img_rgb"] = df["img_obj"].image.convert("RGB")

Usage notes

Item

Description

Engine configuration

Multimodal operators require the DPE engine. Make sure that DPE is prioritized in the engine_order setting.

OSS path format

Use the full internal endpoint of OSS: oss://oss-cn-<region>-internal.aliyuncs.com/<bucket>/<path>.

Lazy execution

All operator actions are built lazily. You must call .execute() to trigger execution.

Security authentication

Use a RAM role (role_arn) to avoid hardcoding AccessKey pairs.

Resource cleanup

After processing, call session.destroy() to release cluster resources.

FAQ

Q1: How do I troubleshoot image download failures?

  1. Verify that the OSS path format is correct (you must use an internal endpoint).

  2. Verify that the RAM role has the required permissions to access OSS.

  3. Review the error logs in Logview for details.

Q2: How do I process data at scale?

Multimodal operators run in parallel on the MaxFrame DPE engine. The engine automatically handles data slicing and task scheduling, so you do not need to manually configure parallelism. For custom processing logic, you can use the MaxFrame apply_chunk operator for batch processing.