MaxFrame multimodal data processing operators - image audio - MaxCompute

MaxFrame provides distributed multimodal data processing operators for large-scale image, video, and audio workloads, including image decoding, video frame extraction, and speech recognition, to prepare data for AI model training and inference.

Overview

Challenges

Processing multimodal data (images, audio) at scale for AI model training presents several challenges:

Low throughput: Single-machine processing cannot handle large-scale multimodal data efficiently.
Complex ops: Building and maintaining dedicated processing clusters adds operational overhead.
Fragmented pipelines: Structured data processing, multimodal processing, and AI inference run on separate systems, increasing development complexity.

Key features

MaxFrame provides built-in multimodal operators that unify large-scale data processing with AI Function offline inference in a single pipeline.

Distributed processing
Processes massive multimodal datasets in parallel on the MaxCompute distributed engine.
Native OSS integration
Reads multimodal data directly from OSS without data migration.
Rich operator set
Covers image operations (decode, properties, resize, crop, format conversion) and audio operations (decode, transcribe, language detection, VAD).
Declarative API
Pandas-like API with chainable calls for a minimal learning curve.
Secure authentication
Accesses OSS through RAM roles (role_arn) — no hard-coded AK/SK required.

Use cases

Image data cleaning: Filter invalid images and select those that meet size or format requirements.
Image preprocessing: Prepare training data through resize, crop, and format conversion.
Audio transcription: Batch-transcribe audio to text with multi-language support.
Multimodal dataset construction: Build large-scale image and audio datasets for model training.

Requirements

Python: 3.11 or later
MaxFrame SDK: 2.6.0 or later

# version check
python -c "import maxframe; print(maxframe.__version__)"

# SDK Upgrade
pip install --upgrade maxframe

Operator reference

Image operators

Operator	Description	Example
`url.download()`	Downloads a file from a URL or OSS path and returns raw bytes.	`df["path"].url.download(storage_options={...})`
`image.decode()`	Decodes image bytes into an image object.	`df["bytes"].image.decode()`
`image.width`	Returns image width in pixels.	`df["img"].image.width`
`image.height`	Returns image height in pixels.	`df["img"].image.height`
`image.size`	Returns image file size in bytes.	`df["img"].image.size`
`image.format`	Returns image format (JPEG, PNG, etc.).	`df["img"].image.format`
`image.mode`	Returns image color mode (RGB, RGBA, L, etc.).	`df["img"].image.mode`

Audio operators

Operator	Description	Example
`url.download()`	Downloads an audio file from a URL or OSS path and returns raw bytes.	`df["audio_path"].url.download(storage_options={...})`
`audio.decode()`	Decodes audio bytes into an AudioObject.	`df["audio_bytes"].audio.decode()`
`audio.channels`	Returns the number of audio channels.	`df["audio_obj"].audio.channels`
`audio.size`	Returns audio file size in bytes.	`df["audio_obj"].audio.size`
`audio.sample_rate`	Returns audio sample rate in Hz.	`df["audio_obj"].audio.sample_rate`
`audio.duration`	Returns audio duration in seconds.	`df["audio_obj"].audio.duration`
`audio.format`	Returns the audio encoding format.	`df["audio_obj"].audio.format`
`audio.detect_language()`	Detects the language of the audio.	`df["audio_bytes"].audio.detect_language()`
`audio.transcribe()`	Transcribes speech to text (auto-detects language).	`df["audio_bytes"].audio.transcribe()`
`audio.transcribe(language="zh")`	Transcribes speech to text (specified language).	`df["audio_bytes"].audio.transcribe(language="zh")`
`audio.vad_detect()`	Voice activity detection (VAD). Returns segments containing speech.	`df["audio_bytes"].audio.vad_detect()`

OSS access authentication

url.download() uses the storage_options parameter to configure OSS access credentials:

# RAM role authorization (recommended, automatically requests temporary STS token)
df["img_bytes"] = df["oss_path"].url.download(
    storage_options={"role_arn": "acs:ram::[account-id]:role/[role-name]"}
)

Examples

The following examples demonstrate typical multimodal operator workflows. All examples assume a MaxFrame session is already created and the engine is configured.

Set DPE as the preferred engine with options.dag.settings = {"engine_order": ["DPE", "MCSQL", "SPE"]} for optimal multimodal processing performance.

Example 1: Extract and filter image properties

Download images from OSS in batch, extract properties, filter by criteria, and write results to a table.

from maxframe import dataframe as md

# Read data table containing image OSS paths
df = md.read_odps_table("image_src_table")

# Download → decode → property extraction (chained calls)
df["img_bytes"] = df["oss_path"].url.download(
    storage_options={"role_arn": ROLE_ARN}
)
df["img_obj"] = df["img_bytes"].image.decode()

df["width"]  = df["img_obj"].image.width
df["height"] = df["img_obj"].image.height
df["size"]   = df["img_obj"].image.size
df["format"] = df["img_obj"].image.format
df["mode"]   = df["img_obj"].image.mode

# Filter by width and height range
df_filtered = df[
    df["width"].between(1000, 5000) &
    df["height"].between(2000, 6000)
]

# Write to result table
df_filtered[
    ["id", "oss_path", "width", "height", "size", "format", "mode"]
].to_odps_table("image_sink_table", overwrite=True).execute()

Example 2: Image preprocessing (resize, crop, and format conversion)

Resize, crop, and convert image formats to prepare training data for AI models.

df = md.read_odps_table("image_src_table")

df["img_bytes"] = df["oss_path"].url.download(
    storage_options={"role_arn": ROLE_ARN}
)
df["img_obj"] = df["img_bytes"].image.decode()

# Resize image to 224×224 (commonly used for model input)
df["img_resized"] = df["img_obj"].image.resize((224, 224))

# Crop image to specified region (left, top, right, bottom)
df["img_cropped"] = df["img_obj"].image.crop((100, 100, 500, 500))

# Convert color mode (e.g., RGBA → RGB)
df["img_rgb"] = df["img_obj"].image.convert("RGB")

Usage notes

Item	Description
Engine configuration	Multimodal operators require the DPE engine. Set DPE first in `engine_order`.
OSS path format	Use the full internal OSS endpoint: `oss://oss-cn-<region>-internal.aliyuncs.com/<bucket>/<path>`
Lazy execution	All operator calls are lazily constructed. Call `.execute()` to trigger actual computation.
Authentication	Use a RAM role (`role_arn`) instead of hard-coded AK/SK credentials.
Resource cleanup	Call `session.destroy()` after processing to release cluster resources.

FAQ

Q1: How do I troubleshoot image download failures?

Verify the OSS path uses the internal endpoint.
Confirm the RAM role has OSS access permissions.
Check the Logview error logs for details.

Q2: How do I handle large-scale data?

Multimodal operators run in parallel automatically on the MaxFrame DPE engine — data sharding and task scheduling are handled internally, with no manual parallelism configuration required. For custom processing logic, use the MaxFrame apply_chunk operator for batch processing.