Deploy GPU-Accelerated LLM Inference via MaxFrame AI - MaxCompute

MaxFrame AI Function is an end-to-end solution for offline large language model (LLM) inference in MaxCompute. It combines data processing with AI capabilities, letting you run batch LLM inference directly on GPU resource quota (GU) resources without leaving your data warehouse workflow.

This topic shows you how to call a managed LLM using llm.generate on GU resources.

Quick start

The following snippet shows the minimal end-to-end call. Read the sections below for parameter details and debugging guidance.

import os
import maxframe.dataframe as md
from maxframe import new_session
from maxframe.config import options
from maxframe.learn.contrib.llm.models.managed import ManagedTextGenLLM
from odps import ODPS

# Initialize session
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='<your project>',
    endpoint='https://service.cn-<your region>.maxcompute.aliyun.com/api',
)
session = new_session(o)
options.session.gu_quota_name = "<your GU quota name>"  # Required to route tasks to GPU workers.

# Prepare input data
df = md.DataFrame({"query": ["What is the boiling point of water?"]})

# Initialize the managed LLM client
llm = ManagedTextGenLLM(name="Qwen3-4B-Instruct-2507-FP8")  # Name must be an exact match.

# Define the prompt template and run inference
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please answer the following question: {query}"},
]
result_df = llm.generate(df, prompt_template=messages)
result_df.execute()

Prerequisites

Before you begin, make sure you have:

MaxFrame software development kit (SDK) version 2.3.0 or later
Python 3.11
A GU quota enabled for your MaxCompute project — see Purchase and use MaxCompute AI computing resources
At least project-level read and write permissions for MaxCompute

Set up the environment

The following code initializes a MaxFrame session and assigns a GU quota to it. All subsequent LLM tasks run under this session.

import os
import maxframe.dataframe as md
import numpy as np
from maxframe import new_session
from maxframe.config import options
from maxframe.udf import with_running_options  # Helper for attaching resource options to UDF-based tasks.
from odps import ODPS
import logging

# Route tasks to the DPE and MCSQL engines; disable SPE for GPU workloads.
options.dag.settings = {
    "engine_order": ["DPE", "MCSQL"],
    "unavailable_engines": ["SPE"],
}

logging.basicConfig(level=logging.INFO)

# Initialize the MaxFrame session with your project credentials.
o = ODPS(
    # Make sure the ALIBABA_CLOUD_ACCESS_KEY_ID environment variable is set to your AccessKey ID,
    # and the ALIBABA_CLOUD_ACCESS_KEY_SECRET environment variable is set to your AccessKey secret.
    # Do not use the AccessKey ID and AccessKey secret strings directly.
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='<your project>',
    endpoint='https://service.cn-<your region>.maxcompute.aliyun.com/api',
)

session = new_session(o)

options.session.gu_quota_name = "xxxxx"  # Replace with your GU quota name.

print("LogView address:", session.get_logview_address())

Important

gu_quota_name is required to use GU resources. Without it, inference tasks are not dispatched to GPU workers.

Call a managed LLM

The four steps below use llm.generate to submit an asynchronous inference task against a managed model.

Step 1: Prepare input data

import pandas as pd
from IPython.display import HTML

# Set display options for debugging.
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)
HTML("<style>div.output_area pre {white-space: pre-wrap;}</style>")

# Create a query list.
query_list = [
    "What is the average distance between the Earth and the Sun?",
    "In what year did the American Revolutionary War begin?",
    "What is the boiling point of water?",
    "How can I quickly relieve a headache?",
    "Who is the main character in the Harry Potter series?",
]

# Convert to a MaxFrame DataFrame.
df = md.DataFrame({"query": query_list})
df.execute()

Step 2: Initialize the LLM instance

ManagedTextGenLLM is the client for managed text generation models hosted on MaxCompute. Pass the exact model name to identify which model to use.

from maxframe.learn.contrib.llm.models.managed import ManagedTextGenLLM

llm = ManagedTextGenLLM(
    name="Qwen3-4B-Instruct-2507-FP8"  # The model name must be an exact match.
)

Note

A typo or version mismatch in the model name causes the task to fail. For the full list of supported models, see Supported models for MaxFrame AI Function.

Step 3: Define the prompt template

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please answer the following question: {query}"},
]

Template syntax:

Use {column_name} as a placeholder — it is automatically replaced with the value from the corresponding DataFrame column at runtime.
Pass a messages list to support multi-turn conversations.
Use the system role to define the model's behavior.

Step 4: Run the generation task

result_df = llm.generate(
    df,                          # Input DataFrame
    prompt_template=messages,
    running_options={
        "max_tokens": 4096,      # Maximum output length
        "verbose": True          # Enable verbose log output
    },
    params={"temperature": 0.7},
)

# Execute and get the result.
result_df.execute()

Output schema

result_df is a MaxFrame DataFrame with the following fields:

Field	Type	Description
`query`	string	Original input
`generated_text`	string	Response generated by the model
`finish_reason`	string	Completion reason: `stop` or `length`
`usage.prompt_tokens`	int	Number of input tokens
`usage.completion_tokens`	int	Number of output tokens
`usage.total_tokens`	int	Total number of tokens

Performance and debugging

Performance optimization

Optimization	Recommendation
Batch size	Keep each batch to fewer than 100 items to avoid out-of-memory (OOM) errors
GU allocation	`gu=2` is suitable for 4B models — larger models require more GU
Degree of parallelism	MaxFrame automatically schedules concurrent jobs; control this with `num_workers`
Cache intermediate results	Use `to_odps_table()` to save intermediate tables and avoid recomputation
Timeout	Add `timeout=3600` to prevent jobs from getting stuck

Debugging tips

View execution logs

print(session.get_logview_address())  # Click the link to view real-time MaxFrame job logs.

Run a small-scale test before full execution

df_sample = df.head(2)  # Use two rows for testing.
result_sample = llm.generate(df_sample, prompt_template=messages, running_options={"gu": 2})
result_sample.execute()

Check resource usage

View detailed job execution status in MaxFrame Logview.