Process data with MaxFrame - MaxCompute

MaxFrame is a Pandas-compatible DataFrame API for MaxCompute. It lets you run Pandas-style data operations on large-scale datasets without being limited by local memory or compute resources. Use MaxFrame when your dataset exceeds single-machine memory or when your Pandas workflows are too slow to run locally.

To migrate from Pandas, change one import — MaxFrame handles the rest by distributing computation across the MaxCompute backend:

# Before (pandas)
import pandas as pd

# After (MaxFrame)
import maxframe.dataframe as md

MaxFrame uses lazy evaluation. Operations build a computation graph but do not run until you call .execute(). Chain multiple transformations freely, then call .execute() once to trigger computation.

Prerequisites

Before you begin, ensure that you have:

A MaxCompute project with the required permissions
Your Alibaba Cloud access key ID and secret access key (set as environment variables)
MaxFrame installed in your Python environment

Sample dataset

This guide uses the maxframe_ml_100k_users table from the MaxCompute public dataset. The table is available at BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users — no import needed.

Initialize a session

Before running any MaxFrame job, initialize a session at the entry point of your code. All subsequent data processing goes through this session, which communicates with the MaxCompute backend.

import os
from maxframe import new_session
from odps import ODPS

# Initialize MaxCompute with your credentials.
o = ODPS(
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-end-point',
)

# Start the MaxFrame session.
new_session(odps_entry=o)

For parameter details, see new_session.

Create a DataFrame

Use read_odps_table or read_odps_query to load data from MaxCompute into a MaxFrame DataFrame. You can also create a DataFrame from local data for quick testing.

Read from a MaxCompute table

read_odps_table reads a MaxCompute table and returns a DataFrame.

import maxframe.dataframe as md

# Read all columns from the table.
df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')

Use index_col to set a column as the DataFrame index, or columns to select a subset of columns:

# Set user_id as the index.
df = md.read_odps_table(
    'BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
    index_col='user_id'
)

# Select specific columns.
df = md.read_odps_table(
    'BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
    columns=['user_id', 'age', 'sex']
)

For full parameter details, see read_odps_table.

Read from a SQL query

Run a SQL query and use the result as a DataFrame:

import maxframe.dataframe as md

df = md.read_odps_query(
    'select user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`'
)

# Optionally set the index column.
df = md.read_odps_query(
    'select user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`',
    index_col='user_id'
)

For full parameter details, see read_odps_query.

Create from local data

For quick local testing, create a DataFrame directly from a Python dict:

import maxframe.dataframe as md

d = {'col1': [1, 2], 'col2': [3, 4]}
df = md.DataFrame(data=d)
print(df.execute().fetch())

# Result:
#    col1  col2
# 0     1     3
# 1     2     4

Process data

MaxFrame supports the same data operations as Pandas. All operations use lazy evaluation — chain as many as you need, then call .execute() once to trigger computation.

Mathematical operations

Apply arithmetic operations column-wise. All standard operators (+, -, *, /) work as in Pandas.

import maxframe.dataframe as md

df = md.DataFrame({'angles': [0, 3, 4],
                   'degrees': [360, 180, 360]},
                  index=['circle', 'triangle', 'rectangle'])

# Addition
print((df + 1).execute().fetch())
# Result:
#            angles  degrees
# circle          1      361
# triangle        4      181
# rectangle       5      361

# Multiplication
print((df * 2).execute().fetch())
# Result:
#            angles  degrees
# circle          0      720
# triangle        6      360
# rectangle       8      720

# Division
print((df / 2).execute().fetch())
# Result:
#            angles  degrees
# circle        0.0      180
# triangle      1.5       90
# rectangle     2.0      180

# Exponentiation
print((df ** 2).execute().fetch())
# Result:
#            angles  degrees
# circle          0   129600
# triangle        9    32400
# rectangle      16   129600

For more information, see Binary operator functions.

Filtering, projection, and sampling

Filtering

Select rows that match a condition:

import maxframe.dataframe as md

df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')

# Keep only rows where age > 18.
filtered_df = df[df['age'] > 18]
print(filtered_df.execute().fetch())
# Result:
#    user_id  age sex occupation zip_code
# 0        1   24   M technician    85711
# 1        2   53   F      other    94043
# ...

Projection

Select a subset of columns:

import maxframe.dataframe as md

df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')

# Select specific columns.
projected_df = df[['user_id', 'age']]
print(projected_df.execute().fetch())
# Result:
#    user_id  age
# 0        1   24
# 1        2   53
# ...

Sampling

Draw a random sample from a DataFrame:

import maxframe.dataframe as md

df = md.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])

print(df['num_legs'].sample(n=3, random_state=1).execute())
# Result:
# falcon    2
# fish      0
# dog       4
# Name: num_legs, dtype: int64

For more information, see Reindexing / selection / label manipulation.

Sorting

Sort a DataFrame by one or more columns using sort_values:

import maxframe.dataframe as md
import numpy as np

df = md.DataFrame({
    'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
    'col2': [2, 1, 9, 8, 7, 4],
    'col3': [0, 1, 9, 4, 2, 3],
})

# Sort by col1.
res = df.sort_values(by=['col1']).execute()
print(res.fetch())
# Result:
#    col1  col2  col3
# 0     A     2     0
# 1     A     1     1
# 2     B     9     9
# 5     C     4     3
# 4     D     7     2
# 3  None     8     4

# Sort by col1, then col2.
res = df.sort_values(by=['col1', 'col2']).execute()
print(res.fetch())
# Result:
#    col1  col2  col3
# 1     A     1     1
# 0     A     2     0
# 2     B     9     9
# 5     C     4     3
# 4     D     7     2
# 3  None     8     4

For more information, see Reshaping / sorting / transposing.

Join, merge, and concatenate

Combine DataFrames horizontally (join/merge) or vertically (concatenate).

import maxframe.dataframe as md

df1 = md.DataFrame({'key': ['K0', 'K1', 'K2'],
                    'A': ['A0', 'A1', 'A2'],
                    'B': ['B0', 'B1', 'B2']})

df2 = md.DataFrame({'key': ['K0', 'K1', 'K2'],
                    'C': ['C0', 'C1', 'C2'],
                    'D': ['D0', 'D1', 'D2']})

# Merge on a key column.
result = md.merge(df1, df2, on='key')
print(result.execute().fetch())
# Result:
#   key   A   B   C   D
# 0  K0  A0  B0  C0  D0
# 1  K1  A1  B1  C1  D1
# 2  K2  A2  B2  C2  D2

# Concatenate vertically.
result = md.concat([df1, df2])
print(result.execute().fetch())

For more information, see Combining / joining / merging.

Aggregation and user-defined functions (UDFs)

Aggregation

Summarize data across groups or columns with aggregation functions:

import maxframe.dataframe as md
import numpy as np

df = md.DataFrame([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9],
                   [np.nan, np.nan, np.nan]],
                  columns=['A', 'B', 'C'])

# Apply multiple aggregation functions.
print(df.agg(['sum', 'min']).execute().fetch())
# Result:
#         A     B     C
# min   1.0   2.0   3.0
# sum  12.0  15.0  18.0

# Apply different functions to different columns.
df.agg({'A': ['sum', 'min'], 'B': ['min', 'max']}).execute().fetch()
# Result:
#         A    B
# max   NaN  8.0
# min   1.0  2.0
# sum  12.0  NaN

UDFs

MaxFrame supports UDFs through transform and apply. Use UDFs for custom logic that built-in operators don't cover.

Important

Before running a UDF, declare the common image by setting config.options.sql.settings before calling new_session.

All UDF examples use this config block:

from maxframe import config
config.options.sql.settings = {
    "odps.session.image": "common",
    "odps.sql.type.system.odps2": "true"
}
session = new_session(o)

Use `transform` to apply a UDF column-wise (output shape matches input):

import maxframe.dataframe as md

df = md.DataFrame({'A': range(3), 'B': range(1, 4)})
print(df.transform(lambda x: x + 1).execute().fetch())
# Result:
#    A  B
# 0  1  2
# 1  2  3
# 2  3  4

Use `apply` when the output column count differs from the input:

import maxframe.dataframe as md
import numpy as np

def simple(row):
    row['is_man'] = row['sex'] == "man"
    return row

df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')
new_dtypes = df.dtypes.copy()
new_dtypes["is_man"] = np.dtype(np.bool_)

df.apply(
    simple,
    axis=1,
    result_type="expand",
    output_type="dataframe",
    dtypes=new_dtypes
).execute().fetch()
# Result:
#      user_id  age sex     occupation zip_code  is_man
# 0          1   24   M     technician    85711   False
# 1          2   53   F          other    94043   False
# ...
# [943 rows x 6 columns]

For more information, see Function application / GroupBy / window.

Store results

After processing, save results to a MaxCompute table with to_odps_table.

# Write results to a MaxCompute table.
filtered_df.to_odps_table('<table_name>')

# Set a lifecycle (in days) for the result table.
filtered_df.to_odps_table('<table_name>', lifecycle=7)

Parameter	Description
`<table_name>`	Destination table name. If the table does not exist, MaxCompute creates it automatically. If it already exists, data is appended by default.
`lifecycle`	Number of days to retain the result table.
`overwrite`	Set to `True` to overwrite existing data instead of appending.

For more information, see to_odps_table.

Execute and retrieve results

MaxFrame uses lazy evaluation — operations build a computation graph but do not run until you call .execute(). Chain .execute().fetch() to trigger execution and retrieve results.

# Execute the computation and fetch results.
data = filtered_df.execute().fetch()
print(data)

Method	Description
`.execute()`	Submits the computation graph to MaxCompute and waits for completion.
`.fetch()`	Retrieves a portion of the result data after execution.
`.execute().fetch()`	Triggers execution and returns results in one step.

For more information, see execute and fetch.