MaxFrame is a Pandas-compatible DataFrame API for MaxCompute. It lets you run Pandas-style data operations on large-scale datasets without being limited by local memory or compute resources. Use MaxFrame when your dataset exceeds single-machine memory or when your Pandas workflows are too slow to run locally.
To migrate from Pandas, change one import — MaxFrame handles the rest by distributing computation across the MaxCompute backend:
# Before (pandas)
import pandas as pd
# After (MaxFrame)
import maxframe.dataframe as md
MaxFrame uses lazy evaluation. Operations build a computation graph but do not run until you call.execute(). Chain multiple transformations freely, then call.execute()once to trigger computation.
Prerequisites
Before you begin, ensure that you have:
-
A MaxCompute project with the required permissions
-
Your Alibaba Cloud access key ID and secret access key (set as environment variables)
-
MaxFrame installed in your Python environment
Sample dataset
This guide uses the maxframe_ml_100k_users table from the MaxCompute public dataset. The table is available at BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users — no import needed.
Initialize a session
Before running any MaxFrame job, initialize a session at the entry point of your code. All subsequent data processing goes through this session, which communicates with the MaxCompute backend.
import os
from maxframe import new_session
from odps import ODPS
# Initialize MaxCompute with your credentials.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-end-point',
)
# Start the MaxFrame session.
new_session(odps_entry=o)
For parameter details, see new_session.
Create a DataFrame
Use read_odps_table or read_odps_query to load data from MaxCompute into a MaxFrame DataFrame. You can also create a DataFrame from local data for quick testing.
Read from a MaxCompute table
read_odps_table reads a MaxCompute table and returns a DataFrame.
import maxframe.dataframe as md
# Read all columns from the table.
df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')
Use index_col to set a column as the DataFrame index, or columns to select a subset of columns:
# Set user_id as the index.
df = md.read_odps_table(
'BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
index_col='user_id'
)
# Select specific columns.
df = md.read_odps_table(
'BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
columns=['user_id', 'age', 'sex']
)
For full parameter details, see read_odps_table.
Read from a SQL query
Run a SQL query and use the result as a DataFrame:
import maxframe.dataframe as md
df = md.read_odps_query(
'select user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`'
)
# Optionally set the index column.
df = md.read_odps_query(
'select user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`',
index_col='user_id'
)
For full parameter details, see read_odps_query.
Create from local data
For quick local testing, create a DataFrame directly from a Python dict:
import maxframe.dataframe as md
d = {'col1': [1, 2], 'col2': [3, 4]}
df = md.DataFrame(data=d)
print(df.execute().fetch())
# Result:
# col1 col2
# 0 1 3
# 1 2 4
Process data
MaxFrame supports the same data operations as Pandas. All operations use lazy evaluation — chain as many as you need, then call .execute() once to trigger computation.
Mathematical operations
Apply arithmetic operations column-wise. All standard operators (+, -, *, /) work as in Pandas.
import maxframe.dataframe as md
df = md.DataFrame({'angles': [0, 3, 4],
'degrees': [360, 180, 360]},
index=['circle', 'triangle', 'rectangle'])
# Addition
print((df + 1).execute().fetch())
# Result:
# angles degrees
# circle 1 361
# triangle 4 181
# rectangle 5 361
# Multiplication
print((df * 2).execute().fetch())
# Result:
# angles degrees
# circle 0 720
# triangle 6 360
# rectangle 8 720
# Division
print((df / 2).execute().fetch())
# Result:
# angles degrees
# circle 0.0 180
# triangle 1.5 90
# rectangle 2.0 180
# Exponentiation
print((df ** 2).execute().fetch())
# Result:
# angles degrees
# circle 0 129600
# triangle 9 32400
# rectangle 16 129600
For more information, see Binary operator functions.
Filtering, projection, and sampling
Filtering
Select rows that match a condition:
import maxframe.dataframe as md
df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')
# Keep only rows where age > 18.
filtered_df = df[df['age'] > 18]
print(filtered_df.execute().fetch())
# Result:
# user_id age sex occupation zip_code
# 0 1 24 M technician 85711
# 1 2 53 F other 94043
# ...
Projection
Select a subset of columns:
import maxframe.dataframe as md
df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')
# Select specific columns.
projected_df = df[['user_id', 'age']]
print(projected_df.execute().fetch())
# Result:
# user_id age
# 0 1 24
# 1 2 53
# ...
Sampling
Draw a random sample from a DataFrame:
import maxframe.dataframe as md
df = md.DataFrame({'num_legs': [2, 4, 8, 0],
'num_wings': [2, 0, 0, 0],
'num_specimen_seen': [10, 2, 1, 8]},
index=['falcon', 'dog', 'spider', 'fish'])
print(df['num_legs'].sample(n=3, random_state=1).execute())
# Result:
# falcon 2
# fish 0
# dog 4
# Name: num_legs, dtype: int64
For more information, see Reindexing / selection / label manipulation.
Sorting
Sort a DataFrame by one or more columns using sort_values:
import maxframe.dataframe as md
import numpy as np
df = md.DataFrame({
'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2': [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
})
# Sort by col1.
res = df.sort_values(by=['col1']).execute()
print(res.fetch())
# Result:
# col1 col2 col3
# 0 A 2 0
# 1 A 1 1
# 2 B 9 9
# 5 C 4 3
# 4 D 7 2
# 3 None 8 4
# Sort by col1, then col2.
res = df.sort_values(by=['col1', 'col2']).execute()
print(res.fetch())
# Result:
# col1 col2 col3
# 1 A 1 1
# 0 A 2 0
# 2 B 9 9
# 5 C 4 3
# 4 D 7 2
# 3 None 8 4
For more information, see Reshaping / sorting / transposing.
Join, merge, and concatenate
Combine DataFrames horizontally (join/merge) or vertically (concatenate).
import maxframe.dataframe as md
df1 = md.DataFrame({'key': ['K0', 'K1', 'K2'],
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']})
df2 = md.DataFrame({'key': ['K0', 'K1', 'K2'],
'C': ['C0', 'C1', 'C2'],
'D': ['D0', 'D1', 'D2']})
# Merge on a key column.
result = md.merge(df1, df2, on='key')
print(result.execute().fetch())
# Result:
# key A B C D
# 0 K0 A0 B0 C0 D0
# 1 K1 A1 B1 C1 D1
# 2 K2 A2 B2 C2 D2
# Concatenate vertically.
result = md.concat([df1, df2])
print(result.execute().fetch())
For more information, see Combining / joining / merging.
Aggregation and user-defined functions (UDFs)
Aggregation
Summarize data across groups or columns with aggregation functions:
import maxframe.dataframe as md
import numpy as np
df = md.DataFrame([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[np.nan, np.nan, np.nan]],
columns=['A', 'B', 'C'])
# Apply multiple aggregation functions.
print(df.agg(['sum', 'min']).execute().fetch())
# Result:
# A B C
# min 1.0 2.0 3.0
# sum 12.0 15.0 18.0
# Apply different functions to different columns.
df.agg({'A': ['sum', 'min'], 'B': ['min', 'max']}).execute().fetch()
# Result:
# A B
# max NaN 8.0
# min 1.0 2.0
# sum 12.0 NaN
UDFs
MaxFrame supports UDFs through transform and apply. Use UDFs for custom logic that built-in operators don't cover.
Before running a UDF, declare the common image by setting config.options.sql.settings before calling new_session.
All UDF examples use this config block:
from maxframe import config
config.options.sql.settings = {
"odps.session.image": "common",
"odps.sql.type.system.odps2": "true"
}
session = new_session(o)
Use `transform` to apply a UDF column-wise (output shape matches input):
import maxframe.dataframe as md
df = md.DataFrame({'A': range(3), 'B': range(1, 4)})
print(df.transform(lambda x: x + 1).execute().fetch())
# Result:
# A B
# 0 1 2
# 1 2 3
# 2 3 4
Use `apply` when the output column count differs from the input:
import maxframe.dataframe as md
import numpy as np
def simple(row):
row['is_man'] = row['sex'] == "man"
return row
df = md.read_odps_table('BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users')
new_dtypes = df.dtypes.copy()
new_dtypes["is_man"] = np.dtype(np.bool_)
df.apply(
simple,
axis=1,
result_type="expand",
output_type="dataframe",
dtypes=new_dtypes
).execute().fetch()
# Result:
# user_id age sex occupation zip_code is_man
# 0 1 24 M technician 85711 False
# 1 2 53 F other 94043 False
# ...
# [943 rows x 6 columns]
For more information, see Function application / GroupBy / window.
Store results
After processing, save results to a MaxCompute table with to_odps_table.
# Write results to a MaxCompute table.
filtered_df.to_odps_table('<table_name>')
# Set a lifecycle (in days) for the result table.
filtered_df.to_odps_table('<table_name>', lifecycle=7)
| Parameter | Description |
|---|---|
<table_name> |
Destination table name. If the table does not exist, MaxCompute creates it automatically. If it already exists, data is appended by default. |
lifecycle |
Number of days to retain the result table. |
overwrite |
Set to True to overwrite existing data instead of appending. |
For more information, see to_odps_table.
Execute and retrieve results
MaxFrame uses lazy evaluation — operations build a computation graph but do not run until you call .execute(). Chain .execute().fetch() to trigger execution and retrieve results.
# Execute the computation and fetch results.
data = filtered_df.execute().fetch()
print(data)
| Method | Description |
|---|---|
.execute() |
Submits the computation graph to MaxCompute and waits for completion. |
.fetch() |
Retrieves a portion of the result data after execution. |
.execute().fetch() |
Triggers execution and returns results in one step. |