MaxFrame provides a set of APIs beyond the standard pandas interface for managing sessions, reading and writing MaxCompute tables, triggering distributed computation, and retrieving results locally.
Session
new_session
Source code: new_session
new_session(
session_id: str = None,
default: bool = True,
new: bool = True,
odps_entry: Optional[ODPS] = None
)Creates a MaxFrame session and connects to MaxCompute.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
session_id | String | No | None | A unique identifier for the session. If not specified, MaxFrame generates one automatically. When new=False, this identifies the existing session to reuse. |
default | Boolean | No | True | Sets the session as the global default. When True, subsequent calls to execute() and fetch() use this session without requiring an explicit session argument. |
new | Boolean | No | True | Creates a new session. Set to False to connect to an existing session identified by session_id. |
odps_entry | ODPS | Yes | — | The MaxCompute entry object. See Create a MaxCompute entry point. |
Returns: The session object.
Example
import os
from maxframe import new_session
from odps import ODPS
# Initialize the MaxCompute entry object.
# Store credentials in environment variables — do not hardcode them.
o = ODPS(
os.environ.get('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.environ.get('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-endpoint',
)
# Create the MaxFrame session.
session = new_session(odps_entry=o)Input/Output
The following functions read data from and write data to MaxCompute.
| Function | Description |
|---|---|
read_odps_table | Reads a MaxCompute table into a DataFrame |
read_odps_query | Runs a SQL query and returns results as a DataFrame |
to_odps_table | Writes a DataFrame to a MaxCompute table |
to_odps_model | Saves a trained XGBoost model to MaxCompute |
Choosing between `read_odps_table` and `read_odps_query`: Use read_odps_table when reading from a specific table (with optional partition and column filters). Use read_odps_query when you need SQL-level filtering or joins across multiple tables.
read_odps_table
Source code: read_odps_table
read_odps_table(
table_name: Union[str, Table],
partitions: Union[None, str, List[str]] = None,
columns: Optional[List[str]] = None,
index_col: Union[None, str, List[str]] = None,
odps_entry: ODPS = None,
string_as_binary: bool = None,
append_partitions: bool = False
)Reads data from a MaxCompute table and returns it as a DataFrame. If no index columns are specified, a RangeIndex is generated.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
table_name | String/Table | Yes | — | The MaxCompute table name or table object to read from. |
partitions | String/List | No | None | The partition or list of partitions to read. Format: <partition_name>=<partition_value>. If not specified, all partitions are read. |
columns | List | No | None | The columns to read. Format: <column1>, <column2>, .... If not specified, all non-partition columns are read. |
index_col | String/List | No | None | One or more columns to use as the DataFrame index. |
odps_entry | ODPS | No | None | The MaxCompute entry object. See Create a MaxCompute entry point. |
string_as_binary | Boolean | No | None | Reads string columns in binary form. |
append_partitions | Boolean | No | False | When True and columns is not specified, includes partition key columns in the result. |
Returns: A DataFrame object.
Example
import maxframe.dataframe as md
df = md.read_odps_table(
'BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
index_col='user_id',
columns=['age', 'sex']
)
print(df.execute().fetch())
# Output:
# age sex
# user_id
# 1 24 M
# 2 53 F
# 3 23 M
# 4 24 M
# 5 33 F
# ... ... ..
# 939 26 F
# 940 32 M
# 941 20 M
# 942 48 F
# 943 22 M
#
# [943 rows x 2 columns]read_odps_query
Source code: read_odps_query
read_odps_query(
query: str,
odps_entry: ODPS = None,
index_col: Union[None, str, List[str]] = None,
string_as_binary: bool = None
)Runs a MaxCompute SQL query and returns the results as a DataFrame. If no index columns are specified, a RangeIndex is generated.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
query | String | Yes | — | The MaxCompute SQL statement to run. |
odps_entry | ODPS | No | None | The MaxCompute entry object. See Create a MaxCompute entry point. |
index_col | String/List | No | None | One or more columns to use as the DataFrame index. |
string_as_binary | Boolean | No | None | Reads string columns in binary form. |
Returns: A DataFrame object.
Example
import maxframe.dataframe as md
df = md.read_odps_query(
'SELECT user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`'
)to_odps_table
Source code: to_odps_table
to_odps_table(
table: Union[Table, str],
partition: Optional[str] = None,
partition_col: Union[None, str, List[str]] = None,
overwrite: bool = False,
unknown_as_string: Optional[bool] = None,
index: bool = True,
index_label: Union[None, str, List[str]] = None,
lifecycle: Optional[int] = None
)Writes a DataFrame to a MaxCompute table. If the table does not exist, MaxFrame creates it automatically.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
table | String/Table | Yes | — | The target table name or table object. |
partition | String | No | None | The target partition. Example: pt1=xxx, pt2=yyy. |
partition_col | String/List | No | None | DataFrame columns to use as partition key columns in the output table. |
overwrite | Boolean | No | False | Overwrites data if the table or partition already exists. |
unknown_as_string | Boolean | No | False | When True, object-type columns in the DataFrame are written as STRING. An error may occur if type conversion fails. |
index | Boolean | No | True | Writes the DataFrame index as a column in the output table. |
index_label | String/List | No | None | Column name for the index. Defaults to index for a single-level index, or level_x (where x is the level of the index) for a multi-level index. |
lifecycle | int | No | None | Lifecycle of the output table in days (positive integer). If the table already exists, this overwrites its current lifecycle setting. |
Returns: A DataFrame object.
Example
import maxframe.dataframe as md
df = md.read_odps_query(
'SELECT user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`',
index_col='user_id'
)
df.to_odps_table('output_table', lifecycle=7)to_odps_model
to_odps_model(
model_name: str,
model_version: str = None,
schema: str = None,
project: str = None,
description: Optional[str] = None,
version_description: Optional[str] = None,
create_model: bool = True,
set_default_version: bool = False
)Saves an XGBoost model trained in a MaxFrame job as a MaxCompute model object. Call .execute() on the returned Scalar to trigger the save operation.
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
model_name | String | Yes | — | The model name. If project and schema are specified separately, provide only the model name. Otherwise, use the format project.schema.model_name. |
model_version | String | No | None | The model version. If not specified, the system generates a version automatically. |
schema | String | No | "default" | The schema the model belongs to. |
project | String | No | None | The project the model belongs to. |
description | String | No | None | A description of the model. |
version_description | String | No | None | A description of the model version. |
create_model | Boolean | No | True | Creates the model if it does not already exist. |
set_default_version | Boolean | No | False | Sets the saved version as the default version of the model. |
Returns: A Scalar object. Call .execute() to trigger the model saving operation.
Example
from maxframe.learn.contrib.xgboost import XGBClassifier
import maxframe.dataframe as md
# Train an XGBoost model.
X_df = md.DataFrame(X, columns=cols)
clf = XGBClassifier(n_estimators=10)
clf.fit(X_df, y)
# Save the model to MaxCompute.
clf.to_odps_model(
model_name='my_model',
# If project and schema are not specified separately,
# use the format: model_name='project.schema.my_model'
model_version='version1'
).execute()Execute
execute
Source code: execute
execute(
session: SessionType = None
)Submits a data processing task to MaxCompute for execution. Because MaxFrame uses lazy execution, operations on a DataFrame are not computed until you call execute().
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
session | Session | No | None | The session to use for execution. If not specified, the global default session created by new_session is used. |
Returns: None.
Example
import maxframe.dataframe as md
df = md.read_odps_query(
'SELECT user_id, age, sex FROM BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
index_col='user_id'
)
df.execute()Fetch
fetch
Source code: fetch
fetch(
session: SessionType = None
)Retrieves the computation result from MaxCompute and returns it as a pandas DataFrame or Series in your local environment. Always call execute() before fetch().
Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
session | Session | No | None | The session to use for fetching results. If not specified, the global default session created by new_session is used. |
Returns: A pandas DataFrame or Series.
Example
import maxframe.dataframe as md
df = md.read_odps_query(
'SELECT user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`',
index_col='user_id'
)
result = df.execute().fetch()
print(result)
# Output:
# age sex
# user_id
# 1 24 M
# 2 53 F
# 3 23 M
# 4 24 M
# 5 33 F
# ... ... ..
# 939 26 F
# 940 32 M
# 941 20 M
# 942 48 F
# 943 22 M
#
# [943 rows x 2 columns]