MaxFrame Python API Overview - MaxCompute

MaxFrame provides a set of APIs beyond the standard pandas interface for managing sessions, reading and writing MaxCompute tables, triggering distributed computation, and retrieving results locally.

Session

new_session

Source code: new_session

new_session(
    session_id: str = None,
    default: bool = True,
    new: bool = True,
    odps_entry: Optional[ODPS] = None
)

Creates a MaxFrame session and connects to MaxCompute.

Parameters

Parameter	Type	Required	Default	Description
`session_id`	String	No	None	A unique identifier for the session. If not specified, MaxFrame generates one automatically. When `new=False`, this identifies the existing session to reuse.
`default`	Boolean	No	True	Sets the session as the global default. When True, subsequent calls to `execute()` and `fetch()` use this session without requiring an explicit `session` argument.
`new`	Boolean	No	True	Creates a new session. Set to False to connect to an existing session identified by `session_id`.
`odps_entry`	ODPS	Yes	—	The MaxCompute entry object. See Create a MaxCompute entry point.

Returns: The session object.

Example

import os
from maxframe import new_session
from odps import ODPS

# Initialize the MaxCompute entry object.
# Store credentials in environment variables — do not hardcode them.
o = ODPS(
    os.environ.get('ALIBABA_CLOUD_ACCESS_KEY_ID'),
    os.environ.get('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
    project='your-default-project',
    endpoint='your-endpoint',
)

# Create the MaxFrame session.
session = new_session(odps_entry=o)

Input/Output

The following functions read data from and write data to MaxCompute.

Function	Description
`read_odps_table`	Reads a MaxCompute table into a DataFrame
`read_odps_query`	Runs a SQL query and returns results as a DataFrame
`to_odps_table`	Writes a DataFrame to a MaxCompute table
`to_odps_model`	Saves a trained XGBoost model to MaxCompute

Choosing between `read_odps_table` and `read_odps_query`: Use read_odps_table when reading from a specific table (with optional partition and column filters). Use read_odps_query when you need SQL-level filtering or joins across multiple tables.

read_odps_table

Source code: read_odps_table

read_odps_table(
    table_name: Union[str, Table],
    partitions: Union[None, str, List[str]] = None,
    columns: Optional[List[str]] = None,
    index_col: Union[None, str, List[str]] = None,
    odps_entry: ODPS = None,
    string_as_binary: bool = None,
    append_partitions: bool = False
)

Reads data from a MaxCompute table and returns it as a DataFrame. If no index columns are specified, a RangeIndex is generated.

Parameters

Parameter	Type	Required	Default	Description
`table_name`	String/Table	Yes	—	The MaxCompute table name or table object to read from.
`partitions`	String/List	No	None	The partition or list of partitions to read. Format: `<partition_name>=<partition_value>`. If not specified, all partitions are read.
`columns`	List	No	None	The columns to read. Format: `<column1>, <column2>, ...`. If not specified, all non-partition columns are read.
`index_col`	String/List	No	None	One or more columns to use as the DataFrame index.
`odps_entry`	ODPS	No	None	The MaxCompute entry object. See Create a MaxCompute entry point.
`string_as_binary`	Boolean	No	None	Reads string columns in binary form.
`append_partitions`	Boolean	No	False	When True and `columns` is not specified, includes partition key columns in the result.

Returns: A DataFrame object.

Example

import maxframe.dataframe as md

df = md.read_odps_table(
    'BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
    index_col='user_id',
    columns=['age', 'sex']
)
print(df.execute().fetch())

# Output:
#          age sex
# user_id
# 1         24   M
# 2         53   F
# 3         23   M
# 4         24   M
# 5         33   F
# ...      ...  ..
# 939       26   F
# 940       32   M
# 941       20   M
# 942       48   F
# 943       22   M
#
# [943 rows x 2 columns]

read_odps_query

Source code: read_odps_query

read_odps_query(
    query: str,
    odps_entry: ODPS = None,
    index_col: Union[None, str, List[str]] = None,
    string_as_binary: bool = None
)

Runs a MaxCompute SQL query and returns the results as a DataFrame. If no index columns are specified, a RangeIndex is generated.

Parameters

Parameter	Type	Required	Default	Description
`query`	String	Yes	—	The MaxCompute SQL statement to run.
`odps_entry`	ODPS	No	None	The MaxCompute entry object. See Create a MaxCompute entry point.
`index_col`	String/List	No	None	One or more columns to use as the DataFrame index.
`string_as_binary`	Boolean	No	None	Reads string columns in binary form.

Returns: A DataFrame object.

Example

import maxframe.dataframe as md

df = md.read_odps_query(
    'SELECT user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`'
)

to_odps_table

Source code: to_odps_table

to_odps_table(
    table: Union[Table, str],
    partition: Optional[str] = None,
    partition_col: Union[None, str, List[str]] = None,
    overwrite: bool = False,
    unknown_as_string: Optional[bool] = None,
    index: bool = True,
    index_label: Union[None, str, List[str]] = None,
    lifecycle: Optional[int] = None
)

Writes a DataFrame to a MaxCompute table. If the table does not exist, MaxFrame creates it automatically.

Parameters

Parameter	Type	Required	Default	Description
`table`	String/Table	Yes	—	The target table name or table object.
`partition`	String	No	None	The target partition. Example: `pt1=xxx, pt2=yyy`.
`partition_col`	String/List	No	None	DataFrame columns to use as partition key columns in the output table.
`overwrite`	Boolean	No	False	Overwrites data if the table or partition already exists.
`unknown_as_string`	Boolean	No	False	When True, object-type columns in the DataFrame are written as STRING. An error may occur if type conversion fails.
`index`	Boolean	No	True	Writes the DataFrame index as a column in the output table.
`index_label`	String/List	No	None	Column name for the index. Defaults to `index` for a single-level index, or `level_x` (where x is the level of the index) for a multi-level index.
`lifecycle`	int	No	None	Lifecycle of the output table in days (positive integer). If the table already exists, this overwrites its current lifecycle setting.

Returns: A DataFrame object.

Example

import maxframe.dataframe as md

df = md.read_odps_query(
    'SELECT user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`',
    index_col='user_id'
)
df.to_odps_table('output_table', lifecycle=7)

to_odps_model

to_odps_model(
    model_name: str,
    model_version: str = None,
    schema: str = None,
    project: str = None,
    description: Optional[str] = None,
    version_description: Optional[str] = None,
    create_model: bool = True,
    set_default_version: bool = False
)

Saves an XGBoost model trained in a MaxFrame job as a MaxCompute model object. Call .execute() on the returned Scalar to trigger the save operation.

Parameters

Parameter	Type	Required	Default	Description
`model_name`	String	Yes	—	The model name. If `project` and `schema` are specified separately, provide only the model name. Otherwise, use the format `project.schema.model_name`.
`model_version`	String	No	None	The model version. If not specified, the system generates a version automatically.
`schema`	String	No	`"default"`	The schema the model belongs to.
`project`	String	No	None	The project the model belongs to.
`description`	String	No	None	A description of the model.
`version_description`	String	No	None	A description of the model version.
`create_model`	Boolean	No	True	Creates the model if it does not already exist.
`set_default_version`	Boolean	No	False	Sets the saved version as the default version of the model.

Returns: A Scalar object. Call .execute() to trigger the model saving operation.

Example

from maxframe.learn.contrib.xgboost import XGBClassifier
import maxframe.dataframe as md

# Train an XGBoost model.
X_df = md.DataFrame(X, columns=cols)
clf = XGBClassifier(n_estimators=10)
clf.fit(X_df, y)

# Save the model to MaxCompute.
clf.to_odps_model(
    model_name='my_model',
    # If project and schema are not specified separately,
    # use the format: model_name='project.schema.my_model'
    model_version='version1'
).execute()

Execute

execute

Source code: execute

execute(
    session: SessionType = None
)

Submits a data processing task to MaxCompute for execution. Because MaxFrame uses lazy execution, operations on a DataFrame are not computed until you call execute().

Parameters

Parameter	Type	Required	Default	Description
`session`	Session	No	None	The session to use for execution. If not specified, the global default session created by `new_session` is used.

Returns: None.

Example

import maxframe.dataframe as md

df = md.read_odps_query(
    'SELECT user_id, age, sex FROM BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users',
    index_col='user_id'
)
df.execute()

Fetch

fetch

Source code: fetch

fetch(
    session: SessionType = None
)

Retrieves the computation result from MaxCompute and returns it as a pandas DataFrame or Series in your local environment. Always call execute() before fetch().

Parameters

Parameter	Type	Required	Default	Description
`session`	Session	No	None	The session to use for fetching results. If not specified, the global default session created by `new_session` is used.

Returns: A pandas DataFrame or Series.

Example

import maxframe.dataframe as md

df = md.read_odps_query(
    'SELECT user_id, age, sex FROM `BIGDATA_PUBLIC_DATASET.data_science.maxframe_ml_100k_users`',
    index_col='user_id'
)
result = df.execute().fetch()
print(result)

# Output:
#          age sex
# user_id
# 1         24   M
# 2         53   F
# 3         23   M
# 4         24   M
# 5         33   F
# ...      ...  ..
# 939       26   F
# 940       32   M
# 941       20   M
# 942       48   F
# 943       22   M
#
# [943 rows x 2 columns]