Managing third-party Python dependencies in distributed MaxFrame jobs typically requires manually uploading packages to MaxCompute before each run. The automatic packaging service removes this step: declare your dependencies in code using with_python_requirements, and MaxFrame resolves and bundles them at runtime automatically.
Prerequisites
Before you begin, ensure that you have:
-
A MaxFrame session connected to MaxCompute
-
(If using the on-premises MaxFrame client) MaxFrame SDK version V0.1.0b5 or later. See Preparations for setup instructions.
How it works
-
Decorate your UDF with
@with_python_requirements, listing the packages your function needs. -
When the job runs, MaxFrame resolves the listed packages from PyPI and bundles them into the job environment.
-
On subsequent runs, if the packaged result is already cached, MaxFrame skips repackaging.
Packaging is triggered on the first run. If the cache is cleared (temporary resources are deleted daily when force_rebuild=False), MaxFrame repackages automatically on the next run, which adds latency.
Declare dependencies with with_python_requirements
The with_python_requirements decorator is the entry point for the automatic packaging service.
def with_python_requirements(
*requirements: str,
force_rebuild: bool = False,
prefer_binary: bool = False,
pre_release: bool = False,
): ...
Parameters
`requirements` (required)
One or more dependency package specifiers, following PEP 508 syntax — the same format pip uses.
@with_python_requirements("scikit_learn>1.0", "xgboost>1.0")
`force_rebuild` (optional, default: False)
Controls whether MaxFrame repackages dependencies that are already cached.
| Value | Behavior |
|---|---|
False (default) |
Skip repackaging if a cached result exists. The cached package is stored as a temporary resource and deleted daily. |
True |
Always repackage using the latest PyPI image version. The result is stored as a long-term resource and is not deleted automatically. |
For development and iterative testing, keep the default False. Use force_rebuild=True when you want to force an upgrade to the latest package version and have the result stored as a long-term resource.
With force_rebuild=False, the temporary resource is deleted daily. If the cache is cleared between runs, MaxFrame repackages automatically, which adds latency to the next run.
`prefer_binary` (optional, default: False)
Controls whether MaxFrame prefers pre-built binary wheel files over source distributions.
| Value | Behavior |
|---|---|
False (default) |
No preference; pip resolves the best match normally. |
True |
Prefer binary wheels, equivalent to passing --prefer-binary to pip. |
Preferring binary wheels can speed up packaging, but the selected version may not be the latest release.
`pre_release` (optional, default: False)
Controls whether pre-release (alpha or beta) package versions are eligible for packaging.
| Value | Behavior |
|---|---|
False (default) |
Only stable releases are packaged. |
True |
Alpha and beta releases are included. |
Example
The following example uses with_python_requirements to inject jieba, cloudpickle, and pandas into a DataFrame apply job.
import os
import maxframe.dataframe as md
from maxframe import new_session
from maxframe.udf import with_python_requirements
from odps import ODPS
# Initialize the ODPS client.
# Load credentials from environment variables — avoid hardcoding AccessKey ID
# and AccessKey secret in your code.
o = ODPS(
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_ID'),
os.getenv('ALIBABA_CLOUD_ACCESS_KEY_SECRET'),
project='your-default-project',
endpoint='your-end-point',
)
session = new_session(o)
data = [["abcd"], ["efgh"], ["ijkl"], ["mno"]]
md_df = md.DataFrame(data, columns=["col1"])
# Declare dependencies. MaxFrame packages them automatically at runtime.
@with_python_requirements("jieba==0.40 cloudpickle pandas")
def process(row):
import jieba
row["col1"] = row["col1"] + "_" + jieba.__version__
return row
md_result = (
md_df.apply(
process,
axis=1,
result_type="expand",
output_type="dataframe",
dtypes=md_df.dtypes.copy(),
)
.execute()
.fetch()
)
Replace the following placeholders with your actual values:
| Placeholder | Description |
|---|---|
your-default-project |
Your MaxCompute project name |
your-end-point |
Your MaxCompute endpoint |
FAQ
When does packaging happen?
Packaging is triggered at the start of the first job run. If the packaged result is already cached, MaxFrame skips repackaging and the job starts immediately.
What if the cached package is deleted before my next run?
With force_rebuild=False, the cached package is stored as a temporary resource and deleted daily. If it is deleted before your next run, MaxFrame repackages automatically. This adds latency to that run but does not cause the job to fail.
How do I ensure the latest package versions are used across runs?
Set force_rebuild=True. MaxFrame repackages using the latest PyPI image version and stores the result as a long-term resource that is not deleted automatically.