Function introduction
MaxFrame is a distributed computing framework from Alibaba Cloud MaxCompute that provides a Python programming interface. It addresses two key challenges in traditional Python data processing: performance bottlenecks and inefficient data movement. With MaxFrame, you can directly process and analyze petabyte-scale big data on MaxCompute. You can perform visual data exploration and analytics, scientific computing, machine learning, and AI development—meeting the growing demand for efficient big data processing and AI development within the Python ecosystem.
Use cases
Interactive data exploration
MaxFrame delivers a smooth, memory-unlimited experience. You can explore, manipulate, and visualize massive datasets in real time, just as you would in a local Jupyter Notebook.
Large-scale data preprocessing (ETL)
For multi-terabyte raw data cleansing, format conversion, feature engineering, and other tasks, you can replace complex SQL+UDF logic with more expressive and maintainable Python code—while benefiting from the high performance of distributed execution.
AI and machine learning
In the model development workflow, MaxFrame unifies data processing and model training. Use it to efficiently prepare training data and combine it with the image feature to import libraries such as Scikit-learn and XGBoost, enabling end-to-end AI workflows.
Usage notes
Supported regions
China (Hangzhou), China (Shanghai), China (Beijing), China (Shenzhen), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), Germany (Frankfurt), US (Silicon Valley), and US (Virginia).
Supported environments
Local Python development environment.
MaxCompute Notebook.
DataWorks Notebook.
DataWorks Data Development PyODPS 3 task nodes.
Billing
MaxFrame billing is based on compute resource usage per job. It supports both and subscription billing methods.
Subscription: Jobs consume the quota of purchased resource groups with no additional charges.
For more information, see Analyze MaxCompute bill and usage details.
Core advantages
Compared to other Python development tools, MaxFrame better aligns with familiar development habits, enables more efficient data processing, provides more elastic computing resources, and delivers a more convenient development experience.
Pandas-compatible API: MaxFrame provides an API highly compatible with Pandas. This supports the smooth migration of existing code to the MaxCompute platform and significantly reduces learning and migration costs.
Server-side distributed execution: MaxFrame jobs run directly within the MaxCompute cluster. Data does not need to be pulled to a local machine. This eliminates performance bottlenecks caused by insufficient client memory and enables efficient processing of petabyte-scale data.
Elastic computing resources: MaxFrame relies on the MaxCompute serverless architecture to allocate computing resources on demand. This lets you process data tasks of any scale without managing a cluster.
Simplified development environment: MaxFrame provides built-in Python 3.7 and Python 3.11 environments with pre-installed common libraries such as Pandas and XGBoost. Manage third-party dependencies with simple annotations. This simplifies environment configuration and dependency management. It is more convenient than manually packaging and uploading user-defined function (UDF) dependencies.
The following table compares this tool with other Python development tools:
Comparison Item | MaxFrame | PyODPS | Mars | SQL+UDF |
Development API | Compatible with Pandas. | Syntax and API differ significantly from Pandas DataFrame. | Requires using two sets of APIs: SQL and Python. | |
Data processing | At runtime, data is processed on the server and does not need to be pulled to a local machine. This reduces unnecessary local data transfer and improves job execution efficiency. | The | Distributed execution is supported for only some operators. Cluster creation is required during initialization, which is slow and offers low stability. | Supports distributed jobs based on MaxCompute SQL capabilities. |
Computing resources | Not limited by the size of local resources, breaking the single-machine performance bottleneck of Python. | Limited by the size of local resources. | Limited by resource size. The size of workers, CPU, and memory must be specified. | Enables elastic computing for SQL jobs based on the MaxCompute serverless capabilities. |
Development experience | Out-of-the-box interactive development environment and offline scheduling capabilities. Common libraries are built-in. Manage third-party dependencies using annotations, with no need for manual packaging. | Out-of-the-box interactive development environment and offline scheduling capabilities. | Requires preparing the corresponding runtime environment and launching a Mars cluster. | Dependency packages for Python UDFs must be manually packaged and uploaded. |
How it works
MaxFrame keeps the complexity of distributed computing transparent to the user. The automated workflow is as follows:
Code submission: Write and execute Python code on a client, such as a Notebook. The MaxFrame software development kit (SDK) captures the code and submits it to MaxCompute.
Parsing and optimization: After the MaxCompute execution engine receives the job, it performs syntax parsing and logical optimization. It then transforms the job into a physical plan that can be executed in parallel.
Distributed execution: The optimized task is distributed to numerous compute nodes in the MaxCompute cluster. The nodes directly read the data and perform parallel computing.
Result return: After the computation is complete, the results are aggregated and returned to your client.