MaxFrame is a distributed scientific computing framework developed by Alibaba Cloud. Building on PyODPS and Mars, MaxFrame provides Pandas-compatible APIs and lets you work with MaxCompute the same way you work with Python.
Background information
Python is the mainstream programming language for machine learning and AI model development. It offers rich scientific computing and visualization libraries such as NumPy (multidimensional array operations), Pandas (data analytics), Matplotlib (2D plotting), and scikit-learn (data analytics and mining algorithms), as well as training frameworks such as TensorFlow, PyTorch, XGBoost, and LightGBM.
MaxCompute provides a Python development ecosystem for large-scale data processing, analytics, mining, and model training. With unified Python APIs, you can perform data processing and mining efficiently.
Development history
The following figure shows the development history of the MaxCompute Python development ecosystem.
PyODPS
PyODPS was officially released in 2015 as the MaxCompute SDK for Python. It lets you operate on MaxCompute data through Python interfaces. Over multiple iterations, PyODPS added DataFrame support with Pandas-like syntax and built-in operators for aggregation, sorting, and deduplication.
Core features of PyODPS:
-
Support for basic operations on MaxCompute objects (in 2015):
-
Access MaxCompute objects such as tables, resources, and functions.
-
Submit SQL requests by using the
run_sqlorexecute_sqlmethod. -
Run Platform for AI (PAI) machine learning tasks by using the
run_xfloworexecute_xflowmethod. -
Upload and download data by using the
open_write,open_reader, or Tunnel API operations.
-
-
DataFrame APIs and Pandas-like interfaces for MaxCompute DataFrame computing (from 2016 to 2022):
-
PyODPS DataFrame lets you perform data operations in Python, leveraging its native language features.
-
PyODPS DataFrame provides Pandas-like interfaces with extended syntax, including MapReduce APIs for the big data environment.
-
PyODPS DataFrame provides built-in functions for aggregation, sorting, deduplication, sampling, and visualization.
-
Mars
The Python ecosystem offers rich scientific computing libraries such as NumPy, Pandas, and scikit-learn with convenient analytics and mining operators. However, most of these libraries are limited to standalone resources. Mars is a tensor-based distributed computing framework that implements approximately 70% of NumPy interfaces in a distributed manner, significantly reducing the difficulty of writing distributed scientific computing code.
Core features of Mars:
Compatibility and distributed capability: Mars was officially open-sourced in January 2019. It enables NumPy, Pandas, scikit-learn, and Python functions to run in a distributed manner while remaining compatible with most interfaces.
MaxFrame
Mars and PyODPS target different scenarios. Users familiar with Pandas who want to run NumPy or scikit-learn in parallel are better suited to Mars. Users who need DataFrame-based processing with high stability at terabyte scale or above are better suited to PyODPS. However, the complexity of this split architecture creates difficulties for users.
MaxFrame is a distributed scientific computing framework developed by Alibaba Cloud. It is fully compatible with Pandas interfaces and automatically selects the optimal underlying engine to run jobs, improving both performance and development efficiency. You no longer need to choose an execution engine manually, and can efficiently complete the entire workflow from data development and analytics to AI training and inference. The architecture is shown in the following diagram.
Core features of MaxFrame:
-
More familiar development habits
MaxFrame is compatible with the Python ecosystem and provides unified development interfaces for MaxCompute. You can use the same Python code throughout the entire data and AI development process.
MaxFrame can directly reference third-party libraries such as NumPy, SciPy, Pandas, and Matplotlib for scientific computing, data analysis, and visualization, reducing operation costs.
-
Higher processing performance
MaxFrame accesses MaxCompute data directly without pulling data to your on-premises machine, eliminating data transfers and improving execution efficiency.
MaxFrame leverages the elastic computing resources in MaxCompute with automatic distribution and parallel processing, significantly reducing data processing time.
-
More convenient development experience
MaxFrame is integrated with MaxCompute Notebook and DataWorks. You can use MaxFrame directly in either environment without additional configuration, or install it in your on-premises environment.
MaxFrame supports both built-in and custom images in MaxCompute, reducing environment setup time and preventing version conflicts.
-
Improved operator support
MaxFrame is fully compatible with Pandas interfaces and automatically distributes processing, delivering powerful data processing capabilities with high computing efficiency.