Scientific Analysis of Large-Scale Data Based on the Distributed Python Capabilities of MaxCompute

By Meng Shuo, Product Expert of Alibaba Cloud Intelligence

How can we accelerate data science with distributed Python on the cloud? If you are familiar with data science technology stacks, such as NumPy, pandas, or scikit-learn, and are limited by the computing performance of the platform, MaxCompute in this article allows you to use parallel and distributed technologies to accelerate data science. In other words, if you are familiar with NumPy, pandas, and scikit-learn, it will not be difficult to use the distributed Python capabilities of MaxCompute.

1. The Importance of Python Ecology

Why Python?

Python has grown to become the dominant language in data analytics and general programming.

Based on the statistics of stack overflow, the development trend of Python, C#, JavaScript, Java, PHP, C++, SQL, and R from 2009 to 2021 is shown in the following figure. It shows that Python is on the rise, especially in the data analysis and data science fields. It is almost the top programming language in these fields. This is the development trend of Python ecology. However, in the data analysis, data science, and machine learning fields, programming languages are not the only thing that matters.

The statistics above were taken from https://insights.stackoverflow.com/trends

Technology Stack of Data Science

Programming languages are only one aspect of the data science field. Python is one language option, but some data analysts use SQL, the traditional analysis language, R, or the functional programming language, Scala. The second aspect is a database for data analysis, such as NumPy and pandas, or a visualization-based library. Clusters where Python runs also contain some O&M technology stacks, for example, being able to run on Docker or Kubernetes. The early stage of data analysis and data science involves data cleaning and some extract, transform, and load (ETL) processes. Some cleaning works involve more than one or two steps but require the use of workflow to complete the overall ETL process. It involves the most popular components, such as Spark and the workflow scheduler Airflow. Storage is needed to present the final result. Generally, PostgreSQL database or memory database Redis is used, connecting to a BI tool externally to display the final result. There are also some machine learning platforms or components, such as TensorFlow and PyTorch. In web development, Flask will be used for the quick establishment of a frontend platform. Finally, business intelligence software is involved, including BI tools such as Tableau and Power BI or SaaS software, which is often used in the data science field.

This is a relatively complete overview of the entire data science technology stack. We start with programming languages and find that data science for large-scale data requires all aspects of consideration.

2. An Introduction to the Distributed Python Capabilities of MaxCompute

The Distributed Python Technology of MaxCompute: PyODPS

MaxCompute is a cloud data warehouse in SaaS mode that is compatible with Python.

PyODPS is a Python version SDK for MaxCompute. It provides basic operations on MaxCompute objects and a DataFrame framework (a two-dimensional table structure, which adds, retrieves, updates, and deletes data records) to analyze data on MaxCompute.

SQL and DataFrame tasks submitted by PyODPS are converted into MaxCompute SQL for distributed operation. If a third-party library can be run in the form of UDF + SQL, it can also run in a distributed manner.

Sometimes tasks need to be split into subtasks to run them in a distributed manner in Python. For example, native Python does not have distributed capabilities for large-scale vector computing. We recommend MaxCompute Mars here. It is a framework that can split Python tasks into subtasks for running.

Use Third-Party Packages in Custom Functions

MaxCompute also supports using third-party packages in Python. The steps are listed below:

Step 1

Determine the third-party packages to be used:

sklearn and scipy

Step 2

Find all dependencies for the corresponding packages:

sklearn, scipy, pytz, pandas, six, and python-dateutil

Step 3

Download the corresponding third-party packages (pypi):

python-dateutil-2.6.0.zip
pytz-2017.2.zip, six-1.11.0.tar.gz
pandas-0.20.2-cp27-cp27m-manylinux1_x86_64.zip
scipy-0.19.0-cp27-cp27m-manylinux1_x86_64.zip
scikit_learn-0.18.1-cp27-cp27m-manylinux1_x86_64.zip

Step 4

Convert uploaded resources into a Resource object of MaxCompute. By doing so, third-party packages will be used when we create functions and reference custom functions.

Customize Code of Functions

def test(x):
    from sklearn import datasets, svm
    from scipy import misc
    import numpy as np
    
    iris = datasets.load_iris()
    clf = svm.LinearSVC()
    clf.fit(iris.data, iris.target)
    pred = clf.predict([[5.0, 3.6, 1.3, 0.25]])
    assert pred[0] == 0
    assert misc.face().shape is not None
return x

The Distributed Python Technology of MaxCompute: Mars

Project Name: Mars

It was originally called MatrixandArray.

Why Was Mars Developed?

It was designed for large-scale scientific computing. The programming interface of the big data engine is not very friendly to scientific computing, and the framework design is not suitable for scientific computing models.
Traditional scientific computing is based on a single machine, while large-scale scientific computing requires supercomputing.

Tips: Scientific computing involves computer combing data. The procedure is: Excel → database (MySQL) → Hadoop, Spark, MaxCompute. The data volume has changed significantly, while the computing model has not changed. Two-dimensional table, projection, sharding, aggregation, filtering and sorting, relational algebra, and set theory are all required. Scientific computing infrastructure is not a two-dimensional table. For example, although a picture has two dimensions, each pixel point in a picture is not a number (RGB + α transparent channel).

The traditional SQL model has insufficient processing capability. It has a low efficiency when processing linear algebra, multiplication of determinants, and existing databases.
Status Quo: R and NumPy are based on stand-alone machines. Dask in the Python ecology serves as a bridge between big data and scientific computing.

Cases

Customer A: TB-level data with tens of billions of items in MaxCompute needs to be multiplied. The existing MapReduce mode has a low performance for this kind of task, while Mars can be used to do this efficiently. Currently, Mars is the only large-scale scientific computing engine.

A New Way to Accelerate Data Science

Methods based on DASK or MaxCompute Mars are compatible with scale up and scale out. The lower left of the figure below stands for a way to do data science by running Python libraries on a stand-alone basis. The idea of large-scale supercomputing is scale up, which uses vertical diffusion online to increase hardware capability. For example, the multi-core method can be used. Currently, each computer or server has more than one core, including GPU, TPU, NPU, and other hardware for deep learning. Python running on the hardware will gain acceleration. The technology here includes Modin, which is for the multi-core acceleration of pandas. The lower right part shows some frameworks for distributed Python. For example, RAY is a framework service of Ant. In essence, Mars can run on RAY, which is equivalent to a scheduling service of Python ecology or a Kubernetes service. DASK and Mars are also for distributed Python, but the best mode is to combine scale up and scale out. The advantage of this is that it can be distributed, and hardware capabilities can be utilized on a single node. Mars can only be configured on large-scale clusters and a single-machine GPU cluster. The following figure shows how to accelerate data science:

The Design Logic of Distributed Python

In essence, the design idea of Mars is to distribute the data science library, such as Python, which can split Dataframe, NumPy, and scikit-learn.

The idea is to split large-scale tasks into small tasks for distributed computing. The framework is used for splitting tasks. First, the client submits a task. Then, the Mars framework splits the task and makes a DAG diagram. Finally, the calculation results are collected.

Mars Scenario 1: Hybrid CPU and GPU Computing

In the security and finance fields, the mining and waiting cycles of traditional big data platforms are long, and resources are insufficient.
The Mars DataFrame accelerates data processing with large-scale sorting, statistics, and aggregation analysis.
Mars learn accelerates unsupervised learning. Mars pulls up the distributed computing of deep learning.
GPU is used to accelerate specific computing.

Mars Scenario 2: Explanatory Computing

The interpretation algorithm of advertising attribution and insight into characteristics is time-consuming and involves a huge amount of calculation.
Mars Remote is used to accelerate computing with dozens of servers to improve performance a hundredfold.

Mars Scenario 3: Large-Scale K-Nearest Neighbors Algorithm

The popularity of embedding makes it very common to use vectors to express entities.
The NearestNeighbors algorithm of Mars is compatible with scikit-learn. In the computing of three million vectors and top10 similarity computation (nine trillion vector comparisons), the brute force algorithm takes two hours with 20 workers. The big data platform cannot complete the computations based on SQL + UDF. For computing on a smaller scale, Mars improves performance a hundredfold compared with big data platforms.
Mars supports distributed acceleration of Faiss and Proxima (vector retrieval library of Alibaba DAMO Academy) to handle tens of millions and billions of vectors.

3. Best Practices

Mars integrates some Python third-party packages, including mainstream machine learning and deep learning libraries. A demo below shows how to use Mars for intelligent recommendation and LightGBM for classification algorithms. They are used to judge whether to send discounts to certain users.

The main step in the first figure is to connect to MaxCompute through AK, project name, and Endpoint information. Next, create a 4-node cluster with an 8-core CPU and 32-GB memory per node, apply the extended extension package, and generate training data with 64-dimension description information of 1 million users.

Model training using the LightGBM 2 classification algorithm:

Send the model to MaxCompute as a resource object through create resource and prepare the test set data.

Use test set data to validate the model and derive a classification:

Community