Alibaba Released the First Self-Developed Scientific Computing Engine Mars

Abstract: The MaxCompute R&D team of Alibaba unified big data computing platform, after more than a year of research and development, broke the boundaries of big data and scientific computing, completed the first version and made it open source. Mars, a unified distributed computing framework based on tensors. Using Mars for scientific computing not only reduces the completion of large-scale scientific computing tasks from thousands of lines of code in MapReduce to a few lines of code in Mars, but also greatly improves performance.

A few days ago, Alibaba officially released the open source code address of the distributed scientific computing engine Mars. Developers can download and install it on pypi, or obtain the source code on Github and participate in the development.

Previously, Alibaba announced this open source plan as early as September 2018 at the Hangzhou Yunqi Conference. Mars breaks through the relational algebra-based computing model of the existing big data computing engine, introduces distributed technology into the field of scientific computing/numerical computing, and greatly expands the computing scale and efficiency of scientific computing. It has been applied to the business and production scenarios of Alibaba and its cloud customers. This article will introduce the original design and technical architecture of Mars in detail.

Overview

Scientific computing is numerical computing, which refers to the application of computers to deal with mathematical computing problems encountered in scientific research and engineering technology. Scientific computing is used in many fields such as image processing, machine learning, and deep learning. There are many languages ​​and libraries that provide tools for scientific computing. Among them, Numpy has become the leader with its concise and easy-to-use syntax and powerful performance, and has formed a huge technology stack based on this. (shown below)


Numpy's core concept of multidimensional arrays is the basis for various upper-level tools. Multidimensional arrays are also called tensors. Compared with two-dimensional tables/matrices, tensors have more powerful expressive power. Therefore, the popular deep learning frameworks are also widely based on tensor data structures.


With the boom in machine learning/deep learning, the concept of tensors has gradually become familiar, and the scale requirements for general-purpose computations on tensors are also increasing. But the reality is that excellent scientific computing libraries such as Numpy are still stuck in the stand-alone era and cannot break through the scale bottleneck. The current popular distributed computing engines are not born for scientific computing. The mismatch of upper-level interfaces makes it difficult to write scientific computing tasks with traditional SQL/MapReduce. The execution engine itself is not optimized for scientific computing, which makes the computing efficiency unsatisfactory.

Based on the above status of scientific computing, Alibaba unified big data computing platform MaxCompute R&D team, after more than a year of research and development, broke the boundaries of big data and scientific computing fields, completed the first version and opened it up. Mars, a unified distributed computing framework based on tensors. Using Mars for scientific computing not only reduces the completion of large-scale scientific computing tasks from thousands of lines of code in MapReduce to a few lines of code in Mars, but also greatly improves performance. At present, Mars implements the tensor part, that is, numpy distributed, and implements 70% of the common numpy interfaces. In the follow-up, in the version of Mars 0.2, pandas is being distributed, and a fully compatible interface with pandas will be provided to build the entire ecosystem.

Mars, as a new generation of ultra-large-scale scientific computing engine, not only enters the distributed era of inclusive scientific computing, but also makes efficient scientific computing possible with big data.

Mars' core competencies

User-friendly interface
Mars provides a Numpy-compatible interface through the tensor module. Users can transfer existing code written based on Numpy, simply replace the import, and the code logic can be transplanted to Mars, and directly obtain tens of thousands of times larger scale than the original, while processing capacity Improve the ability dozens of times. Currently, Mars implements about 70% of the common Numpy interfaces.

Take advantage of GPU acceleration
In addition, Mars also extended Numpy, making full use of GPU's existing achievements in the field of scientific computing. By specifying gpu=True when creating a tensor, subsequent computations can be performed on the GPU. for example:
a = mt.random.rand(1000, 2000, gpu=True) # specify to create on GPU
(a + 1).sum(axis=1).execute()
sparse matrix
Mars also supports two-dimensional sparse matrices. When creating a sparse matrix, you can specify sparse=True. Take the eye interface as an example, it creates a unit diagonal matrix, this matrix only has values ​​on the diagonal, and all other positions are 0, so we can store it in a sparse way.
a = mt.eye(1000, sparse=True) # Specify to create a sparse matrix
(a + 1).sum(axis=1).execute()
system design

Next, I will introduce the system design of Mars, so that everyone can understand how Mars can automatically parallelize scientific computing tasks and have powerful performance.

Divide and conquer - tile
Mars typically takes a divide-and-conquer approach to scientific computing tasks. Given a tensor, Mars will automatically divide it into small chunks in each dimension for processing separately. All operators implemented by Mars support automatic split task parallelism. This automatic splitting process is called tile in Mars.
For example, given a tensor of 1000 2000, if the chunk size in each dimension is 500, then this tensor will be tiled into 2 4 total 8 chunks. For subsequent operators, such as addition (Add) and summation (SUM), tile operations are also performed automatically. The tile process of a tensor operation is shown in the following figure.


Lazy execution and Fusion optimizations
At present, the code written in Mars needs to explicitly call execute to trigger, which is based on the delayed execution mechanism of Mars. When users write intermediate code, they do not need any actual data calculation. The advantage of this is that more optimizations can be made on the intermediate process to make the execution of the entire task better. At present, fusion optimization is mainly used in Mars, that is, combining multiple operations into one execution.
For the example of the previous graph, after the tile is completed, Mars will perform fusion optimization on the fine-grained Chunk-level graph, such as 8 RAND+ADD+SUM, each of which can be merged into a node. numexpr library to generate accelerated code, on the other hand, reducing the number of actual running nodes can also effectively reduce the overhead of scheduling the execution graph.

Multiple scheduling methods
Mars supports multiple scheduling methods:
| Multi-threaded mode: Mars can use multi-threading to locally schedule and execute Chunk-level graphs. For Numpy, most of the operators are executed using a single thread. Only using this scheduling method can also enable Mars to obtain the ability to tile the execution graph on a single machine, break through the single-machine memory limit of Numpy, and make full use of it. All CPU/GPU resources of a single machine are several times faster than Numpy.

| Single-machine cluster mode: Mars can use multiple processes to speed up the execution of tasks when a single machine starts the entire distributed runtime; this mode is suitable for simulating development and debugging for a distributed environment.

| Distributed: Mars can start one or more schedulers and multiple workers, and the scheduler will schedule Chunk-level operators to each worker for execution.

The following figure shows the distributed execution architecture of Mars:


During distributed execution of Mars, multiple schedulers and multiple workers are started. In the figure, there are 3 schedulers and 5 workers. These schedulers form a consistent hash ring. The user explicitly or implicitly creates a session on the client side, and the SessionActor is allocated on one of the schedulers according to the consistent hash. Then the user submits the calculation of a tensor through execute, and a GraphActor is created to manage the execution of this tensor. This tensor will be tiled into a chunk-level graph in the GraphActor. Assuming that there are 3 chunks here, then 3 OperandActors will be created on the scheduler corresponding to each other. These OperandActors will be submitted to each worker for execution according to whether their dependencies are completed and whether the cluster resources are sufficient. After all OperandActors are completed, the GraphActor will be notified that the task is complete, and then the client can pull data to display or draw.

Scale in and out
Mars' flexible tiled execution graph and multiple scheduling modes allow the same code written in Mars to scale in and out at will. Scaling inward to a single machine can use multiple cores to perform scientific computing tasks in parallel; scaling out to a distributed cluster can support thousands of workers to complete tasks that are difficult for a single machine to complete anyway.
Benchmark

In a real scenario, we encountered the computational requirement of giant matrix multiplication, and we needed to multiply two matrices with a size of about 2.25T, each of which is 100 billion elements. Through 5 lines of code, Mars uses 1600 CU (200 workers, each worker has 8 cores and 32G memory), and completed the calculation in 2 and a half hours. Before that, the same kind of calculation could only be simulated by writing more than a thousand lines of code using MapReduce, and it took 9000 CUs and 10 hours to complete the same task.

Let's look at two more comparisons. The figure below is to add one and multiply each element of the 3.6 billion data matrix by two. The red cross represents the calculation time of Numpy, the green solid line is the calculation time of Mars, and the blue dotted line is the theoretical calculation time. It can be seen that a single-machine Mars is several times faster than Numpy. With the increase of Workers, an almost linear speedup can be obtained.


The figure below is to further expand the calculation scale, expand the data to 14.4 billion elements, add one and multiply these elements and then sum them up. At this time, the input data is 115G, and the single-machine Numpy can no longer complete the operation. Mars can still complete the operation, and with the increase of the number of machines, a good speedup ratio can be obtained.


open source address

Mars has been open sourced on Github:
https://github.com/mars-project/mars , and will all use standard open source software for development on Github in the future. You are welcome to use Mars and become a contributor to Mars.

Mars Scientific Computing Engine Product Launch

Publish live playback >>
https://yq.aliyun.com/live/800

Release event page >>
https://promotion.aliyun.com/ntms/act/cloud/marsfbh.html

Big data computing service MaxCompute official website>>
https://www.aliyun.com/product/odps

MaxCompute trial application page>>
https://i.aliyun.com/inviteapplyagent_id=183

Energy Chat>>
https://yq.aliyun.com/roundtable/493400

Author: Jin Heng

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us