In-depth Review of Apache Spark: Spark + AI Summit 2020

Matei Zaharia, founder of the Spark project, gave an in-depth review of Spark at the Spark + AI Summit 2020 in conjunction with its 10-year anniversary.

By Zheng Kai (Tiejie), Senior Technical Expert at Alibaba
At the Spark + AI Summit 2020, Matei Zaharia gave a wonderful and in-depth keynote speech on the development of Spark over the past decade. Spark SQL delivers new performance gains, far surpassing the performance of Presto. In benchmark tests over the past few years, many participants performed better than Spark. However, Apache Spark 3.0 has injected confidence into the Spark ecosystem.

After watching the speech, I want to present my personal views on Apache Spark.

The following figure shows a slide of a video I watched on YouTube. The content of this slide can be summarized as follows:

Point 1: SQL is key to the Spark APIs, which are written in multiple languages. Spark SQL and Spark Core function in much the same way. It is still important to continuously optimize SQL. This is not a big challenge at all.

Point 2: Spark SQL delivers new performance gains, far surpassing the performance of Presto. In benchmark tests over the past few years, many participants performed better than Spark. However, Apache Spark 3.0 has injected confidence into the Spark ecosystem.

Point 3: Alibaba Cloud outperforms other cloud service providers in developing Spark technologies. Alibaba Cloud has topped the TPC-DS benchmark chart for the second time this year. I appreciate Matei Zaharia's acknowledgment of our achievements.

While vigorously developing Flink, Alibaba also devotes efforts to Spark technologies to extend its Spark user base and foster the Spark ecosystem. This demonstrates that Alibaba is an enthusiastic participant in the Spark ecosystem. Our persistent efforts to top the TPC-DS benchmark chart are driven by our steady investments in technologies. We have been increasing our investments in SQL optimizers, native code generation and execution, and caching based on object storage. For more information, see the article Native Codegen Framework Behind Performance Optimization of EMR Spark SQL.

The Photon project unveiled at the Spark + AI Summit 2020 works in much the same way as the Native Codegen Framework. The goals and processes of performance optimization do not vary much among developers: We proceed to the native layer after we complete all architectural optimizations. What's more, it is important for us to maximize server capabilities when we work on the cloud. In the future, I will introduce the Photon engine in detail.

Point 4: Many cloud service providers are just now finding it easier and more efficient to work with Databricks. Compared with the second time Alibaba Cloud set a world record in TPC-DS, Alibaba Cloud Elastic MapReduce (EMR) first-place finish in the TPC-DS benchmark chart by using Spark represents an even greater achievement. However, it is worth pondering that Matei Zaharia recognized Alibaba's second victory in the TPC-DS at the Spark + AI Summit 2020. Matei Zaharia is the founder of the Spark project and chief technologist at Databricks.

In short, the development of Spark is inseparable from the efforts of cloud service providers, and Databricks will continue to embrace and develop cloud platforms. This helps promote the mutually beneficial cooperation between open-source communities and cloud service providers.

I believe we can learn a lot from the Spark + AI Summit 2020 at the 10-year anniversary of the Spark project. I hope the hype around Spark will continue and I need to continue to learn so I can keep up with the pace of Spark development.

About the Author

Zheng Kai (Tiejie) is a senior technical expert at Alibaba, Apache Hadoop Project Management Committee (PMC) member, and founder of Apache Kerby. Zheng Kai has been deeply engaged in the development of distributed systems and open-source big data systems for many years. Currently, he is working on Hadoop and Spark big data platforms at Alibaba Cloud, aiming to improve the ease of use and elasticity of these platforms.

Community

In-depth Review of Apache Spark: Spark + AI Summit 2020

About the Author

Read previous post:

Read next post:

Alibaba EMR

You may also like

Comments

Alibaba EMR

Related Products

Realtime Compute for Apache Flink

Big Data Consulting for Data Technology Solution

Big Data Consulting Services for Retail Solution

E-MapReduce Service