Apache Spark 3.0.0-comprehensive analysis of important features-Alibaba Cloud Developer Community

2020-06-24 3040

introduction: Apache SparkTM 3.0.0, which has been developed for nearly two years (since October 2018), was officially released! Apache SparkTM 3.0.0 contains more than 3400 patches, which is the result of the great contribution made by the open source community, significant progress has been made in Python and SQL functions and the focus has been on ease of use in development and production. At the same time, this year is also the 10th anniversary of Spark's open source. These measures reflect how Spark has continuously met the needs of a wider audience and more application scenarios since it was open source.
+ Follow to continue viewing

on June 18, 2020, Apache Spark 3.0.0, which has been developed for nearly two years (since October 2018), was officially released!

Apache Spark 3.0.0 contains more than 3400 patches, which is the result of the great contribution made by the open source community, significant progress has been made in Python and SQL functions and the focus has been on ease of use in development and production. At the same time, this year is also the 10th anniversary of Spark's open source. These measures reflect how Spark has continuously met the needs of a wider audience and more application scenarios since it was open source.

First, let's take a look at the main new features of Apache Spark 3.0.0:

  • 1. In the benchmark test of TPC-DS, other optimization measures such as adaptive query execution and dynamic partition pruning are enabled. Compared with Spark 2.4, the performance is improved by 2 times.
  • 2. Compatible ANSI SQL
  • 3. Significant improvements to pandas API, including python hints and other pandas UDFs
  • 4. Simplify Pyspark exceptions and handle Python error better
  • 5. New UI for structured streaming
  • 6. The speed of calling the R language UDF is increased by 40 times.
  • 7. More than 3400 Jira problems have been solved. The distribution of these problems in the core components of Spark is shown in the following figure:

in addition, with Spark3.0, the main code has not changed.

Spark SQL is an engine that supports most Spark applications. For example, in Databricks, more than 90% of Spark API calls use DataFrame, Dataset, and SQL API, and other lib packages optimized by SQL Optimizer. This means that even Python and Scala developers can handle most of their work through Spark SQL engines.

As shown in the following figure, the performance of Spark3.0 in the entire runtime is approximately twice that of Spark2.4:

next, we will introduce the new features of the Spark SQL engine.

Even if the initial plan generated is not ideal due to lack or inaccurate data statistics and incorrect cost estimation, Adaptive Query Execution (Adaptive Query Execution) by optimizing query execution plans at runtime, Spark Planner are allowed to execute optional execution plans at runtime. These plans will be optimized based on runtime statistics to improve performance.

Because Spark data storage and computing are separated, it is impossible to predict the arrival of data. For these reasons, runtime self-adaptation is particularly important for Spark. AQE currently provides three main adaptive optimizations:

  • 1. Dynamically merge shuffle partitions

    you can simplify or even avoid adjusting the number of shuffle partitions. You can set a relatively large number of shuffle partitions at the beginning. AQE merges adjacent small partitions into larger partitions at runtime.

  • 2. Dynamically adjust the join policy

    to a certain extent, the implementation of sub-optimal plans is avoided due to lack of statistical information or incorrect estimation size (of course, both cases may exist at the same time). This adaptive optimization can convert sort merge join into broadcast hash join at runtime to further improve performance.

  • 3. Dynamically optimize skew joins

    skew joins may lead to extreme load imbalance and severely reduce performance. After AQE detects any skew from the shuffle File statistics, it can split the skewed partitions into smaller partitions and connect them to the corresponding partitions on the other side. This optimization can parallelize skew processing to achieve better overall performance.

In the benchmark test of TPC-DS based on 3TB, Spark using AQE improves the performance of the two queries by more than 1.5 times compared with that without AQE, the performance of the other 37 queries has been improved by more than 1.1 times.

Dynamic partition pruning

when the optimizer cannot identify skipped partitions during compilation, it can use dynamic partition pruning, that is, further partition pruning based on the information inferred at runtime. This is common in star models. A star model consists of one or more fact tables that reference any number of dimension tables. In this connection operation, we can crop the partitions read from the fact table by identifying the partitions filtered from the dimension table. In a TPC-DS benchmark test, 60 out of 102 queries achieved 2 to 18 times faster.

For more information about dynamic partition pruning, see https://databricks.com/session_eu19/dynamic-partition-pruning-in-apache-spark#:~:text=Dynamic%20partition%20pruning%20occurs%20when,any%20number%20of%20dimension%20tables.

ANSI SQL COMPATIBILITY

it is critical to migrate workloads from other SQL engines to Spark SQL.

To improve compatibility, this version uses Proleptic Gregorian calendar. You can disable the use of ANSI SQL reserved keywords as identifiers. In addition, the runtime overflow check is introduced in digital operations, and the compile-time type mandatory check is introduced when data is inserted into tables with predefined schemas, these new verification mechanisms improve data quality.

For more information about ASNI compatibility, see https://spark.apache.org/docs/3.0.0/sql-ref-ansi-compliance.html.

Join hints

although the community has been improving the compiler, it still cannot guarantee that the compiler can make the optimal decision in any scenario-the choice of join algorithm is based on statistics and heuristic algorithm. When the compiler cannot make the best choice, users can use join hints to influence the optimizer so that it can choose a better plan.

Apache Spark 3.0 extends existing join hints by adding new hints, including SHUFFLE_MERGE, SHUFFLE_HASH, and SHUFFLE_REPLICATE_NL.

Python is now a widely used programming language in Spark, so it is also the focus of Spark 3.0. Databricks, 68 percent of notebook commands are written in Python. The number of monthly downloads of PySpark on Python Package Index exceeds 5 million.

Many Python developers use pandas API in data structure and data analysis, but only single-node processing. Databricks will continue to develop Koalas, a Apache Spark-based pandas API implementation that enables data scientists to process big data more efficiently in a distributed environment.

By using Koalas, data scientists in PySpark do not need to build many functions (for example, drawing support) to achieve higher performance in the entire cluster.

After more than a year of development, Koalas has achieved nearly 80% coverage of pandas API. The monthly PyPI downloads of Koalas have rapidly increased to 850,000 and evolved at a biweekly release pace. Although Koalas may be the simplest way to migrate code from standalone pandas, many people still use PySpark API, which also means that PySpark API is becoming more and more popular.

Spark 3.0 provides multiple enhancements for PySpark API:

  • 1. New pandas API with type prompts

pandas UDF was originally introduced in Spark 2.3 to extend user-defined functions in PySpark and integrate pandas API into PySpark applications. However, with the increase of UDF types, existing interfaces become difficult to understand. This version introduces a new pandas UDF interface, which uses Python type prompts to solve the problem of pandas UDF type surge. New interfaces become more Python stylized and self-descriptive.

  • 2. New pandas UDF types and pandas function APIs

     该版本增加了两种新的pandas UDF类型,即系列迭代器到系列迭代器和多个系列迭代器到系列迭代器。这对于数据预取和昂贵的初始化操作来说非常有用。
    
     此外,该版本还添加了两个新的pandas函数API,map和co-grouped map。更多详细信息请参考:https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html。
    
  • 3. Better error handling

     对于Python用户来说,PySpark的错误处理并不友好。该版本简化了PySpark异常,隐藏了不必要的JVM堆栈跟踪信息,并更具Python风格化。
    

improving Python support and availability in Spark remains one of our priorities.

Spark 3.0 has completed Hydrogen key components of the project and introduced new features to improve streaming and scalability.

  • 1. Accelerator-aware scheduling

     Hydrogen项目旨在更好地统一基于Spark的深度学习和数据处理。GPU和其他加速器已经被广泛用于加速深度学习工作负载。为了使Spark能够利用目标平台上的硬件加速器,该版本增强了已有的调度程序,使集群管理器可以感知到加速器。
    
     用户可以通过配置来指定加速器(详细配置介绍可参考:https://spark.apache.org/docs/3.0.0/configuration.html#custom-resource-scheduling-and-configuration-overview)。然后,用户可以调用新的RDD API来利用这些加速器。
    
  • 2. New UI for structured streams

structured streams were originally introduced in Spark 2.0. In Databricks, the number of records processed by structured streams per day exceeded 5 trillion after the four-fold increase in usage.

Apache Spark added a new Spark UI to view streaming jobs. The new UI provides two sets of statistics:

(1) the aggregation information of the completed stream query job

2) detailed statistics of stream queries, including Input Rate, Process Rate, Input Rows, Batch Duration, and Operation Duration.

  • 3. Observable indicators

continuous monitoring of data quality changes is an important function of managing data pipelines. Spark 3.0 introduces the function monitoring of batch processing and streaming applications. Observable metrics are aggregate functions (DataFrame) that can be defined in queries. Once DataFrame execution reaches a completion point (for example, a batch query is completed), an event is issued, which contains the metric information of data processed since the last completion point.

existing data source APIs lack the ability to access and operate external data source metadata. The new version enhances the data source V2 API and introduces a new directory plug-in API. For external data sources that implement both the directory plug-in API and the data source V2 API, you can use identifiers to directly operate the data and metadata of external tables (after the corresponding external directory is registered).

Spark 3.0 is the community's an important version, solve more than 3400 Jira issue, this is more than 440 contributors result of the joint efforts of, these contributors including individuals and from Databricks, Google, Microsoft, employees of Intel, IBM, Alibaba, Facebook, Avida, Netflix, Adobe and other companies.

In this blog post, we will focus on Spark's key improvements in SQL, Python, and streaming technologies.

In addition, Spark 3.0, as a milestone, has many other improvements. For more information, see Release Notes: https://spark.apache.org/releases/spark-release-3-0-0.html 。 The release document provides more detailed information about improvements to this release, including data sources, ecosystems, and monitoring.

Finally, congratulations on the 10th anniversary of Spark's open source development!

Spark was born in UC Berkeley's AMPlab, which is dedicated to the research of data-intensive computing. AMPLab researchers work with large internet companies to solve data and AI problems. However, they found that similar problems need to be solved for those companies with massive data and growing data. Therefore, the team developed a new engine to handle these emerging workloads and make data processing APIs easier for developers to use.

The community quickly expanded Spark to different fields and provided new features in stream, Python, and SQL. These patterns have now become some of the main use cases of Spark. As a de facto engine for workloads in data processing, data science, machine learning, and data analysis, Spark has been continuously invested to this day. Apache Spark 3.0 continues this trend by significantly improving the support for SQL and Python (currently the two most widely used languages of Spark), as well as optimizing performance and operability.

This article mainly refers to Databricks blog and Apache Spark official website, including not limited to the following articles:

1.https://databricks.com/blog/2020/06/18/introducing-apache-spark-3-0-now-available-in-databricks-runtime-7-0.html

2.https://spark.apache.org/releases/spark-release-3-0-0.html

for a more detailed introduction to the important features of Apache SparkTM 3.0.0, in addition to the content in this article, you can also refer to other technical blogs from Databricks:

  • 1.Adaptive Query Execution blog

    https://databricks.com/blog/2020/05/29/adaptive-query-execution-speeding-up-spark-sql-at-runtime.html

  • 2.Pandas UDFs and Python Type Hints blog

    https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html

SQL machine learning/deep learning distributed computing surveillance test Technology compiler API Apache Spark Python
apache cannot open a web page by:apache spring apache war apache spring login apache cannot be opened
developer Community> development and O & M > article
Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now