The serverless Spark engine of Data Lake Analytics (DLA) uses a cloud-native architecture to provide data analytics and computing services for data lake scenarios. After you activate DLA, you can submit Spark jobs by completing simple configurations. This frees you from the complex deployment of Spark virtual clusters (VCs).

Challenges facing Apache Spark

Apache Spark is a prevailing engine in the big data field. It applies to data lake scenarios and uses built-in connectors to access data sources. These connectors provide an easy way to extend APIs. Apache Spark supports SQL and allows you to write DataFrame in multiple programming languages. This makes Apache Spark easy to use and flexible. Apache Spark serves as an end-to-end engine to support features, such as SQL, streaming, machine learning, and graph computing.

Before you use Apache Spark, you must deploy a set of open source, basic big data components. These components include Yarn, HDFS, and ZooKeeper. After you deploy these components, the following issues may occur:
  • Complex development and O&M operations: To complete the development and O&M operations, developers must be familiar with a variety of big data components. If they encounter issues, they must conduct in-depth research on the source code provided by the Apache Spark community.
  • High O&M costs: Enterprises require an O&M team to maintain open source components. The O&M team needs to configure resource nodes, configure and deploy open source software, monitor and update open source components, and scale clusters. Customized development is also required to meet enterprise-level requirements, such as permission isolation and monitoring and alerting.
  • High resource costs: Loads of Spark jobs significantly fluctuate over time. During off-peak hours, large amounts of idle resources exist in Apache Spark clusters. Cluster management and control components still consume resources during off-peak hours but do not bring business value to customers. These components include master nodes, ZooKeeper, and Hadoop.
  • Lack of elasticity: During peak hours, enterprises need to accurately estimate resource requirements and add machines if required. If you add a large number of machines, some machines may not be used. If you add only a few machines, your business may be affected due to insufficient resources. In addition, the cluster scale-out process is complex and time-consuming, and resources may become insufficient.

Solution

The serverless Spark engine of DLA is a big data analytics and computing service. This engine is developed based on Apache Spark and uses a service-oriented architecture (SOA).

DLA deeply integrates this engine with Spark, serverless, and cloud-native technologies. Compared with Apache Spark, the serverless Spark engine of DLA brings the following benefits:
  • Easy to use: provides simple APIs and scripts without requiring developers to learn about basic components at the underlying layer. In addition, the serverless Spark engine provides an easy way to perform operations in the DLA console. It enables developers with only a basic knowledge of Apache Spark to develop big data services.
  • Zero O&M: provides product interfaces for you to manage Spark jobs. You do not need to configure servers or Hadoop clusters, or perform O&M operations such as scaling.
  • Low costs: uses the pay-as-you-go billing method. You are charged only for the jobs that you used. You are not charged for resource management and control. In addition, you do not need to pay for idle computing resources in off-peak hours.
  • Job-based scalability: allows you to create resources based on the driver and executors. Compared with compute units (CUs) in Apache Spark clusters, the serverless Spark engine of DLA reduces the probability of insufficient resources. This engine allows you to start up to 500 to 1,000 CUs in a minute. This meets business resource requirements.
  • Superior performance: improves performance threefold to fivefold in typical scenarios when Alibaba Cloud services, such as Object Storage Service (OSS), are deployed. To achieve this purpose, the development team of DLA customizes and optimizes the serverless Spark engine based on Apache Spark.
  • Enterprise-level capability: shares metadata with the serverless Presto engine of DLA. You can execute the GRANT and REVOKE statements to manage permissions granted to RAM users. The serverless Spark engine provides a user-friendly web UI. Compared with the Apache Spark history server, the serverless Spark engine takes only a few seconds for you to open the web UI, no matter how complex a job is.

Basic concepts

  • Virtual cluster

    The serverless Spark engine of DLA uses the multitenancy architecture. The Spark processes run in an isolated environment. A VC is a unit that implements resource and security isolation. A VC does not have fixed computing resources. Therefore, you need only to allocate the resource quota based on your business requirements and configure the network to which the destination data that you want to access belongs. You do not need to configure or maintain CUs. You can also configure default parameters for Spark jobs of a VC. This facilitates unified management of Spark jobs. For more information about how to create a VC, see Create a virtual cluster.

  • Compute unit

    A compute unit (CU) is a basic unit of the serverless Spark engine of DLA. One CU equals 1 CPU core and 4 GB of memory. After a job is complete, DLA calculates the number of CUs consumed on the driver and executors by using the following formula: Total number of CUs used on the driver and executors × Number of hours in which CUs are used. For more information about the billing methods, see Billing methods.

  • Resource specifications
    Elastic container instances are used for the serverless Spark engine at the underlying layer. Similar to ECS instances, elastic container instances have their specifications. You do not need to configure the detailed specifications of elastic container instances. Instead, you need only to set resource specifications to small, medium, or large. By default, the serverless Spark engine preferentially uses elastic container instances with higher specifications.
    Resource specifications Computing resource specifications CU specifications
    small 1 CPU core and 4 GB of memory 1 CU
    medium 2 CPU cores and 8 GB of memory 2 CUs
    large 4 CPU cores and 16 GB of memory 4 CUs
    xlarge 8 CPU cores and 32 GB 8 CUs

Use the serverless Spark engine

Submit a Spark job. For more information, see Quick start of the serverless Spark engine.

Connect to a data source. For more information, see Connect to data sources.

Perform spatio-temporal calculations. For more information, see Overview.

You can also contact Alibaba Cloud customer service for detailed information of this engine. For more information, see Expert service.