The serverless Spark engine of Data Lake Analytics (DLA) uses a cloud-native architecture to provide data analytics and computing services for data lake scenarios. After you activate DLA, you can submit Spark jobs by completing simple configurations. This frees you from the complex deployment of Spark virtual clusters (VCs).
Challenges facing Apache Spark
Apache Spark is a prevailing engine in the big data field. It applies to data lake scenarios and uses built-in connectors to access data sources. These connectors provide an easy way to extend APIs. Apache Spark supports SQL and allows you to write DataFrame in multiple programming languages. This makes Apache Spark easy to use and flexible. Apache Spark serves as an end-to-end engine to support features, such as SQL, streaming, machine learning, and graph computing.
- Complex development and O&M operations: To complete the development and O&M operations, developers must be familiar with a variety of big data components. If they encounter issues, they must conduct in-depth research on the source code provided by the Apache Spark community.
- High O&M costs: Enterprises require an O&M team to maintain open source components. The O&M team needs to configure resource nodes, configure and deploy open source software, monitor and update open source components, and scale clusters. Customized development is also required to meet enterprise-level requirements, such as permission isolation and monitoring and alerting.
- High resource costs: Loads of Spark jobs significantly fluctuate over time. During off-peak hours, large amounts of idle resources exist in Apache Spark clusters. Cluster management and control components still consume resources during off-peak hours but do not bring business value to customers. These components include master nodes, ZooKeeper, and Hadoop.
- Lack of elasticity: During peak hours, enterprises need to accurately estimate resource requirements and add machines if required. If you add a large number of machines, some machines may not be used. If you add only a few machines, your business may be affected due to insufficient resources. In addition, the cluster scale-out process is complex and time-consuming, and resources may become insufficient.
Solution
The serverless Spark engine of DLA is a big data analytics and computing service. This engine is developed based on Apache Spark and uses a service-oriented architecture (SOA).
- Easy to use: provides simple APIs and scripts without requiring developers to learn about basic components at the underlying layer. In addition, the serverless Spark engine provides an easy way to perform operations in the DLA console. It enables developers with only a basic knowledge of Apache Spark to develop big data services.
- Zero O&M: provides product interfaces for you to manage Spark jobs. You do not need to configure servers or Hadoop clusters, or perform O&M operations such as scaling.
- Low costs: uses the pay-as-you-go billing method. You are charged only for the jobs that you used. You are not charged for resource management and control. In addition, you do not need to pay for idle computing resources in off-peak hours.
- Job-based scalability: allows you to create resources based on the driver and executors. Compared with compute units (CUs) in Apache Spark clusters, the serverless Spark engine of DLA reduces the probability of insufficient resources. This engine allows you to start up to 500 to 1,000 CUs in a minute. This meets business resource requirements.
- Superior performance: improves performance threefold to fivefold in typical scenarios when Alibaba Cloud services, such as Object Storage Service (OSS), are deployed. To achieve this purpose, the development team of DLA customizes and optimizes the serverless Spark engine based on Apache Spark.
- Enterprise-level capability: shares metadata with the serverless Presto engine of DLA. You can execute the GRANT and REVOKE statements to manage permissions granted to RAM users. The serverless Spark engine provides a user-friendly web UI. Compared with the Apache Spark history server, the serverless Spark engine takes only a few seconds for you to open the web UI, no matter how complex a job is.
Basic concepts
- Virtual cluster
The serverless Spark engine of DLA uses the multitenancy architecture. The Spark processes run in an isolated environment. A VC is a unit that implements resource and security isolation. A VC does not have fixed computing resources. Therefore, you need only to allocate the resource quota based on your business requirements and configure the network to which the destination data that you want to access belongs. You do not need to configure or maintain CUs. You can also configure default parameters for Spark jobs of a VC. This facilitates unified management of Spark jobs. For more information about how to create a VC, see Create a virtual cluster.
- Compute unit
A compute unit (CU) is a basic unit of the serverless Spark engine of DLA. One CU equals 1 CPU core and 4 GB of memory. After a job is complete, DLA calculates the number of CUs consumed on the driver and executors by using the following formula: Total number of CUs used on the driver and executors × Number of hours in which CUs are used. For more information about the billing methods, see Billing methods.
- Resource specificationsElastic container instances are used for the serverless Spark engine at the underlying layer. Similar to ECS instances, elastic container instances have their specifications. You do not need to configure the detailed specifications of elastic container instances. Instead, you need only to set resource specifications to small, medium, or large. By default, the serverless Spark engine preferentially uses elastic container instances with higher specifications.
Resource specifications Computing resource specifications CU specifications c.small 1Core 2GB 0.8CU small 1Core 4GB 1CU m.small 1Core 8GB 1.5CU c.medium 2Core 4GB 1.6CU medium 2Core 8GB 2CU m.medium 2Core 16GB 3CU c.large 4Core 8GB 3.2CU large 4Core 16GB 4CU m.large 4Core 32GB 6CU c.xlarge 8Core 16GB 6.4CU xlarge 8Core 32GB 8CU m.xlarge 8Core 64GB 12CU c.2xlarge 16Core 32GB 12.8CU 2xlarge 16Core 64GB 16CU m.2xlarge 16Core 128GB 24CU
Use the serverless Spark engine
Submit a Spark job. For more information, see Quick start of the serverless Spark engine.
Connect to a data source. For more information, see Connect to data sources.
Perform spatio-temporal calculations. For more information, see Overview.
You can also contact Alibaba Cloud customer service for detailed information of this engine. For more information, see Expert service.