The serverless Spark engine of DLA allows you to orchestrate and schedule tasks by using DataWorks or DMS. In addition, DLA provides the custom SDK and Spark-Submit tool packages for you to orchestrate and schedule Spark tasks as needed. This topic describes how to use DataWorks to orchestrate and schedule DLA Spark tasks.
DataWorks is a platform that is used to process and analyze large amounts of data in offline mode. DataWorks offers fully hosted services for visualized workflow development, scheduling, and O&M. In DataWorks, tasks can be hosted and scheduled by time or dependency.
Note To submit a Spark job in the DLA console as a RAM user for the first time, you must configure the permissions for the RAM user. For more information, see Grant permissions to a RAM user (detailed version).
- The DLA, DataWorks, and OSS services are activated and deployed in the same region. In this topic, the three services are all deployed in the China (Beijing) region.
- A DataWorks workspace is created. For more information, see Create a workspace.
- Add an OSS data source in DataWorks.
- Log on to the DataWorks console, find your workspace, and then click Data Integration in the Actions column. In the left-side navigation pane, click Connection. In the left-side navigation pane, click Data Source.
- Click New data source in the upper-right corner. In the Add data source dialog box, click the OSS icon in the Semi-structured storage section.
- In the Add OSS data source dialog box, configure parameters in sequence as required. You must specify AccessKey ID and AccessKey Secret that are used to submit Spark jobs. Connectivity testing is not required. After you specify parameters, click Complete.
The following table describes the parameters.
Data Source Name The name of the data source. We recommend that you specify an informative name for easy management. Description The description of the data source. This parameter is optional. Network connection type The network connection type. You must select Aliyun Vpc from the drop-down list. Endpoint The endpoint that must start with http://. You can specify this parameter as needed. Example: http://dummy Bucket The name of the bucket to which the Spark job belongs. You can specify this parameter as needed. Example: dummy AccessKey ID The AccessKey ID that is used to submit a Spark job. AccessKey Secret The AccessKey secret that is used to submit a Spark job.
- Create a DLA scheduling task in DataWorks.
- Log on to the DataWorks console. In the left-side navigation pane, click Workspaces. On the page that appears, find your workspace, and click Data Analytics in the Actions column.
- In the left-side panel, right-click Business Flow and choose Create Workflow. In the Create Workflow dialog box, set Workflow Name to dla_serverless_spark_test.
- Double-click the dla_serverless_spark_test workflow. In the left-side panel of the dla_serverless_spark_test tab, unfold Custom, and drag and drop DLA Serverless Spark three times to create three nodes named test_1, test_2, and test_3 on the canvas. Then, connect the three nodes, as shown in the following figure.
- In the Data Analytics panel, unfold Business Flow, and choose dla_serverless_spark_test > UserDefined > test_1. Select the OSS data source from the Select a connection drop-down list. Repeat these operations for test_2 and test_3.
- On the canvas of test_1, click Properties. In the Properties panel, enter the values of VC_NAME and SKYNET_REGION in the Arguments field, and select an appropriate option from the Rerun drop-down list. Repeat these operations for test_2 and test_3.
- VC_NAME is the parameter required for each task. This parameter indicates the name of the virtual cluster (VC) that is used to run jobs.
- SKYNET_REGION indicates the region where your VC is deployed. For example,
cn-beijingindicates that your VC is deployed in the China (Beijing) region. For more information about the mappings between regions and SKYNET_REGION, see Regions and zones.
- In the Dependencies section of the Properties panel, click Use Root Node for test_1.
- After you complete the preceding configurations, click the Run icon to check whether logs are generated for the three tasks. In the operational log, the link to Apache Spark web UI is listed. For more information about how to configure the directory for saving logs, see Apache Spark web UI.
- After the test succeeds, click Save in the top navigation bar on the test_1, test_2, and test_3 tabs. Then, click the Submit icon to submit tasks for test_1, test_2, and test_3 in sequence, as shown in the following figure.
- After you configure these tasks, you can publish and maintain the tasks. For more information, see Publish a task and Task O&M.
You can also click the Run Smoke Test in Development Environment icon to perform smoke testing and check whether a single task succeeds.
- Add an OSS data source in DataWorks.
Orchestrate and schedule custom tasks
You can use the preceding tools and a third-party task orchestration and scheduling system, such as Apache Airflow, to create a custom task workflow.