Configure global Spark parameters for EMR tasks in DataWorks -

You can configure global Spark parameters at the workspace level for DataWorks services. The global Spark parameters are used to run tasks by default. You can refer to Apache Spark - A Unified engine for large-scale data analytics to configure custom global Spark parameters and specify whether the global Spark parameters configured at the workspace level have a higher priority than the Spark parameters configured to run a single task in a specific DataWorks service, such as DataStudio, DataAnalysis, or Operation Center. This topic describes how to configure global Spark parameters.

Background information

Apache Spark is an analytics engine that processes a large amount of data. In DataWorks, you can use one of the following methods to configure Spark parameters that are used to schedule nodes:

Method 1: Configure global Spark parameters
You can configure global Spark parameters that are used by a DataWorks service to run EMR tasks at the workspace level, and specify whether the global Spark parameters have a higher priority than the Spark parameters that you configure to run a single task in the same DataWorks service. For more information, see the Configure global Spark parameters section in this topic.
Method 2: Configure Spark parameters to run a single task in a DataWorks service
- In DataStudio, you can perform the following steps to configure Spark parameters for an EMR Hive or EMR Spark node: Go to the configuration tab of an EMR Hive or EMR Spark node. In the right-side navigation pane, click Advanced Settings. On the Advanced Settings tab, configure Spark properties that you want to use to run a task on the EMR Hive or EMR Spark node.
- You cannot configure Spark properties that you want to use to run a single task in other DataWorks services.

Limits

You can use only the following accounts and roles to configure global Spark parameters:
- An Alibaba Cloud account
- RAM users or RAM roles to which the AliyunDataWorksFullAccess policy is attached
- RAM users to which the Workspace Administrator role is assigned
Spark parameters take effect only for EMR Spark, EMR Spark SQL, and EMR Spark Streaming nodes.
You can update Spark-related configurations on the SettingCenter page in the DataWorks console or in the EMR console. If the configurations of the same Spark properties are different between the DataWorks console and the EMR console, the configurations of the Spark properties on the SettingCenter page in the DataWorks console are used for tasks that you commit in DataWorks.
You can configure global Spark parameters only for DataStudio, Data Quality, DataAnalysis, and Operation Center.

Prerequisites

An EMR cluster is registered to DataWorks. For more information, see Register an EMR cluster to DataWorks.

Configure global Spark parameters

Go to the page for configuring global Spark parameters.
1. Go to the Management Center page.
  Log on to the DataWorks console. In the left-side navigation pane, click Management Center. On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
2. In the left-side navigation pane, click Open Source Clusters.
3. On the Open Source Clusters page, find the desired EMR cluster and click the Spark-related Parameter tab.

Configure global Spark parameters.

Click Edit Spark-related Parameter in the upper-right corner of the Spark-related Parameter tab to configure global Spark parameters and parameter priorities for DataWorks services.

Note

The configurations globally take effect in a workspace. You must confirm the workspace before you configure the parameters.

Parameter

Description

Spark Property Name and Spark Property Value

The Spark properties that you want to configure to run EMR tasks in a DataWorks service. You can configure Spark properties by referring to Spark Configuration and Running Spark on Kubernetes.

Global Settings Take Precedence

Specifies whether the global configuration takes precedence over the separate configuration for a single task in a DataWorks service. If you select this check box, the globally configured Spark properties are used when tasks on nodes are run.

Global configuration: Go to the SettingCenter page. In the left-side navigation pane, click Open Source Clusters. On the Open Source Clusters page, find the desired EMR cluster and click the Spark-related Parameter tab.
Note
You can configure global Spark parameters only for DataStudio, Data Quality, DataAnalysis, and Operation Center.
Separate configurations for single tasks in DataWorks services:
- In DataStudio, you can perform the following steps to configure Spark parameters for an EMR Hive or EMR Spark node: Go to the configuration tab of an EMR Hive or EMR Spark node. In the right-side navigation pane, click Advanced Settings. On the Advanced Settings tab, configure Spark properties that you want to use to run a task on the EMR Hive or EMR Spark node.
- You cannot configure Spark properties that you want to use to run a single task in other DataWorks services.