Alibaba Cloud DataWorks supports creating Hive, Spark SQL, Spark, and other nodes on E-MapReduce to configure and schedule task workflows. It also provides metadata management and data quality monitoring alert features to help users efficiently develop and govern data. This topic describes how to submit jobs through Alibaba Cloud DataWorks.
Supported cluster types
DataWorks currently supports registering the following cluster types:
DataLake cluster (new data lake)
Custom cluster
Hadoop cluster (old data lake)
You can use EMR Hadoop clusters of the following versions in DataWorks:
EMR-3.38.2, EMR-3.38.3, EMR-4.9.0, EMR-5.6.0, EMR-3.26.3, EMR-3.27.2, EMR-3.29.0, EMR-3.32.0, EMR-3.35.0, EMR-4.3.0, EMR-4.4.1, EMR-4.5.0, EMR-4.5.1, EMR-4.6.0, EMR-4.8.0, EMR-5.2.1, EMR-5.4.3
Hadoop clusters (old data lake) are no longer recommended. You must migrate to DataLake clusters as soon as possible. For more information, see Migrate Hadoop clusters to DataLake clusters.
Limits
Task type: You cannot run EMR Flink tasks in the DataWorks console.
Task running: You can use a serverless resource group (recommended) or an old-version exclusive resource group for scheduling to run an EMR task.
Task governance:
Only SQL tasks in EMR Hive, EMR Spark, and EMR Spark SQL nodes can be used to generate data lineages. If your EMR cluster is of V3.43.1, V5.9.1, or a minor version later than V3.43.1 or V5.9.1, you can view the table-level lineages and field-level lineages of the preceding nodes that are created based on the cluster.
NoteFor Spark-based EMR nodes, if the EMR cluster is of V5.8.0, V3.42.0, or a minor version later than V5.8.0 or V3.42.0, the Spark-based EMR nodes can be used to view table-level and field-level lineages. If the EMR cluster is of a minor version earlier than V5.8.0 or V3.42.0, only the Spark-based EMR nodes that use Spark 2.x can be used to view table-level lineages.
If you want to manage metadata for a DataLake or custom cluster in DataWorks, you must configure EMR-HOOK in the cluster first. If you do not configure EMR-HOOK in the desired cluster, metadata cannot be displayed in real time, audit logs cannot be generated, and data lineages cannot be displayed in DataWorks. In addition, EMR governance tasks cannot be run. EMR-HOOK can be configured for EMR Hive and EMR Spark SQL services. For more information, see Use the Hive extension feature to record data lineage and historical access information and Use the Spark SQL extension feature to record data lineage and historical access information.
Supported regions: EMR Serverless Spark is available in the China (Hangzhou), China (Shanghai), China (Beijing), China (Zhangjiakou), China (Shenzhen), Singapore, Germany (Frankfurt), and US (Silicon Valley) regions.
For an EMR cluster for which Kerberos authentication is enabled, you must add inbound rules of UDP ports to the security group of the EMR cluster for the CIDR block of the vSwitch with which a resource group is associated.
NoteTo add an inbound rule, perform the following operations: Log on to the EMR console. Go to the Basic Information tab of your EMR cluster. In the Security section of the Basic Information tab, click the
icon to the right of the Cluster Security Group parameter. On the Security Group Details tab of the Security Groups page, click the Inbound tab in the Access Rule section. On the Inbound tab, click Add Rule. Set the Protocol Type parameter to Custom UDP, the Port Range parameter to the configuration specified in the /etc/krb5.conffile of your EMR cluster, and the Authorization Object parameter to the CIDR block of the vSwitch with which a resource group is associated.
Prerequisites
The following permissions have been granted.
Only RAM users or RAM roles with the following identities can register EMR clusters. For operation details, see Grant permissions to RAM users.
Alibaba Cloud account.
RAM user or RAM role that has both the DataWorks
workspace administrator roleand theAliyunEMRFullAccesspolicy.RAM user or RAM role that has both the
AliyunDataWorksFullAccessandAliyunEMRFullAccesspolicies.
The corresponding type of EMR cluster has been purchased. In this example, the region of the EMR cluster is China (Shanghai).
For more information about the cluster types that DataWorks supports registering, see Supported cluster types.
Precautions
If you want to isolate EMR data in the development environment from EMR data in the production environment by using a workspace in standard mode, you must register different EMR clusters in the development and production environments of the workspace. In addition, the metadata of the EMR clusters must be stored by using one of the following methods:
Method 1: Store the metadata in two different catalogs in DLF. We recommend that you use this method. For more information, see Use DLF for unified metadata storage.
Method 2: Store the metadata in two different ApsaraDB RDS databases. For information about how to configure an ApsaraDB RDS database as the metadatabase of an EMR cluster, see Configure a self-managed ApsaraDB RDS for MySQL database.
You can register an EMR cluster to multiple workspaces within the same Alibaba Cloud account but cannot register an EMR cluster to multiple workspaces across Alibaba Cloud accounts. For example, if you register an EMR cluster to a workspace within the current Alibaba Cloud, you cannot register the cluster to a workspace in another Alibaba Cloud account.
If a DataWorks resource group and an EMR cluster are deployed in the same virtual private cloud (VPC) and use the same vSwitch, but the resource group cannot connect to the EMR cluster as expected, check the security group rules of the EMR cluster and add the CIDR block of the vSwitch and inbound rules of ports of common open source components to the security group rules of the EMR cluster to ensure that you can use the DataWorks resource group to access the EMR cluster as expected. For more information, see Manage security groups.
Prepare a DataWorks environment
Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Prepare an environment.
Step 1: Create a workspace
If a workspace exists in the China (Shanghai) region, skip this step and use the existing workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region.
In the left-side navigation pane, click Workspace. On the Workspaces page, click Create Workspace to create a workspace in standard mode. For more information, see Create a workspace. For a workspace in standard mode, the development environment is isolated from the production environment.
Step 2: Create a serverless resource group
This tutorial requires a serverless resource group for data synchronization and scheduling. Therefore, you need to purchase and configure a serverless resource group.
Purchase a serverless resource group.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group to go to the Resource Groups page.
On the Resource Groups page, click Create Resource Group. On the buy page, set Region and Zone to China (Shanghai), specify the resource group name, configure other parameters as prompted, and then follow on-screen instructions to pay for the resource group. For information about the billing details of serverless resource groups, see Billing of serverless resource groups.
NoteIf no virtual private cloud (VPC) or vSwitch exists in the current region, click the link in the parameter description to go to the VPC console to create one. For more information about VPCs and vSwitches, see What is VPC?
Associate the serverless resource group with the DataWorks workspace.
You can use the serverless resource group that you purchased in subsequent operations only after you associate the serverless resource group with a workspace.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the serverless resource group that you purchased, and click Associate Workspace in the Actions column. In the Associate Workspace panel, find the workspace with which you want to associate and click Associate in the Actions column.
Enable the serverless resource group to access the Internet.
The test data used in this tutorial must be obtained over the Internet. By default, the serverless resource group cannot be used to access the Internet. You must configure an Internet NAT gateway for the VPC with which the serverless resource group is associated and configure an EIP for the VPC to establish a network connection between the VPC and the network environment of the test data. This way, you can use the serverless resource group to access the test data.
Go to the Internet NAT Gateway page in the VPC console. In the top navigation bar, select the China (Shanghai) region.
Click Create Internet NAT Gateway and configure the parameters. The following table describes the key parameters that are required in this tutorial. You can retain the default values for the parameters that are not described in the following table.
Parameter
Description
Region
Select China (Shanghai).
VPC
Select the VPC and vSwitch with which the resource group is associated.
To view the VPC and vSwitch with which the resource group is associated, perform the following operations: Log on to the DataWorks console. In the top navigation bar, select the region in which you activate DataWorks. In the left-side navigation pane, click Resource Group. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab on the page that appears, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is VPC?
Associate vSwitch
Access Mode
Select SNAT-enabled Mode.
EIP
Select Purchase EIP.
Service-linked Role
Click Create Service-linked Role to create a service-linked role if this is the first time you create a NAT gateway.
Click Buy Now. On the Confirm page, read the terms of service, select the check box for Terms of Service, and then click Activate Now.
For more information about how to create and use a serverless resource group, see Use serverless resource groups.
Step 3: Register the EMR cluster to DataWorks and initialize the resource group
You can use the EMR cluster in DataWorks only if you register the cluster to DataWorks.
Go to the Register EMR Cluster page.
Go to the SettingCenter page.
Log on to the DataWorks console. In the top navigation bar, select the China (Shanghai) region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Cluster Management. On the Cluster Management page, click Register Cluster. In the Select Cluster Type dialog box, click E-MapReduce. The Register EMR Cluster page appears.
Register the EMR cluster to DataWorks.
On the Register EMR Cluster page, configure cluster information. The following table describes the key parameters.
Parameter
Description
Alibaba Cloud Account to Which Cluster Belongs
Set it to Current Alibaba Cloud Account.
Cluster Type
Select Data Lake.
Default Access Identity
Set it to Cluster Account: hadoop.
Pass Proxy User Information
Set it to Pass.
Initialize the resource group.
Go to the Cluster Management page in SettingCenter. Find the EMR cluster that is registered to DataWorks and click Initialize Resource Group in the section that displays the information of the EMR cluster.
In the Initialize Resource Group dialog box, find the desired resource group and click Initialize.
After the initialization is complete, click OK.
ImportantYou must make sure that the initialization of the resource group is successful. Otherwise, tasks that use the resource group may fail. If the initialization of the resource group fails, you can view the failure cause and perform a network connectivity diagnosis as prompted.
For more information about how to register an EMR cluster, see DataStudio (old version): Associate an EMR computing resource.
Submit EMR jobs
Submit EMR Hive jobs
Step 1: Create an EMR Hive node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Hive node.
Find the desired workflow, right-click the name of the workflow, and then choose .
NoteAlternatively, you can move the pointer over the Create icon and choose .
In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters. Click Confirm. The configuration tab of the EMR Hive node appears.
NoteThe node name can contain only letters, digits, underscores (_), and periods (.).
Step 2: Develop an EMR Hive task
You can develop a Hive task on the configuration tab of the EMR Hive node.
Develop SQL code
In the SQL editor, develop node code. You can define variables in the ${Variable} format in the node code and configure the scheduling parameters that are assigned to the variables as values in the Scheduling Parameter section of the Properties tab. This way, the values of the scheduling parameters are dynamically replaced in the node code when the node is scheduled to run. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. Sample code:
show tables;
select '${var}'; -- You can assign a specific scheduling parameter to the var variable.
select * from userinfo ;The size of the SQL statements for Hive task development cannot exceed 130 KB.
If multiple EMR computing resources are associated with DataStudio in your workspace, select one computing resource. If only one EMR computing resource is associated with DataStudio in your workspace, you do not need to select a data source.
If you want to change the scheduling parameter that is assigned to the variable in the code, click Run with Parameters in the top toolbar. For more information about the value assignment logic of scheduling parameters, see What are the differences in the value assignment logic of scheduling parameters among the Run, Run with Parameters, and Perform Smoke Testing in Development Environment modes?
Run the Hive task
In the toolbar, click the
icon. In the Parameters dialog box, select the desired resource group from the Resource Group Name drop-down list and click Run. NoteIf you want to access a computing resource over the Internet or a virtual private cloud (VPC), use the resource group for scheduling that is connected to the computing resource. For more information, see Network connectivity solutions.
If you want to change the resource group in subsequent operations, you can click the
(Run with Parameters) icon to change the resource group in the Parameters dialog box. If you use an EMR Hive node to query data, a maximum of 10,000 data records can be returned, and the total size of the returned data records cannot exceed 10 MB.
Click the
icon in the top toolbar to save the SQL statements. Optional. Perform smoke testing.
You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.
If you want to modify the queue to which jobs are committed, see Configure advanced parameters.
Step 3: Configure scheduling properties
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
Step 4: Deploy the task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
Click the
icon in the top toolbar to save the task. Click the
icon in the top toolbar to commit the task. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
NoteYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
Submit EMR Spark SQL jobs
Step 1: Create an EMR Spark SQL node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Spark SQL node.
Find the desired workflow, right-click the workflow name, and then choose .
NoteAlternatively, you can move the pointer over the Create icon and choose .
In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters. Click Confirm to go to the EMR Spark SQL node configuration tab.
NoteThe node name can contain letters, digits, underscores (_), and periods (.).
Step 2: Develop an EMR Spark SQL task
You can perform the following operations to develop an EMR Spark SQL task on the configuration tab of the EMR Spark SQL node:
Develop SQL code
In the SQL editor, develop node code. You can define variables in the ${Variable} format in the node code and configure the scheduling parameters that are assigned to the variables as values in the Properties>Scheduling Parameter section. This way, the values of the scheduling parameters are dynamically replaced in the node code when the node is scheduled to run. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. Sample code:
SHOW TABLES;
-- Define a variable named var in the ${var} format. If you assign the ${yyyymmdd} parameter to the variable as a value, you can create a table whose name is suffixed with the data timestamp.
CREATE TABLE IF NOT EXISTS userinfo_new_${var} (
ip STRING COMMENT'IP address',
uid STRING COMMENT'User ID'
)PARTITIONED BY(
dt STRING
); -- You can assign a specific scheduling parameter to the var variable.The size of SQL statements for the node cannot exceed 130 KB.
If multiple EMR data sources are associated with DataStudio in your workspace, you must select one from the data sources based on your business requirements. If only one EMR data source is associated with DataStudio in your workspace, you do not need to select a data source.
Execute SQL statements
Click the
icon in the top toolbar. In the Parameters dialog box, select a created resource group for scheduling and click Run.NoteIf you want to access a data source over the Internet or a virtual private cloud (VPC), you must use the resource group for scheduling that is connected to the data source. For more information, see Network connectivity solutions.
If you want to change the resource group in subsequent operations, you can click the Run With Parameters
icon to change the resource group in the Parameters dialog box.If you use an EMR Spark SQL node to query data, a maximum of 10,000 data records can be returned, and the total size of the returned data records cannot exceed 10 MB.
Click the
icon in the top toolbar to save SQL statements.Optional. Perform smoke testing.
You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.
If you want to modify the queue to which jobs are committed, see Configure advanced parameters.
Step 3: Configure scheduling properties
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
Step 4: Deploy the task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
Click the
icon in the top toolbar to save the task. Click the
icon in the top toolbar to commit the task. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
NoteYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
Submit EMR Spark jobs
Step 1: Create an EMR Spark node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Spark node.
Find the desired workflow, right-click the workflow name, and then choose .
NoteAlternatively, you can move the pointer over the Create icon and choose .
In the Create Node dialog box, configure the Name, Engine Instance, Node Type, and Path parameters. Click Confirm. The configuration tab of the EMR Spark node appears.
NoteThe node name can contain letters, digits, underscores (_), and periods (.).
Step 2: Develop a Spark task
You can use one of the following methods based on your business requirements to develop a Spark task on the configuration tab of the EMR Spark node:
(Recommended) Upload a resource from your on-premises machine to DataStudio and then reference the resource. For more information, see the Method 1: Upload and reference an EMR JAR resource section in this topic.
Use the OSS REF method to reference an OSS resource. For more information, see the Method 2: Reference an OSS resource section in this topic.
Method 1: Upload and reference an EMR JAR resource
DataWorks allows you to upload a resource from your on-premises machine to DataStudio before you reference the resource. You must obtain and store the JAR package that is generated after the code of a Spark task is compiled in EMR. The method for storing a JAR package varies based on the size of the JAR package.
You can upload the JAR package to the DataWorks console as an EMR JAR resource and commit the resource. You can also store the JAR package in HDFS of EMR. For a Spark cluster that is created on the EMR on ACK page or an EMR Serverless Spark cluster, you cannot upload resources to HDFS.
A JAR package is less than 200 MB in size
Create an EMR JAR resource.
You can upload the JAR package from your on-premises machine to the DataWorks console as an EMR JAR resource. This way, you can manage the JAR package in the DataWorks console in a visualized manner. After you create an EMR JAR resource, you must commit the resource. For more information, see Create and use an EMR resource.
NoteThe first time you create an EMR JAR resource, you must perform authorization as prompted first if you want the JAR package to be stored in OSS after the JAR package is uploaded.
Reference the EMR JAR resource.
Double-click the name of the created EMR Spark node to go to the configuration tab of the node.
Find the desired EMR JAR resource under Resource in the EMR folder, right-click the resource name, and then select Insert Resource Path.
Resource reference code is automatically added to the configuration tab of the EMR Spark node. Sample code:
##@resource_reference{"spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar"} spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jarIf the automatic addition of the preceding code is successful, the resource is referenced. spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar is the name of the JAR package that you uploaded.
Rewrite the code of the EMR Spark node and add the spark-submit command. The following sample code is only for reference.
NoteYou cannot add comments when you write code for an EMR Spark node. If you add comments, an error is reported when you run the EMR Spark node. You can refer to the following sample code to rewrite the code of an EMR Spark node.
##@resource_reference{"spark-examples_2.11-2.4.0.jar"} spark-submit --class org.apache.spark.examples.SparkPi --master yarn spark-examples_2.11-2.4.0.jar 100Components:
org.apache.spark.examples.SparkPi: the main class of the task in the compiled JAR package.
spark-examples_2.11-2.4.0.jar: the name of the JAR package that you uploaded.
You can keep the settings of other parameters unchanged. You can also run the following command to view the help documentation for using the
spark-submitcommand and modify thespark-submitcommand based on your business requirements.NoteIf you want to use a parameter that is simplified by running the
spark-submitcommand, such as--executor-memory 2G, in an EMR Spark node, you need to add the parameter to the code of the EMR Spark node.You can use Spark nodes on YARN to submit jobs only if your nodes are in cluster mode.
If you commit a node by using
spark-submit, we recommend that you set deploy-mode to cluster rather than client.
spark-submit --help
A JAR package is greater than or equal to 200 MB in size
Store the JAR package in HDFS of EMR.
You cannot upload the JAR package from your on-premises machine to the DataWorks console as a DataWorks resource. We recommend that you store the JAR package in HDFS of EMR and record the storage path of the JAR package. This way, you can reference the JAR package in this path when you use DataWorks to schedule Spark tasks.
Reference the JAR package.
You can reference the JAR package by specifying the storage path of the JAR package in the code of an EMR Spark node.
Double-click the name of the created EMR Spark node to go to the configuration tab of the node.
Write the spark-submit command. Example:
spark-submit --master yarn --deploy-mode cluster --name SparkPi --driver-memory 4G --driver-cores 1 --num-executors 5 --executor-memory 4G --executor-cores 1 --class org.apache.spark.examples.JavaSparkPi hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100Parameter description:
hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar: the storage path of the JAR package in HDFS.
org.apache.spark.examples.JavaSparkPi: the main class of the task in the compiled JAR package.
Other parameters are configured in the EMR cluster that is used. You can modify the parameters based on your business requirements. You can also run the following command to view the help documentation for using the spark-submit command and modify the spark-submit command based on your business requirements.
ImportantIf you want to use a parameter that is simplified by running the spark-submit command, such as
--executor-memory 2G, in an EMR Spark node, you need to add the parameter to the code of the EMR Spark node.You can use Spark nodes on YARN to submit jobs only if your nodes are in cluster mode.
If you commit a node by using spark-submit, we recommend that you set deploy-mode to cluster rather than client.
spark-submit --help
Method 2: Reference an OSS resource
(Optional) Configure advanced parameters
You can configure Spark-specific parameters on the Advanced Settings tab of the configuration tab of the current node. For more information about how to configure the parameters, see Spark Configuration. The following table describes the advanced parameters that can be configured for different types of EMR clusters.
DataLake cluster or custom cluster: created on the EMR on ECS page
Advanced parameter | Description |
queue | The scheduling queue to which jobs are committed. Default value: default. If you have configured a workspace-level YARN queue when you register an EMR cluster to a DataWorks workspace, the following configurations apply:
For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue. |
priority | The priority. Default value: 1. |
FLOW_SKIP_SQL_ANALYZE | The manner in which SQL statements are executed. Valid values:
Note This parameter is available only for testing in the development environment of a DataWorks workspace. |
Others |
|
Hadoop cluster: created on the EMR on ECS page
Advanced parameter | Description |
queue | The scheduling queue to which jobs are committed. Default value: default. If you have configured a workspace-level YARN queue when you register an EMR cluster to a DataWorks workspace, the following configurations apply:
For information about EMR YARN, see YARN schedulers. For information about queue configuration when you register an EMR cluster, see Configure a global YARN queue. |
priority | The priority. Default value: 1. |
FLOW_SKIP_SQL_ANALYZE | The manner in which SQL statements are executed. Valid values:
Note This parameter is available only for testing in the development environment of a DataWorks workspace. |
USE_GATEWAY | Specifies whether to use a gateway cluster to commit jobs on the current node. Valid values:
Note If the EMR cluster to which the node belongs is not associated with a gateway cluster but the USE_GATEWAY parameter is set to |
Others |
|
Spark cluster: created on the EMR on ACK page
Advanced parameter | Description |
queue | This parameter is not supported. |
priority | This parameter is not supported. |
FLOW_SKIP_SQL_ANALYZE | The manner in which SQL statements are executed. Valid values:
Note This parameter is available only for testing in the development environment of a DataWorks workspace. |
Others |
|
EMR Serverless Spark cluster
For more information about parameter settings, see the Step 3: Submit a Spark task section of the "Use the spark-submit CLI to submit a Spark job" topic.
Advanced parameter | Description |
queue | The scheduling queue to which jobs are committed. Default value: dev_queue. |
priority | The priority. Default value: 1. |
FLOW_SKIP_SQL_ANALYZE | The manner in which SQL statements are executed. Valid values:
Note This parameter is available only for testing in the development environment of a DataWorks workspace. |
SERVERLESS_RELEASE_VERSION | The version of the Spark engine. By default, the value specified by the Default Engine Version parameter on the Register EMR Cluster page is used. To go to the Register EMR Cluster page, you can perform the following operations: Go to the SettingCenter page. In the left-side navigation pane, click Cluster Management. On the Cluster Management page, click Register Cluster and select E-MapReduce in the Select Cluster Type dialog box. You can configure this parameter to specify different engine versions for different types of tasks. |
SERVERLESS_QUEUE_NAME | The resource queue. By default, the value specified by the Default Resource Queue parameter on the Register EMR Cluster page is used. You can add queues to meet resource isolation and management requirements. For more information, see Manage resource queues. |
Others |
|
Execute SQL statements
Click the
icon in the top toolbar. In the Parameters dialog box, select a created resource group for scheduling and click Run. NoteIf you want to access a computing source over the Internet or a virtual private cloud (VPC), use the resource group for scheduling that is connected to the computing source. For more information, see Network connectivity solutions.
If you want to change the resource group in subsequent operations, you can click the
(Run with Parameters) icon to change the resource group in the Parameters dialog box. If you use an EMR Spark node to query data, a maximum of 10,000 data records can be returned, and the total size of the returned data records cannot exceed 10 MB.
Click the
icon in the top toolbar to save SQL statements. Optional. Perform smoke testing.
You can perform smoke testing on the node in the development environment when you commit the node or after you commit the node. For more information, see Perform smoke testing.
Step 3: Configure scheduling properties
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
Step 4: Deploy the task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
Click the
icon in the top toolbar to save the task. Click the
icon in the top toolbar to commit the task. In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
NoteYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
What to do next
After a task is deployed, it is automatically added to Operation Center. You can view the task running status in Operation Center or manually trigger the task to run. For more information, see Operation Center.
FAQ
After I prepare the DataWorks environment and submit an EMR Hive job, the
java.net.ConnectException: Connection timed out (Connection timed out)error occurs.Check whether the EMR cluster and DataWorks environment are correctly configured as required in the documentation, and confirm whether the DataWorks resource group and EMR are associated with the same VPC and vSwitch.
Check the security group rules of the EMR cluster to ensure that port
10000of the ECS instance is open. For more information, see Manage security groups. When you submit jobs of other components in DataWorks, you need to open the corresponding ECS ports. For more information, see Commonly used ports of open source components.
References
If your task needs to be periodically scheduled to run, you need to define the scheduling-related properties of the task, including the scheduling cycle, scheduling dependencies, and scheduling parameters. For more information, see Node scheduling configuration.
If your task requires complex string processing or mathematical operations, you can create user-defined functions in DataWorks. For more information, see Create an EMR function.
icon and select a created serverless resource group to run a task on the EMR Spark node. After the task finishes running, record 