All Products
Search
Document Center

DataWorks:Create an EMR Spark node

Last Updated:Mar 26, 2026

DataWorks EMR Spark nodes let you schedule and run Apache Spark tasks on EMR clusters without managing infrastructure. This topic walks you through creating a node, configuring your JAR package reference, and deploying the task for periodic execution.

Prerequisites

Before you begin, ensure that you have:

  • An EMR cluster registered with DataWorks. See DataStudio (legacy version): Bind an EMR computing resource.

  • A resource group purchased and configured — workspace binding and network connectivity must be complete. See Use a Serverless resource group.

  • A business flow created. All DataStudio development is organized around business flows. See Create a business flow.

  • (Optional) RAM user permissions. If developing with a Resource Access Management (RAM) user account, add the user to the workspace with the Development or Workspace Manager role. See Add members to a workspace.

  • (Optional) A custom image. If your task requires a specific runtime environment — for example, to replace Spark JAR packages or bundle specific libraries, files, or JAR packages — create a custom image based on the official dataworks_emr_base_task_pod image. See Custom images.

Limitations

  • EMR Spark nodes run only on a serverless resource group or an exclusive resource group for scheduling. Use a serverless resource group. If you need to use a custom image in DataStudio, use a serverless computing resource group.

  • For DataLake or custom clusters, configure EMR-HOOK on the cluster to enable real-time metadata visibility, audit logs, data lineage, and EMR-related data administration in DataWorks. See Configure EMR-HOOK for Spark SQL.

  • EMR on ACK Spark clusters do not support viewing data lineage. EMR Serverless Spark clusters support viewing data lineage.

  • EMR on ACK Spark clusters and EMR Serverless Spark clusters support OSS resource references via ossref and uploading resources to OSS only. Uploading to Hadoop Distributed File System (HDFS) is not supported.

  • DataLake clusters and custom clusters support ossref references, uploading to OSS, and uploading to HDFS.

Usage notes

If Ranger access control is enabled for Spark in the bound EMR cluster:

  • Tasks using the default image support Ranger access control automatically.

  • Tasks using a custom image require an image upgrade. Submit a ticket to request the upgrade.

Step 1: Create an EMR Spark node

  1. Go to the DataStudio page. Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, choose Data Development and O\&M > Data Development. Select your workspace from the drop-down list and click Go to Data Development.

  2. Create an EMR Spark node.

    1. Right-click the target business flow and choose Create Node > EMR > EMR Spark. > Note: You can also hover over Create and choose Create Node > EMR > EMR Spark.

    2. In the Create Node dialog box, set Name, Engine Instance, Node Type, and Path, then click Confirm. > Note: The node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

Step 2: Develop the Spark task

Before scheduling an EMR Spark task, you need a compiled JAR package. See Spark overview for how to develop and compile one.

Double-click the created node to open the task development page, then choose one of the following approaches based on where your JAR package is stored:

ApproachWhen to use
Upload and reference an EMR JAR resourceJAR package is available locally (under 500 MB), or you want to store it in HDFS for larger packages
Reference an OSS resource directlyJAR package is already in OSS, or you want DataWorks to auto-load it from OSS at runtime

Scenario 1: Upload and reference an EMR JAR resource

Choose the upload path based on JAR package size.

JAR package smaller than 500 MB

  1. Create an EMR JAR resource. Upload the JAR package as a DataWorks EMR JAR resource. This keeps the resource managed within the DataWorks console. After creating the resource, submit it. See Create and use EMR resources for details.

    When uploading a JAR package to OSS for the first time, complete the authorization as prompted.

    image.png

  2. Reference the EMR JAR resource in the node.

    1. Double-click the EMR Spark node to open its code editor.

    2. In the EMR > Resource panel, right-click the uploaded EMR JAR resource and select Insert Resource Path. A resource reference block is automatically inserted into the editor: ``sql ##@resource_reference{"spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar"} spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar ``

      A resource reference block is automatically inserted into the editor:

      ##@resource_reference{"spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar"}
      spark-examples_2.12-1.0.0-SNAPSHOT-shaded.jar
    3. Add the spark-submit command below the resource reference. Example:

      ##@resource_reference{"spark-examples_2.11-2.4.0.jar"}
      spark-submit --class org.apache.spark.examples.SparkPi --master yarn spark-examples_2.11-2.4.0.jar 100

      Parameter

      Required

      Description

      --class

      Yes

      Fully qualified name of the main class in the JAR package, e.g., org.apache.spark.examples.SparkPi

      --master yarn

      Yes

      Submits the job to YARN. Spark nodes support YARN cluster mode only.

      JAR name

      Yes

      Must match the name of the uploaded EMR JAR resource

      --executor-memory, etc.

      No

      Additional Spark parameters — add directly to the command, e.g., --executor-memory 2G

      Note

      The EMR Spark node editor does not support comments. Do not add comments to the task code, as this causes execution errors.

      Set --deploy-mode cluster. The client deploy mode is not recommended for scheduled tasks.

      Run spark-submit --help to view all available parameters.

      image.png

JAR package 500 MB or larger

JAR packages 500 MB or larger cannot be uploaded as DataWorks resources. Store the JAR package in HDFS and reference it by path.

EMR on ACK Spark clusters and EMR Serverless Spark clusters do not support HDFS. For these cluster types, upload the JAR to OSS and use Scenario 2 instead.
  1. Store the JAR package in the HDFS of your EMR cluster and record its path, for example: hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar.

  2. Double-click the EMR Spark node to open its code editor, and write the spark-submit command referencing the HDFS path:

    ParameterRequiredDescription
    --master yarnYesSubmits to YARN
    --deploy-mode clusterYesRuns the driver on the cluster. Set to cluster, not client
    --classYesFully qualified main class name, e.g., org.apache.spark.examples.JavaSparkPi
    HDFS pathYesFull HDFS path to the JAR package, e.g., hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar
    --driver-memory, --executor-memory, etc.NoConfigure based on your cluster resources
    spark-submit --master yarn
    --deploy-mode cluster
    --name SparkPi
    --driver-memory 4G
    --driver-cores 1
    --num-executors 5
    --executor-memory 4G
    --executor-cores 1
    --class org.apache.spark.examples.JavaSparkPi
    hdfs:///tmp/jars/spark-examples_2.11-2.4.8.jar 100

    Run spark-submit --help to view all available parameters.

    image.png

Scenario 2: Reference an OSS resource directly

EMR Spark nodes support referencing JAR packages stored in OSS directly via ossref://. At runtime, DataWorks automatically downloads the OSS resource before executing the task. This approach is useful when tasks depend on JAR files or scripts stored in OSS.

Step 1: Build the JAR package

  1. Prepare code dependencies. Check the required Spark dependencies on your EMR cluster's master node at /usr/lib/emr/spark-current/jars/. The following example uses Spark 3.4.2. Add the following dependencies to your pom.xml:

    Add pom dependencies

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.12</artifactId>
            <version>3.4.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.12</artifactId>
            <version>3.4.2</version>
        </dependency>
    </dependencies>

    Reference related plug-ins

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.7.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <configuration>
                    <recompileMode>incremental</recompileMode>
                </configuration>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
  2. Write the Spark application code. Example:

    package com.aliyun.emr.example.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkMaxComputeDemo {
      def main(args: Array[String]): Unit = {
        // Create a SparkSession
        val spark = SparkSession.builder()
          .appName("HelloDataWorks")
          .getOrCreate()
    
        // Print the Spark version
        println(s"Spark version: ${spark.version}")
      }
    }
  3. Package the code into a JAR file. Build the project to produce a JAR package. This example generates SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar.

Step 2: Upload the JAR package to OSS

  1. Log on to the OSS console. In the left navigation pane, click Buckets.

  2. Click the target bucket name to open the Object Management page. This example uses the bucket onaliyun-bucket-2.

  3. Click Create Directory and set Directory Name to emr/jars.

  4. Navigate to the emr/jars directory, click Upload Object, and select SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar.

Step 3: Reference the JAR package in the node

  1. On the editor page of the EMR Spark node, enter the spark-submit command with the ossref:// path:

    ParameterRequiredDescription
    --classYesFully qualified main class name, e.g., com.aliyun.emr.example.spark.SparkMaxComputeDemo
    --master yarnYesSubmits the job to YARN
    ossref pathYesFormat: ossref://{endpoint}/{bucket}/{object}. The endpoint is optional; if omitted, the OSS bucket must be in the same region as the EMR cluster. Log on to the OSS console to find your bucket name and object path.
    spark-submit --class com.aliyun.emr.example.spark.SparkMaxComputeDemo --master yarn ossref://onaliyun-bucket-2/emr/jars/SparkWorkOSS-1.0-SNAPSHOT-jar-with-dependencies.jar
  2. Click the image icon and select the serverless resource group to run the node. After the task completes, note the applicationId printed in the console, for example, application_1730367929285_xxxx.

  3. Verify the result. Create an EMR Shell node and run the following command to view the task output:

    yarn logs -applicationId application_1730367929285_xxxx

    image

(Optional) Configure advanced parameters

Configure Spark-specific properties in the Advanced Settings panel of the node. DataWorks formats each custom parameter as --conf key=value before sending it to the EMR cluster. For the full list of Spark properties, see Spark configuration.

Available parameters vary by cluster type.

DataLake cluster/Custom cluster: EMR on ECS

DataLake cluster / Custom cluster (EMR on ECS)

ParameterDefaultDescription
queuedefaultThe YARN scheduling queue. If a workspace-level YARN Resource Queue is configured during EMR cluster registration: when Global Settings Take Precedence is enabled, that queue applies; otherwise, this node's setting applies. See Basic queue configuration and Set a global YARN resource queue.
priority1Job priority.
FLOW_SKIP_SQL_ANALYZEfalseSQL execution mode. true: run multiple SQL statements at once. false: run one SQL statement at a time. Available in the development environment only.
OtherAdd any custom Spark parameters, for example, "spark.eventLog.enabled":false. To enable Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger in Set global Spark parameters.

EMR Serverless Spark cluster

EMR Serverless Spark cluster

For the full parameter reference, see Set parameters for submitting a Spark task.

ParameterDefaultDescription
queuedev_queueThe scheduling queue.
priority1Job priority.
FLOW_SKIP_SQL_ANALYZESame as above. Available in the development environment only.
SERVERLESS_RELEASE_VERSIONDefault Engine Version (Cluster Management)The Spark engine version. Set here to override the cluster-level default.
SERVERLESS_QUEUE_NAMEDefault Resource Queue (SettingCenter > Cluster Management)The resource queue. Set here if you have resource isolation requirements. See Manage resource queues.
OtherAdd custom Spark parameters, for example, "spark.eventLog.enabled":false. See Set global Spark parameters.

Spark cluster: EMR ON ACK

Spark cluster (EMR on ACK)

ParameterDefaultDescription
queueNot supported.
priorityNot supported.
FLOW_SKIP_SQL_ANALYZESame as above. Available in the development environment only.
OtherAdd custom Spark parameters, for example, "spark.eventLog.enabled":false. See Set global Spark parameters.

Hadoop cluster: EMR on ECS

Hadoop cluster (EMR on ECS)

ParameterDefaultDescription
queuedefaultThe YARN scheduling queue. Same workspace-level override behavior as the DataLake cluster.
priority1Job priority.
FLOW_SKIP_SQL_ANALYZEfalseSame as above. Available in the development environment only.
USE_GATEWAYtrue: submit the job through a Gateway cluster. false: submit to the header node. If the cluster has no associated Gateway cluster, setting this to true causes submission failure.
OtherAdd custom Spark parameters. To enable Ranger access control, add spark.hadoop.fs.oss.authorization.method=ranger in Set global Spark parameters.

Run the task

  1. Click the 高级运行 icon. In the Parameters dialog box, select the scheduling resource group and click Run.

    - The resource group must have passed the network connectivity test with the computing resources. See Network connectivity solutions. - To change the resource group, click the Run With Parameters 高级运行 icon and select another resource group. - Query results are capped at 10,000 records and 10 MB total.
  2. Click the 保存 icon to save the node.

  3. (Optional) Run smoke testing in the development environment before committing the node. See Perform smoke testing.

Step 3: Configure node scheduling

To run the task on a schedule, click Properties in the right-side navigation pane and configure the scheduling properties. See Overview for the full scheduling configuration reference.

Configure Rerun and Parent Nodes on the Properties tab before committing.
If your task requires a custom runtime environment, create a custom image based on dataworks_emr_base_task_pod and use it in DataStudio.

Step 4: Publish the node task

  1. Click the 保存 icon to save the task.

  2. Click the 提交 icon to commit the task. In the Submit dialog box, fill in Change description. Optionally enable code review to require approval before deployment. See Code review.

  3. If your workspace is in standard mode, click Deploy in the upper-right corner to deploy the task to the production environment. See Deploy nodes.

What's next

After deploying the task, it runs automatically on the configured schedule. To monitor execution, click Operation Center in the upper-right corner to view scheduling status and task history. See View and manage auto triggered tasks.

FAQ

Why does the DlfMetaStoreClientFactory not found error occur when running spark-submit in YARN cluster mode after enabling Kerberos?

See this FAQ entry for the resolution.

Why does a connection timeout occur when running a node?

Check network connectivity between the resource group and the EMR cluster. Go to the computing resource list page, click Re-initialize, and verify that initialization succeeds.

imageimage

References