All Products
Search
Document Center

E-MapReduce:Submit a Spark job

Last Updated:Aug 29, 2023

Alibaba Cloud E-MapReduce (EMR) allows you to submit jobs by using a Custom Resource Definition (CRD), by running the spark-submit command, or in the EMR console. This topic describes how to submit a Spark job by using these methods.

Prerequisites

A Spark cluster is created on the EMR on ACK page. For more information, see Create a cluster.

Precautions

In this topic, the desired JAR file is packaged into an image. If you are using your own JAR file, you can upload the JAR file to Alibaba Cloud Object Storage Service (OSS). For more information about how to upload a file, see Simple upload.

In this case, you need to replace local:///opt/spark/examples/spark-examples.jar in a command with the actual path in which the JAR file is stored in OSS. The path is specified in the oss://<yourBucketName>/<path>.jar format.

Submit a Spark job

Method 1: Submit a Spark job by using a CRD

  1. Connect to an Alibaba Cloud Container Service for Kubernetes (ACK) cluster by using kubectl. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

  2. Create a job file named spark-pi.yaml. The following code shows the content in the file:

    apiVersion: "sparkoperator.k8s.io/v1beta2"
    kind: SparkApplication
    metadata:
      name: spark-pi-simple
    spec:
      type: Scala
      sparkVersion: 3.2.1
      mainClass: org.apache.spark.examples.SparkPi
      mainApplicationFile: "local:///opt/spark/examples/spark-examples.jar"
      arguments:
        - "1000"
      driver:
        cores: 1
        coreLimit: 1000m
        memory: 4g
      executor:
        cores: 1
        coreLimit: 1000m
        memory: 8g
        memoryOverhead: 1g
        instances: 1

    For information about the fields in the code, see spark-on-k8s-operator.

    Note
    • You can specify a custom file name. In this example, spark-pi.yaml is used.

    • In this example, Spark 3.2.1 for EMR V5.6.0 is used. If you use another version of Spark, configure the sparkVersion parameter based on your business requirements.

  3. Run the following command to submit a job:

    kubectl apply -f spark-pi.yaml --namespace <Namespace in which the cluster resides>

    Replace <Namespace in which the cluster resides> with the namespace based on your business requirements. To view the namespace, log on to the EMR console and go to the Cluster Details tab.

    The following information is returned:

    sparkapplication.sparkoperator.k8s.io/spark-pi-simple created
    Note

    spark-pi-simple is the name of the submitted Spark job.

  4. Optional. View the information about the submitted Spark job on the Job Details tab.

Method 2: Submit a Spark job by running the spark-submit command

  1. Connect to an Alibaba Cloud ACK cluster by using kubectl. For more information, see Obtain the kubeconfig file of a cluster and use kubectl to connect to the cluster.

  2. Run the following command to install the emr-spark-ack tool that is provided by EMR and grant permissions to the tool:

    wget https://ecm-repo-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/emr-on-ack/util/emr-spark-ack
    chmod 755 emr-spark-ack
  3. Submit a Spark job by using the emr-spark-ack tool.

    Syntax:

     ./emr-spark-ack -n <Namespace in which the cluster resides> <Spark command>
    Note

    You can replace <Spark command> with the spark-submit, spark-sql, spark-shell, or pyspark command based on the mode that Spark requires.

    • Cluster mode:

      Run the spark-submit command to submit a job named spark-pi. Syntax:

      ./emr-spark-ack -n <Namespace in which the cluster resides> spark-submit \
          --name spark-pi-submit \
          --deploy-mode cluster \
          --class org.apache.spark.examples.SparkPi \
          local:///opt/spark/examples/spark-examples.jar \
          1000
    • Client mode:

      • Run the spark-sql command to submit a Spark job. Syntax:

        # Prepare an SQL file on your on-premises machine.
        echo "select 1+1">test.sql
        # Submit a Spark job.
        ./emr-spark-ack -n <Namespace in which the cluster resides> spark-sql -f test.sql

        In Spark 3 or later for EMR V5.X, the emr-spark-ack tool supports the automatic upload of dependencies of on-premises files. The dependencies of on-premises files in the command for job submitting, including the on-premises files specified by fields such as --jars, --files, and -f, can be automatically uploaded to the Spark cluster. The dependencies are used to submit a Spark job in an ACK cluster.

        The following figure shows the sample code and returned information.result-spark

      • Run the spark-shell command to submit a Spark job:

        ./emr-spark-ack -n <Namespace in which the cluster resides> spark-shell

        The following figure shows the sample code and returned information.result-shell

  4. Optional. View the information about the submitted job on the Job Details tab.

  5. Optional. Use the emr-spark-ack tool to terminate the Spark job.

    Syntax:

     ./emr-spark-ack -n <Namespace in which the cluster resides> kill <Spark_app_id>
    Note

    The emr-spark-ack tool generates the ID of the Spark application specified by <Spark_app_id> when you submit a Spark job. You can view the ID in the output log.

Method 3: Submit a Spark job in the EMR console

  1. Go to the Access Links and Ports tab.

    1. Log on to the EMR console. In the left-side navigation pane, click EMR on ACK.

    2. On the EMR on ACK page, find the desired cluster and click the name of the cluster in the Cluster ID/Name column.

    3. On the page that appears, click the Access Links and Ports tab.

  2. On the Access Links and Ports tab, click the link that corresponds to SparkSubmitGateway UI in the Access URL column.

    Then, you can go to the Shell terminal.

  3. In the Shell terminal, run one of the following commands:

    • spark-sql command:

      spark-sql

      After you run the spark-sql command, you can perform interactive queries.

      spark-sql
    • spark-submit command:

      spark-submit \
          --name spark-pi-submit \
          --deploy-mode cluster \
          --class org.apache.spark.examples.SparkPi \
          local:///opt/spark/examples/spark-examples.jar \
          1000
  4. Optional. View the information about the submitted Spark job on the Job Details tab.

References