This topic describes how to use the spark-submit Command Line Interface (CLI) and provides sample code.

Prerequisites

The spark-submit package is obtained.

You can click the link to download the dla-spark-toolkit.tar.gz package or run the following wget command to download this package.
wget https://dla003.oss-cn-hangzhou.aliyuncs.com/dla_spark_toolkit_1/dla-spark-toolkit.tar.gz
After you download the package, you can run the following command to decompress it.
tar zxvf dla-spark-toolkit.tar.gz
Note To use the spark-submit CLI, make sure that JDK 8 or later is installed.

Procedure

  1. View help information.
    • Command used to view help information:
      cd /path/to/dla-spark-toolkit
      ./bin/spark-submit --help
    • Command output:
      Usage: spark-submit [options] <app jar> [app arguments]
      Usage: spark-submit --list [PAGE_NUMBER] [PAGE_SIZE]
      Usage: spark-submit --kill [JOB_ID] 
      
      Options:
        --keyId                     Your ALIYUN_ACCESS_KEY_ID, required
        --secretId                  Your ALIYUN_ACCESS_KEY_SECRET, required
        --regionId                  Your Cluster Region Id, required
        --vcName                    Your Virtual Cluster Name, required
        --oss-keyId                 Your ALIYUN_ACCESS_KEY_ID to upload local resource to oss.
                                    The default is the same as --keyId
        --oss-secretId              Your ALIYUN_ACCESS_KEY_SECRET, the default is the same as --secretId
        --oss-endpoint              Oss endpoint where the resource will upload. The default is http://oss-$regionId.aliyuncs.com
        --oss-upload-path           The user oss path where the resource will upload
                                    If you want to upload a local jar package to the OSS directory,
                                    you need to specify this parameter
      
      
      
        --class CLASS_NAME          Your application's main class (for Java / Scala apps).
        --name NAME                 A name of your application.
        --jars JARS                 Comma-separated list of jars to include on the driver
                                    and executor classpaths.
      
        --conf PROP=VALUE           Arbitrary Spark configuration property
        --help, -h                  Show this help message and exit.
        --driver-resource-spec      Indicates the resource specifications used by the driver:
                                    small | medium | large
        --executor-resource-spec    Indicates the resource specifications used by the executor:
                                    small | medium | large
        --num-executors             Number of executors to launch
        --properties-file           spark-defaults.conf properties file location, only local files are supported
                                    The default is ${SPARK_SUBMIT_TOOL_HOME}/conf/spark-defaults.conf
        --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                                    on the PYTHONPATH for Python apps.
        --files FILES               Comma-separated list of files to be placed in the working
                                    directory of each executor. File paths of these files
                                    in executors can be accessed via SparkFiles.get(fileName).
                                    Specially, you can pass in a custom log output format file named `log4j.properties`
                                    Note: The file name must be `log4j.properties` to take effect
      
      
        --status job_id             If given, requests the status and details of the job specified
        --verbose                   print more messages, enable spark-submit print job status and more job details.
      
        List Spark Job Only:
        --list                      List Spark Job, should use specify --vcName and --regionId
        --pagenumber, -pn           Set page number which want to list (default: 1)
        --pagesize, -ps             Set page size which want to list (default: 10)
      
        Get Job Log Only:
        --get-log job_id            Get job log
      
        Kill Spark Job Only:
        --kill job_id,job_id        Comma-separated list of job to kill spark job with specific ids
      
        Spark Offline SQL options:
        -e <quoted-query-string>    SQL from command line
        -f <filename>               SQL from files
    • Description for the exit code of a spark-submit job:
      255    #Indicates that the job fails.
      0      #Indicates that the job succeeds.
      143    #Indicates that the job is terminated.
  2. Use the spark-defaults.conf file to configure common parameters.
    The spark-defaults.conf file allows you to configure the following parameters. Under spark conf, only common parameters are listed.
    #  cluster information
    # AccessKeyId
    #keyId =
    #  AccessKeySecret
    #secretId =
    #  RegionId
    #regionId =
    #  set vcName
    #vcName =
    #  set OssUploadPath, if you need upload local resource
    #ossUploadPath =
    
    ##spark conf
    #  driver specifications : small 1c4g | medium 2c8g | large 4c16g
    #spark.driver.resourceSpec =
    #  executor instance number
    #spark.executor.instances =
    #  executor specifications : small 1c4g | medium 2c8g | large 4c16g
    #spark.executor.resourceSpec =
    #  when use ram,  role arn
    #spark.dla.roleArn =
    #  when use option -f or -e, set catalog implementation
    #spark.sql.catalogImplementation =
    #  config dla oss connectors
    #spark.dla.connectors = oss
    #  config eni, if you want to use eni
    #spark.dla.eni.enable =
    #spark.dla.eni.vswitch.id =
    #spark.dla.eni.security.group.id =
    #  config log location, need an oss path to store logs
    #spark.dla.job.log.oss.uri =
    #  config spark read dla table
    #spark.sql.hive.metastore.version = dla
    Note
    • The spark-submit script automatically reads the spark-defaults.conf file in the conf folder.
    • Parameters in the command line take precedence over parameters in the spark-defaults.conf file.
    • For more information about the mappings between regions and region IDs, see Regions and zones.
  3. Submit a job.

    For more information, see Create and run Spark jobs.

    Before you use the spark-submit CLI, submit a Spark job in the JSON format. Example:
    {
        "name": "xxx",
        "file": "oss://{bucket-name}/jars/xxx.jar",
        "jars": "oss://{bucket-name}/jars/xxx.jar,oss://{bucket-name}/jars/xxx.jar"
        "className": "xxx.xxx.xxx.xxx.xxx",
        "args": [
            "xxx",
            "xxx"
        ],
        "conf": {
            "spark.executor.instances": "1",
            "spark.driver.resourceSpec": "medium",
            "spark.executor.resourceSpec": "medium",
            "spark.dla.job.log.oss.uri": "oss://{bucket-name}/path/to/log/"
        }
    }
    After you use the spark-submit CLI, submit a job in the following format:
    $ ./bin/spark-submit  \
    --class xxx.xxx.xxx.xxx.xxx \
    --verbose \
    --name xxx \
    --jars oss://{bucket-name}/jars/xxx.jar,oss://{bucket-name}/jars/xxx.jar
    --conf spark.driver.resourceSpec=medium \
    --conf spark.executor.instances=1 \
    --conf spark.executor.resourceSpec=medium \
    oss://{bucket-name}/jars/xxx.jar \
    xxx xxx
    
    ## --verbose is used to display the parameters and execution status of the submitted job.
    
    ## The main program file can be a JAR package specified by --jars or a file specified by --py-files or --files. The main program file can be saved in a local directory or an Object Storage Service (OSS) directory.
    ## You must specify absolute paths for the specified local files. After the spark-submit CLI is used, the local files are automatically uploaded to the specified OSS directory.
    ## You can use --oss-upload-path or set ossUploadPath in the spark-defaults.conf file to specify the OSS directory.
    ## When a local file is being uploaded, the file content is verified by using MD5. If a file that has the same name and MD5 value as your local file exists in the specified OSS directory, the file upload is canceled.
    ## If you manually update the JAR package in the specified OSS directory, delete the MD5 file that corresponds to the JAR package.
    ## Format: --jars  /path/to/local/directory/XXX.jar,/path/to/local/directory/XXX.jar
    ## Separate multiple files with commas (,) and specify an absolute path for each file.
    
    ## -- jars, -- py-files, and --files also allow you to specify a local directory from which all files are uploaded. Files in sub-directories are not recursively uploaded.
    ## You must specify an absolute path for the directory. Example: --jars /path/to/local/directory/,/path/to/local/directory2/
    ## Separate multiple directories with commas (,) and specify an absolute path for each directory.
    
    
    ## Program output. You can use SparkUI listed in the following output to access SparkUI of the job and view Job Detail to check whether the parameters submitted by the job meet your expectations.
    job status: starting
    job status: starting
    job status: starting
    job status: starting
    job status: starting
    job status: starting
    job status: running
    {
      "jobId": "",
      "jobName": "SparkPi",
      "status": "running",
      "detail": "",
      "sparkUI": "",
      "createTime": "2020-08-20 14:12:07",
      "updateTime": "2020-08-20 14:12:07",
      ...
    }
    Job Detail: {
      "name": "SparkPi",
      "className": "org.apache.spark.examples.SparkPi",
      "conf": {
        "spark.driver.resourceSpec": "medium",
        "spark.executor.instances": "1",
        "spark.executor.resourceSpec": "medium"
      },
      "file": ""
    }
    Note
    • For more information about how to use the AccessKey ID and AccessKey secret of a RAM user to submit jobs, see Configure RAM user permissions in fine-grained manner.
    • If the JAR package in your local directory is uploaded by using spark-submit, the RAM user must be granted the permissions to access OSS. You can grant the AliyunOSSFullAccess permission to the RAM user. For more information, see Users.
  4. Terminate a job.
    Run the following command to terminate a job:
    $ ./spark-submit \
    --kill <jobId>
    
    ## The following result is returned:
    {"data":"deleted"}
  5. View jobs.
    You can run the following command to view jobs. The following sample script demonstrates how to view the first page of a job list that contains only one job.
    $ ./bin/spark-submit \
    --list --pagenumber 1 --pagesize 1
    
    ## The following result is returned:
    {
      "requestId": "",
      "dataResult": {
        "pageNumber": "1",
        "pageSize": "1",
        "totalCount": "251",
        "jobList": [
          {
            "createTime": "2020-08-20 11:02:17",
            "createTimeValue": "1597892537000",
            "detail": "",
            "driverResourceSpec": "large",
            "executorInstances": "4",
            "executorResourceSpec": "large",
            "jobId": "",
            "jobName": "",
            "sparkUI": "",
            "status": "running",
            "submitTime": "2020-08-20 11:01:58",
            "submitTimeValue": "1597892518000",
            "updateTime": "2020-08-20 11:22:01",
            "updateTimeValue": "1597893721000",
            "vcName": ""
          }
        ]
      }
    }
                            
  6. Obtain the parameters of the submitted job and SparkUI.
    ./bin/spark-submit --status <jobId>
    
    ## The following result is returned:
    Status: success
    job status: success
    {
      "jobId": "",
      "jobName": "SparkPi",
      "status": "success",
      "detail": "",
      "sparkUI": "",
      "createTime": "2020-08-20 14:12:07",
      "updateTime": "2020-08-20 14:12:33",
      "submitTime": "2020-08-20 14:11:49",
      "createTimeValue": "1597903927000",
      "updateTimeValue": "1597903953000",
      "submitTimeValue": "1597903909000",
      "vcName": "",
      "driverResourceSpec": "medium",
      "executorResourceSpec": "medium",
      "executorInstances": "1"
    }
  7. Obtain job logs.
    ./bin/spark-submit --get-log <jobId>
    
    # The following result is returned:
    20/08/20 06:24:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    20/08/20 06:24:58 INFO SparkContext: Running Spark version 2.4.5
    20/08/20 06:24:58 INFO SparkContext: Submitted application: Spark Pi
    20/08/20 06:24:58 INFO SecurityManager: Changing view acls to: spark
    20/08/20 06:24:58 INFO SecurityManager: Changing modify acls to: spark
    20/08/20 06:24:58 INFO SecurityManager: Changing view acls groups to: 
    20/08/20 06:24:58 INFO SecurityManager: Changing modify acls groups to: 
    20/08/20 06:24:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(spark); groups with view permissions: Set(); users  with modify permissions: Set(spark); groups with modify permissions: Set()
    ...