A Serverless Spark job must be described in the JSON format. The job information includes the job name, directory of JAR packages, and job configuration parameters. This topic describes how to configure a Serverless Spark job.

Example of a Spark job

This section describes how to compile a Spark job by using the command line to access OSS data. The parameters are in the JSON format.

 {  
  "args": ["oss://${oss-buck-name}/data/test/test.csv"],
  "name": "spark-oss-test",
  "file": "oss://${oss-buck-name}/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.SparkReadOss",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.dla.connectors": "oss"
  }
}

This example describes the format of a typical offline Spark JAR job, including the job name, main classes of JAR packages, entry class, entry class parameters, and Spark job configurations. These configurations are similar to the parameters in the spark-submit commands of the Apache Spark community because Serverless Spark uses the same configurations that include parameter names and semantics as the Apache Spark community.

Job parameters

The following table describes the parameters of a Serverless Spark job.

Parameter Description Required
args The parameters that you want to specify for a Spark job. Separate multiple parameters with commas (,). No
name The name of a Spark job. No
file The directory where the main files of a Spark job are saved. The main files can be the JAR packages that contain entry classes or the entry execution files of Python.

Note: The main files of Spark jobs can only be stored in OSS.

Required for Python, Java, and Scala
className The entry class of Java or Scala, such as com.aliyun.spark.oss.SparkReadOss. This parameter is not required for Python. Required for Java and Scala
sqls Allows you to submit offline SQL jobs without the need to submit JAR packages or Python files. This parameter provides a self-developed feature of Data Lake Analystics (DLA) and is not provided by the Apache Spark community. This parameter cannot be used with the file parameter. You are allowed to specify multiple SQL statements in a Spark job. The SQL statements are separated by commas (,) and executed in a specified order. Required for SQL
jars The JAR packages on which a Spark job depends. Separate multiple JAR packages with commas (,). When a Spark job is running, JAR packages are added to ClassPath of the driver and executor JVMs.

Note: The JAR packages on which a Spark job depends must be stored in OSS.

No
files The files on which a Spark job depends. These files are downloaded to the directory where the driver and executor are running. You can specify an alias for a file, such as oss://bucket/xx/yy.txt#yy. This way, you only need to enter ./yy in the code to access the file. If you do not specify the alias for the file, you must use ./yy.txt to access the file. Separate multiple files with commas (,).

Note: The JAR packages on which a Spark job depends must be stored in OSS.

No
archives The packages on which a Spark job depends. The packages can only be in the ZIP, TGZ, TAR, or TAR.GZ format. The packages are decompressed to the directory where the Spark process is running. You can specify an alias for a package, such as oss://bucket/xx/yy.zip#yy. This way, you only need to enter ./yy/zz.txt in the code to access the decompressed files. If you do not specify the alias for the package, you must use ./yy.zip/zz.txt to access the decompressed files. In this example, zz.txt is a file in the yy.zip package. Separate multiple packages with commas (,).

Note: The packages on which a Spark job depends must be stored in OSS. If a package on which a Spark job depends fails to be decompressed, the job also fails.

No
pyFiles The Python files on which PySpark depends. These files must be in the ZIP, PY, or EGG format. If PySpark depends on multiple files, we recommend that you use the files in the ZIP or EGG format. These files can be referenced in the Python code by using a module method. Separate multiple files with commas (,).

Note: The Python files on which PySpark depends must be stored in OSS.

Required for Python
conf The configuration file. The fields contained in this parameter must be the same as those configured in the open source Apache Spark. The fields are in the format of key: value. Separate multiple fields with commas (,). For information about the configuration differences between DLA and the Apache Spark community, see Serverless Spark fields.

If you do not specify the conf parameter, the default settings for creating a virtual cluster are used.

No

Serverless Spark fields

Most of the fields used by DLA Serverless Spark are the same as those of the Apache Spark community. This section describes the differences in the fields between DLA Serverless Spark and the Apache Spark community and the unique parameters provided by DLA Serverless Spark.

  1. "spark.driver.cores","spark.driver.memory","spark.executor.cores","spark.executor.memory" are not supported in the conf parameter. You must use "spark.driver.resourceSpec" and "spark.executor.resourceSpec".

    Example:
    "conf": {
        "spark.driver.resourceSpec": "small",
        "spark.executor.resourceSpec": "medium",
        "spark.executor.instances": 2
    }
    In this example:
    • "spark.driver.resourceSpec":"small" indicates that the specifications of the driver are small, which include 1 vCPU and 4 GB of memory.
    • "spark.executor.resourceSpec":"medium" indicates that the specifications of the executor are medium, which include 2 vCPUs and 8 GB of memory.
  2. spark.dla.connectors is provided by DLA Serverless Spark. This field specifies the self-developed Spark connector of DLA used by a Spark job. The connectors can be OSS, Tablestore, and HBase. You can specify multiple connectors and separate them with commas (,).