A Spark job must be described in the JSON format. The job information includes the job name, directory in which JAR packages are saved, and job configuration parameters. This topic describes how to configure a Spark job.

Example of a Spark job

This example describes how to compile a Spark job that is used to read data from Object Storage Service (OSS). The job parameters in the command line are in the JSON format.

 {  
  "args": ["oss://${oss-buck-name}/data/test/test.csv"],
  "name": "spark-oss-test",
  "file": "oss://${oss-buck-name}/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.SparkReadOss",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.dla.connectors": "oss"
  }
}

This example provides the format of a typical offline Spark job. This format specifies the job name, main classes of JAR packages, entry class, entry class parameters, and Spark job configurations. The information is similar to the parameters in the spark-submit commands defined by the Apache Spark community. This is because the configurations of the serverless Spark engine of DLA are the same as those defined by the Apache Spark community. The configurations include parameter names and semantics.

Job parameters

The following table describes the parameters of a Spark job.

Parameter Description Required
args The parameters that you want to configure for a Spark job. Separate multiple parameters with commas (,). No
name The name of a Spark job. No
file The directory in which the main files of a Spark job are saved. The main files can be the JAR packages that contain the entry classes or entry execution files of Python.

Note: The main files of Spark jobs must be stored in OSS.

Required for Python, Java, and Scala
className The entry class of Java or Scala, such as com.aliyun.spark.oss.SparkReadOss. This parameter is not required for Python. Required for Java and Scala
sqls This parameter provides a feature that is developed by the Data Lake Analytics (DLA) team and is not defined by the Apache Spark community. This feature allows you to submit offline SQL jobs without the need to submit JAR packages or Python files. This parameter cannot be used with the file parameter. You can specify multiple SQL statements in a Spark job. The SQL statements are separated by commas (,) and executed in a specified order. Required for SQL
jars The JAR packages required for a Spark job. Separate multiple JAR packages with commas (,). When a Spark job is running, JAR packages are added to the classpaths of the JVMs of the driver and executors.

Note: The JAR packages required for a Spark job must be stored in OSS.

No
files The files required for a Spark job. These files are downloaded to the working directories of the driver and executors. You can configure an alias for a file. For example, the alias of the yy.txt file in the oss://bucket/xx/ directory is yy. In this case, you need only to enter ./yy in the code to access the file. If you do not configure an alias for the file, you must use ./yy.txt to access the file. Separate multiple files with commas (,).
Note If the log4j.properties file in oss://<path/to>/ is specified for this parameter, the Spark job uses the log4j.properties file as the log configuration file.

Note: All the files required for a Spark job must be stored in OSS.

No
archives The packages required for a Spark job. The packages can be in the ZIP, TGZ, TAR, or TAR.GZ format. The packages are decompressed to the directory where the Spark process is running. You can configure an alias for a package. For example, the alias of the yy.zip package in the oss://bucket/xx/ directory is yy and the zz.txt file is included in the package. In this case, you need only to enter ./yy/zz.txt in the code to access the zz.txt file. If you do not configure an alias for the package, you must use ./yy.zip/zz.txt to access the file. Separate multiple packages with commas (,).

Note: The packages required for a Spark job must be stored in OSS. If a package fails to be decompressed, the job also fails.

No
pyFiles The Python files required for PySpark. These files must be in the ZIP, PY, or EGG format. If PySpark requires multiple Python files, we recommend that the files be in the ZIP or EGG format. These files can be referenced in the Python code by using a module. Separate multiple files with commas (,).

Note: The Python files required for PySpark must be stored in OSS.

Optional for Python
conf The parameters required for a Spark job. The parameters are the same as those configured in Apache Spark. The parameters are in the format of key: value. Separate multiple parameters with commas (,).

If you do not specify conf, the default settings you configure when you create a VC are used.

No

Configuration parameters

The configuration parameters of the serverless Spark engine of DLA are basically the same as those defined by the Apache Spark community. This section provides only the parameter differences and describes the parameters supported by the serverless Spark engine of DLA.

  • Differences
    The serverless Spark engine of DLA uses different parameters to configure the Spark driver and executors. These parameters are mapped to the parameters defined by the Apache Spark community.
    Parameter Description Parameter defined by the Apache Spark community
    spark.driver.resourceSpec The resource specifications of the Spark driver, which can be small, medium, large, or xlarge. small indicates 1 CPU core and 4 GB of memory. medium indicates 2 CPU cores and 8 GB of memory. large indicates 4 CPU cores and 16 GB of memory. xlarge indicates 8 CPU cores and 32 GB of memory. spark.driver.cores and spark.driver.memory
    spark.executor.resourceSpec The resource specifications of Spark executors, which are the same as those of spark.driver.resourceSpec. spark.executor.cores and spark.executor.memory
  • Parameters supported by the serverless Spark engine of DLA
    • Parameter used to access SparkUI
      Parameter Default value Description
      spark.dla.job.log.oss.uri N/A The uniform resource identifiers (URIs) of the directory in which logs generated by a Spark job are saved and the URIs of the directories in which SparkUI event logs are saved. Only OSS directories are supported. If you do not specify this parameter, you cannot view job logs or access SparkUI after a job is complete.
    • Parameter used to submit a Spark job as a RAM user
      Parameter Default value Description
      spark.dla.roleArn N/A The Aliyun Resource Name (ARN) of the RAM user who is granted the permissions to submit a job in the RAM console. This parameter is required only when you submit a job as a RAM user.
    • Parameters for built-in data source connectors
      Parameter Default value Description
      spark.dla.connectors N/A The names of the built-in connectors of the serverless Spark engine of DLA. Separate multiple connector names with commas (,). Valid values: oss, hbase1.x, and tablestore.
      spark.hadoop.job.oss.fileoutputcommitter.enable false The parameters that are required for optimizing the committer for a Parquet file. For more information, see OSS FileOutputCommitter
      Notice
      1. The two parameters must be used at the same time.
      2. Parquet files cannot be used with files in other formats.
      3. spark.dla.connectors must be set to oss.
      spark.sql.parquet.output.committer.class com.aliyun.hadoop.mapreduce.lib.output.OSSFileOutputCommitter
      spark.hadoop.io.compression.codec.snappy.native false Specifies whether a Snappy file is in the standard Snappy format. By default, Hadoop recognizes Snappy files edited in Hadoop. If this parameter is set to true, the standard Snappy library is used for decompression. Otherwise, the default Snappy library of Hadoop is used for decompression.
    • Parameters used to access a VPC and connect to a data source
      Parameter Default value Description
      spark.dla.eni.enable false Specifies whether DLA can access a VPC. If this parameter is set to true, DLA can access the VPC.
      spark.dla.eni.vswitch.id N/A The ID of the vSwitch that is associated with an elastic network interface (ENI). This ID is used to access a VPC. In most cases, if your ECS instance can access a destination data source, you can directly set this parameter to the ID of the vSwitch of the ECS instance.
      spark.dla.eni.security.group.id N/A The ID of a security group that is associated with an ENI. This ID is used to access a VPC. In most cases, if your ECS instance can access a destination data source, you can directly set this parameter to the ID of the security group of the ECS instance.
      spark.dla.eni.extra.hosts N/A The mappings between IP addresses and hostnames. This parameter enables the serverless Spark engine of DLA to correctly parse the domain names of data sources. You must pass this parameter if you use DLA to access a Hive data source. For more information about how to access a Hive data source, see Hive.
      Notice Separate IP addresses and hostnames with spaces. Separate multiple groups of IP addresses and hostnames with commas (,), for example, "ip0 master0, ip1 master1".
    • Parameter used to access the metadata of DLA:
      Parameter Default value Description
      spark.sql.hive.metastore.version 1.2.1 The version of a Hive metastore. The serverless Spark engine of DLA supports more values of this parameter in addition to the values that are defined by the Apache Spark community. If this parameter is set to dla, you can execute SQL statements of the serverless Spark engine to access the metadata of DLA.
    • PySpark parameter
      Parameter Default value Description
      spark.kubernetes.pyspark.pythonVersion 2 The Python version used by the serverless Spark engine of DLA. Valid values: 2 and 3. 2 indicates Python 2.0. 3 indicates Python 3.0.