A Spark job must be described in the JSON format. The job information includes the job name, directory in which JAR packages are saved, and job configuration parameters. This topic describes how to configure a Spark job.

Example of a Spark job

This example demonstrates how to configure a Spark job that is used to read data from Object Storage Service (OSS). The job parameters in the command line are in the JSON format. Sample job configurations:

 {  
  "args": ["oss://${oss-buck-name}/data/test/test.csv"],
  "name": "spark-oss-test",
  "file": "oss://${oss-buck-name}/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.SparkReadOss",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.dla.connectors": "oss"
  }
}

This example provides the format of a typical offline Spark job. This format specifies the job name, main JAR packages, entry class, entry class parameters, and Spark job configurations. These configurations are similar to those in the spark-submit commands defined by the Apache Spark community. This is because the serverless Spark engine uses the same configurations that include parameter names and semantics as the Apache Spark community.

Job parameters

The following table describes the parameters of a Spark job.

Parameter Required Example Description
args No "args":["args0", "args1"] The parameters that you want to configure for a Spark job. Separate multiple parameters with commas (,).
name No "name": "your_job_name" The name of a Spark job.
file Required for jobs that are written in Python, Java, or Scala "file":"oss://bucket/path/to/your/jar" The directory in which the main files of a Spark job are saved. The main files can be the JAR packages that contain the entry classes or entry execution files of Python.
Note The main files of Spark jobs must be stored in OSS.
className Required for jobs that are written in Java or Scala "className":"com.aliyun.spark.oss.SparkReadOss" The Java or Scala program entry class. This parameter is not required for jobs that are written in Python.
sqls Required for Spark SQL jobs "sqls":["select * from xxxx","show databases"] This parameter provides a feature that is developed by the Data Lake Analytics (DLA) team and is not defined by the Apache Spark community. This feature allows you to submit offline SQL jobs without requiring you to submit JAR packages or Python files. This parameter cannot be used with the file, className, or args parameter. You can specify multiple SQL statements in one Spark job and separate these SQL statements with commas (,). These statements are executed in a specified order.
jars No jars:["oss://bucket/path/to/jar","oss://bucket/path/to/jar"] The JAR packages required for a Spark job. Separate multiple JAR packages with commas (,). When a Spark job is running, JAR packages are added to the classpaths of the driver and executor Java virtual machines (JVMs).
Note The JAR packages required for a Spark job must be stored in OSS.
files No "files":["oss://bucket/path/to/files","oss://bucket/path/to/files"] The files required for a Spark job. These files are downloaded to the working directories of the driver and executors. You can configure an alias for a file. For example, the alias of the yy.txt file in the oss://bucket/xx/ directory is yy. In this case, you need only to enter ./yy in the code to access the file. If you do not configure an alias for the file, you must use ./yy.txt to access the file. Separate multiple files with commas (,).
Note
  • If the log4j.properties file in oss://<path/to>/ is specified for this parameter, the Spark job uses the log4j.properties file as the log configuration file.
  • All the files required for a Spark job must be stored in OSS.
archives No "archives":["oss://bucket/path/to/archives","oss://bucket/path/to/archives"] The packages required for a Spark job. These packages must be in the ZIP, TAR, or TAR.GZ format. The packages are decompressed to the directory in which the Spark process is running. You can configure an alias for a package. For example, the alias of the yy.zip package in the oss://bucket/xx/ directory is yy and the zz.txt file is included in the package. In this case, you need only to enter ./yy/zz.txt in the code to access the zz.txt file. If you do not configure an alias for the package, you must use ./yy.zip/zz.txt to access the file. Separate multiple packages with commas (,).
Note The packages required for a Spark job must be stored in OSS. If a package fails to be decompressed, the job also fails.
pyFiles Required for jobs that are written in Python "pyFiles":["oss://bucket/path/to/pyfiles","oss://bucket/path/to/pyfiles"] The Python files required for PySpark. These files must be in the ZIP, PY, or EGG format. If multiple Python files are required, we recommend that the files are in the ZIP or EGG format. These files can be referenced in the Python code by using a module method. Separate multiple files with commas (,).
Note The Python files required for PySpark must be stored in OSS.
conf No "conf":{"spark.xxxx":"xxx","spark.xxxx":"xxxx"} The parameters required for a Spark job. The parameters are the same as those configured in Apache Spark. The parameters are in the format of key: value. Separate multiple parameters with commas (,).

If you do not specify the conf parameter, the default settings that are configured when you created the virtual cluster (VC) are used.

Configuration parameters

The configuration parameters of the serverless Spark engine of DLA are basically the same as those defined by the Apache Spark community. This section provides only the parameter differences and describes the parameters supported by the serverless Spark engine of DLA.

  • Differences
    The serverless Spark engine of DLA uses different parameters to configure the Spark driver and executors. These parameters are mapped to the parameters defined by the Apache Spark community.
    Parameter Description Parameter defined by the Apache Spark community
    spark.driver.resourceSpec The resource specifications of the Spark driver. Valid values:
    • small: indicates 1 CPU core and 4 GB of memory.
    • medium: indicates 2 CPU cores and 8 GB of memory.
    • large: indicates 4 CPU cores and 16 GB of memory.
    • xlarge: indicates 8 CPU cores and 32 GB of memory.
    spark.driver.cores and spark.driver.memory
    spark.executor.resourceSpec The resource specifications of Spark executors, which are the same as those of spark.driver.resourceSpec. spark.executor.cores and spark.executor.memory
  • Parameters supported by the serverless Spark engine of DLA
    • Parameter used to access Spark web UI
      Parameter Default value Description
      spark.dla.job.log.oss.uri N/A. The uniform resource identifiers (URIs) of the directory in which logs generated by a Spark job are saved and the URIs of the directories in which Spark web UI event logs are saved. Only OSS directories are supported. If you do not specify this parameter, you cannot view job logs or access Spark web UI after a job is complete.
    • Parameter used to submit a Spark job as a RAM user
      Parameter Default value Description
      spark.dla.roleArn N/A. The Aliyun Resource Name (ARN) of the RAM user that is granted the permissions to submit a job in the RAM console. This parameter is required only when you submit a job as a RAM user.
    • Parameters for built-in data source connectors
      Parameter Default value Description
      spark.dla.connectors N/A. The names of the built-in connectors in the serverless Spark engine of DLA. Separate multiple connector names with commas (,). Valid values: oss, hbase1.x, and tablestore.
      spark.hadoop.job.oss.fileoutputcommitter.enable false The parameters that are required for optimizing the committer for a Parquet file. For more information, see OSS.
      Notice
      • The two parameters must be used at the same time.
      • Parquet files cannot be used with files in other formats.
      • Set the value to "spark.dla.connectors": "oss".
      spark.sql.parquet.output.committer.class com.aliyun.hadoop.mapreduce.lib.output.OSSFileOutputCommitter
      spark.hadoop.io.compression.codec.snappy.native false Specifies whether a Snappy file is in the standard Snappy format. By default, Hadoop recognizes Snappy files that are edited in Hadoop. If this parameter is set to true, the standard Snappy library is used to decompress files. Otherwise, the default Snappy library of Hadoop is used to decompress files.
    • Parameters used to access a VPC and connect to a data source
      Parameter Default value Description
      spark.dla.eni.enable false Specifies whether DLA can access a virtual private cloud (VPC). If this parameter is set to true, DLA can access the VPC.
      spark.dla.eni.vswitch.id N/A. The ID of the vSwitch that is associated with an elastic network interface (ENI). This ID is used to access a VPC. In most cases, if your ECS instance can access a destination data source, you can directly set this parameter to the vSwitch ID of the ECS instance.
      spark.dla.eni.security.group.id N/A. The ID of the security group that is associated with an ENI. This ID is used to access a VPC. In most cases, if your ECS instance can access a destination data source, you can directly set this parameter to the security group ID of the ECS instance.
      spark.dla.eni.extra.hosts N/A. The mappings between IP addresses and hostnames. This parameter enables the serverless Spark engine of DLA to correctly parse the domain names of data sources. You must pass this parameter if you use DLA to access a Hive data source. For more information about how to access a Hive data source, see Hive.
      Notice Separate IP addresses and hostnames with spaces. Separate multiple groups of IP addresses and hostnames with commas (,), for example, "ip0 master0, ip1 master1".
    • Parameter used for Spark SQL to access the metadata of DLA:
      Parameter Default value Description
      spark.sql.hive.metastore.version 1.2.1 The version of a Hive metastore. The serverless Spark engine of DLA supports more values of this parameter in addition to the values that are defined by the Apache Spark community. If this parameter is set to dla, you can use Spark SQL to access the metadata of DLA.
    • PySpark parameter
      Parameter Default value Description
      spark.kubernetes.pyspark.pythonVersion 2 The Python version used by the serverless Spark engine of DLA. Valid values: 2 and 3. The value 2 indicates Python 2.0. The value 3 indicates Python 3.0.
    • Parameters related to job attempts
      Parameter Default value Description Examples
      spark.dla.job.maxAttempts 1 The maximum number of job attempts that can be made. The default value is 1, which indicates that job attempts are not supported.
      Note Valid values: [1, 9999]. If a job succeeds, job attempts are not required. If the job fails and the value of this parameter is greater than 1, the next job attempt will be made automatically.
      If the spark.dla.job.maxAttempts parameter is set to 3, a maximum of three job attempts can be made.
      spark.dla.job.attemptFailuresValidityInterval -1 The validity interval for job attempt tracking. The default value is -1, which indicates that job attempt tracking is not enabled.
      Notice
      • If the difference between the end time of a job attempt and the current time exceeds the value of this parameter, this attempt is not counted as a failure.
      • If this parameter is set to a small value, incorrect jobs may be retried repeatedly. Therefore, we recommend that you do not specify this parameter.
      • Supported units:
        • ms: milliseconds. This is the default unit.
        • m: minutes.
        • h: hours.
        • d: days.
      If the spark.dla.job.attemptFailuresValidityInterval parameter is set to 30m, the current time is 12:40, the end time of JobAttempt0 is 12:00, the end time of JobAttempt1 is 12:30, and the end time of JobAttempt2 is 12:35, JobAttempt0 is not counted as a job attempt, and the valid job attempts are JobAttempt1 and JobAttempt2. The total number of job attempts is 2.