In the serverless Spark engine of Data Lake Analytics (DLA), a Spark job must be described in the JSON format. The job information includes the job name, directory in which JAR packages are saved, and job configuration parameters. This topic describes how to configure a Spark job.
Sample code of a Spark job
In this topic, a Spark job is configured to read data from Object Storage Service (OSS). The following code demonstrates how the code of a Spark job is written. The job parameters in the command line are in the JSON format. Sample code:
{
"args": ["oss://${oss-buck-name}/data/test/test.csv"],
"name": "spark-oss-test",
"file": "oss://${oss-buck-name}/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
"className": "com.aliyun.spark.oss.SparkReadOss",
"conf": {
"spark.driver.resourceSpec": "medium",
"spark.executor.resourceSpec": "medium",
"spark.executor.instances": 2,
"spark.dla.connectors": "oss"
}
}
The preceding code provides the format of a typical offline Spark job. The code provides information about the Spark job, such as the job name, main JAR file, entry class, parameters related to the entry class, and Spark job configurations. This information is similar to the parameters in the spark-submit scripts defined by the Apache Spark community. This is because the configurations of the serverless Spark engine of DLA are the same as those defined by the Apache Spark community. The configurations include parameter names and semantics.
Job parameters
The following table describes the parameters for configuring a Spark job.
Parameter | Required | Example | Description |
---|---|---|---|
args | No | "args":["args0", "args1"] |
The parameters that you need to configure for a Spark job. The parameters are separated by commas (,). |
name | No | "name": "your_job_name" |
The name of a Spark job. |
file | Required for jobs that are written in Python, Java, or Scala | "file":"oss://bucket/path/to/your/jar" |
The directory in which the main file of a Spark job is stored. The main file can be
a JAR package that includes the entry class or entry execution file of Python.
Note The main file of a Spark job must be stored in OSS.
|
className | Required for jobs that are written in Java or Scala | "className":"com.aliyun.spark.oss.SparkReadOss" |
The entry class of the Java or Scala program. This parameter is not required for the jobs that are written in Python. |
sqls | Required for SQL jobs | "sqls":["select * from xxxx","show databases"] |
This parameter provides a feature that is developed by the Alibaba Cloud DLA team and is not defined by the Apache Spark community. This feature allows you to submit offline SQL jobs without the need to submit JAR packages or Python files. This parameter cannot be used with the file, className, or args parameter. You can specify multiple SQL statements for a Spark job. Separate multiple SQL statements with commas (,) and execute them in a specified order. |
jars | No | jars:["oss://bucket/path/to/jar","oss://bucket/path/to/jar"] |
The JAR packages that are required for a Spark job. The JAR packages are separated
by commas (,). When a Spark job runs, JAR packages are added to the classpaths of
the driver and executor Java virtual machines (JVMs).
Note The JAR packages that are required for a Spark job must be stored in OSS.
|
files | No | "files":["oss://bucket/path/to/files","oss://bucket/path/to/files"] |
The files that are required for a Spark job. These files are downloaded to the working
directories of the driver and executors. You can configure an alias for a file. For
example, the alias of the yy.txt file in the oss://bucket/xx/ directory is yy. In this case, you need only to enter ./yy in the code to access the file. If you do not configure an alias for the file, you
must use ./yy.txt to access the file. The files are separated by commas (,).
Note
|
archives | No | "archives":["oss://bucket/path/to/archives","oss://bucket/path/to/archives"] |
The packages that are required for a Spark job. These packages must be in the ZIP,
TAR, or TAR.GZ format. The packages are decompressed to the directory in which the
Spark process is running. You can configure an alias for a package. For example, the
alias of the yy.zip package in the oss://bucket/xx/ directory is yy and the zz.txt file is included in the package. In this case, you
need only to enter ./yy/zz.txt in the code to access the zz.txt file. If you do not configure an alias for the package,
you must use ./yy.zip/zz.txt to access the file. The packages are separated by commas (,).
Note The packages that are required for a Spark job must be stored in OSS. If a package
fails to be decompressed, the job also fails.
|
pyFiles | Required for jobs that are written in Python | "pyFiles":["oss://bucket/path/to/pyfiles","oss://bucket/path/to/pyfiles"] |
The Python files that are required for a PySpark job. These files must be in the ZIP,
PY, or EGG format. If multiple Python files are required, we recommend that you use
the files in the ZIP or EGG format. These files can be referenced in the Python code
by using a module method. The files are separated by commas (,).
Note The Python files that are required for a PySpark job must be stored in OSS.
|
conf | No | "conf":{"spark.xxxx":"xxx","spark.xxxx":"xxxx"} |
The parameters that are required for a Spark job. The parameters are the same as the
parameters that are configured for an Apache Spark job. The parameters are in the
key: value format. The parameters are separated by commas (,).
If you do not specify conf, the default settings that you configured when you create a virtual cluster are used. |
Configuration parameters
The configuration parameters of the serverless Spark engine of DLA are basically the same as those defined by the Apache Spark community. This section describes only the parameter differences and the parameters supported by the serverless Spark engine of DLA.
- DifferencesThe serverless Spark engine of DLA uses different parameters to configure the Spark driver and executors. These parameters are mapped to the parameters defined by the Apache Spark community.
Parameter Description Parameter defined by the Apache Spark community spark.driver.resourceSpec The resource specifications of the Spark driver. Valid values: - small: indicates 1 CPU core and 4 GB of memory.
- medium: indicates 2 CPU cores and 8 GB of memory.
- large: indicates 4 CPU cores and 16 GB of memory.
- xlarge: indicates 8 CPU cores and 32 GB of memory.
spark.driver.cores and spark.driver.memory spark.executor.resourceSpec The resource specifications of Spark executors, which are the same as the resource specifications of spark.driver.resourceSpec. spark.executor.cores and spark.executor.memory - Parameters supported by the serverless Spark engine of DLA
- Parameter used to access Apache Spark web UI
Parameter Default value Description spark.dla.job.log.oss.uri N/A The uniform resource identifiers (URIs) of the directories in which the logs that are generated by a Spark job are saved and the URIs of the directories in which Spark web UI event logs are saved. Only OSS directories are supported. If you do not configure this parameter, you cannot query job logs or access Spark web UI after a job is completed. - Parameter used to Grant permissions to a RAM user (detailed version)
Parameter Default value Description spark.dla.roleArn N/A The Alibaba Cloud Resource Name (ARN) of the RAM user that is granted the permissions to submit a job in the RAM console. This parameter is required only when you submit a job as a RAM user. - Parameters for built-in data source connectors
Parameter Default value Description spark.dla.connectors N/A The names of the built-in connectors in the serverless Spark engine of DLA. The connector names are separated by commas (,). Valid values: oss, hbase1.x, and tablestore. spark.hadoop.job.oss.fileoutputcommitter.enable false The parameters that are required for optimizing the committer for a Parquet file. For more information, see OSS. Notice- The two parameters must be used together.
- Parquet files cannot be used with files in other formats.
- The two parameters take effect only when the spark.dla.connectors parameter is set to oss.
spark.sql.parquet.output.committer.class com.aliyun.hadoop.mapreduce.lib.output.OSSFileOutputCommitter spark.hadoop.io.compression.codec.snappy.native false Specifies whether a Snappy file is in the standard Snappy format. By default, Hadoop recognizes Snappy files edited in Hadoop. If this parameter is set to true, the standard Snappy library is used for decompression. Otherwise, the default Snappy library of Hadoop is used for decompression. - Parameters used to Configure the network of data sources and connect to a data source
Parameter Default value Description spark.dla.eni.enable false Specifies whether DLA can access a VPC. If this parameter is set to true, DLA can access the VPC. spark.dla.eni.vswitch.id N/A The ID of the vSwitch that is associated with an elastic network interface (ENI). This ID is used to access a VPC. In most cases, if your Elastic Compute Service (ECS) instance can access a destination data source, you can directly set this parameter to the ID of the vSwitch that is connected to the ECS instance. spark.dla.eni.security.group.id N/A The ID of a security group that is associated with an ENI. This ID is used to access a VPC. In most cases, if your ECS instance can access a destination data source, you can directly set this parameter to the ID of the security group to which the ECS instance is added. spark.dla.eni.extra.hosts N/A The mappings between IP addresses and hostnames. This parameter enables the serverless Spark engine of DLA to correctly parse the domain names of data sources. You must pass this parameter if you use DLA to access a Hive data source. Notice IP addresses and hostnames are separated by spaces. Groups of IP addresses and hostnames are separated by commas (,), for example, "ip0 master0, ip1 master1". - Parameter for Spark SQL
Parameter Default value Description spark.sql.hive.metastore.version 1.2.1 The version of a Hive metastore. The serverless Spark engine of DLA supports more values of this parameter in addition to the values that are defined by the Apache Spark community. If this parameter is set to dla, you can use Spark SQL to access the metadata of DLA. - Parameter for PySpark
Parameter Default value Description spark.kubernetes.pyspark.pythonVersion 2 The Python version that is used by the serverless Spark engine of DLA. Valid values: 2 and 3. The value 2 indicates Python 2.0. The value 3 indicates Python 3.0. - Parameters related to job attempts
Parameter Default value Description Example spark.dla.job.maxAttempts 1 The maximum number of job attempts that can be made for a failed job. The default value is 1, which indicates that the failed job cannot be retried. Note Valid values: [1, 9999]. If a job succeeds, the job does not need to be retried. If a job fails and the value of this parameter is greater than 1, the failed job is automatically retried.If the spark.dla.job.maxAttempts parameter is set to 3, a failed job can be retried up to three times. spark.dla.job.attemptFailuresValidityInterval -1 The interval for job attempt tracking. The default value is -1, which indicates that job attempt tracking is not enabled. Notice- If the difference between the end time of a job attempt and the current time exceeds the value of this parameter, this attempt is not counted as a failure.
- If this parameter is set to a small value, failed jobs may be retried repeatedly. Therefore, we recommend that you do not configure this parameter.
- Supported units:
- ms: milliseconds. This is the default unit.
- m: minutes.
- h: hours.
- d: days.
For example, the spark.dla.job.attemptFailuresValidityInterval parameter is set to 30m, and the current time is 12:40. The end time of JobAttempt0 is 12:00, the end time of JobAttempt1 is 12:30, and the end time of JobAttempt2 is 12:35. In this case, JobAttempt0 is not counted as a job attempt, and the valid job attempts are JobAttempt1 and JobAttempt2. The total number of job attempts is 2. - Parameters related to resource configurations
Parameter Default value Description spark.dla.driver.cpu-vcores-ratio 1 The ratio between vCPU cores and actual CPU cores used by the driver. If the driver uses the medium resource specifications and this parameter is set to 2, the driver can run 4 vCPU cores in parallel. You can also set spark.driver.cores to 4 to achieve the same performance.
spark.dla.executor.cpu-vcores-ratio 1 The ratio between vCPU cores and actual CPU cores used by an executor. If the CPU utilization of a job is low, you can configure this parameter to increase the CPU utilization. If an executor uses the medium specifications and this parameter is set to 2, the executor can run 4 vCPU cores in parallel. This means that 4 jobs are scheduled in parallel. You can also set spark.executor.cores to 4 to achieve the same performance.
- Monitoring-related parameter
Parameter Default value Description spark.dla.monitor.enabled true Specifies whether to enable job monitoring. The default value of this parameter is true, which indicates that job monitoring is enabled. If this parameter is set to false, the monitoring data of a job is not collected.
- Parameter used to access Apache Spark web UI