All Products
Search
Document Center

Create and execute a Spark job

Last Updated: Jun 10, 2020

If a virtual cluster is created and is in the RUNNING state, you can use the virtual cluster to execute Spark jobs.

Procedure

  1. Log on to the Data Lake Analytics console.

  2. In the top navigation bar, select the region where Data Lake Analytics is deployed.

    The Serverless Spark feature is available only to China(Hong Kong),Singapore,US(Silicon Valley).

  3. In the left-side navigation pane, choose Serverless Spark > Submit job.

  4. On the Parameter Configuration page, click Create Job.

  5. In the Create Job dialog box, specify the parameters as required.

    Parameter Description
    File Name The name of the folder or file.
    Data Format The type of data that you want to access, which can be File or Folder.
    Parent The parent directory of the file or folder.
    • The job list is the root directory, and all jobs must be created in the list.
    • You can first create a folder in the job list, and then create jobs in the folder. Alternatively, you can create jobs in the root directory.

    Create a Spark job

  6. Click OK.

  7. Compile a Spark job based on Configurations of a Spark job after you create the job.

  8. Perform any of the following operations as required:

    • Click Save to reuse the job if needed.

    • Click Execute to execute the Spark job. You can check the execution status in real time.

    Parameter Description
    Task ID The ID of the Spark job, which is generated by Data Lake Analytics.
    State The running status of the Spark job.
    • STARTING: The Spark job is being submitted.
    • RUNNING: The Spark job is being executed.
    • SUCCESS: The Spark job is successfully executed.
    • DEAD: An error occurs during the Spark job execution. You can view the log and then troubleshoot the error.
    • KILLED: The Spark job is killed.
    Task Name The name of the Spark job specified when the job was created by using the command line.
    Submit Time The time when the Spark job was submitted.
    Start Up Time The time when the Spark job was executed.
    Update Time The time when the status of the Spark job changed.
    Duration The time required to execute the Spark job.
    Operation You can perform the following four operations:
    • Log: queries the log of the Spark job. The latest 300 lines in logs can be queried.
    • SparkUI: accesses the address of the Spark UI of the job. If the token expires, click Refresh to get the latest address.
    • Details: views the JSON script that is used to submit the Spark job.
    • kill: kills the Spark job.

    To view the job execution example SparkPi provided by Data Lake Analytics, click Example to display the example and click Execute to execute the example.

    Job execution example

Configurations of a Spark job

This section describes how to compile a Spark job by using the command line to access OSS data. The parameters are JSON strings.

  1. {
  2. "args": ["oss://${oss-buck-name}/data/test/test.csv"],
  3. "name": "spark-oss-test",
  4. "file": "oss://${oss-buck-name}/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  5. "className": "com.aliyun.spark.oss.SparkReadOss",
  6. "jars": ["oss://${oss-buck-name}/jars/oss/hadoop-aliyun-2.7.2-rc5.jar"],
  7. "conf": {
  8. "spark.driver.resourceSpec": "medium",
  9. "spark.executor.resourceSpec": "medium",
  10. "spark.executor.instances": 2,
  11. "spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
  12. }
  13. }
ParameterDescriptionRequired
argsThe parameters that you want to specify for the Spark job. Separate multiple parameters with commas (,).

In the preceding example, the value of args is the storage path of the test.csv file in OSS.

Yes
nameThe name of the Spark job.No
fileThe storage path of the JAR files that are used to run the Spark job, for example, oss://${oss-buck-name}/spark-examples-0.0.1-SNAPSHOT.jar.

Note: The JAR files on which a Spark job depends must be stored in OSS.

Yes
classNameThe entry class, such as com.aliyun.spark.oss.SparkReadOss. For more information, see Appendix in this topic.Yes
jarsThe JAR files on which a Spark job depends. Separate multiple JAR files with commas (,).

In the preceding example, hadoop-aliyun-2.7.2-rc5.jar is the JAR file.

Note: The JAR files on which a Spark job depends must be stored in OSS.

Yes
confThe fields in this parameter must be the same as the fields configured in the open-source Spark. The fields are in the format of key: value. Separate multiple fields with commas (,).

"spark.driver.cores", "spark.driver.memory", "spark.executor.cores", "spark.executor.memory" are not supported. You must use "spark.driver.resourceSpec" and "spark.executor.resourceSpec".

for example, conf: {"spark.driver.resourceSpec":"small", "spark.executor.resourceSpec": "medium", "spark.executor.instances":5}, where:

  • "spark.driver.resourceSpec":"small" indicates that the specification of the driver is small with one vCPU and 4 GB memory.
  • "spark.executor.resourceSpec":"medium" indicates that the specification of the executor is medium with two vCPUs and 8 GB memory.
"spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem" is the implementation class that is used to access OSS data. In the preceding example, the JAR file of the implementation class is hadoop-aliyun-2.7.2-rc5.jar.

If you do not specify the conf parameter, the default values for creating the virtual cluster are used.

No

Appendix

The following content is the scala source code of the main class that is used to access OSS data.

  1. import org.apache.spark.SparkConf
  2. import org.apache.spark.sql.SparkSession
  3. object SparkReadOss {
  4. def main(args: Array[String]): Unit = {
  5. val conf = new SparkConf().setAppName("spark oss test")
  6. val sparkSession = SparkSession
  7. .builder()
  8. .config(conf)
  9. .getOrCreate()
  10. val inputPath = args(0)
  11. val data = sparkSession.read.format("csv").load(inputPath)
  12. data.show()
  13. sparkSession.stop()
  14. }
  15. }