how to develop spark jobs - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for MySQL uses the same development method for Spark batch applications and streaming applications. This topic describes the available development tools, configuration parameters, and language-specific parameters for Java, Scala, and Python applications.

Prerequisites

Before you begin, make sure that:

You have an AnalyticDB for MySQL cluster
You have an Object Storage Service (OSS) bucket in the same region as the cluster

All application files — JARs, Python files, dependencies, and compressed packages — must be stored in OSS.

Development tools

You can use one of the following tools to develop Spark batch applications and streaming applications:

Application configuration

Spark applications in AnalyticDB for MySQL are configured using JSON. The following example shows a Java application that reads data from OSS, including the common parameters (name, file, conf) and the Java-specific parameters (args, className).

{
  "args": ["args0", "args1"],
  "name": "spark-oss-test",
  "file": "oss://<testBucketName>/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.SparkReadOss",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.adb.connectors": "oss"
  }
}

For language-specific parameter details, see Java application parameters, Scala application parameters, or Python application parameters.

Common parameters

Parameter	Required	Description	Example
`name`	No	The name of the Spark application.	`"name": "spark-oss-test"`
`file`	Yes (Java, Scala, Python)	The absolute OSS path of the application's main file. For Java and Scala, this is the JAR file that contains the entry point. For Python, this is the executable entry point. The OSS bucket must be in the same region as the cluster.	`"file": "oss://<testBucketName>/jars/test/spark-examples-0.0.1-SNAPSHOT.jar"`
`files`	No	OSS paths of additional files to download to the driver and executor working directories. Supports aliases using `#` (for example, `oss://<testBucketName>/test/test1.txt#test1` makes the file accessible as `./test1` or `./test1.txt`). Separate multiple paths with commas. If you include `log4j.properties`, Spark uses it as the log configuration file.	`"files": ["oss://<testBucketName>/path/to/file1", "oss://<testBucketName>/path/to/file2"]`
`archives`	No	OSS paths of TAR.GZ compressed packages to decompress into the Spark process working directory. Supports aliases using `#` (for example, `oss://testBucketName/test/test1.tar.gz#test1` makes `test2.txt` inside accessible as `./test1/test2.txt` or `./test1.tar.gz/test2.txt`). Separate multiple paths with commas. If a package fails to decompress, the job fails.	`"archives": ["oss://<testBucketName>/path/to/archive1.tar.gz", "oss://<testBucketName>/path/to/archive2.tar.gz"]`
`conf`	Yes	Spark configuration in `key: value` format, similar to Apache Spark. Separate multiple entries with commas. For parameters specific to AnalyticDB for MySQL, see Spark application configuration parameters.	`"conf": {"spark.driver.resourceSpec": "medium", "spark.executor.resourceSpec": "medium", "spark.executor.instances": 2, "spark.adb.connectors": "oss"}`

Java application parameters

Parameter	Required	Description	Example
`args`	No	Arguments passed to the JAR. Separate multiple arguments with commas.	`"args": ["args0", "args1"]`
`className`	Yes	The main class of the Java application.	`"className": "com.aliyun.spark.oss.SparkReadOss"`
`jars`	No	Absolute OSS paths of additional JAR files added to the driver and executor JVM (Java Virtual Machine) classpaths at runtime. The OSS bucket must be in the same region as the cluster. Separate multiple paths with commas.	`"jars": ["oss://<testBucketName>/path/to/app.jar", "oss://testBucketName/path/to/lib.jar"]`

Scala application parameters

Parameter	Required	Description	Example
`className`	Yes	The main class of the Scala application.	`"className": "com.aliyun.spark.oss.SparkReadOss"`
`jars`	No	Absolute OSS paths of additional JAR files added to the driver and executor JVM classpaths at runtime. The OSS bucket must be in the same region as the cluster. Separate multiple paths with commas.	`"jars": ["oss://<testBucketName>/path/to/app.jar", "oss://testBucketName/path/to/lib.jar"]`

Python application parameters

Parameter	Required	Description	Example
`pyFiles`	Yes	OSS paths of Python files for the PySpark application. Supported formats: ZIP, PY, or EGG. For multiple files, use ZIP or EGG format. Python files can be imported as modules in your code. Separate multiple paths with commas.	`"pyFiles": ["oss://<testBucketName>/path/to/app.zip", "oss://<testBucketName>/path/to/lib.egg"]`