Develop Spark jobs - AnalyticDB for MySQL - Alibaba Cloud Documentation Center

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) provides the same development method for Spark batch applications and streaming applications. This topic describes how to develop Spark applications.

Development tools

You can use one of the following tools to develop Spark batch applications and streaming applications:

Sample code

The following sample code provides an example on how to develop a Spark application based on data that is stored in Object Storage Service (OSS). The code includes common parameters such as name and conf, and parameters that are specific to Java, Scala, and Python applications. The parameters are written in the JSON format.

 {
  "args": ["args0", "args1"],
  "name": "spark-oss-test",
  "file": "oss://<testBucketName>/jars/test/spark-examples-0.0.1-SNAPSHOT.jar",
  "className": "com.aliyun.spark.oss.SparkReadOss",
  "conf": {
    "spark.driver.resourceSpec": "medium",
    "spark.executor.resourceSpec": "medium",
    "spark.executor.instances": 2,
    "spark.adb.connectors": "oss"
  }
}

Common parameters

Parameter	Required	Example	Description
name	No	`"name": "spark-oss-test"`	The name of the Spark application.
file	Yes for Python, Java, and Scala applications	`"file":"oss://<testBucketName>/jars/test/spark-examples-0.0.1-SNAPSHOT.jar"`	The absolute path of the main file of the Spark application. The main file can be a JAR package that contains the entry point or an executable file that serves as the entry point for the Python program. Important You must store the main files of Spark applications in OSS. The OSS bucket must reside in the same region as the AnalyticDB for MySQL cluster.
files	No	`"files":["oss://<testBucketName>/path/to/files_name1","oss://<testBucketName>/path/to/files_name2"]`	The files that are required for the Spark application. These files are downloaded to the working directories of the driver and executor processes. You can configure aliases for the files. Example: `oss://<testBucketName>/test/test1.txt#test1`. In this example, test1 is used as the file alias. You can specify ./test1 or ./test1.txt to access the file. Separate multiple files with commas (,). Note If you specify the `log4j.properties` file for this parameter, the Spark application uses the `log4j.properties` file as the log configuration file. You must store all files that are required for Spark applications in OSS.
archives	No	`"archives":["oss://<testBucketName>/path/to/archives","oss://<testBucketName>/path/to/archives"]`	The compressed packages that are required for the Spark application. The packages must be in the TAR.GZ format. The packages are decompressed to the working directory of the Spark process. You can configure aliases for the files that are contained in the package. Example: `oss://testBucketName/test/test1.tar.gz#test1`. In this example, test1 is used as the file alias. For example, test2.txt is a file that is contained in the test1.tar.gz package. You can access the file by specifying ./test1/test2.txt or ./test1.tar.gz/test2.txt. Separate multiple packages with commas (,). Note You must store all compressed packages that are required for Spark applications in OSS. If a package fails to be decompressed, the job fails.
conf	Yes	`"conf":{"spark.driver.resourceSpec": "medium",spark.executor.resourceSpec":"medium,"spark.executor.instances": 2,"spark.adb.connectors": "oss"}`	The configuration parameters that are required for the Spark application, which are similar to those of Apache Spark. The parameters must be in the `key:value` format. Separate multiple parameters with commas (,). For information about the configuration parameters that are different from those of Apache Spark or the configuration parameters that are specific to AnalyticDB for MySQL, see Spark application configuration parameters.

Java application parameters

Parameter	Required	Example	Description
args	No	`"args":["args0", "args1"]`	The parameters that are required for JAR packages. Separate multiple parameters with commas (,).
className	Yes	`"className":"com.aliyun.spark.oss.SparkReadOss"`	The entry class of the Java application.
jars	No	`"jars":["oss://<testBucketName>/path/to/jar","oss://testBucketName/path/to/jar"]`	The absolute paths of JAR packages that are required for the Spark application. Separate multiple paths with commas (,). When a Spark application runs, JAR packages are added to the classpaths of the driver and executor Java virtual machines (JVMs). Important You must store all JAR packages that are required for Spark applications in OSS. The OSS bucket must reside in the same region as the AnalyticDB for MySQL cluster.

Scala application parameters

Parameter	Required	Example	Description
className	Yes	`"className":"com.aliyun.spark.oss.SparkReadOss"`	The entry class of the Scala application.
jars	No	`"jars":["oss://<testBucketName>/path/to/jar","oss://testBucketName/path/to/jar"]`	The absolute paths of JAR packages that are required for the Spark application. Separate multiple paths with commas (,). When a Spark application runs, JAR packages are added to the classpaths of the driver and executor Java virtual machines (JVMs). Important You must store all JAR packages that are required for Spark applications in OSS. The OSS bucket must reside in the same region as the AnalyticDB for MySQL cluster.

Python application parameters

Parameter	Required	Example	Description
pyFiles	Yes	`"pyFiles":["oss://<testBucketName>/path/to/pyfiles","oss://<testBucketName>/path/to/pyfiles"]`	The Python files that are required for the PySpark application. The files must be in the ZIP, PY, or EGG format. If multiple Python files are required, we recommend that you use the files in the ZIP or EGG format. You can reference Python files as modules in Python code. Separate multiple packages with commas (,). Note You must store all Python files that are required for Spark applications in OSS.