This topic describes how to set up a Spark on MaxCompute development environment that runs the Windows operating system.
If you have a Linux operating system installed, follow the instructions in Set up a Linux development environment to set up a Spark on MaxCompute development environment that runs the Linux operating system.
Prerequisites
Before you set up a Spark on MaxCompute development environment, make sure that the following software is installed in your Windows operating system:
The software version number and software installation path used in this topic are for reference only. The actual software version that you must download and install may differ based on your operating system.
JDK
In this example, JDK 1.8.0_361 is used. For more information about how to download JDK, go to the JDK official website.
Python
In this example, Python 3.7 is used. For more information about how to download Python, go to the Python official website.
NoteIn this example, Spark 2.4.5 is used. If you use another version of Spark, download and install a version of Python that corresponds to the Spark version. For more information, see https://pypi.org/project/pyspark/.
Maven
In this example, Apache Maven 3.8.7 is used. For more information about how to download Apache Maven, go to the Maven official website.
Git
In this example, Git 2.39.1.windows.1 is used. For more information about how to download Git, go to the Git official website.
Scala
In this example, Scala 2.13.10 is used. For more information about how to download Scala, go to the Scala official website.
Download the Spark on MaxCompute client package
The Spark on MaxCompute client package is released with the MaxCompute authentication feature. This allows Spark on MaxCompute to serve as a client that submits jobs to your MaxCompute project by using the spark-submit script. MaxCompute provides release packages for Spark 1.x, Spark 2.x, and Spark 3.x. You can download these packages from the following links. In this example, Spark 2.4.5 is used.
Spark 1.6.3: used to develop Spark 1.x applications.
Spark 2.3.0: used to develop Spark 2.x applications.
Spark 2.4.5: used to develop Spark 2.x applications. For more information about the precautions on using Spark 2.4.5, see Precautions on using Spark 2.4.5.
Spark 3.1.1: used to develop Spark 3.x applications. For more information about the precautions on using Spark 3.1.1, see Precautions on using Spark 3.1.1.
Configure environment variables
In the Windows operating system, right-click This PC and select Properties from the shortcut menu. On the page that appears, click Advanced system setting. On the Advanced tab, click Environment Variables and configure environment variables. The following content describes how to configure the environment variables.
Configure Java environment variables.
Obtain the Java installation path.
Edit Java environment variables.
Add the
JAVA_HOME
variable to system variables and set the variable value to the Java installation path.Add
%JAVA_HOME%\bin
to thePath
parameter in system variables.
Check whether the Java environment variables are successfully configured.
Verification method
Press Win+R. In the Run dialog box, enter cmd and click OK. In the Command Prompt, enter
java -version
. If the returned result is as expected, Java environment variables are successfully configured.Example of an expected result
java version "1.8.0_361" Java(TM) SE Runtime Environment (build 1.8.0_361-b09) Java HotSpot(TM) 64-Bit Server VM (build 25.361-b09, mixed mode)
Configure Spark environment variables.
Obtain the path to which the Spark on MaxCompute package is decompressed.
Edit Spark environment variables.
Add the
SPARK_HOME
parameter to system variables and set the variable value to the path to which the Spark on MaxCompute client package is decompressed.Add
%SPARK_HOME%\bin
to thePath
parameter in system variables.
Configure Scala environment variables.
Check whether Scala environment variables are successfully configured.
Verification method
Press Win+R. In the Run dialog box, enter cmd and click OK. In the Command Prompt, enter
scala
. If the returned result is as expected, Scala environment variables are successfully configured.Example of an expected result
Welcome to Scala 2.13.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_361). Type in expressions for evaluation. Or try :help. scala>
Configure Python environment variables.
Obtain the Python installation path.
Edit Python environment variables.
Add the Python installation path and the Scripts subdirectory in the installation path of Python to the
Path
parameter in system variables.Check whether the Python environment variables are successfully configured.
Verification method
Press Win+R. In the Run dialog box, enter cmd and click OK. In the Command Prompt, enter
python --version
. If the returned result is as expected, Python environment variables are successfully configured.Example of an expected result
Python 3.10.6
Configure Maven environment variables.
Obtain the path to which the Maven package is decompressed.
Edit Maven environment variables.
Add the
MAVEN_HOME
parameter to system variables and set the variable value to the path to which the Maven package is decompressed.Add
%MAVEN_HOME%\bin
to thePath
parameter in system variables.
Check whether the Maven environment variables are successfully configured.
Verification method
Press Win+R. In the Run dialog box, enter cmd and click OK. In the Command Prompt, enter
mvn --version
. If the returned result is as expected, Maven environment variables are successfully configured.Example of an expected result
# *** indicates the partial path to which the Maven package is decompressed. Apache Maven 3.8.7 (b89d5959fcde851dcb1c8946a785a163f14e1e29) Maven home: D:\***\apache-maven-3.8.7-bin\apache-maven-3.8.7 Java version: 1.8.0_361, vendor: Oracle Corporation, runtime: C:\Program Files\Java\jdk1.8.0_361\jre Default locale: zh_CN, platform encoding: GBK OS name: "windows 10", version: "10.0", arch: "amd64", family: "windows"
Configure Git environment variables.
Check whether the Git environment variables are successfully configured.
Verification method
Press Win+R. In the Run dialog box, enter cmd and click OK. In the Command Prompt, enter
git --version
. If the returned result is as expected, Git environment variables are successfully configured.Example of an expected result
git version 2.39.1.windows.1
Configure the spark_defaults.conf file.
If you use the Spark on MaxCompute client for the first time, rename the spark-defaults.conf.template file in the conf folder of the decompression path where the Spark on MaxCompute client package resides as spark-defaults.conf. Then, configure the file. If both the spark-defaults.conf.template and spark-defaults.conf files exist, you do not need to perform the rename operation. You need only to configure the spark-defaults.conf file. Sample code:
# Go to the path to which the Spark on MaxCompute client package is decompressed and open the conf folder. The actual path may vary. # Open the spark-defaults.conf file. # Add the following configurations to the end of the configuration file: spark.hadoop.odps.project.name = <MaxCompute_project_name> spark.hadoop.odps.access.id = <AccessKey_id> spark.hadoop.odps.access.key = <AccessKey_secret> spark.hadoop.odps.end.point = <Endpoint> # The endpoint that is used to connect the Spark on MaxCompute client to your MaxCompute project. You can modify the endpoint based on your business requirements. # For Spark 2.3.0, set spark.sql.catalogImplementation to odps. For Spark 2.4.5, set spark.sql.catalogImplementation to hive. spark.sql.catalogImplementation={odps|hive} # Retain the following parameter configurations: spark.hadoop.odps.task.major.version = cupid_v2 spark.hadoop.odps.cupid.container.image.enable = true spark.hadoop.odps.cupid.container.vm.engine.type = hyper spark.hadoop.odps.moye.trackurl.host = http://jobview.odps.aliyun.com
MaxCompute_project_name: the name of the MaxCompute project that you want to access.
This parameter specifies the name of your MaxCompute project instead of the DataWorks workspace to which the MaxCompute project corresponds. You can log on to the MaxCompute console. In the top navigation bar, select a region. In the left-side navigation pane, choose Workspace > Projects to view the name of the MaxCompute project.
AccessKey_id: the AccessKey ID that is used to access the MaxCompute project.
You can obtain the AccessKey ID from the AccessKey Pair page.
AccessKey_secret: the AccessKey secret that corresponds to the AccessKey ID.
You can obtain the AccessKey secret from the AccessKey Pair page.
Endpoint: the public endpoint of the region where your MaxCompute project resides.
For more information about the public endpoint of each region, see Endpoints in different regions (Internet).
VPC_endpoint: the VPC endpoint of the region where your MaxCompute project resides.
For more information about the VPC endpoint of each region, see Endpoints in different regions (VPC).
Prepare a project
Spark on MaxCompute provides a demo project template. We recommend that you download and copy the template to develop your application.
In the demo project, the dependency scope for Spark on MaxCompute is provided. Do not change this scope. Otherwise, the job that you submit may not run as expected.
Prepare a project in the Windows operating system.
Download the Spark-1.x template and compile the template.
# Start the downloaded Git client (Git Bash), go to the directory to which the project is downloaded, and then run the following code: git clone https://github.com/aliyun/MaxCompute-Spark.git # Go to the project folder. cd MaxCompute-Spark/spark-1.x # Compile the project package. mvn clean package
Download the Spark-2.x template and compile the template.
# Start the downloaded Git client (Git Bash), go to the directory to which the project is downloaded, and then run the following code: git clone https://github.com/aliyun/MaxCompute-Spark.git # Go to the project folder. cd MaxCompute-Spark/spark-2.x # Compile the project package. mvn clean package
Download the Spark-3.x template and compile the template.
# Start the downloaded Git client (Git Bash), go to the directory to which the project is downloaded, and then run the following code: git clone https://github.com/aliyun/MaxCompute-Spark.git # Go to the project folder. cd MaxCompute-Spark/spark-3.x # Compile the project package. mvn clean package
After you run the preceding commands, if the project fails to be created, some environment configurations are invalid. Follow the preceding instructions to check the configurations. If any invalid configurations are found, modify the configurations.
Configure dependencies
In the Spark on MaxCompute project that you prepared, configure the dependencies. The following content provides sample commands that you can run on the Git client to configure dependencies. You can also directly open related files and configure dependencies.
Configure the dependencies that are required for accessing tables in your MaxCompute project.
The Spark-1.x template is used.
# Go to the spark-1.x folder. cd MaxCompute-Spark/spark-1.x # Edit the POM file to add the odps-spark-datasource dependency. <dependency> <groupId>com.aliyun.odps</groupId> <artifactId>odps-spark-datasource_2.10</artifactId> <version>3.3.8-public</version> </dependency>
The Spark-2.x template is used.
# Go to the spark-2.x folder. cd MaxCompute-Spark/spark-2.x # Edit the POM file to add the odps-spark-datasource dependency. <dependency> <groupId>com.aliyun.odps</groupId> <artifactId>odps-spark-datasource_2.11</artifactId> <version>3.3.8-public</version> </dependency>
Configure the dependency that is required for accessing Object Storage Service (OSS).
If your job needs to access OSS, add the following dependency:
<dependency> <groupId>com.aliyun.odps</groupId> <artifactId>hadoop-fs-oss</artifactId> <version>3.3.8-public</version> </dependency>
For more information about the dependencies that are required when the Spark-1.x, Spark-2.x, or Spark-3.x template is used, see Spark-1.x pom, Spark-2.x pom, or Spark-3.x pom.
Smoke testing
After you complete the preceding operations, conduct smoke testing to check the end-to-end connectivity of Spark on MaxCompute.
SparkPi smoke testing
For example, you can run the following commands to conduct SparkPi smoke testing for a Spark 2.x application:
# Press Win+R. In the Run dialog box, enter cmd.
# Go to the bin folder in the D:\PC\spark\spark-2.4.5-odps0.33.2\ directory where the job is saved.
cd D:\PC\spark\spark-2.4.5-odps0.33.2\bin
# Run the following commands:
spark-submit \
--class com.aliyun.odps.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
/path/to/your/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
# If the following log information is displayed, smoke testing is successful.
19/06/11 11:57:30 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 11.222.166.90
ApplicationMaster RPC port: 38965
queue: queue
start time: 1560225401092
final status: SUCCEEDED
IntelliJ IDEA smoke testing in local mode
Open the downloaded project in IntelliJ IDEA and add the directory specified by the
--Jars
parameter on Spark on MaxCompute to the project template of IntelliJ IDEA. For more information, see Notes on running Spark on MaxCompute in local mode by using IntelliJ IDEA.Add the following code in IntelliJ IDEA in local mode for debugging:
val spark = SparkSession .builder() .appName("SparkPi") .config("spark.master", "local[4]") // The code can run after you set spark.master to local[N]. N indicates the number of concurrent Spark jobs. .getOrCreate()
You must specify the related configurations in the
odps.conf
file in the main/resource directory when you run the code in IntelliJ IDEA in local mode. You cannot directly reference the configurations in the spark-defaults.conf file.The following code provides an example.
NoteYou must specify configuration items in the
odps.conf
file for Spark 2.4.5 and later.dops.access.id="" odps.access.key="" odps.end.point="" odps.project.name=""
Precautions on using Spark 2.4.5
Use Spark 2.4.5 to submit jobs
Submit a job in a Yarn cluster. For more information, see Cluster mode.
Changes in using Spark 2.4.5
If you submit jobs in a Yarn cluster, you must specify
HADOOP_CONF_DIR=$SPARK_HOME/conf
to add the SPARK_HOME environment variable.If you perform debugging in local mode, you must create a file named odps.conf in the
$SPARK_HOME/conf
path and add the following configurations to the file:odps.project.name = odps.access.id = odps.access.key = odps.end.point =
Changes in the parameter settings of Spark 2.4.5
spark.sql.catalogImplementation
: This parameter is set tohive
.spark.sql.sources.default
: This parameter is set tohive
.spark.sql.odps.columnarReaderBatchSize
: specifies the number of rows from which the vectorized reader reads data at a time. Default value: 4096.spark.sql.odps.enableVectorizedReader
: specifies whether to enable the vectorized reader. Default value: True.spark.sql.odps.enableVectorizedWriter
: specifies whether to enable the vectorized writer. Default value: True.spark.sql.odps.split.size
: This parameter can be used to adjust the concurrency of data reading operations on MaxCompute tables. By default, this parameter is set to 256 for each partition. Unit: MB.
Precautions on using Spark 3.1.1
Use Spark 3.1.1 to submit jobs
Submit a job in a Yarn cluster. For more information, see Cluster mode.
Changes in using Spark 3.1.1
If you submit jobs in a Yarn cluster, you must specify
HADOOP_CONF_DIR=$SPARK_HOME/conf
to add the SPARK_HOME environment variable.If you submit PySpark jobs in a Yarn cluster, you must add the following configurations to the spark-defaults.conf file to use Spark for Python 3.
spark.hadoop.odps.cupid.resources = public.python-3.7.9-ucs4.tar.gz spark.pyspark.python = ./public.python-3.7.9-ucs4.tar.gz/python-3.7.9-ucs4/bin/python3
If you perform debugging in local mode:
You must create an odps.conf file in the
$SPARK_HOME/conf
directory and add the following configurations to the file:odps.project.name = odps.access.id = odps.access.key = odps.end.point =
You must add
spark.hadoop.fs.defaultFS = file:///
. Sample code:val spark = SparkSession .builder() .config("spark.hadoop.fs.defaultFS", "file:///") .enableHiveSupport() .getOrCreate()
Changes in the parameter settings of Spark 3.1.1
spark.sql.defaultCatalog
: This parameter is set toodps
.spark.sql.catalog.odps
: This parameter is set toorg.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog
.spark.sql.sources.partitionOverwriteMode
: This parameter is set todynamic
.spark.sql.extensions
: This parameter is set toorg.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions
.spark.sql.odps.enableVectorizedReader
: specifies whether to enable the vectorized reader. Default value: True.spark.sql.odps.enableVectorizedWriter
: specifies whether to enable the vectorized writer. Default value: True.spark.sql.catalog.odps.splitSizeInMB
: This parameter can be used to adjust the concurrency of data reading operations on MaxCompute tables. By default, this parameter is set to 256 for each partition. Unit: MB.