This topic describes how to set up a Spark on MaxCompute development environment.
Prerequisites
- JDK 1.8
The following sample command shows how to install JDK 1.8 in a Linux operating system. The actual JDK package name prevails. You can run the
yum -y list java*
command to obtain the JDK package name.yum install -y java-1.8.0-openjdk-devel.x86_64
- Python 2.7
The following sample commands show how to install Python 2.7 in a Linux operating system. The actual Python package name prevails.
# Obtain the Python package. wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz # Decompress the Python package. tar -zxvf Python-2.7.10.tgz # Switch to the path to which the Python package is decompressed and specify the installation path. cd Python-2.7.10 ./configure --prefix=/usr/local/python2 # Compile and install the Python package. make make install
- Maven
The following sample commands show how to install Maven in a Linux operating system. The actual path of the Maven package prevails.
# Obtain the Maven package. wget https://dlcdn.apache.org/maven/maven-3/3.8.4/binaries/apache-maven-3.8.4-bin.tar.gz # Decompress the Maven package. tar -zxvf apache-maven-3.8.4-bin.tar.gz
- Git
The following sample commands show how to install Git in a Linux operating system.
# Obtain the Git package. wget https://github.com/git/git/archive/v2.17.0.tar.gz # Decompress the Git package. tar -zxvf v2.17.0.tar.gz # Install the dependencies that are required for compiling the source code of Git. yum install curl-devel expat-devel gettext-devel openssl-devel zlib-devel gcc perl-ExtUtils-MakeMaker # Switch to the path to which the Git package is decompressed. cd git-2.17.0 # Compile the source code of Git. make prefix=/usr/local/git all # Install Git in the /usr/local/git path. make prefix=/usr/local/git install
Download the Spark on MaxCompute client package and upload the package to your operating system
- Spark-1.6.3: used to develop Spark 1.x applications.
- Spark-2.3.0: used to develop Spark 2.x applications.
- Spark-2.4.5: used to develop Spark 2.x applications. For more information about the notes on using Spark-2.4.5, see Notes on using Spark 2.4.5.
Upload the Spark on MaxCompute client package to your Linux operating system and decompress the package. You can go to the path where the Spark on MaxCompute client package is located and run the following command to decompress the package:
tar -xzvf spark-2.3.0-odps0.33.0.tar.gz
Configure environment variables
You must configure the following environment variables in the CLI of your operating system. The following content describes how to configure environment variables in a Linux operating system and provides related information.
- Configure Java environment variables.
- Obtain the Java installation path. Sample command:
# If you use yum to install Java, Java is installed in the /usr path by default. You can run the following command to query the Java installation path. If you use a custom installation path, the actual path prevails. whereis java ls -lrt /usr/bin/java ls -lrt /etc/alternatives/java # The following information is returned. /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.1.al7.x86_64 is the Java installation path. /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.1.al7.x86_64/jre/bin/jav
- Edit Java environment variables. Sample commands:
# Edit the configuration file for environment variables. vim /etc/profile # Press i to enter the edit mode and add environment variables to the end of the configuration file. # Replace JAVA_HOME with the actual Java installation path. export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.1.al7.x86_64 export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export PATH=$JAVA_HOME/bin:$PATH # Press the Esc key to exit the edit mode and enter :wq to close the configuration file. # Run the following command to make the modification take effect: source /etc/profile # Check whether the Java environment variables are successfully configured. java -version # Sample output: openjdk version "1.8.0_322" OpenJDK Runtime Environment (build 1.8.0_322-b06) OpenJDK 64-Bit Server VM (build 25.322-b06, mixed mode)
- Obtain the Java installation path. Sample command:
- Configure Spark environment variables.
- Obtain the path to which the Spark on MaxCompute package is decompressed. The following
figure shows that the path is
/root/spark-2.3.0-odps0.33.0
. The actual decompression path and name prevail. - Edit Spark environment variables. Sample commands:
# Edit the configuration file for environment variables. vim /etc/profile # Press i to enter the edit mode and add environment variables to the end of the configuration file. # Replace SPARK_HOME with the actual path to which the Spark on MaxCompute client package is decompressed. export SPARK_HOME=/root/spark-2.3.0-odps0.33.0 export PATH=$SPARK_HOME/bin:$PATH # Press the Esc key to exit the edit mode and enter :wq to close the configuration file. # Run the following command to make the modification take effect: source /etc/profile
- Obtain the path to which the Spark on MaxCompute package is decompressed. The following
figure shows that the path is
- Configure Python environment variables.
If you use PySpark, you must configure Python environment variables.
- Obtain the Python installation path. Sample command:
- Edit Python environment variables. Sample commands:
# Edit the configuration file for environment variables. vim /etc/profile # Press i to enter the edit mode and add environment variables to the end of the configuration file. # Replace PATH with the actual Python installation path. export PATH=/usr/bin/python/bin/:$PATH # Press the Esc key to exit the edit mode and enter :wq to close the configuration file. # Run the following command to make the modification take effect: source /etc/profile # Check whether the Python environment variables are successfully configured. python --version # Sample output: Python 2.7.5
- Obtain the Python installation path. Sample command:
- Configure Maven environment variables.
- Obtain the path to which the Maven package is decompressed. The following figure shows
that the path is
/root/apache-maven-3.8.4
. The actual decompression path and name prevail. - Edit Maven environment variables. Sample commands:
# Edit the configuration file for environment variables. vim /etc/profile # Press i to enter the edit mode and add environment variables to the end of the configuration file. # Replace MAVEN_HOME with the actual path to which the Maven package is decompressed. export MAVEN_HOME=/root/apache-maven-3.8.4 export PATH=$MAVEN_HOME/bin:$PATH # Press the Esc key to exit the edit mode and enter :wq to close the configuration file. # Run the following command to make the modification take effect: source /etc/profile # Check whether the Maven environment variables are successfully configured. mvn -version # Sample output: Apache Maven 3.8.4 (9b656c72d54e5bacbed989b64718c159fe39b537) Maven home: /root/apache-maven-3.8.4 Java version: 1.8.0_322, vendor: Red Hat, Inc., runtime: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.322.b06-1.1.al7.x86_64/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "4.19.91-25.1.al7.x86_64", arch: "amd64", family: "unix"
- Obtain the path to which the Maven package is decompressed. The following figure shows
that the path is
- Configure Git environment variables.
- Obtain the Git installation path. Sample command:
whereis git
- Edit Git environment variables. Sample commands:
# Edit the configuration file for environment variables. vim /etc/profile # Press i to enter the edit mode and add environment variables to the end of the configuration file. # Replace PATH with the actual Git installation path. export PATH=/usr/local/git/bin/:$PATH # Press the Esc key to exit the edit mode and enter :wq to close the configuration file. # Run the following command to make the modification take effect: source /etc/profile # Check whether the Git environment variables are successfully configured. git --version # Sample output: git version 2.24.4
- Obtain the Git installation path. Sample command:
Configure the spark-defaults.conf file
conf
folder of the decompression path of the Spark on MaxCompute client package as spark-defaults.conf and then configure information in the file. If you do not rename the file, the configuration
cannot take effect. Sample commands: # Switch to the path to which the Spark on MaxCompute client package is decompressed and go to the conf folder. The actual path prevails.
cd /root/spark-2.3.0-odps0.33.0/conf
# Rename the spark-defaults.conf.template file.
mv spark-defaults.conf.template spark-defaults.conf
# Edit the spark-defaults.conf file.
vim spark-defaults.conf
# Press i to enter the edit mode and add the following configuration to the end of the configuration file.
spark.hadoop.odps.project.name = <MaxCompute_project_name>
spark.hadoop.odps.access.id = <AccessKey_id>
spark.hadoop.odps.access.key = <AccessKey_secret>
spark.hadoop.odps.end.point = <Endpoint> # The endpoint that is used to connect the Spark on MaxCompute client to your MaxCompute project. You can modify the endpoint based on your business requirements. For more information about endpoints, see Endpoints.
spark.hadoop.odps.runtime.end.point = <VPC_endpoint> # The virtual private cloud (VPC) endpoint of the region where your MaxCompute project resides. You can modify the endpoint based on your business requirements.
# For Spark 2.3.0, set spark.sql.catalogImplementation to odps. For Spark 2.4.5, set spark.sql.catalogImplementation to hive.
spark.sql.catalogImplementation={odps|hive}
# Retain the configurations of the following parameters:
spark.hadoop.odps.task.major.version = cupid_v2
spark.hadoop.odps.cupid.container.image.enable = true
spark.hadoop.odps.cupid.container.vm.engine.type = hyper
spark.hadoop.odps.cupid.webproxy.endpoint = http://service.cn.maxcompute.aliyun-inc.com/api
spark.hadoop.odps.moye.trackurl.host = http://jobview.odps.aliyun.com
- MaxCompute_project_name: the name of the MaxCompute project that you want to access.
This parameter specifies the name of your MaxCompute project instead of the DataWorks workspace to which the MaxCompute project corresponds. You can log on to the MaxCompute console. In the top navigation bar, select a region. Then, view the name of the MaxCompute project on the Project management tab.
- AccessKey_id: the AccessKey ID that is used to access the MaxCompute project.
You can obtain the AccessKey ID from the AccessKey Pair page.
- AccessKey_secret: the AccessKey secret that corresponds to the AccessKey ID.
You can obtain the AccessKey secret from the AccessKey Pair page.
- Endpoint: the public endpoint of the region where your MaxCompute project resides.
For more information about the public endpoint of each region, see Endpoints in different regions (Internet).
- VPC_endpoint: the VPC endpoint of the region where your MaxCompute project resides.
For more information about the VPC endpoint of each region, see Endpoints in different regions (VPC).
Other configuration parameters are required for special scenarios and features. For more information, see Spark on MaxCompute configuration details.
Prepare a project
- Download the Spark-2.x template and compile the template.
git clone https://github.com/aliyun/MaxCompute-Spark.git cd MaxCompute-Spark/spark-2.x mvn clean package
- Download the Spark-1.x template and compile the template.
git clone https://github.com/aliyun/MaxCompute-Spark.git cd MaxCompute-Spark/spark-1.x mvn clean package
After you run the preceding commands, if an error message appears, indicating that the environment configurations are invalid, follow the preceding instructions to check the configurations. If any invalid configurations are found, modify the configurations.
Configure dependencies
In the Spark on MaxCompute project that you prepared, configure the dependencies. The following commands show an example.
- Configure the dependencies that are required for accessing tables in your MaxCompute
project.
- The Spark-1.x template is used.
# Go to the spark-1.x folder. cd MaxCompute-Spark/spark-1.x # Edit the POM file to add the odps-spark-datasource dependency. <dependency> <groupId>com.aliyun.odps</groupId> <artifactId>odps-spark-datasource_2.10</artifactId> <version>3.3.8-public</version> </dependency>
- The Spark-2.x template is used.
# Go to the spark-2.x folder. cd MaxCompute-Spark/spark-2.x # Edit the POM file to add the odps-spark-datasource dependency. <dependency> <groupId>com.aliyun.odps</groupId> <artifactId>odps-spark-datasource_2.11</artifactId> <version>3.3.8-public</version> </dependency>
- The Spark-1.x template is used.
- Configure the dependency that is required for accessing Object Storage Service (OSS).
If your job needs to access OSS, add the following dependency:
<dependency> <groupId>com.aliyun.odps</groupId> <artifactId>hadoop-fs-oss</artifactId> <version>3.3.8-public</version> </dependency>
For more information about the dependencies that are required when the Spark-1.x or Spark-2.x template is used, see Spark-1.x POM file or Spark-2.x POM file.
Reference external files
- Spark jobs that need to read data from specific configuration files.
- Spark jobs that require additional resource packages or third-party libraries, such as JAR files or Python libraries.
- Method 1: Upload external files by using Spark parameters.
Spark on MaxCompute supports the following parameters that are defined by Apache Spark:
--jars
,--py-files
,--files
, and--archives
. You can configure these parameters to upload external files when you submit jobs. When you run the jobs, these external files are uploaded to the working directories.- Use the spark-submit script to upload files on the Spark on MaxCompute client.
Note
--jars
: This parameter specifies the JAR packages that you want to upload to the current working directories of the driver and each executor. Package names are separated by commas (,). These JAR packages are added to the classpaths of the driver and each executor. You can use"./your_jar_name"
in the configurations of Spark jobs to specify these packages. This is the case in Apache Spark.--files
and--py-files
: The two parameters specify the common files or Python files that you want to upload to the current working directories of the driver and each executor. File names are separated by commas (,). You can use"./your_file_name"
in the configurations of Spark jobs to specify these files. This is the case in Apache Spark.--archives
: This parameter is slightly different from the parameter defined by Apache Spark. Configure this parameter in thexxx#yyy
format and separate file names with commas (,). After you configure the --archives parameter, the archive files that you specify, such as ZIP files, are decompressed to the subdirectories of the current working directories of the driver and each executor. For example, if this parameter is set toxx.zip#yy
, you can use"./yy/xx/"
to reference the content in the archive file. If this parameter is set toxx.zip
, you can use"./xx.zip/xx/"
to reference the content in the archive file. If you want to use"./xxx/"
to decompress the archive file directly to the current working directory, use thespark.hadoop.odps.cupid.resources
parameter.
- Use DataWorks to add the resources required by a job. For more information, see Create a MaxCompute resource.
Note In the DataWorks console, you can upload files with a size of up to 50 MB. If the size of files that you want to upload exceeds 50 MB, you can use the MaxCompute client to upload these files as MaxCompute resources and add these resources to the DataStudio page in the DataWorks console. For more information about MaxCompute resources, see MaxCompute resources.
- Use the spark-submit script to upload files on the Spark on MaxCompute client.
- Method 2: Upload files as MaxCompute resources.
Spark on MaxCompute provides the
spark.hadoop.odps.cupid.resources
parameter. This parameter allows you to directly reference resources in MaxCompute. When you run a job, resources that are referenced are uploaded to the working directories. To upload files as MaxCompute resources, perform the following steps:- Log on to the MaxCompute client and upload files as resources to your MaxCompute project. The maximum size of a file that you can upload is 500 MB.
- Add the
spark.hadoop.odps.cupid.resources
parameter to the configurations of your Spark job. This parameter specifies the MaxCompute resources that are required for running the Spark job. Configure this parameter in the<projectname>.<resourcename>
format. If you want to reference multiple files, separate the file names with commas (,). Example:
The specified resource file is downloaded to the current working directories of the driver and each executor. By default, the downloaded file is named in thespark.hadoop.odps.cupid.resources=public.python-python-2.7-ucs4.zip,public.myjar.jar
<projectname>.<resourcename>
format.You can also rename the file in the<projectname>.<resourcename>:<newresourcename>
format. The following configuration shows an example.spark.hadoop.odps.cupid.resources=public.myjar.jar:myjar.jar
Notice The spark.hadoop.odps.cupid.resources parameter takes effect only after you configure this parameter in the spark-defaults.conf file or in the DataWorks console. This parameter does not take effect if this parameter is written in code.
val targetFile = "File name"
val file = Source.fromFile(targetFile)
for (line <- file.getLines)
println(line)
file.close
Conduct SparkPi smoke testing
# /path/to/MaxCompute-Spark Configure a valid path for the compiled JAR package.
cd $SPARK_HOME
bin/spark-submit --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi \
/path/to/MaxCompute-Spark/spark-2.x/target/spark-examples_2.11-1.0.0-SNAPSHOT-shaded.jar
# If the following log information is displayed, smoke testing is successful.
19/06/11 11:57:30 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 11.222.166.90
ApplicationMaster RPC port: 38965
queue: queue
start time: 1560225401092
final status: SUCCEEDED
Notes on running Spark on MaxCompute in local mode by using IntelliJ IDEA
- Specify the spark.master parameter in the code.
val spark = SparkSession .builder() .appName("SparkPi") .config("spark.master", "local[4]") // The code can run only after you set the spark.master parameter to local[N]. N indicates the parallelism. .getOrCreate()
- Add the following dependency of the Spark on MaxCompute client in IntelliJ IDEA.
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.binary.version}</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency>
In the pom.xml file, the scope parameter is set to provided. This setting may cause the "NoClassDefFoundError" error when you run code.
You can use the following method to manually add the directories specified by the --jars parameter on Spark on MaxCompute to the project template of IntelliJ IDEA. This way, the configurationException in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$ at com.aliyun.odps.spark.examples.SparkPi$.main(SparkPi.scala:27) at com.aliyun.odps.spark.examples.Spa. r. kPi.main(SparkPi.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$ at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 2 more
scope=provided
can be retained and the "NoClassDefFoundError" error is not returned when you run code in local mode by using IntelliJ IDEA.- In the main menu bar of IntelliJ IDEA, choose File > Project Structure….
- On the Project Structure page, click Modules in the left-side navigation pane. Then, select the resource package, and click the Dependencies tab for the resource package.
- On the Dependencies tab, click the plus sign (+) in the lower-left corner and select JARs or directories… to add the directory specified by the --jars parameter on Spark on MaxCompute.
- In the main menu bar of IntelliJ IDEA, choose File > Project Structure….
- The spark-defaults.conf file cannot be directly referenced when you run code in local mode. You must manually
configure related parameters in the code.
If you submit jobs by using the spark-submit script, the system reads the configurations from the spark-defaults.conf file. If you submit jobs in local mode, you must manually configure related parameters in code. The following code shows how to use Spark SQL to read data from MaxCompute tables in local mode.
val spark = SparkSession .builder() .appName("SparkPi") .config("spark.master", "local[4]") // The code can run after you set spark.master to local[N]. N indicates the parallelism. .config("spark.hadoop.odps.project.name", "****") .config("spark.hadoop.odps.access.id", "****") .config("spark.hadoop.odps.access.key", "****") .config("spark.hadoop.odps.end.point", "http://service.cn.maxcompute.aliyun.com/api") .config("spark.sql.catalogImplementation", "odps") .getOrCreate()
Notes on using Spark 2.4.5
- Use Spark 2.4.5 to submit jobs
- Submit a job in a Yarn cluster.
- Set the
spark.hadoop.odps.spark.version
parameter to spark-2.4.5-odps0.33.0 in the DataWorks console. If the Spark version of exclusive resource groups in DataWorks is not updated to Spark 2.4.5, you can use shared resource groups to schedule jobs or contact the MaxCompute technical support team to update the Spark version.
- Changes in using Spark 2.4.5
- If you submit jobs in a Yarn cluster, you must run the
export HADOOP_CONF_DIR=$SPARK_HOME/conf
command to add the SPARK_HOME environment variable. - If you perform debugging in local mode, you must create a file named odps.conf in
the
$SPARK_HOME/conf
path and add the following configurations to the file.odps.project.name = odps.access.id = odps.access.key = odps.end.point =
- If you submit jobs in a Yarn cluster, you must run the
- Changes in the parameter settings of Spark 2.4.5
spark.sql.catalogImplementation
is set tohive
.spark.sql.sources.default
is set tohive
.spark.sql.odps.columnarReaderBatchSize
: specifies the number of rows from which the vectorized reader reads data at a time. Default value: 4096.spark.sql.odps.enableVectorizedReader
: specifies whether to enable the vectorized reader. Default value: True.spark.sql.odps.enableVectorizedWriter
: specifies whether to enable the vectorized writer. Default value: True.spark.sql.odps.split.size
: This parameter can be used to adjust the concurrency of data reading operations on MaxCompute tables. Default value: 256. Unit: MB.