By Aaron Handoko and M Rifandy Zulvan, Solution Architects Alibaba Cloud Indonesia
In this article we will discuss how to quickly setup ODPS Spark Environment by using Docker Image and how to run PySpark ODPS using CLI.
1). An active Alibaba Cloud account
2). Maxcompute Project Created
3). Familiarity with Spark
1). Create a new Dockerfile and copy and paste the following codes. This Dockerfile image will setup all the necessary libraries for running Spark on Linux.
FROM centos:7.6.1810
# Install JDK
RUN yum install -y java-1.8.0-openjdk-devel.x86_64
RUN set -ex \
&& yum install wget -y \
&& yum install git -y
#Install Maven
RUN set -ex \
&& wget --no-check-certificate \
&& tar -zxvf apache-maven-3.9.6-bin.tar.gz
#install Spark MaxCompute client
RUN set -ex \
&& wget \
&& tar -xzvf spark-2.4.5-odps0.33.2.tar.gz
# Install python
RUN set -ex \
# Preinstall the required components.
&& yum install -y wget tar libffi-devel zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make initscripts zip\
&& wget \
&& tar -zxvf Python-3.7.0.tgz \
&& cd Python-3.7.0 \
&& ./configure prefix=/usr/local/python3 \
&& make \
&& make install \
&& make clean \
&& rm -rf /Python-3.7.0* \
&& yum install -y epel-release \
&& yum install -y python-pip
# Set the default Python version to Python 3.
RUN set -ex \
# Back up resources of Python 2.7.
&& mv /usr/bin/python /usr/bin/python27 \
&& mv /usr/bin/pip /usr/bin/pip-python27 \
# Set the default Python version to Python 3.
&& ln -s /usr/local/python3/bin/python3.7 /usr/bin/python \
&& ln -s /usr/local/python3/bin/pip3 /usr/bin/pip
# Fix the YUM bug that is caused by the change in the Python version.
RUN set -ex \
&& sed -i "s#/usr/bin/python#/usr/bin/python27#" /usr/bin/yum \
&& sed -i "s#/usr/bin/python#/usr/bin/python27#" /usr/libexec/urlgrabber-ext-down \
&& yum install -y deltarpm
RUN pip install --upgrade pip
ENV JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-
ENV CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
ENV SPARK_HOME=/root/spark-3.1.1-odps0.34.1
ENV SPARK_HOME=/spark-2.4.5-odps0.33.2
ENV MAVEN_HOME=/apache-maven-3.9.6
ENV PATH=/usr/local/git/bin/:$PATH
WORKDIR /spark-job
# Clone repos
RUN git clone .\
&& cd ./spark-2.x \
&& mvn clean package
RUN echo PATH=/usr/local/git/bin/:$PATH
2). Run the following command to execute and enter the Dockerfile container
docker build -t spark-odps .
docker run -it spark-odps bash
3). Edit Configuration file
vim conf/spark-defaults.conf
Below is the example of the configuration file =<YOUR MAXCOMPUTE PROJECT> = <YOUR ACCESSKEY ID>
spark.hadoop.odps.access.key = <YOUR ACCESSKEY SECRET>
spark.hadoop.odps.end.point = <YOUR MAXCOMPUTE ENDPOINT>
# For Spark 2.3.0, set spark.sql.catalogImplementation to odps. For Spark 2.4.5, set spark.sql.catalogImplementation to hive.
# Retain the following configurations:
spark.hadoop.odps.task.major.version = cupid_v2
spark.hadoop.odps.cupid.container.image.enable = true
spark.hadoop.odps.cupid.container.vm.engine.type = hyper
spark.hadoop.odps.cupid.webproxy.endpoint = =
# Accessing OSS
spark.hadoop.fs.oss.accessKeyId = <YOUR ACCESSKEY ID>
spark.hadoop.fs.oss.accessKeySecret = <YOUR ACCESSKEY SECRET>
spark.hadoop.fs.oss.endpoint = <OSS ENDPOINT>
4). Create odps.conf in the conf folder in spark
Here is the example of the configuration required inside the odps.conf = <MAXCOMPUTE PROJECT NAME> = <ACCESSKEY ID>
odps.access.key = <ACCESSKEY SECRET>
odps.end.point = <MAXCOMPUTE ENDPOINT>
5). There are two types of running mode you can use; Local mode and Cluster Mode.
Running on Local Mode (on-premise engine)
1). Run the following command:
./bin/spark-submit --master local[4] /root/MaxCompute-Spark/spark-2.x/src/main/python/
Running on Cluster Mode (ODPS engine)
a). Run the following commands to add environment variable
b). Run the following command to run the spark in cluster mode
bin/spark-submit --master yarn-cluster --class com.aliyun.odps.spark.examples.SparkPi /root/MaxCompute-Spark/spark-2.x/src/main/python/
