Object Storage Service (OSS) is an optional data storage service for Hadoop and Spark. This topic describes how to create a file in OSS and access the file from a Spark application that runs in Elastic Container Instance (ECI).

Prepare data

1. Create an OSS bucket.

2. Upload a file to the OSS bucket.

You can use the OSS SDK to upload a file, or use HDP 2.6-based Hadoop to read and write OSS data. The following figure shows the result in the OSS console after the file is uploaded.

The endpoint of the bucket is oss://liumi-hust/A-Game-of-Thrones.txtendpoint: oss-cn-hangzhou-internal.aliyuncs.com. Take note of the endpoint. Now the data is prepared.

Access data in OSS from a Spark application

1. Develop a Spark application.

You can develop a Spark application in the same way as that used in traditional deployment modes.

SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());

JavaRDD<String> lines = sc.textFile("oss://liumi-hust/A-Game-of-Thrones.txt", 250);

...
wordsCountResult.saveAsTextFile("oss://liumi-hust/A-Game-of-Thrones-result");

sc.close();   

2.Place the core-site.xml file in the resources directory of the application project.

<? xml version="1.0" encoding="UTF-8"? >
<? xml-stylesheet type="text/xsl" href="configuration.xsl"? >
<! --
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<! -- Put site-specific property overrides in this file. -->

<configuration>
    <! -- OSS configuration -->
    <property>
        <name>fs.oss.impl</name>
        <value>org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem</value>
    </property>
    <property>
        <name>fs.oss.endpoint</name>
        <value>oss-cn-hangzhou-internal.aliyuncs.com</value>
    </property>
    <property>
        <name>fs.oss.accessKeyId</name>
        <value>{Temporary AccessKey ID used to connect to OSS}</value>
    </property>
    <property>
        <name>fs.oss.accessKeySecret</name>
        <value>{Temporary AccessKey secret used to connect to OSS}</value>
    </property>
    <property>
        <name>fs.oss.buffer.dir</name>
        <value>/tmp/oss</value>
    </property>
    <property>
        <name>fs.oss.connection.secure.enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>fs.oss.connection.maximum</name>
        <value>2048</value>
    </property>

</configuration>

3. Create a JAR package with all the dependencies included.

mvn assembly:assembly

The content of the pom.xml file of the Spark application is as follows:

 1<? xml version="1.0" encoding="UTF-8"? >
 2<project xmlns="http://maven.apache.org/POM/4.0.0"
 3         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 4         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 5    <modelVersion>4.0.0</modelVersion>
 6
 7    <groupId>com.aliyun.liumi.spark</groupId>
 8    <artifactId>SparkExampleJava</artifactId>
 9    <version>1.0-SNAPSHOT</version>
10
11    <dependencies>
12        <dependency>
13            <groupId>org.apache.spark</groupId>
14            <artifactId>spark-core_2.12</artifactId>
15            <version>2.4.3</version>
16        </dependency>
17
18        <dependency>
19            <groupId>com.aliyun.dfs</groupId>
20            <artifactId>aliyun-sdk-dfs</artifactId>
21            <version>1.0.3</version>
22        </dependency>
23
24    </dependencies>
25
26    <build>
27    <plugins>
28        <plugin>
29            <groupId>org.apache.maven.plugins</groupId>
30            <artifactId>maven-assembly-plugin</artifactId>
31            <version>2.6</version>
32            <configuration>
33                <appendAssemblyId>false</appendAssemblyId>
34                <descriptorRefs>
35                    <descriptorRef>jar-with-dependencies</descriptorRef>
36                </descriptorRefs>
37                <archive>
38                    <manifest>
39                        <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
40                    </manifest>
41                </archive>
42            </configuration>
43            <executions>
44                <execution>
45                    <id>make-assembly</id>
46                    <phase>package</phase>
47                    <goals>
48                        <goal>assembly</goal>
49                    </goals>
50                </execution>
51            </executions>
52        </plugin>
53    </plugins>
54    </build>
55</project>

4. Create the Dockerfile.

OSS:

# Spark base image
FROM registry.cn-beijing.aliyuncs.com/eci_open/spark:2.4.4
RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
RUN mkdir -p /opt/spark/jars
COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars
# Copy the JAR dependency packages of OSS.
COPY aliyun-sdk-oss-3.4.1.jar /opt/spark/jars
COPY hadoop-aliyun-2.7.3.2.6.1.0-129.jar /opt/spark/jars
COPY jdom-1.1.jar /opt/spark/jars

For more information about how to download the JAR dependency packages of OSS, see Use HDP 2.6-based Hadoop to read and write OSS data.

5. Build the image of the Spark application.

docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .

6. Push the image to Alibaba Cloud Container Registry.

docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example

The image of the Spark application is prepared. You can deploy the Spark application in a Kubernetes cluster.