Prepare data

HDFS is one of the optional datastores for Hadoop / spark batch jobs. This article demonstrates creating a file in HDFS and accessing it in spark.

1. Activate the Apsara File Storage for HDFS service and create a file system.

2. Create and configure a permission group.

1. Create a permission group.

2. Configure rules for the permission group.

3. Bind the permission group to a mount target.

The Apsara File Storage for HDFS file system is created and configured.

3. Install Apache Hadoop.

After the Apsara File Storage for HDFS file system is configured, you can write files to the file system. In this example, the HDFS client of Apache Hadoop is used to write files to the file system.

To download Apache Hadoop, click here. We recommend that you download Apache Hadoop version 2.7.2 or later. This topic takes Apache Hadoop version 2.7.2 as an example.

1. Run the following command to decompress the Apache Hadoop package to the specified directory:

tar -zxvf hadoop-2.7.2.tar.gz -C /usr/local/

2. Run the following command to open the core-site.xml configuration file:

vim /usr/local/hadoop-2.7.2/etc/hadoop/core-site.xml

Modify the content of the file as follows:

<? xml version="1.0" encoding="UTF-8"? >
<? xml-stylesheet type="text/xsl" href="configuration.xsl"? >
<! --
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<! -- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
        <! -- Enter the mount target of your Apsara File Storage for HDFS file system. -->
    </property>
    <property>
        <name>fs.dfs.impl</name>
        <value>com.alibaba.dfs.DistributedFileSystem</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.dfs.impl</name>
        <value>com.alibaba.dfs.DFS</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>8388608</value>
    </property>
    <property>
        <name>alidfs.use.buffer.size.setting</name>
        <value>false</value>
        <! -- We recommend that you do not enable this feature. If you enable this feature, the I/O size will be greatly reduced, which affects data throughput. -->
    </property>
    <property>
        <name>dfs.usergroupservice.impl</name>
        <value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
    </property>
    <property>
        <name>dfs.connection.count</name>
        <value>256</value>
    </property>
</configuration>	
Notice In this example, you do not need to set the parameters related to YARN because you will run a Spark application in a Kubernetes cluster. You only need to set the parameters related to HDFS. The core-site.xml file that you modify in this step will be used in subsequent steps.

3. Run the following command to open the /etc/profile configuration file:

vim /etc/profile

Run the following commands to add environment variables:

export HADOOP_HOME=/usr/local/hadoop-2.7.2
export HADOOP_CLASSPATH=/usr/local/hadoop-2.7.2/etc/hadoop:/usr/local/hadoop-2.7.2/share/hadoop/common/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/common/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/hdfs/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/yarn/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/lib/*:/usr/local/hadoop-2.7.2/share/hadoop/mapreduce/*:/usr/local/hadoop-2.7.2/contrib/capacity-scheduler/*.jar
export HADOOP_CONF_DIR=/usr/local/hadoop-2.7.2/etc/hadoop

Run the following command to make your configurations take effect:

source /etc/profile
Notice In this example, only one HDFS client is required. You do not need to deploy an HDFS cluster.

4. Run the following command to add the Apsara File Storage for HDFS dependencies:

cp aliyun-sdk-dfs-1.0.3.jar  /usr/local/hadoop-2.7.2/share/hadoop/hdfs

To download the Apsara File Storage for HDFS SDK, click here.

4. Upload data.

# Create a directory for storing data.
[root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -mkdir -p /pod/data
# Upload a local file, which is a text file of a novel, to Apsara File Storage for HDFS.
[root@liumi-hdfs ~]# $HADOOP_HOME/bin/hadoop fs -put . /A-Game-of-Thrones.txt /pod/data/A-Game-of-Thrones.txt
# The file size is 30 GB.
[root@liumi-hdfs local]# $HADOOP_HOME/bin/hadoop fs -ls /pod/data
Found 1 items
-rwxrwxrwx   3 root root 33710040000 2019-11-10 13:02 /pod/data/A-Game-of-Thrones.txt

The data is prepared and ready in Apsara File Storage for HDFS.

Access data in Apsara File Storage for HDFS from a Spark application

1. Develop a Spark application.

You can develop a Spark application in the same way as that used in traditional deployment modes.

SparkConf conf = new SparkConf().setAppName(WordCount.class.getSimpleName());

JavaRDD<String> lines = sc.textFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones.txt", 250);

...
wordsCountResult.saveAsTextFile("dfs://f-4b1fcae5dvxxx.cn-hangzhou.dfs.aliyuncs.com:10290/pod/data/A-Game-of-Thrones-Result");

sc.close();   	

2. Place the core-site.xml file that you modified earlier in the resources directory of the application project.

<? xml version="1.0" encoding="UTF-8"? >
<? xml-stylesheet type="text/xsl" href="configuration.xsl"? >
<! --
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<! -- Put site-specific property overrides in this file. -->

<configuration>

    <! -- HDFS configuration -->
    <configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>dfs://f-4b1fcae5dvexx.cn-hangzhou.dfs.aliyuncs.com:10290</value>
        <! -- Enter the mount target of your Apsara File Storage for HDFS file system. -->
    </property>
    <property>
        <name>fs.dfs.impl</name>
        <value>com.alibaba.dfs.DistributedFileSystem</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.dfs.impl</name>
        <value>com.alibaba.dfs.DFS</value>
    </property>
    <property>
        <name>io.file.buffer.size</name>
        <value>8388608</value>
    </property>
    <property>
        <name>alidfs.use.buffer.size.setting</name>
        <value>false</value>
        <! -- We recommend that you do not enable this feature. If you enable this feature, the I/O size will be greatly reduced, which affects data throughput. -->
    </property>
    <property>
        <name>dfs.usergroupservice.impl</name>
        <value>com.alibaba.dfs.security.LinuxUserGroupService.class</value>
    </property>
    <property>
        <name>dfs.connection.count</name>
        <value>256</value>
    </property>
</configuration>	

3. Create a JAR package with all the dependencies included.

mvn assembly:assembly

The content of the pom.xml file of the Spark application is as follows:

 1<? xml version="1.0" encoding="UTF-8"? >
 2<project xmlns="http://maven.apache.org/POM/4.0.0"
 3         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 4         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 5    <modelVersion>4.0.0</modelVersion>
 6
 7    <groupId>com.aliyun.liumi.spark</groupId>
 8    <artifactId>SparkExampleJava</artifactId>
 9    <version>1.0-SNAPSHOT</version>
10
11    <dependencies>
12        <dependency>
13            <groupId>org.apache.spark</groupId>
14            <artifactId>spark-core_2.12</artifactId>
15            <version>2.4.3</version>
16        </dependency>
17
18        <dependency>
19            <groupId>com.aliyun.dfs</groupId>
20            <artifactId>aliyun-sdk-dfs</artifactId>
21            <version>1.0.3</version>
22        </dependency>
23
24    </dependencies>
25
26    <build>
27    <plugins>
28        <plugin>
29            <groupId>org.apache.maven.plugins</groupId>
30            <artifactId>maven-assembly-plugin</artifactId>
31            <version>2.6</version>
32            <configuration>
33                <appendAssemblyId>false</appendAssemblyId>
34                <descriptorRefs>
35                    <descriptorRef>jar-with-dependencies</descriptorRef>
36                </descriptorRefs>
37                <archive>
38                    <manifest>
39                        <mainClass>com.aliyun.liumi.spark.example.WordCount</mainClass>
40                    </manifest>
41                </archive>
42            </configuration>
43            <executions>
44                <execution>
45                    <id>make-assembly</id>
46                    <phase>package</phase>
47                    <goals>
48                        <goal>assembly</goal>
49                    </goals>
50                </execution>
51            </executions>
52        </plugin>
53    </plugins>
54    </build>
55</project>
			

4. Create the Dockerfile.

# Spark base image
FROM registry.cn-hangzhou.aliyuncs.com/eci_open/spark:2.4.4
# An error may occur if you use the default version of the Kubernetes client. We recommend that you use the latest version.
RUN rm $SPARK_HOME/jars/kubernetes-client-*.jar
ADD https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.4.2/kubernetes-client-4.4.2.jar $SPARK_HOME/jars
# Copy the local JAR package.
RUN mkdir -p /opt/spark/jars
COPY SparkExampleJava-1.0-SNAPSHOT.jar /opt/spark/jars

5. Build the image of the Spark application.

docker build -t registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example -f Dockerfile .	

6. Push the image to Alibaba Cloud Container Registry.

docker push registry.cn-beijing.aliyuncs.com/liumi/spark:2.4.4-example	

The image of the Spark application is prepared. You can deploy the Spark application in a Kubernetes cluster.