Processing Alibaba Cloud NAS data using the E-MapReduce service
Created#More Posted time:Mar 14, 2017 14:36 PM
File storage is a new storage service launched by Alibaba Cloud in 2017. As this service provides standard file access protocols, you can directly access a distributed file system featuring the following highlights without having to make any changes to their existing applications: unlimited capacity and performance expansion, single namespace, multi-sharing and high reliability and availability. The E-MapReduce service is an open-source big data solution provided by Alibaba Cloud. It can help you build big data platforms based on open-source components such as Hadoop.
This document explains how to make full use of the distributed storage and computing capabilities by integrating the Hadoop jobs and file storage (NAS) components of E-MapReduce.
Prepare the environment
Step 1: Create a file system and a mount point and configure rules for permission groups in the file storage console according to relevant official documents. It is worth noting that the mount point does not provide any default permission groups if you are using the classic network environment. In addition, the authorization address for permission group rules in the classic network environment must be a single IP address rather than a network segment. Therefore, you have to add those rules manually in the console. For this reason, you must ensure that all nodes in the E-MapReduce cluster are configured with NAS access permissions (read and write permissions).
Step 2: Log in to E-MapReduce nodes through SSH and mount the NAS. Note: All the master and worker nodes need to be mounted.
sudo mkdir /mnt/nas
sudo mount -t nfs4 <nas-url>.cn-hangzhou.nas.aliyuncs.com:/ /mnt/nas
Step 3: Verify the mounting has gone into effect. For example, you can try to create a directory on the master node:
And then try to create a file on the worker node.
If you can see the file on all of the nodes, it indicates that the NAS configuration has succeeded.
[hadoop@emr-header-1 ~]$ ls -l /mnt/nas/wc-in
-rw-rw-r-- 1 hadoop hadoop 27 12th December 10:32 1.txt
-rw-rw-r-- 1 hadoop hadoop 28 12th December 10:32 2.txt
Run Hadoop MapReduce tasks
Once the environment is ready, you can run Hadoop tasks. The following uses the most common WordCount task as an example:
hadoop jar /opt/apps/hadoop-2.7.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount file:///mnt/nas/wc-in file:///mnt/nas/wc-out
As NAS is mounted to the local file system, you can directly use the processing components supplied with Hadoop. By prefixing the input and output directories (or files) with file:///, the MapReduce task will automatically locate NAS, process the data on NAS and write the result to NAS.
Viewing the result
[hadoop@emr-worker-2 wc-out]$ cat /mnt/nas/wc-out/part-*