All Products
Search
Document Center

Object Storage Service:Use HDP 2.6-based Hadoop to read and write OSS data

Last Updated:Feb 23, 2024

Hortonworks Data Platform (HDP) is a big data platform released by Hortonworks and consists of open source components such as Hadoop, Hive, and HBase. Hadoop 3.1.1 is included in HDP 3.0.1 and supports Object Storage Service (OSS). However, earlier versions of HDP do not support OSS. This topic uses HDP 2.6.1.0 as an example to describe how to configure HDP 2.6 to read and write OSS data.

Prerequisites

An HDP 2.6.1.0 cluster is created.

If you do not have an HDP 2.6.1.0 cluster, you can use one of the following methods to create an HDP 2.6.1.0 cluster:

  • Use Ambari to create an HDP 2.6.1.0 cluster.
  • If Ambari is not available, you can manually create an HDP 2.6.1.0 cluster.

Procedure

  1. Download the HDP 2.6.1.0 package that supports OSS.
  2. Run the following command to decompress the downloaded package:
    sudo tar -xvf hadoop-oss-hdp-2.6.1.0-129.tar

    Sample success response:

    hadoop-oss-hdp-2.6.1.0-129/
    hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ram-3.0.0.jar
    hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-core-3.4.0.jar
    hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-ecs-4.2.0.jar
    hadoop-oss-hdp-2.6.1.0-129/aliyun-java-sdk-sts-3.0.0.jar
    hadoop-oss-hdp-2.6.1.0-129/jdom-1.1.jar
    hadoop-oss-hdp-2.6.1.0-129/aliyun-sdk-oss-3.4.1.jar
    hadoop-oss-hdp-2.6.1.0-129/hadoop-aliyun-2.7.3.2.6.1.0-129.jar
  3. Modify the directories of the JAR packages.
    Note In this topic, all contents enclosed by ${} are environment variables. Modify the environment variables based on the actual environment.
    1. Move the Hadoop-aliyun-2.7.3.2.6.1.0-129.jar package to the ${/usr/hdp/current}/hadoop-client/ directory. Run the following command to check whether the directory is modified:
      sudo ls -lh /usr/hdp/current/hadoop-client/hadoop-aliyun-2.7.3.2.6.1.0-129.jar

      Sample success response:

      -rw-r--r-- 1 root root 64K Oct 28 20:56 /usr/hdp/current/hadoop-client/hadoop-aliyun-2.7.3.2.6.1.0-129.jar
    2. Move other jar packages to the ${/usr/hdp/current}/hadoop-client/lib/ directory. Run the following command to check whether the directory is modified:
      sudo ls -ltrh /usr/hdp/current/hadoop-client/lib

      Sample success response:

      total 27M
      ......
      drwxr-xr-x 2 root root 4.0K Oct 28 20:10 ranger-hdfs-plugin-impl
      drwxr-xr-x 2 root root 4.0K Oct 28 20:10 ranger-yarn-plugin-impl
      drwxr-xr-x 2 root root 4.0K Oct 28 20:10 native
      -rw-r--r-- 1 root root 114K Oct 28 20:56 aliyun-java-sdk-core-3.4.0.jar
      -rw-r--r-- 1 root root 513K Oct 28 20:56 aliyun-sdk-oss-3.4.1.jar
      -rw-r--r-- 1 root root  13K Oct 28 20:56 aliyun-java-sdk-sts-3.0.0.jar
      -rw-r--r-- 1 root root 211K Oct 28 20:56 aliyun-java-sdk-ram-3.0.0.jar
      -rw-r--r-- 1 root root 770K Oct 28 20:56 aliyun-java-sdk-ecs-4.2.0.jar
      -rw-r--r-- 1 root root 150K Oct 28 20:56 jdom-1.1.jar
  4. Perform the preceding operations on all HDP nodes.
  5. Use Ambari to add configurations. If your cluster does not use Ambari for management, modify core-site.xml. In this example, Ambari is used. The following table describes the configurations that you must add.
    ParameterDescription
    fs.oss.endpointSpecify the endpoint of the region in which the bucket that you want to access is located.

    Example: oss-cn-zhangjiakou-internal.aliyuncs.com.

    fs.oss.accessKeyIdEnter the AccessKey ID used to access OSS.
    fs.oss.accessKeySecretEnter the AccessKey secret used to access OSS.
    fs.oss.implSpecify the class used to implement the OSS file system based on Hadoop. Set the value to org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.
    fs.oss.buffer.dirSpecify the name of the directory used to store temporary files.

    We recommend that you set this parameter to /tmp/oss.

    fs.oss.connection.secure.enabledSpecify whether to enable HTTPS. Performance may be affected when HTTPS is enabled.

    We recommend that you set this parameter to false.

    fs.oss.connection.maximumSpecify the maximum number of connections to OSS.

    We recommend that you set this parameter to 2048.

    For more information about more parameters, visit Hadoop-Aliyun module.

  6. Restart the cluster as prompted by Ambari.
  7. Test whether data can be read from and written to OSS.
    1. Run the following command to test whether data can be read from OSS:
      sudo hadoop fs -ls oss://${your-bucket-name}/
    2. Run the following command to test whether data can be written to OSS:
      sudo hadoop fs -mkdir oss://${your-bucket-name}/hadoop-test
      If data can be read from and written to OSS, the configurations are successful. Otherwise, check whether the configurations are correct.
  8. To run MapReduce jobs, run the following command to move the HDP 2.6.1.0 package to the hdfs://hdp-master:8020/hdp/apps/2.6.1.0-129/mapreduce/mapreduce.tar.gz package:
    Note In this example, MapReduce jobs are used. For more information about how to run jobs of other types, refer to the following step and code. For example, to run TEZ jobs, move the HDP 2.6.1.0 package to the hdfs://hdp-master:8020/hdp/apps/2.6.1.0-129/tez/tez.tar.gz package.
    sudo su hdfs
    sudo cd
    sudo hadoop fs -copyToLocal /hdp/apps/2.6.1.0-129/mapreduce/mapreduce.tar.gz
    sudo hadoop fs -rm /hdp/apps/2.6.1.0-129/mapreduce/mapreduce.tar.gz
    sudo cp mapreduce.tar.gz mapreduce.tar.gz.bak
    sudo tar zxf mapreduce.tar.gz
    sudo cp /usr/hdp/current/hadoop-client/hadoop-aliyun-2.7.3.2.6.1.0-129.jar hadoop/share/hadoop/tools/lib/
    sudo cp /usr/hdp/current/hadoop-client/lib/aliyun-* hadoop/share/hadoop/tools/lib/
    sudo cp /usr/hdp/current/hadoop-client/lib/jdom-1.1.jar hadoop/share/hadoop/tools/lib/
    sudo tar zcf mapreduce.tar.gz hadoop
    sudo hadoop fs -copyFromLocal mapreduce.tar.gz /hdp/apps/2.6.1.0-129/mapreduce/

Verify the configurations

You can test TeraGen and TeraSort to check whether the configurations take effect.

  • Run the following command to test TeraGen:
    sudo hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar teragen -Dmapred.map.tasks=100 10995116 oss://{bucket-name}/1G-input

    Sample success response:

    18/10/28 21:32:38 INFO client.RMProxy: Connecting to ResourceManager at cdh-master/192.168.0.161:8050
    18/10/28 21:32:38 INFO client.AHSProxy: Connecting to Application History server at cdh-master/192.168.0.161:10200
    18/10/28 21:32:38 INFO aliyun.oss: [Server]Unable to execute HTTP request: Not Found
    [ErrorCode]: NoSuchKey
    [RequestId]: 5BD5BA7641FCE369BC1D052C
    [HostId]: null
    18/10/28 21:32:38 INFO aliyun.oss: [Server]Unable to execute HTTP request: Not Found
    [ErrorCode]: NoSuchKey
    [RequestId]: 5BD5BA7641FCE369BC1D052F
    [HostId]: null
    18/10/28 21:32:39 INFO terasort.TeraSort: Generating 10995116 using 100
    18/10/28 21:32:39 INFO mapreduce.JobSubmitter: number of splits:100
    18/10/28 21:32:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540728986531_0005
    18/10/28 21:32:39 INFO impl.YarnClientImpl: Submitted application application_1540728986531_0005
    18/10/28 21:32:39 INFO mapreduce.Job: The url to track the job: http://cdh-master:8088/proxy/application_1540728986531_0005/
    18/10/28 21:32:39 INFO mapreduce.Job: Running job: job_1540728986531_0005
    18/10/28 21:32:49 INFO mapreduce.Job: Job job_1540728986531_0005 running in uber mode : false
    18/10/28 21:32:49 INFO mapreduce.Job:  map 0% reduce 0%
    18/10/28 21:32:55 INFO mapreduce.Job:  map 1% reduce 0%
    18/10/28 21:32:57 INFO mapreduce.Job:  map 2% reduce 0%
    18/10/28 21:32:58 INFO mapreduce.Job:  map 4% reduce 0%
    ...
    18/10/28 21:34:40 INFO mapreduce.Job:  map 99% reduce 0%
    18/10/28 21:34:42 INFO mapreduce.Job:  map 100% reduce 0%
    18/10/28 21:35:15 INFO mapreduce.Job: Job job_1540728986531_0005 completed successfully
    18/10/28 21:35:15 INFO mapreduce.Job: Counters: 36
    ...
  • Run the following command to test TeraSort:
    sudo hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar terasort -Dmapred.map.tasks=100 oss://{bucket-name}/1G-input oss://{bucket-name}/1G-output

    Sample success response:

    18/10/28 21:39:00 INFO terasort.TeraSort: starting
    ...
    18/10/28 21:39:02 INFO mapreduce.JobSubmitter: number of splits:100
    18/10/28 21:39:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1540728986531_0006
    18/10/28 21:39:02 INFO impl.YarnClientImpl: Submitted application application_1540728986531_0006
    18/10/28 21:39:02 INFO mapreduce.Job: The url to track the job: http://cdh-master:8088/proxy/application_1540728986531_0006/
    18/10/28 21:39:02 INFO mapreduce.Job: Running job: job_1540728986531_0006
    18/10/28 21:39:09 INFO mapreduce.Job: Job job_1540728986531_0006 running in uber mode : false
    18/10/28 21:39:09 INFO mapreduce.Job:  map 0% reduce 0%
    18/10/28 21:39:17 INFO mapreduce.Job:  map 1% reduce 0%
    18/10/28 21:39:19 INFO mapreduce.Job:  map 2% reduce 0%
    18/10/28 21:39:20 INFO mapreduce.Job:  map 3% reduce 0%
    ...
    18/10/28 21:42:50 INFO mapreduce.Job:  map 100% reduce 75%
    18/10/28 21:42:53 INFO mapreduce.Job:  map 100% reduce 80%
    18/10/28 21:42:56 INFO mapreduce.Job:  map 100% reduce 86%
    18/10/28 21:42:59 INFO mapreduce.Job:  map 100% reduce 92%
    18/10/28 21:43:02 INFO mapreduce.Job:  map 100% reduce 98%
    18/10/28 21:43:05 INFO mapreduce.Job:  map 100% reduce 100%
    ^@18/10/28 21:43:56 INFO mapreduce.Job: Job job_1540728986531_0006 completed successfully
    18/10/28 21:43:56 INFO mapreduce.Job: Counters: 54
    ...

If the tests are successful, the configurations take effect.