All Products
Search
Document Center

Use CDH 6 together with LindormDFS

Last Updated: Jul 09, 2021

CDH is Cloudera's open source platform distribution, including Apache Hadoop. CDH allows you to install, manage and maintain, and monitor Hadoop components. You can use CDH 6.x to manage Hadoop clusters. This topic describes how to use CDH 6 together with LindormDFS of ApsaraDB for Lindorm (Lindorm). LindormDFS replaces the underlying Hadoop Distributed File System (HDFS) storage. You can use CDH 6 together with LindormDFS to set up an open source cloud native system for big data. This way, Lindorm decouples computing and storage.

Prerequisites

  • Your Lindorm instance and CDH 6 are deployed in the same virtual private cloud (VPC).

  • The IP address of a node on which CDH 6 is deployed is added to the whitelist of the Lindorm instance. For information about how to add an IP address to a whitelist, see Create an instance.

    Note

    In this topic, CDH 6.3.2 is used. LindormDFS supports CDH 5.13.0 or later. If you need to obtain information about other versions, contact Expert support.

Set the file engine of LindormDFS as the default storage engine

  1. Activate LindormDFS. For more information, see Activate LindormDFS.

  2. Configure the endpoint of LindormDFS.

    1. Log on to Cloudera Manager. On the homepage of Cloudera Manager, choose Configuration > Advanced Configuration Snippets. The Advanced Configuration Snippets page appears.

    2. By default, a Hadoop distributed file system that is built in CDH 6 uses a primary/secondary architecture. In this architecture, NameNodes on HDFS are deployed in primary/secondary mode to ensure high availability (HA). Therefore, when you initialize CDH 6, deploy HDFS in HA mode.

    3. Search for hdfs-site.xml in the search box at the top of the page.

    4. In the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml section, add the following configuration items for LindormDFS:

      Configuration item

      Value

      Description

      Required

      dfs.client.failover.proxy.provider.{Instance ID}

      org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

      The failover policy that determines which node is the primary node.

      Yes

      dfs.ha.namenodes.${Instance ID}

      nn1,nn2

      The name of LindormDFS in HA mode.

      Yes

      dfs.namenode.rpc-address.${Instance ID}.nn1

      ${Instance ID}-master1-001.lindorm.rds.aliyuncs.com:8020

      The address used by a NameNode for remote procedure calls (RPCs).

      Yes

      dfs.namenode.rpc-address.${Instance ID}.nn2

      ${Instance ID}-master2-001.lindorm.rds.aliyuncs.com:8020

      The address used by a NameNode for RPCs.

      Yes

      dfs.nameservices

      ${Instance ID},nameservice1

      nameservice1 specifies the initial name of the nameservice provided by LindormDFS.

      Yes

      Note

      ${Instance ID} specifies the ID of the Lindorm instance. Set the value to your actual instance ID.

    5. In the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml section, add the following configuration items for LindormDFS:

      Configuration item

      Value

      Description

      Required

      dfs.client.failover.proxy.provider.{Instance ID}

      org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider

      -

      Yes

      dfs.ha.namenodes.${Instance ID}

      nn1,nn2

      The name of LindormDFS in HA mode.

      Yes

      dfs.namenode.rpc-address.${Instance ID}.nn1

      ${Instance ID}-master1-001.lindorm.rds.aliyuncs.com:8020

      -

      Yes

      dfs.namenode.rpc-address.${Instance ID}.nn2

      ${Instance ID}-master2-001.lindorm.rds.aliyuncs.com:8020

      -

      Yes

      dfs.nameservices

      ${Instance ID},nameservice1

      nameservice1 specifies the initial name of the nameservice provided by LindormDFS.

      Yes

      Note

      ${Instance ID} specifies the ID of the Lindorm instance. Set the value to your actual instance ID.

    6. Search for core-site.xml in the search box.

    7. In the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml section, add the configuration item fs.defaultFS and set the value to hdfs://${Instance ID}. This specifies that the file engine of LindormDFS is the default storage engine.

    8. Click Save Changes.

    9. Return to the homepage of Cloudera Manager and click HDFS. The management page of LindormDFS appears. On the page that appears, choose Actions > Deploy Client Configuration.

  3. Check whether you can use CDH 6 to connect to LindormDFS.

    1. Log on to a node on which CDH 6 is deployed. Then, run the following command:

      $ hadoop fs -ls /
    2. Verify the result. If the following command output appears, the endpoint of LindormDFS is configured.

      Screenshot 9

Install YARN

  1. Modify the installation script of YARN.

    1. Log on to ResourceManager that is a management node in YARN.

    2. Modify the installation script. You can run the following commands:

      # sudo su -
      root@cdhlindorm001 /opt/cloudera/cm-agent/service $ vim /opt/cloudera/cm-agent/service/yarn/yarn.sh
        # Find the DEFAULT_FS field and add the following configuration: 
        DEFAULT_FS="$3"
        DEFAULT_FS="hdfs://${Instance ID}
        # ${Instance ID} specifies the ID of the Lindorm instance. Set the value to your actual instance ID.
      Note

      The script is used to configure an environment when you install YARN.

  2. Log on to Cloudera Manager. Find the cluster that you want to manage, click the icon next to the cluster, and then click Screenshot 10Add Service. The wizard to add a service appears.

  3. Select YARN (MR2 Included) and click Continue. Follow the steps in the wizard to configure YARN. You can retain the default settings.

  4. After YARN is installed, modify the mapred-site.xml file.

    1. Log on to Cloudera Manager. On the homepage of Cloudera Manager, choose Configuration > Advanced Configuration Snippets. The Advanced Configuration Snippets page appears.

    2. Search for mapred-site.xml in the search box at the top of the page.

    3. In the YARN Service MapReduce Advanced Configuration Snippet (Safety Valve) section, add the following configuration item:

      Add the configuration item mapreduce.application.framework.path and set the value to /user/yarn/mapreduce/mr-framework/{version}-mr- framework.tar.gz#mr-framework.

    4. Click Save Changes.

  5. Configure the memory size that can be used by YARN. After Spark is installed, you must use YARN to start Spark.

    1. Return to the homepage of Cloudera Manager and click YARN. The management page of YARN appears.

    2. Click the Configuration tab.

    3. Search for yarn.scheduler.maximum-allocation-mb in the search box at the top of the page. Set the value to 20 GiB. You can configure the value based on your business requirements.

    4. Search for yarn.nodemanager.resource.memory-mb in the search box at the top of the page. Set the value to 20 GiB. You can configure the value based on your business requirements.

    5. Click Save Changes.

  6. Configure the new settings and restart YARN.

    1. Return to the homepage of Cloudera Manager. Find YARN and click the Stale Configure: Client configure redeployment needed icon to deploy YARN again.

    2. On the Stale Configurations page, click Restart Stale Services.

      50
    3. On the Restart Stale Services page, select the service that you want to restart and click Restart Now.

    4. Wait until YARN is restarted and other components are re-configured. Then, click Finish.

  7. Check whether YARN is running as expected.

    After you use CDH 6 to install YARN, a file named hadoop-mapreduce-examples-3.0.0-cdh6.3.2.jar is generated. Use the file to test whether YARN is running as expected. When you use CDH 6 to install YARN, the file is stored in the directory /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars.

    1. Log on to a node on which CDH 6 is deployed. Then, run the following command to generate a test file of 128 MB in the directory /tmp/randomtextwriter:

      yarn jar /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hadoop-mapreduce-examples-3.0.0-cdh6.3.2.jar randomtextwriter  -D mapreduce.randomtextwriter.totalbytes=134217728  -D mapreduce.job.maps=4 -D mapreduce.job.reduces=4   /tmp/randomtextwriter

      In the command, hadoop-mapreduce-examples-3.1.1.3.1.4.0-315.jar specifies the test file that is generated in CDH 6. Replace the value with the actual file name.

    2. Check whether the job is submitted to YARN.

      1. Log on to a node on which CDH 6 is deployed. Then, run the following command:

        $ yarn application -list

        If the following command output appears, YARN is running as expected.

        Screenshot 16
      2. Run a test program to test whether YARN is running as expected.

        1. Log on to a node on which CDH 6 is deployed. Then, run the following command:

          $  hadoop fs -ls  /tmp/randomtextwriter

          If the following command output appears, YARN is running as expected.

          Screenshot 20

Install Hive

  1. Install a database that runs on the MySQL engine. The database is used to store the metadata on Hive.

    1. Log on to a node on which CDH 6 is deployed and install a database that runs on the MySQL engine. You can run the following commands:

      # Switch to the root user. 
      sudo su -
      # Download a source Red Hat Package Manager (RPM) package to install a database that runs on the MySQL engine.
      root@cdhlindorm001 ~/tool $ wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm 
      # Install a MySQL server and a MySQL client.
      root@cdhlindorm001 ~/tool $ yum install mysql-server mysql -y

    2. Start the database that runs on the MySQL engine, create a user, and grant the user the required permissions.

      1. Run the following commands:

        # Start the database that runs on the MySQL engine.
        root@cdhlindorm001 ~/tool $ systemctl start mysqld.service
        # Log on to the MySQL client, create a user, and grant the user the required permissions.
        root@cdhlindorm001 ~/tool $ mysql
        Welcome to the MySQL monitor.  Commands end with ; or \g.
        Your MySQL connection id is 2
        Server version: 5.6.50 MySQL Community Server (GPL)
        
        Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
        
        Oracle is a registered trademark of Oracle Corporation and/or its
        affiliates. Other names may be trademarks of their respective
        owners.
        
        Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
        
        mysql> CREATE USER 'hive'@'%' IDENTIFIED BY '123456';
        Query OK, 0 rows affected (0.00 sec)
        mysql> create DATABASE hive;
        Query OK, 1 row affected (0.00 sec)
        mysql> use mysql;
        Query OK, 0 rows affected (0.00 sec)
        mysql> delete from user;
        Query OK, 9 rows affected (0.00 sec)
        mysql> grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
        # Flush the permissions.
        mysql> flush privileges;
        Query OK, 0 rows affected (0.01 sec)
      2. Check whether the database that runs on the MySQL engine is configured.

        # Check whether you can log on to the database. If the following information appears, the database is configured.
        root@cdhlindorm001 ~/tool $ mysql -uroot -p
        Warning: Using a password on the command line interface can be insecure.
        Welcome to the MySQL monitor.  Commands end with ; or \g.
        Your MySQL connection id is 10
        Server version: 5.6.50 MySQL Community Server (GPL)
        
        Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
        
        Oracle is a registered trademark of Oracle Corporation and/or its
        affiliates. Other names may be trademarks of their respective
        owners.
        
        Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
  2. Log on to Cloudera Manager. Find the cluster that you want to manage, click the Screenshot 20 icon next to the cluster, and then click Add Service. The wizard to add a service appears.

  3. Select Hive and click Continue. Follow the steps in the wizard to configure Hive.

    1. Configure the database information.

    2. Install Hive. Wait until Hive is installed.

  4. Check whether Hive is running as expected.

    1. Log on to a node on which CDH 6 is deployed. Then, run the following commands:

      # Log on to the Hive client.
      [root@cdhlindorm001 ~]# hive
      WARNING: Use "yarn jar" to launch YARN applications.
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
      
      Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false
      
      WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
      hive> create table foo (id int, name string);
      OK
      Time taken: 1.409 seconds
      hive> insert into table foo select * from (select 12,"xyz")a;
      Query ID = root_20201126162112_59e6a5fc-99c2-45a4-bf84-73c16c39de8a
      Total jobs = 3
      Launching Job 1 out of 3
      Number of reduce tasks is set to 0 since there's no reduce operator
      20/11/26 16:21:12 INFO client.RMProxy: Connecting to ResourceManager at cdhlindorm001/192.168.0.218:8032
      20/11/26 16:21:13 INFO client.RMProxy: Connecting to ResourceManager at cdhlindorm001/192.168.0.218:8032
      Starting Job = job_1606364936355_0003, Tracking URL = http://cdhlindorm001:8088/proxy/application_1606364936355_0003/
      Kill Command = /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/bin/hadoop job  -kill job_1606364936355_0003
      Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
      2020-11-26 16:21:20,758 Stage-1 map = 0%,  reduce = 0%
      2020-11-26 16:21:25,864 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.17 sec
      MapReduce Total cumulative CPU time: 1 seconds 170 msec
      Ended Job = job_1606364936355_0003
      Stage-4 is selected by condition resolver.
      Stage-3 is filtered out by condition resolver.
      Stage-5 is filtered out by condition resolver.
      Moving data to directory hdfs://ld-xxxxxxxxxxx/user/hive/warehouse/foo/.hive-staging_hive_2020-11-26_16-21-12_133_7009525880995260840-1/-ext-10000
      Loading data to table default.foo
      MapReduce Jobs Launched:
      Stage-Stage-1: Map: 1   Cumulative CPU: 1.17 sec   HDFS Read: 5011 HDFS Write: 74 HDFS EC Read: 0 SUCCESS
      Total MapReduce CPU Time Spent: 1 seconds 170 msec
      OK
      Time taken: 15.429 seconds
    2. Use the Hive client to query the inserted data. If the returned data is valid, Hive is installed.

      hive> select * from foo;
      OK
      12 xyz
      Time taken: 0.091 seconds, Fetched: 1 row(s)
      hive>
    3. Run a test program to test whether Hive is running as expected.

      1. Run the following command to query file information:

        [root@cdhlindorm001 ~]# hadoop fs -ls /user/hive/warehouse/foo
      2. If the following command output appears, Hive is properly connected to LindormDFS.

        Screenshot 25

Install HBase

  1. Log on to Cloudera Manager. Find the cluster that you want to manage, click the Screenshot 10 icon next to the cluster, and then click Add Service. The wizard to add a service appears.

  2. Select HBase and click Continue. Follow the steps in the wizard to configure HBase. You can retain the default settings.

    1. Install HBase.

    2. If the following information appears, HBase is installed. Then, click Finish.

  3. Modify the hbase-site.xml file.

    1. Log on to Cloudera Manager. On the homepage of Cloudera Manager, choose Configuration > Advanced Configuration Snippets. The Advanced Configuration Snippets page appears.

    2. Search for hbase-site.xml in the search box at the top of the page.

    3. In the HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml and HBase Client Advanced Configuration Snippet (Safety Valve) for hbase-site.xml sections, add the following configuration items:

      1. Add the configuration item hbase.rootdir and set the value to /hbase.

      2. Add the configuration item hbase.unsafe.stream.capability.enforce and set the value to false.

      3. Click Save Changes.

  4. Configure the new settings and start HBase.

    1. Return to the homepage of Cloudera Manager. Find HBase and click the Stale Configure: Client configure redeployment needed icon to deploy HBase again.

      36
    2. On the Stale Configurations page, click Restart Stale Services.

    3. On the Restart Stale Services page, select the service that you want to restart and click Restart Now.

    4. Wait until HBase is restarted and other components are re-configured. Then, click Finish.

  5. Check whether HBase is running as expected. Log on to a node on which CDH 6 is deployed and perform the following steps:

    1. Run the following command to open HBase Shell:

      [root@cdhlindorm001 ~]# hbase shell
      HBase Shell
      Use "help" to get list of supported commands.
      Use "exit" to quit this interactive shell.
      For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
      Version 2.1.0-cdh6.3.2, rUnknown, Fri Nov  8 05:44:07 PST 2019
      Took 0.0009 seconds
      hbase(main):001:0>
    2. Create a test table on HBase and insert data into the table.

      [root@cdhlindorm001 ~]# hbase shell
      HBase Shell
      Use "help" to get list of supported commands.
      Use "exit" to quit this interactive shell.
      For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
      Version 2.1.0-cdh6.3.2, rUnknown, Fri Nov  8 05:44:07 PST 2019
      Took 0.0009 seconds
      hbase(main):001:0> create 'hbase_test','info'
      Created table hbase_test
      Took 1.0815 seconds
      => Hbase::Table - hbase_test
      hbase(main):002:0> put 'hbase_test','1', 'info:name' ,'Sariel'
      Took 0.1233 seconds
      hbase(main):003:0>  put 'hbase_test','1', 'info:age' ,'22'
      Took 0.0089 seconds
      hbase(main):004:0> put 'hbase_test','1', 'info:industry' ,'IT'
      Took 0.0115 seconds

    3. Run the following command to query the inserted data. If the returned result is valid, HBase is running as expected.

      hbase(main):005:0> scan 'hbase_test'

      If the following command output appears, HBase is running as expected.

      36
    4. Run the following command to query the /hbase/data/default directory of LindormDFS. If the hbase_test directory exists in /hbase/data/default, HBase is running as expected.

      # hadoop fs -ls /hbase/data/default
      41

Install Spark

  1. Log on to Cloudera Manager. Find the cluster that you want to manage, click the Screenshot 10 icon next to the cluster, and then click Add Service. The wizard to add a service appears.

  2. Select Spark and click Continue. Follow the steps in the wizard to configure Spark. You can retain the default settings.

    1. Select the dependencies based on your business requirements.

      43
    2. Install Spark.

    3. Verify the result. If the following information appears, Spark is installed.

  3. Start Spark. After you install Spark, Spark does not automatically start.

    1. Return to the homepage of Cloudera Manager and find Spark.

    2. Click the More icon next to Spark and click Start.

  4. Check whether Spark is running as expected.

    1. Log on to a node on which CDH 6 is deployed and use Spark to query the test file on LindormDFS.

      [root@cdhlindorm001 ~]# spark-submit   --master yarn --executor-memory 900M --executor-cores 2  --class org.apache.spark.examples.DFSReadWriteTest  /opt/cloudera/parcels/CDH/jars/spark-examples_2.11-2.4.0-cdh6.3.2.jar   /etc/profile /output
    2. Verify the result. If the result is similar to the following command output, Spark is running as expected.

      56
    3. Run the following command to query the test data:

      [root@cdhlindorm001 ~]# hadoop fs -ls /output/dfs_read_write_test

      If the following command output appears, Spark is running as expected.

      66