CDH is Cloudera's open source platform distribution, including Apache Hadoop. CDH allows you to install, manage and maintain, and monitor Hadoop components. You can use CDH 6.x to manage Hadoop clusters. This topic describes how to use CDH 6 together with LindormDFS of ApsaraDB for Lindorm (Lindorm). LindormDFS replaces the underlying Hadoop Distributed File System (HDFS) storage. You can use CDH 6 together with LindormDFS to set up an open source cloud native system for big data. This way, Lindorm decouples computing and storage.
Prerequisites
Your Lindorm instance and CDH 6 are deployed in the same virtual private cloud (VPC).
The IP address of a node on which CDH 6 is deployed is added to the whitelist of the Lindorm instance. For information about how to add an IP address to a whitelist, see Create an instance.
NoteIn this topic, CDH 6.3.2 is used. LindormDFS supports CDH 5.13.0 or later. If you need to obtain information about other versions, contact Expert support.
Set the file engine of LindormDFS as the default storage engine
Activate LindormDFS. For more information, see Activate LindormDFS.
Configure the endpoint of LindormDFS.
Log on to Cloudera Manager. On the homepage of Cloudera Manager, choose Configuration > Advanced Configuration Snippets. The Advanced Configuration Snippets page appears.
By default, a Hadoop distributed file system that is built in CDH 6 uses a primary/secondary architecture. In this architecture, NameNodes on HDFS are deployed in primary/secondary mode to ensure high availability (HA). Therefore, when you initialize CDH 6, deploy HDFS in HA mode.
Search for hdfs-site.xml in the search box at the top of the page.
In the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml section, add the following configuration items for LindormDFS:
Configuration item
Value
Description
Required
dfs.client.failover.proxy.provider.{Instance ID}
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
The failover policy that determines which node is the primary node.
Yes
dfs.ha.namenodes.${Instance ID}
nn1,nn2
The name of LindormDFS in HA mode.
Yes
dfs.namenode.rpc-address.${Instance ID}.nn1
${Instance ID}-master1-001.lindorm.rds.aliyuncs.com:8020
The address used by a NameNode for remote procedure calls (RPCs).
Yes
dfs.namenode.rpc-address.${Instance ID}.nn2
${Instance ID}-master2-001.lindorm.rds.aliyuncs.com:8020
The address used by a NameNode for RPCs.
Yes
dfs.nameservices
${Instance ID},nameservice1
nameservice1 specifies the initial name of the nameservice provided by LindormDFS.
Yes
Note${Instance ID} specifies the ID of the Lindorm instance. Set the value to your actual instance ID.
In the HDFS Service Advanced Configuration Snippet (Safety Valve) for hdfs-site.xml section, add the following configuration items for LindormDFS:
Configuration item
Value
Description
Required
dfs.client.failover.proxy.provider.{Instance ID}
org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
-
Yes
dfs.ha.namenodes.${Instance ID}
nn1,nn2
The name of LindormDFS in HA mode.
Yes
dfs.namenode.rpc-address.${Instance ID}.nn1
${Instance ID}-master1-001.lindorm.rds.aliyuncs.com:8020
-
Yes
dfs.namenode.rpc-address.${Instance ID}.nn2
${Instance ID}-master2-001.lindorm.rds.aliyuncs.com:8020
-
Yes
dfs.nameservices
${Instance ID},nameservice1
nameservice1 specifies the initial name of the nameservice provided by LindormDFS.
Yes
Note${Instance ID} specifies the ID of the Lindorm instance. Set the value to your actual instance ID.
Search for core-site.xml in the search box.
In the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml section, add the configuration item fs.defaultFS and set the value to hdfs://${Instance ID}. This specifies that the file engine of LindormDFS is the default storage engine.
Click Save Changes.
Return to the homepage of Cloudera Manager and click HDFS. The management page of LindormDFS appears. On the page that appears, choose Actions > Deploy Client Configuration.
Check whether you can use CDH 6 to connect to LindormDFS.
Log on to a node on which CDH 6 is deployed. Then, run the following command:
$ hadoop fs -ls /
Verify the result. If the following command output appears, the endpoint of LindormDFS is configured.
Install YARN
Modify the installation script of YARN.
Log on to ResourceManager that is a management node in YARN.
Modify the installation script. You can run the following commands:
# sudo su - root@cdhlindorm001 /opt/cloudera/cm-agent/service $ vim /opt/cloudera/cm-agent/service/yarn/yarn.sh # Find the DEFAULT_FS field and add the following configuration: DEFAULT_FS="$3" DEFAULT_FS="hdfs://${Instance ID} # ${Instance ID} specifies the ID of the Lindorm instance. Set the value to your actual instance ID.
NoteThe script is used to configure an environment when you install YARN.
Log on to Cloudera Manager. Find the cluster that you want to manage, click the icon next to the cluster, and then click
Add Service. The wizard to add a service appears.
Select YARN (MR2 Included) and click Continue. Follow the steps in the wizard to configure YARN. You can retain the default settings.
After YARN is installed, modify the mapred-site.xml file.
Log on to Cloudera Manager. On the homepage of Cloudera Manager, choose Configuration > Advanced Configuration Snippets. The Advanced Configuration Snippets page appears.
Search for mapred-site.xml in the search box at the top of the page.
In the YARN Service MapReduce Advanced Configuration Snippet (Safety Valve) section, add the following configuration item:
Add the configuration item mapreduce.application.framework.path and set the value to /user/yarn/mapreduce/mr-framework/{version}-mr- framework.tar.gz#mr-framework.
Click Save Changes.
Configure the memory size that can be used by YARN. After Spark is installed, you must use YARN to start Spark.
Return to the homepage of Cloudera Manager and click YARN. The management page of YARN appears.
Click the Configuration tab.
Search for yarn.scheduler.maximum-allocation-mb in the search box at the top of the page. Set the value to 20 GiB. You can configure the value based on your business requirements.
Search for yarn.nodemanager.resource.memory-mb in the search box at the top of the page. Set the value to 20 GiB. You can configure the value based on your business requirements.
Click Save Changes.
Configure the new settings and restart YARN.
Return to the homepage of Cloudera Manager. Find YARN and click the Stale Configure: Client configure redeployment needed icon to deploy YARN again.
On the Stale Configurations page, click Restart Stale Services.
On the Restart Stale Services page, select the service that you want to restart and click Restart Now.
Wait until YARN is restarted and other components are re-configured. Then, click Finish.
Check whether YARN is running as expected.
After you use CDH 6 to install YARN, a file named hadoop-mapreduce-examples-3.0.0-cdh6.3.2.jar is generated. Use the file to test whether YARN is running as expected. When you use CDH 6 to install YARN, the file is stored in the directory /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars.
Log on to a node on which CDH 6 is deployed. Then, run the following command to generate a test file of 128 MB in the directory /tmp/randomtextwriter:
yarn jar /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hadoop-mapreduce-examples-3.0.0-cdh6.3.2.jar randomtextwriter -D mapreduce.randomtextwriter.totalbytes=134217728 -D mapreduce.job.maps=4 -D mapreduce.job.reduces=4 /tmp/randomtextwriter
In the command, hadoop-mapreduce-examples-3.1.1.3.1.4.0-315.jar specifies the test file that is generated in CDH 6. Replace the value with the actual file name.
Check whether the job is submitted to YARN.
Log on to a node on which CDH 6 is deployed. Then, run the following command:
$ yarn application -list
If the following command output appears, YARN is running as expected.
Run a test program to test whether YARN is running as expected.
Log on to a node on which CDH 6 is deployed. Then, run the following command:
$ hadoop fs -ls /tmp/randomtextwriter
If the following command output appears, YARN is running as expected.
Install Hive
Install a database that runs on the MySQL engine. The database is used to store the metadata on Hive.
Log on to a node on which CDH 6 is deployed and install a database that runs on the MySQL engine. You can run the following commands:
# Switch to the root user. sudo su - # Download a source Red Hat Package Manager (RPM) package to install a database that runs on the MySQL engine. root@cdhlindorm001 ~/tool $ wget http://repo.mysql.com/mysql-community-release-el7-5.noarch.rpm # Install a MySQL server and a MySQL client. root@cdhlindorm001 ~/tool $ yum install mysql-server mysql -y
Start the database that runs on the MySQL engine, create a user, and grant the user the required permissions.
Run the following commands:
# Start the database that runs on the MySQL engine. root@cdhlindorm001 ~/tool $ systemctl start mysqld.service # Log on to the MySQL client, create a user, and grant the user the required permissions. root@cdhlindorm001 ~/tool $ mysql Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 2 Server version: 5.6.50 MySQL Community Server (GPL) Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> CREATE USER 'hive'@'%' IDENTIFIED BY '123456'; Query OK, 0 rows affected (0.00 sec) mysql> create DATABASE hive; Query OK, 1 row affected (0.00 sec) mysql> use mysql; Query OK, 0 rows affected (0.00 sec) mysql> delete from user; Query OK, 9 rows affected (0.00 sec) mysql> grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option; # Flush the permissions. mysql> flush privileges; Query OK, 0 rows affected (0.01 sec)
Check whether the database that runs on the MySQL engine is configured.
# Check whether you can log on to the database. If the following information appears, the database is configured. root@cdhlindorm001 ~/tool $ mysql -uroot -p Warning: Using a password on the command line interface can be insecure. Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 10 Server version: 5.6.50 MySQL Community Server (GPL) Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
Log on to Cloudera Manager. Find the cluster that you want to manage, click the
icon next to the cluster, and then click Add Service. The wizard to add a service appears.
Select Hive and click Continue. Follow the steps in the wizard to configure Hive.
Configure the database information.
Install Hive. Wait until Hive is installed.
Check whether Hive is running as expected.
Log on to a node on which CDH 6 is deployed. Then, run the following commands:
# Log on to the Hive client. [root@cdhlindorm001 ~]# hive WARNING: Use "yarn jar" to launch YARN applications. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/jars/hive-common-2.1.1-cdh6.3.2.jar!/hive-log4j2.properties Async: false WARNING: Hive CLI is deprecated and migration to Beeline is recommended. hive> create table foo (id int, name string); OK Time taken: 1.409 seconds hive> insert into table foo select * from (select 12,"xyz")a; Query ID = root_20201126162112_59e6a5fc-99c2-45a4-bf84-73c16c39de8a Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator 20/11/26 16:21:12 INFO client.RMProxy: Connecting to ResourceManager at cdhlindorm001/192.168.0.218:8032 20/11/26 16:21:13 INFO client.RMProxy: Connecting to ResourceManager at cdhlindorm001/192.168.0.218:8032 Starting Job = job_1606364936355_0003, Tracking URL = http://cdhlindorm001:8088/proxy/application_1606364936355_0003/ Kill Command = /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hadoop/bin/hadoop job -kill job_1606364936355_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2020-11-26 16:21:20,758 Stage-1 map = 0%, reduce = 0% 2020-11-26 16:21:25,864 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.17 sec MapReduce Total cumulative CPU time: 1 seconds 170 msec Ended Job = job_1606364936355_0003 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to directory hdfs://ld-xxxxxxxxxxx/user/hive/warehouse/foo/.hive-staging_hive_2020-11-26_16-21-12_133_7009525880995260840-1/-ext-10000 Loading data to table default.foo MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.17 sec HDFS Read: 5011 HDFS Write: 74 HDFS EC Read: 0 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 170 msec OK Time taken: 15.429 seconds
Use the Hive client to query the inserted data. If the returned data is valid, Hive is installed.
hive> select * from foo; OK 12 xyz Time taken: 0.091 seconds, Fetched: 1 row(s) hive>
Run a test program to test whether Hive is running as expected.
Run the following command to query file information:
[root@cdhlindorm001 ~]# hadoop fs -ls /user/hive/warehouse/foo
If the following command output appears, Hive is properly connected to LindormDFS.
Install HBase
Log on to Cloudera Manager. Find the cluster that you want to manage, click the
icon next to the cluster, and then click Add Service. The wizard to add a service appears.
Select HBase and click Continue. Follow the steps in the wizard to configure HBase. You can retain the default settings.
Install HBase.
If the following information appears, HBase is installed. Then, click Finish.
Modify the hbase-site.xml file.
Log on to Cloudera Manager. On the homepage of Cloudera Manager, choose Configuration > Advanced Configuration Snippets. The Advanced Configuration Snippets page appears.
Search for hbase-site.xml in the search box at the top of the page.
In the HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml and HBase Client Advanced Configuration Snippet (Safety Valve) for hbase-site.xml sections, add the following configuration items:
Add the configuration item hbase.rootdir and set the value to /hbase.
Add the configuration item hbase.unsafe.stream.capability.enforce and set the value to false.
Click Save Changes.
Configure the new settings and start HBase.
Return to the homepage of Cloudera Manager. Find HBase and click the Stale Configure: Client configure redeployment needed icon to deploy HBase again.
On the Stale Configurations page, click Restart Stale Services.
On the Restart Stale Services page, select the service that you want to restart and click Restart Now.
Wait until HBase is restarted and other components are re-configured. Then, click Finish.
Check whether HBase is running as expected. Log on to a node on which CDH 6 is deployed and perform the following steps:
Run the following command to open HBase Shell:
[root@cdhlindorm001 ~]# hbase shell HBase Shell Use "help" to get list of supported commands. Use "exit" to quit this interactive shell. For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell Version 2.1.0-cdh6.3.2, rUnknown, Fri Nov 8 05:44:07 PST 2019 Took 0.0009 seconds hbase(main):001:0>
Create a test table on HBase and insert data into the table.
[root@cdhlindorm001 ~]# hbase shell HBase Shell Use "help" to get list of supported commands. Use "exit" to quit this interactive shell. For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell Version 2.1.0-cdh6.3.2, rUnknown, Fri Nov 8 05:44:07 PST 2019 Took 0.0009 seconds hbase(main):001:0> create 'hbase_test','info' Created table hbase_test Took 1.0815 seconds => Hbase::Table - hbase_test hbase(main):002:0> put 'hbase_test','1', 'info:name' ,'Sariel' Took 0.1233 seconds hbase(main):003:0> put 'hbase_test','1', 'info:age' ,'22' Took 0.0089 seconds hbase(main):004:0> put 'hbase_test','1', 'info:industry' ,'IT' Took 0.0115 seconds
Run the following command to query the inserted data. If the returned result is valid, HBase is running as expected.
hbase(main):005:0> scan 'hbase_test'
If the following command output appears, HBase is running as expected.
Run the following command to query the /hbase/data/default directory of LindormDFS. If the hbase_test directory exists in /hbase/data/default, HBase is running as expected.
# hadoop fs -ls /hbase/data/default
Install Spark
Log on to Cloudera Manager. Find the cluster that you want to manage, click the
icon next to the cluster, and then click Add Service. The wizard to add a service appears.
Select Spark and click Continue. Follow the steps in the wizard to configure Spark. You can retain the default settings.
Select the dependencies based on your business requirements.
Install Spark.
Verify the result. If the following information appears, Spark is installed.
Start Spark. After you install Spark, Spark does not automatically start.
Return to the homepage of Cloudera Manager and find Spark.
Click the More icon next to Spark and click Start.
Check whether Spark is running as expected.
Log on to a node on which CDH 6 is deployed and use Spark to query the test file on LindormDFS.
[root@cdhlindorm001 ~]# spark-submit --master yarn --executor-memory 900M --executor-cores 2 --class org.apache.spark.examples.DFSReadWriteTest /opt/cloudera/parcels/CDH/jars/spark-examples_2.11-2.4.0-cdh6.3.2.jar /etc/profile /output
Verify the result. If the result is similar to the following command output, Spark is running as expected.
Run the following command to query the test data:
[root@cdhlindorm001 ~]# hadoop fs -ls /output/dfs_read_write_test
If the following command output appears, Spark is running as expected.