All Products
Search
Document Center

Use E-MapReduce to access LindormDFS

Last Updated: Jul 11, 2021

This topic describes how to use E-MapReduce (EMR) to access LindormDFS provided by ApsaraDB for Lindorm (Lindorm).

Introduction to EMR

Alibaba Cloud EMR provides a PaaS solution for big data processing. EMR is built on top of Alibaba Cloud Elastic Compute Service (ECS) and developed based on open source services such as Apache Hadoop, Apache Spark, Apache Hive, and Apache Flink. EMR provides a solution to process big data. This solution allows you to use open source technologies to create data warehouses in the cloud. This solution also allows you to meet the business requirements for scenarios such as batch processing, stream processing, ad hoc queries, and machine learning.

Migrate data from EMR to LindormDFS

  1. Prepare for data migration.

    1. Activate EMR and create an EMR cluster. For more information, see Create a cluster.

      Note

      When you replace the EMR Hadoop Distributed File System (HDFS) service with LindormDFS, you can use ultra disks, SSD disks, or local disks as the temporary local storage media for shuffled data. For more information about storage planning, see Storage.

    1. Migrate data.

      1. Log on to the EMR console.

      2. On the Instances page, click one of the ECS instances to go to the Elastic Compute Service (ECS) console. In the Basic Information section, click Connect to go to the console.

      3. In the Service Configuration section, click the hdfs-site tab. On this tab, find the dfs.nameservices field and enter the namespace of LindormDFS in this field.

        Click Custom Configuration and add the following configuration. You can click Generate Configuration Items of LindormDFS. In the Client Connection Configurations message, find the configuration information on the hdfs-site Configurations tab.

      4. Modify the configuration.

        Choose Cluster Service > HDFS and click Deploy Client Configuration.

      5. Check whether LindormDFS can be accessed as expected.

        Run the following command on an EMR node. Replace the Instance ID value in ${Instance ID} in the following command with the actual instance ID of LindormDFS.

        hadoop fs -ls hdfs://${Instance ID}/
        Command output
      6. Migrate data.

        We recommend that you migrate all the data in the related data directories and the service directories, such as /user, /hbase, /spark-history, and /apps to LindormDFS.

        1. Use the data migration tool hadoop distcp to migrate data in EMR HDFS. For more information, see Migrate data from self-managed HDFS to LindormDFS.

        2. Run the following commands to migrate all the data of the EMR HDFS service to LindormDFS. You must replace the Instance ID value in ${Instance ID} in each of the commands with the actual instance ID of LindormDFS.

          hadoop distcp /apps hdfs://${Instance ID}/
          hadoop distcp /emr-flow hdfs://${Instance ID}/
          hadoop distcp /emr-sparksql-udf hdfs://${Instance ID}/
          hadoop distcp /hbase hdfs://${Instance ID}/
          hadoop distcp /spark-history hdfs://${Instance ID}/
          hadoop distcp /tmp hdfs://${Instance ID}/
          hadoop distcp /user hdfs://${Instance ID}/

          Modify directory permissions.

          hadoop fs -chmod 777  hdfs://${Instance ID}/tmp
        3. Verify the result of data migration.

          1. Run the following command to view the data migration result. You must replace the Instance ID value in ${Instance ID} in the following command with the actual instance ID of LindormDFS.

            hadoop fs -ls hdfs://${Instance ID}/

            If the command output is similar to that in the following figure, the data migration was successful.

            Command output
  2. Activate LindormDFS. For more information, see Activate LindormDFS.

    Note

    The virtual private cloud (VPC) and vSwitch specified for LindormDFS must be consistent with those for the EMR cluster. To view the configuration, perform the following steps:

    • Click Cluster Overview to view details about the VPC and vSwitches in the Network Info section.

Configure EMR to use LindormDFS

  1. Configure the HDFS service.

    1. Log on to the EMR console.

    2. In the top navigation bar, click Cluster Management. On the Cluster Management page, find the EMR cluster of LindormDFS that you want to access and click Cluster Management in the left-side navigation pane.

    3. Modify the configuration.

      1. Choose Cluster Service > HDFS and click the Configure tab.

      2. In the Service Configuration section of the Configure tab, click the core-site tab.

      3. Find the configuration item fs.defaultFS and replace the value with the endpoint of your LindormDFS.

      4. Click Save. In the Confirm Changes dialog box, configure the Description parameter and click OK.

      5. Click Deploy Client Configuration. In the Cluster Activities dialog box, configure the Description parameter and click OK.

    4. Restart the Apache YARN service.

      1. Choose Cluster Service > YARN.

      2. On the right side of the page, click Restart All Components.

  2. Configure the Apache Hive service.

    Note

    • You can configure the Apache Hive service only after the HDFS service is configured.

    1. Modify the configuration of the Apache Hive service.

      1. Choose Cluster Management > HIVE and click the Configure tab.

      2. In the Service Configuration section of the Configure tab, click the hive-site tab.

      3. Find the configuration item hive.metastore.warehouse.dir and delete the domain name of the EMR HDFS service from the value of the configuration item. Keep only /user/hive/warehouse.

      4. Click Save. In the Confirm Changes dialog box, configure the Description parameter and click OK.

      5. Click Deploy Client Configuration. In the Cluster Activities dialog box, configure the Description parameter and click OK.

    2. Modify the metadata of Apache Hive.

      1. Log on to the web shell console of the ECS instance where EMR Hive MetaStore is located and run the cd $HIVE_CONF_DIR command to go to the directory that stores Apache Hive configuration data. Then, obtain the database-related information from the hivemetastore-site.xml file.

        1. From the value of the configuration item javax.jdo.option.ConnectionURL, obtain the MySQL host name and the information about the database where the metadata is stored.

        2. From the value of the configuration item javax.jdo.option.ConnectionUserName, obtain the username to access MySQL.

        3. From the value of the configuration item javax.jdo.option.ConnectionPassword, obtain the password to access to MySQL.

      2. Go to the MySQL database hivemeta where the metadata of Apache Hive is stored. Modify the DBS and SDS tables, as shown in the following figure.

        MariaDB [(none)]> use hivemeta;
        ## Modify the DBS table.
        
        MariaDB [hivemeta]> select * from DBS;
        +-------+-----------------------+----------------------------------------------------------------------------------+-------------------+------------+------------+
        | DB_ID | DESC                  | DB_LOCATION_URI                                                                  | NAME              | OWNER_NAME | OWNER_TYPE |
        +-------+-----------------------+----------------------------------------------------------------------------------+-------------------+------------+------------+
        |     1 | Default Hive database | hdfs://emr-header-1.cluster-190507:9000/user/hive/warehouse                      | default           | public     | ROLE       |
        |     2 | NULL                  | hdfs://emr-header-1.cluster-190507:9000/user/hive/warehouse/emr_presto_init__.db | emr_presto_init__ | has        | USER       |
        +-------+-----------------------+----------------------------------------------------------------------------------+-------------------+------------+------------+
        2 rows in set (0.00 sec)
        
        MariaDB [hivemeta]> UPDATE DBS
            -> SET    DB_LOCATION_URI = 'hdfs://ld-uf681d0qf7w50f800/user/hive/warehouse'
            -> WHERE  DB_ID = 1;
        Query OK, 1 row affected (0.00 sec)
        Rows matched: 1  Changed: 1  Warnings: 0
        MariaDB [hivemeta]> UPDATE DBS
            -> SET    DB_LOCATION_URI = 'hdfs://ld-uf681d0qf7w50f800/user/hive/warehouse/emr_presto_init__.db'
            -> WHERE  DB_ID = 2;
        Query OK, 1 row affected (0.00 sec)
        Rows matched: 1  Changed: 1  Warnings: 0
        ## Modify the SDS table. This table has no data because the EMR cluster is newly created.
        MariaDB [hivemeta]> select * from  SDS ;
        Empty set (0.00 sec)
        
      3. On the right side of the page, click Restart All Components to restart the Apache Hive service.

  3. Configure the Apache Spark service.

    Note

    • You can configure the Apache Spark service only after the HDFS service is configured.

    1. Modify the configuration of the Apache Spark service.

      1. Choose Cluster Service > Spark and click the Configure tab.

      2. In the Service Configuration section of the Configure tab, click the spark-defaults tab.

      3. Find the configuration item spark_eventlog_dir and replace the value of the configuration item with the connection string of LindormDFS.

      4. Click Save. In the Confirm Changes dialog box, configure the Description parameter and click OK.

      5. Click Deploy Client Configuration. In the Cluster Activities dialog box, configure the Description parameter and click OK.

    2. On the right side of the page, click Restart All Components to restart the Apache Spark service.

  4. Configure the Apache HBase service.

    Note

    • You can configure the Apache HBase service only after the HDFS service is configured.

    1. Modify the configuration of the Apache HBase service.

      1. Choose Cluster Service > HBase and click the Configure tab.

      2. In the Service Configuration section of the Configure tab, click the hbase-site tab.

      3. Find the configuration item hbase.rootdir and delete the domain name of the EMR HDFS service from the value of the configuration item. Keep only /hbase.

        Apache HBase configuration
      4. Click Save. In the Confirm Changes dialog box, configure the Description parameter and click OK.

      5. Click Deploy Client Configuration. In the Cluster Activities dialog box, configure the Description parameter and click OK.

    2. On the right side of the page, click Restart All Components to restart the Apache HBase service.

  5. Disable the HDFS service

    Note

    Before you disable the HDFS service, make sure that the data stored on EMR HDFS is already migrated to LindormDFS. For more information about how to migrate the data, see Migrate data from self-managed HDFS clusters to LindormDFS.

    1. Choose Cluster Service > HDFS.

    2. On the right side of the page, click Stop All Components to disable the HDFS service.

    3. In the Cluster Activities dialog box, configure the Description parameter and click OK.

  6. Verify the Apache Hadoop service.

    1. Verify whether the Apache Hadoop service can be accessed as expected. Use the test package hadoop-mapreduce-examples-2.x.x.jar provided in EMR Hadoop to test the service. By default, the test package is stored in the /opt/apps/ecm/service/hadoop/2.x.x-1.x.x/package/hadoop-2.x.x-1.x.x/share/hadoop/mapreduce/ path.

      1. Run the following command to generate a file whose size is 128 MB in the /tmp/randomtextwriter path:

        hadoop jar /opt/apps/ecm/service/hadoop/2.8.5-1.5.3/package/hadoop-2.8.5-1.5.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar randomtextwriter  -D mapred
        uce.randomtextwriter.totalbytes=134217728  -D mapreduce.job.maps=2 -D mapreduce.job.reduces=2   /tmp/randomtextwriter

        In the command, hadoop-mapreduce-examples-2.8.5.jar is the test package of EMR. Replace this package name with the name of the actual test package.

      2. Run the following command to check whether the file is generated as expected. If the file is generated as expected, LindormDFS is connected to the Apache Hadoop service.

        hadoop fs -ls hdfs://${Instance ID}/tmp/randomtextwriter

        Instance ID in ${Instance ID} specifies the instance ID of your LindormDFS. Replace the Instance ID value in ${Instance ID} with the actual instance ID.

  7. Verify whether LindormDFS can be accessed by using Apache Spark.

    1. Use the test package spark-examples_2.x-2.x.x.jar provided in EMR to test the service. By default, the test package is stored in the /opt/apps/ecm/service/spark/2.x.x-1.0.0/package/spark-2.x.x-1.0.0-bin-hadoop2.8/examples/jars path.

      1. Run the following command to generate a file whose size is 128 MB in the /tmp/randomtextwriter path:

        hadoop jar /opt/apps/ecm/service/hadoop/2.8.5-1.5.3/package/hadoop-2.8.5-1.5.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar randomtextwriter  -D mapred
        uce.randomtextwriter.totalbytes=134217728  -D mapreduce.job.maps=2 -D mapreduce.job.reduces=2   /tmp/randomtextwriter

        In the command, hadoop-mapreduce-examples-2.8.5.jar is the test package of EMR. Replace this package name with the name of the actual test package.

      2. Use the test package of Apache Spark to read the test file from LindormDFS and display this file in the WordCount format.

        spark-submit   --master yarn --executor-memory 2G --executor-cores 2  --class org.apache.spark.examples.JavaWordCount /opt/apps/ecm/service/spark/2.4.5-hadoop2.8-1.1.0/package/spark-2.4.5-hadoop2.8-1.1.0/examples/jars/spark-examples_2.11-2.4.5.jar  /tmp/randomtextwriter

        Verify the result. If the command output is similar to the following command output, the configuration of Apache Spark is in effect.

        Result
  8. Verify the configuration of Apache Hive.

    1. Run the following command to go to the Apache Hive CLI:

      [hadoop@emr-worker-2 ~]$ hive
      
      Logging initialized using configuration in file:/etc/ecm/hive-conf-2.3.5-1.2.0/hive-log4j2.properties Async: true
      Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
      hive> 
    2. Run the following command to create a test table:

      hive> create table default.tt(id int , name string )  row format delimited   fields terminated by '\t'  lines terminated by '\n';
      OK
      Time taken: 2.058 seconds
    3. Run the following command to view the test table.

      1. If the value of the Location attribute in the command output is the path of LindormDFS, the configuration of Apache Hive takes effect.

        ive>  desc formatted  default.tt ;
        OK
        # col_name              data_type               comment             
                         
        id                      int                                         
        name                    string                                      
                         
        # Detailed Table Information             
        Database:               default                  
        Owner:                  hadoop                   
        CreateTime:             Mon Sep 28 21:26:14 CST 2020     
        LastAccessTime:         UNKNOWN                  
        Retention:              0                        
        Location:               hdfs://ld-uf681d0qf7w50f800/user/hive/warehouse/tt       
        Table Type:             MANAGED_TABLE            
        Table Parameters:                
                COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
                numFiles                0                   
                numRows                 0                   
                rawDataSize             0                   
                totalSize               0                   
                transient_lastDdlTime   1601299574          
                         
        # Storage Information            
        SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
        InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
        OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
        Compressed:             No                       
        Num Buckets:            -1                       
        Bucket Columns:         []                       
        Sort Columns:           []                       
        Storage Desc Params:             
                field.delim             \t                  
                line.delim              \n                  
                serialization.format    \t                  
        Time taken: 0.507 seconds, Fetched: 33 row(s)
  9. Verify the configuration of Apache HBase.

    1. Run the following command to go to the Apache HBase shell:

      [hadoop@emr-worker-2 ~]$ hbase shell
      SLF4J: Class path contains multiple SLF4J bindings.
      SLF4J: Found binding in [jar:file:/opt/apps/ecm/service/hbase/1.4.9-1.0.0/package/hbase-1.4.9-1.0.0/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
      SLF4J: Found binding in [jar:file:/opt/apps/ecm/service/hadoop/2.8.5-1.5.3/package/hadoop-2.8.5-1.5.3/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.clas
      s]
      SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
      SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
      HBase Shell
      Use "help" to get list of supported commands.
      Use "exit" to quit this interactive shell.
      Version 1.4.9, r8214a16c5d80f077abf1aa01bb312851511a2b15, Thu Jan 31 20:35:22 CST 2019
      
      hbase(main):001:0> 
    2. Create a test table in Apache HBase.

      hbase(main):001:0> create 'hbase_test','info'
      0 row(s) in 1.6700 seconds
      
      => Hbase::Table - hbase_test
      hbase(main):002:0> put 'hbase_test','1', 'info:name' ,'Sariel'
      0 row(s) in 0.0960 seconds
      
      hbase(main):003:0>  put 'hbase_test','1', 'info:aa' ,'33'
      0 row(s) in 0.0110 seconds
      
      hbase(main):004:0>  
      
    3. Run the following command to check the /hbase/data/default/ path in LindormDFS. If the /hbase/data/default/ path contains the hbase_test directory, the configuration of Apache HBase takes effect.

      hadoop fs -ls /hbase/data/default
      Result

Detach and release the disks that are used by EMR HDFS

When EMR is properly running on LondormDFS, the disks attached to the ECS instances are used to store only the temporary shuffle files generated in the computing process. You can detach the disks that are used to build the EMR HDFS service based on your business requirements to reduce the total cost of ownership (TCO) of your cluster.

Procedure

  1. Detach a data disk. For more information, see Detach a data disk.

  2. Release a disk. For more information, see Release a disk.