You can use a Hive internal table or external table to access HBase data in your E-MapReduce (EMR) cluster. This topic describes how to use Hive in your EMR cluster to access EMR HBase data.

Prerequisites

An EMR Hadoop cluster is created, and HBase and ZooKeeper are selected from the optional services when you create the cluster. For more information, see Create a cluster.

Access HBase data by using a Hive internal table

If no table is created in HBase, you can create an internal table in Hive. This way, a table that has the same schema as the Hive internal table is automatically created in HBase. In this example, an internal table is created in Hive to access HBase data.

  1. Open the Hive CLI.
    1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to open the Hive CLI:
      hive
      If the following information is returned, the Hive CLI is opened:
      Logging initialized using configuration in file:/etc/ecm/hive-conf-2.3.5-2.0.3/hive-log4j2.properties Async: true
      Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  2. Create an internal table in Hive and query data in the table.
    1. Run the following command to create an internal table in Hive:
      create table hive_hbase_table(key int, value string)
      stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
      with serdeproperties("hbase.columns.mapping" = ":key,cf1:val")
      tblproperties("hbase.table.name" = "hive_hbase_table", "hbase.mapred.output.outputtable" = "hive_hbase_table");
      Note HBaseStorageHandler is used to store the internal table and read HBase data.
    2. Run the following command to insert data into the internal table:
      insert into hive_hbase_table values(212,'bab');
    3. Run the following command to query data in the table:
      select * from hive_hbase_table;
      The following information is returned:
      OK
      212 bab
      Time taken: 0.337 seconds, Fetched: 1 row(s)
  3. Open the HBase CLI.
    1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to open the HBase CLI:
      hbase shell
      If the following information is returned, the HBase CLI is opened: HBase
  4. Run the following command to check whether a table with the same schema as the Hive internal table exists in HBase:
    describe 'hive_hbase_table'
    The following information is returned: describe
    Note The preceding information shows that a table is created in HBase by Hive.
  5. Run the following command to check whether the table in HBase contains the same data as the Hive internal table:
    scan 'hive_hbase_table'
    The following information is returned:
    ROW                                           COLUMN+CELL                                                                                                                          
     212                                          column=cf1:val, timestamp=1624513121062, value=bab                                                                                   
    1 row(s) in 0.2320 seconds
    Note The preceding information shows that the table in HBase contains the same data as the Hive internal table. This indicates that you have used Hive to access HBase data.

Access HBase data by using a Hive external table

If you want to use Hive to access an existing HBase table named hbase_table, you can create an external table in Hive and establish a mapping between the Hive external table and the HBase table to access data in the HBase table.

  1. Open the Hive CLI.
    1. Log on to the master node of your EMR cluster in SSH mode. For more information, see Log on to a cluster.
    2. Run the following command to open the Hive CLI:
      hive
      If the following information is returned, the Hive CLI is opened:
      Logging initialized using configuration in file:/etc/ecm/hive-conf-2.3.5-2.0.3/hive-log4j2.properties Async: true
      Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
  2. Run the following command to create an external table named hbase_table in Hive and establish a mapping between the Hive external table and the HBase table:
    create external table hbase_table(key int,col1 string,col2 string)
    stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    with serdeproperties("hbase.columns.mapping" = "f:col1,f:col2")
    tblproperties("hbase.table.name" = "hbase_table", "hbase.mapred.output.outputtable" = "hbase_table");
  3. Run the following command to query data in the hbase_table external table in Hive:
    select * from hbase_table;
    The following information is returned:
    OK
    1122  hello NULL
    Time taken: 2.201 seconds, Fetched: 1 row(s)
    Note The preceding information shows that the hbase_table external table contains the same data as the HBase table. This indicates that you have used Hive to access HBase data.