All Products
Search
Document Center

E-MapReduce:Read data from and write data to HBase

Last Updated:Sep 19, 2025

EMR Serverless Spark uses the official HBase Spark connector to connect to HBase. You must add the required configurations during development to establish the connection. This topic describes how to read data from and write data to HBase in an EMR Serverless Spark environment.

Prerequisites

  • A Serverless Spark workspace is created. For more information, see Create a workspace.

  • An HBase cluster is created.

    This topic uses a custom cluster created in EMR on ECS as an example. The cluster includes the HBase service and is referred to as the EMR HBase cluster. For more information about how to create a cluster, see Create a cluster.

Limits

The operations described in this topic are supported only by the following Serverless Spark DPI engine versions:

  • esr-4.x: esr-4.1.0 and later

  • esr-3.x: esr-3.1.0 and later

  • esr-2.x: esr-2.5.0 and later

Procedure

Step 1: Get the HBase Spark Connector JAR package and upload it to OSS

Complete the following steps to obtain the required dependency packages based on the version compatibility requirements for Spark, Scala, Hadoop, and HBase. For more information, see the official HBase Spark Connector documentation.

  1. Compile and package the connector.

    Compile the HBase Spark Connector based on the versions of Spark, Scala, Hadoop, and HBase in your target environment. This process generates the following two core JAR packages:

    • hbase-spark-1.1.0-SNAPSHOT.jar

    • hbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar

      For example, you can use the following Maven command to compile and package the connector based on the specified versions.

      mvn -Dspark.version=3.4.2 -Dscala.version=2.12.10 -Dhadoop-three.version=3.2.0 -Dscala.binary.version=2.12 -Dhbase.version=2.4.9 clean package -DskipTests

      If your environment uses the same versions as listed above (Spark 3.4.2, Scala 2.12.10, Hadoop 3.2.0, and HBase 2.4.9), you can directly use the pre-compiled JAR packages:

  2. Obtain HBase dependencies. From the HBase installation directory, fetch the following dependency packages from the lib/shaded-clients and lib/client-facing-thirdparty folders. In this example, 2.4.9 is the HBase version number.

    • hbase-shaded-client-2.4.9.jar

    • hbase-shaded-mapreduce-2.4.9.jar

    • slf4j-log4j12-1.7.30.jar

  3. Upload the five JAR packages to Alibaba Cloud OSS. For more information about this operation, see Simple upload.

Step 2: Create a network connection

Serverless Spark requires network connectivity to the HBase cluster to access the HBase service. For more information about network connections, see Network connectivity between EMR Serverless Spark and other VPCs.

Important

When you add security group rules, set the Port Range to open the required ports. The port range is 1 to 65535. This example requires you to open the ZooKeeper service port (2181), the HBase Master port (16000), and the HBase RegionServer port (16020).

Step 3: Create a table in the EMR HBase cluster

  1. Connect to the cluster using Secure Shell (SSH). For more information, see Log on to a cluster.

  2. Run the following command to connect to HBase.

    hbase shell
  3. Run the following command to create a test table.

    create 'hbase_table', 'c1', 'c2'
  4. Run the following command to write test data.

    put 'hbase_table', 'r1', 'c1:name', 'Alice'
    put 'hbase_table', 'r1', 'c1:age', '25'
    put 'hbase_table', 'r1', 'c2:city', 'New York'
    
    put 'hbase_table', 'r2', 'c1:name', 'Bob'
    put 'hbase_table', 'r2', 'c1:age', '30'
    put 'hbase_table', 'r2', 'c2:city', 'San Francisco'

Step 4: Read data from an HBase table using Serverless Spark

  1. Create a Notebook session. For more information, see Manage Notebook sessions.

    When you create the session, select an engine version that matches your HBase Spark Connector from the Engine Version drop-down list. For Network Connection, select the network connection that you created in Step 2. In the Spark Configuration section, add the following parameters to load the HBase Spark Connector.

    spark.jars                                        oss://<bucketname>/path/to/hbase-shaded-client-2.4.9.jar,oss://<bucketname>/path/to/hbase-shaded-mapreduce-2.4.9.jar,oss://<bucketname>/path/to/hbase-spark-1.1.0-SNAPSHOT.jar,oss://<bucketname>/path/to/hbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar,oss://<bucketname>/path/to/slf4j-log4j12-1.7.30.jar
    spark.hadoop.hbase.zookeeper.quorum               The private IP address of ZooKeeper
    spark.hadoop.hbase.zookeeper.property.clientPort  The service port of ZooKeeper

    The following table describes the parameters.

    Parameter

    Description

    Example

    spark.jars

    The path of the external dependency JAR packages.

    The five files uploaded to OSS. For example, oss://<yourBucketname>/spark/hbase/hbase-shaded-client-2.4.9.jar.

    spark.hadoop.hbase.zookeeper.quorum

    The private IP address of ZooKeeper.

    • If you use a different HBase cluster, specify the configuration as needed.

    • If you use an Alibaba Cloud EMR HBase cluster, you can find the Private IP of the master node on the Node Management page of the EMR HBase cluster.

    spark.hadoop.hbase.zookeeper.property.clientPort

    The service port of ZooKeeper.

    • If you use a different HBase cluster, specify the configuration as needed.

    • If you use an Alibaba Cloud EMR HBase cluster, the port is 2181.

  2. On the Data Development page, create a task of the Interactive Development > Notebook type. Then, in the upper-right corner, select the Notebook session that you created.

    For more information, see Manage Notebook sessions.

  3. Copy the following code into the new Notebook tab, modify the parameters as required, and then click Run.

    # Read the HBase table.
    df = spark.read.format("org.apache.hadoop.hbase.spark") \
        .option("hbase.columns.mapping", "id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city") \
        .option("hbase.table", "hbase_table") \
        .option("hbase.spark.pushdown.columnfilter", False) \
        .load()
    
    # Register a temporary view.
    df.createOrReplaceTempView("hbase_table_view")
    
    # Query data using SQL.
    results = spark.sql("SELECT * FROM hbase_table_view")
    results.show()
    

    If data is returned successfully, the configuration is correct.

    image

Step 5: Write data to an HBase table using Serverless Spark

In the same Notebook tab, copy the following code, modify the parameters as required, and then click Run.

from pyspark.sql.types import StructType, StructField, StringType

data = [
    ("r3", "sam", "26", "New York")
]

schema = StructType([
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", StringType(), True),
    StructField("city", StringType(), True)
])
 
testDS = spark.createDataFrame(data=data,schema=schema)

testDS.write.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city").option("hbase.table", "hbase_table").save()

After the data is written, you can query the table to confirm that the data was written successfully.

image