All Products
Search
Document Center

E-MapReduce:Read data from and write data to HBase

Last Updated:Mar 26, 2026

EMR Serverless Spark uses the official HBase Spark Connector to read from and write to HBase. This topic walks you through the full setup: preparing dependencies, establishing network connectivity, and running read/write operations from a Notebook.

Prerequisites

Before you begin, make sure you have:

  • A Serverless Spark workspace. See Create a workspace.

  • An HBase cluster with network connectivity to EMR Serverless Spark. This topic uses a custom cluster created in EMR on ECS (referred to as the EMR HBase cluster). See Create a cluster.

Supported engine versions

The HBase Spark Connector is supported on the following EMR Serverless Spark DPI engine versions:

Engine seriesMinimum version
esr-4.xesr-4.1.0
esr-3.xesr-3.1.0
esr-2.xesr-2.5.0

Overview

Setting up HBase access from EMR Serverless Spark involves five steps:

  1. Get the HBase Spark Connector JAR packages and upload them to OSS.

  2. Create a network connection between Serverless Spark and the HBase cluster.

  3. Create a test table in the HBase cluster.

  4. Read data from HBase using Serverless Spark.

  5. Write data to HBase using Serverless Spark.

Step 1: Get the JAR packages and upload them to OSS

The HBase Spark Connector requires five JAR packages: two connector packages and three HBase dependency packages.

Get the connector packages

Compile the connector to match the Spark, Scala, Hadoop, and HBase versions in your environment. For version compatibility details, see the official HBase Spark Connector documentation.

Run the following Maven command to compile the connector:

mvn -Dspark.version=3.4.2 -Dscala.version=2.12.10 -Dhadoop-three.version=3.2.0 -Dscala.binary.version=2.12 -Dhbase.version=2.4.9 clean package -DskipTests

This produces two JAR files:

  • hbase-spark-1.1.0-SNAPSHOT.jar

  • hbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar

If your environment uses Spark 3.4.2, Scala 2.12.10, Hadoop 3.2.0, and HBase 2.4.9, download the pre-compiled packages directly:

Get the HBase dependency packages

From the HBase installation directory, copy the following three files from the lib/shaded-clients and lib/client-facing-thirdparty folders. The version number (2.4.9) must match your HBase version.

  • hbase-shaded-client-2.4.9.jar

  • hbase-shaded-mapreduce-2.4.9.jar

  • slf4j-log4j12-1.7.30.jar

Upload to OSS

Upload all five JAR packages to an OSS bucket. See Simple upload.

Step 2: Create a network connection

Serverless Spark must reach the HBase cluster over the network. See Network connectivity between EMR Serverless Spark and other VPCs to set up the connection.

When configuring security group rules, open the following ports:

ServicePort
ZooKeeper2181
HBase Master16000
HBase RegionServer16020

Set Port Range to 1/65535 to cover all required ports.

Step 3: Create a test table in the HBase cluster

  1. Connect to the EMR HBase cluster using Secure Shell (SSH). See Log on to a cluster.

  2. Open the HBase shell:

    hbase shell
  3. Create a table with two column families (c1 and c2):

    create 'hbase_table', 'c1', 'c2'
  4. Insert test records:

    put 'hbase_table', 'r1', 'c1:name', 'Alice'
    put 'hbase_table', 'r1', 'c1:age', '25'
    put 'hbase_table', 'r1', 'c2:city', 'New York'
    
    put 'hbase_table', 'r2', 'c1:name', 'Bob'
    put 'hbase_table', 'r2', 'c1:age', '30'
    put 'hbase_table', 'r2', 'c2:city', 'San Francisco'

Step 4: Read data from HBase

Create a Notebook session

  1. Create a Notebook session. See Manage Notebook sessions. When creating the session, configure the following:

    • Engine Version: Select the version that matches your HBase Spark Connector.

    • Network Connection: Select the network connection created in Step 2.

    • Spark Configuration: Add the following parameters to load the connector.

    ParameterDescriptionExample
    spark.jarsOSS paths to the five JAR packagesoss://my-bucket/spark/hbase/hbase-shaded-client-2.4.9.jar
    spark.hadoop.hbase.zookeeper.quorumPrivate IP address of the ZooKeeper node. For an EMR HBase cluster, find this on the Node Management page under Private IP of the master node.192.168.x.x
    spark.hadoop.hbase.zookeeper.property.clientPortZooKeeper service port. For an EMR HBase cluster, the default is 2181.2181
    spark.jars                                        oss://<bucket-name>/path/to/hbase-shaded-client-2.4.9.jar,oss://<bucket-name>/path/to/hbase-shaded-mapreduce-2.4.9.jar,oss://<bucket-name>/path/to/hbase-spark-1.1.0-SNAPSHOT.jar,oss://<bucket-name>/path/to/hbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar,oss://<bucket-name>/path/to/slf4j-log4j12-1.7.30.jar
    spark.hadoop.hbase.zookeeper.quorum               <ZooKeeper private IP>
    spark.hadoop.hbase.zookeeper.property.clientPort  <ZooKeeper port>

    Replace the placeholders with your values:

Run the read query

  1. On the Data Development page, create a task of the Interactive Development > Notebook type, then select the session you created. For details, see Manage Notebook sessions.

  2. Paste the following code into a Notebook cell and click Run:

    # Load the HBase table into a Spark DataFrame.
    # hbase.columns.mapping syntax: <spark_col> <type> <cf>:<qualifier>
    # Use :key for the row key column (no column family).
    df = spark.read.format("org.apache.hadoop.hbase.spark") \
        .option("hbase.columns.mapping",
                "id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city") \
        .option("hbase.table", "hbase_table") \
        .option("hbase.spark.pushdown.columnfilter", False) \
        .load()
    
    # Register a temporary view for SQL queries.
    df.createOrReplaceTempView("hbase_table_view")
    
    # Query all rows.
    results = spark.sql("SELECT * FROM hbase_table_view")
    results.show()

    A successful read returns output similar to:

    +---+-----+---+-------------+
    | id| name|age|         city|
    +---+-----+---+-------------+
    | r1|Alice| 25|     New York|
    | r2|  Bob| 30|San Francisco|
    +---+-----+---+-------------+

Understanding `hbase.columns.mapping`

Each entry follows the pattern <spark_column> <type> <column_family>:<qualifier>. The row key maps to :key (no column family):

Mapping entrySpark columnTypeHBase column familyHBase qualifier
id STRING :keyidSTRING— (row key)
name STRING c1:namenameSTRINGc1name
age STRING c1:ageageSTRINGc1age
city STRING c2:citycitySTRINGc2city

Step 5: Write data to HBase

In the same Notebook, paste the following code into a new cell and click Run:

from pyspark.sql.types import StructType, StructField, StringType

# Define the row to write. Schema must match the column mapping used in Step 4.
data = [("r3", "sam", "26", "New York")]

schema = StructType([
    StructField("id", StringType(), True),
    StructField("name", StringType(), True),
    StructField("age", StringType(), True),
    StructField("city", StringType(), True)
])

testDS = spark.createDataFrame(data=data, schema=schema)

# Write to HBase using the same column mapping.
testDS.write \
    .format("org.apache.hadoop.hbase.spark") \
    .option("hbase.columns.mapping",
            "id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city") \
    .option("hbase.table", "hbase_table") \
    .save()

After the data is written, you can query the table to confirm that the data was written successfully.

image