EMR Serverless Spark uses the official HBase Spark Connector to read from and write to HBase. This topic walks you through the full setup: preparing dependencies, establishing network connectivity, and running read/write operations from a Notebook.
Prerequisites
Before you begin, make sure you have:
A Serverless Spark workspace. See Create a workspace.
An HBase cluster with network connectivity to EMR Serverless Spark. This topic uses a custom cluster created in EMR on ECS (referred to as the EMR HBase cluster). See Create a cluster.
Supported engine versions
The HBase Spark Connector is supported on the following EMR Serverless Spark DPI engine versions:
| Engine series | Minimum version |
|---|---|
| esr-4.x | esr-4.1.0 |
| esr-3.x | esr-3.1.0 |
| esr-2.x | esr-2.5.0 |
Overview
Setting up HBase access from EMR Serverless Spark involves five steps:
Get the HBase Spark Connector JAR packages and upload them to OSS.
Create a network connection between Serverless Spark and the HBase cluster.
Create a test table in the HBase cluster.
Read data from HBase using Serverless Spark.
Write data to HBase using Serverless Spark.
Step 1: Get the JAR packages and upload them to OSS
The HBase Spark Connector requires five JAR packages: two connector packages and three HBase dependency packages.
Get the connector packages
Compile the connector to match the Spark, Scala, Hadoop, and HBase versions in your environment. For version compatibility details, see the official HBase Spark Connector documentation.
Run the following Maven command to compile the connector:
mvn -Dspark.version=3.4.2 -Dscala.version=2.12.10 -Dhadoop-three.version=3.2.0 -Dscala.binary.version=2.12 -Dhbase.version=2.4.9 clean package -DskipTestsThis produces two JAR files:
hbase-spark-1.1.0-SNAPSHOT.jarhbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar
If your environment uses Spark 3.4.2, Scala 2.12.10, Hadoop 3.2.0, and HBase 2.4.9, download the pre-compiled packages directly:
Get the HBase dependency packages
From the HBase installation directory, copy the following three files from the lib/shaded-clients and lib/client-facing-thirdparty folders. The version number (2.4.9) must match your HBase version.
hbase-shaded-client-2.4.9.jarhbase-shaded-mapreduce-2.4.9.jarslf4j-log4j12-1.7.30.jar
Upload to OSS
Upload all five JAR packages to an OSS bucket. See Simple upload.
Step 2: Create a network connection
Serverless Spark must reach the HBase cluster over the network. See Network connectivity between EMR Serverless Spark and other VPCs to set up the connection.
When configuring security group rules, open the following ports:
| Service | Port |
|---|---|
| ZooKeeper | 2181 |
| HBase Master | 16000 |
| HBase RegionServer | 16020 |
Set Port Range to 1/65535 to cover all required ports.
Step 3: Create a test table in the HBase cluster
Connect to the EMR HBase cluster using Secure Shell (SSH). See Log on to a cluster.
Open the HBase shell:
hbase shellCreate a table with two column families (
c1andc2):create 'hbase_table', 'c1', 'c2'Insert test records:
put 'hbase_table', 'r1', 'c1:name', 'Alice' put 'hbase_table', 'r1', 'c1:age', '25' put 'hbase_table', 'r1', 'c2:city', 'New York' put 'hbase_table', 'r2', 'c1:name', 'Bob' put 'hbase_table', 'r2', 'c1:age', '30' put 'hbase_table', 'r2', 'c2:city', 'San Francisco'
Step 4: Read data from HBase
Create a Notebook session
Create a Notebook session. See Manage Notebook sessions. When creating the session, configure the following:
Engine Version: Select the version that matches your HBase Spark Connector.
Network Connection: Select the network connection created in Step 2.
Spark Configuration: Add the following parameters to load the connector.
Parameter Description Example spark.jarsOSS paths to the five JAR packages oss://my-bucket/spark/hbase/hbase-shaded-client-2.4.9.jarspark.hadoop.hbase.zookeeper.quorumPrivate IP address of the ZooKeeper node. For an EMR HBase cluster, find this on the Node Management page under Private IP of the master node. 192.168.x.xspark.hadoop.hbase.zookeeper.property.clientPortZooKeeper service port. For an EMR HBase cluster, the default is 2181.2181spark.jars oss://<bucket-name>/path/to/hbase-shaded-client-2.4.9.jar,oss://<bucket-name>/path/to/hbase-shaded-mapreduce-2.4.9.jar,oss://<bucket-name>/path/to/hbase-spark-1.1.0-SNAPSHOT.jar,oss://<bucket-name>/path/to/hbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar,oss://<bucket-name>/path/to/slf4j-log4j12-1.7.30.jar spark.hadoop.hbase.zookeeper.quorum <ZooKeeper private IP> spark.hadoop.hbase.zookeeper.property.clientPort <ZooKeeper port>Replace the placeholders with your values:
Run the read query
On the Data Development page, create a task of the Interactive Development > Notebook type, then select the session you created. For details, see Manage Notebook sessions.
Paste the following code into a Notebook cell and click Run:
# Load the HBase table into a Spark DataFrame. # hbase.columns.mapping syntax: <spark_col> <type> <cf>:<qualifier> # Use :key for the row key column (no column family). df = spark.read.format("org.apache.hadoop.hbase.spark") \ .option("hbase.columns.mapping", "id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city") \ .option("hbase.table", "hbase_table") \ .option("hbase.spark.pushdown.columnfilter", False) \ .load() # Register a temporary view for SQL queries. df.createOrReplaceTempView("hbase_table_view") # Query all rows. results = spark.sql("SELECT * FROM hbase_table_view") results.show()A successful read returns output similar to:
+---+-----+---+-------------+ | id| name|age| city| +---+-----+---+-------------+ | r1|Alice| 25| New York| | r2| Bob| 30|San Francisco| +---+-----+---+-------------+
Understanding `hbase.columns.mapping`
Each entry follows the pattern <spark_column> <type> <column_family>:<qualifier>. The row key maps to :key (no column family):
| Mapping entry | Spark column | Type | HBase column family | HBase qualifier |
|---|---|---|---|---|
id STRING :key | id | STRING | — (row key) | — |
name STRING c1:name | name | STRING | c1 | name |
age STRING c1:age | age | STRING | c1 | age |
city STRING c2:city | city | STRING | c2 | city |
Step 5: Write data to HBase
In the same Notebook, paste the following code into a new cell and click Run:
from pyspark.sql.types import StructType, StructField, StringType
# Define the row to write. Schema must match the column mapping used in Step 4.
data = [("r3", "sam", "26", "New York")]
schema = StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("age", StringType(), True),
StructField("city", StringType(), True)
])
testDS = spark.createDataFrame(data=data, schema=schema)
# Write to HBase using the same column mapping.
testDS.write \
.format("org.apache.hadoop.hbase.spark") \
.option("hbase.columns.mapping",
"id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city") \
.option("hbase.table", "hbase_table") \
.save()After the data is written, you can query the table to confirm that the data was written successfully.
