EMR Serverless Spark uses the official HBase Spark connector to connect to HBase. You must add the required configurations during development to establish the connection. This topic describes how to read data from and write data to HBase in an EMR Serverless Spark environment.
Prerequisites
A Serverless Spark workspace is created. For more information, see Create a workspace.
An HBase cluster is created.
This topic uses a custom cluster created in EMR on ECS as an example. The cluster includes the HBase service and is referred to as the EMR HBase cluster. For more information about how to create a cluster, see Create a cluster.
Limits
The operations described in this topic are supported only by the following Serverless Spark DPI engine versions:
esr-4.x: esr-4.1.0 and later
esr-3.x: esr-3.1.0 and later
esr-2.x: esr-2.5.0 and later
Procedure
Step 1: Get the HBase Spark Connector JAR package and upload it to OSS
Complete the following steps to obtain the required dependency packages based on the version compatibility requirements for Spark, Scala, Hadoop, and HBase. For more information, see the official HBase Spark Connector documentation.
Compile and package the connector.
Compile the HBase Spark Connector based on the versions of Spark, Scala, Hadoop, and HBase in your target environment. This process generates the following two core JAR packages:
hbase-spark-1.1.0-SNAPSHOT.jarhbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jarFor example, you can use the following Maven command to compile and package the connector based on the specified versions.
mvn -Dspark.version=3.4.2 -Dscala.version=2.12.10 -Dhadoop-three.version=3.2.0 -Dscala.binary.version=2.12 -Dhbase.version=2.4.9 clean package -DskipTestsIf your environment uses the same versions as listed above (Spark 3.4.2, Scala 2.12.10, Hadoop 3.2.0, and HBase 2.4.9), you can directly use the pre-compiled JAR packages:
Obtain HBase dependencies. From the HBase installation directory, fetch the following dependency packages from the
lib/shaded-clientsandlib/client-facing-thirdpartyfolders. In this example, 2.4.9 is the HBase version number.hbase-shaded-client-2.4.9.jarhbase-shaded-mapreduce-2.4.9.jarslf4j-log4j12-1.7.30.jar
Upload the five JAR packages to Alibaba Cloud OSS. For more information about this operation, see Simple upload.
Step 2: Create a network connection
Serverless Spark requires network connectivity to the HBase cluster to access the HBase service. For more information about network connections, see Network connectivity between EMR Serverless Spark and other VPCs.
When you add security group rules, set the Port Range to open the required ports. The port range is 1 to 65535. This example requires you to open the ZooKeeper service port (2181), the HBase Master port (16000), and the HBase RegionServer port (16020).
Step 3: Create a table in the EMR HBase cluster
Connect to the cluster using Secure Shell (SSH). For more information, see Log on to a cluster.
Run the following command to connect to HBase.
hbase shellRun the following command to create a test table.
create 'hbase_table', 'c1', 'c2'Run the following command to write test data.
put 'hbase_table', 'r1', 'c1:name', 'Alice' put 'hbase_table', 'r1', 'c1:age', '25' put 'hbase_table', 'r1', 'c2:city', 'New York' put 'hbase_table', 'r2', 'c1:name', 'Bob' put 'hbase_table', 'r2', 'c1:age', '30' put 'hbase_table', 'r2', 'c2:city', 'San Francisco'
Step 4: Read data from an HBase table using Serverless Spark
Create a Notebook session. For more information, see Manage Notebook sessions.
When you create the session, select an engine version that matches your HBase Spark Connector from the Engine Version drop-down list. For Network Connection, select the network connection that you created in Step 2. In the Spark Configuration section, add the following parameters to load the HBase Spark Connector.
spark.jars oss://<bucketname>/path/to/hbase-shaded-client-2.4.9.jar,oss://<bucketname>/path/to/hbase-shaded-mapreduce-2.4.9.jar,oss://<bucketname>/path/to/hbase-spark-1.1.0-SNAPSHOT.jar,oss://<bucketname>/path/to/hbase-spark-protocol-shaded-1.1.0-SNAPSHOT.jar,oss://<bucketname>/path/to/slf4j-log4j12-1.7.30.jar spark.hadoop.hbase.zookeeper.quorum The private IP address of ZooKeeper spark.hadoop.hbase.zookeeper.property.clientPort The service port of ZooKeeperThe following table describes the parameters.
Parameter
Description
Example
spark.jarsThe path of the external dependency JAR packages.
The five files uploaded to OSS. For example,
oss://<yourBucketname>/spark/hbase/hbase-shaded-client-2.4.9.jar.spark.hadoop.hbase.zookeeper.quorumThe private IP address of ZooKeeper.
If you use a different HBase cluster, specify the configuration as needed.
If you use an Alibaba Cloud EMR HBase cluster, you can find the Private IP of the master node on the Node Management page of the EMR HBase cluster.
spark.hadoop.hbase.zookeeper.property.clientPortThe service port of ZooKeeper.
If you use a different HBase cluster, specify the configuration as needed.
If you use an Alibaba Cloud EMR HBase cluster, the port is
2181.
On the Data Development page, create a task of the type. Then, in the upper-right corner, select the Notebook session that you created.
For more information, see Manage Notebook sessions.
Copy the following code into the new Notebook tab, modify the parameters as required, and then click Run.
# Read the HBase table. df = spark.read.format("org.apache.hadoop.hbase.spark") \ .option("hbase.columns.mapping", "id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city") \ .option("hbase.table", "hbase_table") \ .option("hbase.spark.pushdown.columnfilter", False) \ .load() # Register a temporary view. df.createOrReplaceTempView("hbase_table_view") # Query data using SQL. results = spark.sql("SELECT * FROM hbase_table_view") results.show()If data is returned successfully, the configuration is correct.

Step 5: Write data to an HBase table using Serverless Spark
In the same Notebook tab, copy the following code, modify the parameters as required, and then click Run.
from pyspark.sql.types import StructType, StructField, StringType
data = [
("r3", "sam", "26", "New York")
]
schema = StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("age", StringType(), True),
StructField("city", StringType(), True)
])
testDS = spark.createDataFrame(data=data,schema=schema)
testDS.write.format("org.apache.hadoop.hbase.spark").option("hbase.columns.mapping", "id STRING :key, name STRING c1:name, age STRING c1:age, city STRING c2:city").option("hbase.table", "hbase_table").save()
After the data is written, you can query the table to confirm that the data was written successfully.
