ApsaraDB for HBase V2.0 is an end-to-end NoSQL service for big data. It is suitable
for storage, retrieval, and analysis of gigabytes to petabytes of data. ApsaraDB for
HBase V2.0 is a database that supports many core services of Alibaba Group, such as
Taobao recommendations, Alipay online payment, and risk control for Ant Credit Pay.
This topic describes how to use the serverless Spark engine of Data Lake Analytics
(DLA) to access Phoenix SQL of ApsaraDB for HBase V2.0.
Prerequisites
- Phoenix SQL of ApsaraDB for HBase V2.0 is activated.
- Object Storage Service (OSS) is activated. For more information, see Activate OSS.
- The security group ID and vSwitch ID that are used by the serverless Spark engine
of DLA to access Phoenix SQL of the ApsaraDB for HBase cluster are obtained. For more
information, see Access your VPC.
- The Classless Inter-Domain Routing (CIDR) block of the virtual private cloud (VPC)
from which you access the ApsaraDB for HBase cluster is added to a whitelist of the
ApsaraDB for HBase cluster in the ApsaraDB for HBase console.
- A table is created in Phoenix SQL of ApsaraDB for HBase V2.0 and data is inserted
into the table. In this topic, the us_population table is created. Sample statements:
# Statement for creating a table:
CREATE TABLE IF NOT EXISTS us_population (
state CHAR(2) NOT NULL,
city VARCHAR NOT NULL,
population BIGINT
CONSTRAINT my_pk PRIMARY KEY (state, city));
# Statements for inserting data into the table:
UPSERT INTO us_population VALUES('NY','New York',8143197);
UPSERT INTO us_population VALUES('CA','Los Angeles',3844829);
UPSERT INTO us_population VALUES('IL','Chicago',2842518);
UPSERT INTO us_population VALUES('TX','Houston',2016582);
UPSERT INTO us_population VALUES('PA','Philadelphia',1463281);
UPSERT INTO us_population VALUES('AZ','Phoenix',1461575);
UPSERT INTO us_population VALUES('TX','San Antonio',1256509);
UPSERT INTO us_population VALUES('CA','San Diego',1255540);
UPSERT INTO us_population VALUES('TX','Dallas',1213825);
UPSERT INTO us_population VALUES('CA','San Jose',912332);
Procedure
- Compile the following test code and the pom.xml file that contains the dependency
required for accessing Phoenix SQL of ApsaraDB for HBase V2.0. Then, package the test
code and dependency into a JAR file and upload this file to OSS.
Sample test code:
package com.aliyun.spark
import org.apache.spark.sql.SparkSession
object SparkOnHBase2xForPhoenix {
def main(args: Array[String]): Unit = {
// queryServerAddress is the URL of Phoenix SQL of the ApsaraDB for HBase cluster. The URL is in the format of http://xxx:8765.
val queryServerAddress = args(0)
// The name of the table in Phoenix SQL. You must create a table in Phoenix SQL in advance.
val phoenixTableName = args(1)
// The name of the table in the serverless Spark engine of DLA.
val sparkTableName = args(2)
val sparkSession = SparkSession
.builder()
.appName("scala spark on Phoenix5.x test")
.getOrCreate()
// If the table is found, delete it.
sparkSession.sql(s"drop table if exists $sparkTableName")
val driver = "org.apache.phoenix.queryserver.client.Driver"
val url = "jdbc:phoenix:thin:url=" + queryServerAddress + ";serialization=PROTOBUF"
val createCmd = "CREATE TABLE " +
sparkTableName +
" USING org.apache.spark.sql.jdbc\n" +
"OPTIONS (\n" +
" 'driver' '" + driver + "',\n" +
" 'url' '" + url + "',\n" +
" 'dbtable' '" + phoenixTableName + "',\n" +
" 'fetchsize' '" + 100 + "'\n" +
")"
println(" createCmd: \n" + createCmd)
sparkSession.sql(createCmd)
val querySql = "select * from " + sparkTableName + " limit 1"
sparkSession.sql(querySql).show
sparkSession.stop()
}
}
Dependency in the pom.xml file of Phoenix SQL:
<dependency>
<groupId>com.aliyun.phoenix</groupId>
<artifactId>ali-phoenix-shaded-thin-client</artifactId>
<version>5.2.2-HBase-2.x-SNAPSHOT</version>
</dependency>
- Log on to the DLA console.
- In the top navigation bar, select the region where Phoenix SQL of the ApsaraDB for
HBase cluster resides.
- In the left-side navigation pane, choose .
- On the Parameter Configuration page, click Create Job.
- In the Create Job dialog box, configure the parameters and click OK to create a Spark job.
- In the Job List navigation tree, click the Spark job that you created and enter the
following content of the job in the code editor. Replace the parameter values based
on the following parameter descriptions. Then, click Save and Execute.
{
"args": [
"http://hb-xxx-proxy-phoenix.hbase.rds.aliyuncs.com:8765", # The URL of Phoenix SQL of the ApsaraDB for HBase cluster. The URL can be accessed from a load balancer or a Phoenix Query server.
"us_population", # The name of the table in Phoenix SQL.
"spark_on_hbase2x_phoenix" # The name of the Spark table that maps to the table in Phoenix SQL.
],
"file": "oss://spark_test/jars/hbase2x/spark-examples-0.0.1-SNAPSHOT.jar", # The OSS directory where the test code is saved.
"name": "hbase2x-for-phoenix-test",
"jars": [
"oss://spark_test/jars/hbase2x/ali-phoenix-shaded-thin-client-5.2.2-HBase-2.x-SNAPSHOT.jar" # The OSS directory where the JAR file that contains the dependency of the test code is saved.
],
"className": "com.aliyun.spark.SparkOnHBase2xForPhoenix",
"conf": {
"spark.driver.resourceSpec": "small", # The specifications of the Spark driver, which can be small, medium, large, or xlarge.
"spark.executor.instances": 2, # The number of Spark executors.
"spark.executor.resourceSpec": "small", # The specifications of Spark executors, which can be small, medium, large, or xlarge.
"spark.dla.eni.enable": "true", # Specifies whether to enable an elastic network interface (ENI) for the VPC. If you want to access data of the VPC, set spark.dla.eni.enable to true.
"spark.dla.eni.vswitch.id": "vsw-xxx", # The ID of the vSwitch to which the ApsaraDB for HBase cluster belongs.
"spark.dla.eni.security.group.id": "sg-xxx" # The ID of the security group to which the ApsaraDB for HBase cluster belongs.
}
}
Result
After the job succeeds, find the job and click Log in the Operation column to view
the logs of the job.