ApsaraDB for HBase V2.0 is an end-to-end NoSQL service for big data. It is suitable for storage, retrieval, and analysis of gigabytes to petabytes of data. ApsaraDB for HBase V2.0 is a database that supports many core services of Alibaba Group, such as Taobao recommendations, Alipay online payment, and risk control for Ant Credit Pay. This topic describes how to use the serverless Spark engine of Data Lake Analytics (DLA) to access Phoenix SQL of ApsaraDB for HBase V2.0.

Prerequisites

  • Phoenix SQL of ApsaraDB for HBase V2.0 is activated.
  • Object Storage Service (OSS) is activated. For more information, see Activate OSS.
  • The security group ID and vSwitch ID that are used by the serverless Spark engine of DLA to access Phoenix SQL of the ApsaraDB for HBase cluster are obtained. For more information, see Access your VPC.
  • The Classless Inter-Domain Routing (CIDR) block of the virtual private cloud (VPC) from which you access the ApsaraDB for HBase cluster is added to a whitelist of the ApsaraDB for HBase cluster in the ApsaraDB for HBase console.
  • A table is created in Phoenix SQL of ApsaraDB for HBase V2.0 and data is inserted into the table. In this topic, the us_population table is created. Sample statements:
    # Statement for creating a table:
    CREATE TABLE IF NOT EXISTS us_population (
    state CHAR(2) NOT NULL,
    city VARCHAR NOT NULL,
    population BIGINT
    CONSTRAINT my_pk PRIMARY KEY (state, city));
    # Statements for inserting data into the table:
    UPSERT INTO us_population VALUES('NY','New York',8143197);
    UPSERT INTO us_population VALUES('CA','Los Angeles',3844829);
    UPSERT INTO us_population VALUES('IL','Chicago',2842518);
    UPSERT INTO us_population VALUES('TX','Houston',2016582);
    UPSERT INTO us_population VALUES('PA','Philadelphia',1463281);
    UPSERT INTO us_population VALUES('AZ','Phoenix',1461575);
    UPSERT INTO us_population VALUES('TX','San Antonio',1256509);
    UPSERT INTO us_population VALUES('CA','San Diego',1255540);
    UPSERT INTO us_population VALUES('TX','Dallas',1213825);
    UPSERT INTO us_population VALUES('CA','San Jose',912332);

Procedure

  1. Compile the following test code and the pom.xml file that contains the dependency required for accessing Phoenix SQL of ApsaraDB for HBase V2.0. Then, package the test code and dependency into a JAR file and upload this file to OSS.
    Sample test code:
    package com.aliyun.spark
    
    import org.apache.spark.sql.SparkSession
    
    object SparkOnHBase2xForPhoenix {
    
      def main(args: Array[String]): Unit = {
        // queryServerAddress is the URL of Phoenix SQL of the ApsaraDB for HBase cluster. The URL is in the format of http://xxx:8765.
        val queryServerAddress = args(0)
        // The name of the table in Phoenix SQL. You must create a table in Phoenix SQL in advance.
        val phoenixTableName = args(1)
        // The name of the table in the serverless Spark engine of DLA.
        val sparkTableName = args(2)
    
        val sparkSession = SparkSession
          .builder()
          .appName("scala spark on Phoenix5.x test")
          .getOrCreate()
    
        // If the table is found, delete it.
        sparkSession.sql(s"drop table if exists $sparkTableName")
    
        val driver = "org.apache.phoenix.queryserver.client.Driver"
        val url = "jdbc:phoenix:thin:url=" + queryServerAddress + ";serialization=PROTOBUF"
        val createCmd = "CREATE TABLE " +
          sparkTableName +
          " USING org.apache.spark.sql.jdbc\n" +
          "OPTIONS (\n" +
          "  'driver' '" + driver + "',\n" +
          "  'url' '" + url + "',\n" +
          "  'dbtable' '" + phoenixTableName + "',\n" +
          "  'fetchsize' '" + 100 + "'\n" +
          ")"
        println(" createCmd: \n" + createCmd)
        sparkSession.sql(createCmd)
        val querySql = "select * from " + sparkTableName + " limit 1"
        sparkSession.sql(querySql).show
        sparkSession.stop()
      }
    }
    Dependency in the pom.xml file of Phoenix SQL:
            <dependency>
                <groupId>com.aliyun.phoenix</groupId>
                <artifactId>ali-phoenix-shaded-thin-client</artifactId>
                <version>5.2.2-HBase-2.x-SNAPSHOT</version>
            </dependency>
  2. Log on to the DLA console.
  3. In the top navigation bar, select the region where Phoenix SQL of the ApsaraDB for HBase cluster resides.
  4. In the left-side navigation pane, choose Serverless Spark > Submit job.
  5. On the Parameter Configuration page, click Create Job.
  6. In the Create Job dialog box, configure the parameters and click OK to create a Spark job.
    3
  7. In the Job List navigation tree, click the Spark job that you created and enter the following content of the job in the code editor. Replace the parameter values based on the following parameter descriptions. Then, click Save and Execute.
    {
        "args": [
            "http://hb-xxx-proxy-phoenix.hbase.rds.aliyuncs.com:8765",  # The URL of Phoenix SQL of the ApsaraDB for HBase cluster. The URL can be accessed from a load balancer or a Phoenix Query server.
            "us_population", # The name of the table in Phoenix SQL.
            "spark_on_hbase2x_phoenix"  # The name of the Spark table that maps to the table in Phoenix SQL.
        ],
        "file": "oss://spark_test/jars/hbase2x/spark-examples-0.0.1-SNAPSHOT.jar",  # The OSS directory where the test code is saved.
        "name": "hbase2x-for-phoenix-test",
        "jars": [
            "oss://spark_test/jars/hbase2x/ali-phoenix-shaded-thin-client-5.2.2-HBase-2.x-SNAPSHOT.jar"  # The OSS directory where the JAR file that contains the dependency of the test code is saved.
        ],
        "className": "com.aliyun.spark.SparkOnHBase2xForPhoenix",
        "conf": {
            "spark.driver.resourceSpec": "small",  # The specifications of the Spark driver, which can be small, medium, large, or xlarge.
            "spark.executor.instances": 2,  # The number of Spark executors.
            "spark.executor.resourceSpec": "small",  # The specifications of Spark executors, which can be small, medium, large, or xlarge.
            "spark.dla.eni.enable": "true",  # Specifies whether to enable an elastic network interface (ENI) for the VPC. If you want to access data of the VPC, set spark.dla.eni.enable to true.
            "spark.dla.eni.vswitch.id": "vsw-xxx",  # The ID of the vSwitch to which the ApsaraDB for HBase cluster belongs.
            "spark.dla.eni.security.group.id": "sg-xxx"  # The ID of the security group to which the ApsaraDB for HBase cluster belongs.
        }
    }

Result

After the job succeeds, find the job and click Log in the Operation column to view the logs of the job.