An elastic network interface (ENI) is a virtual network interface controller (NIC) that can be bound with a VPC-type ECS instance. You can use ENIs to deploy high availability clusters and perform low-cost failovers and fine-grained network management. This topic describes how to use the serverless Spark engine to access HBase clusters in your VPC through ENI VSwitches.

Prerequisites

The serverless Spark engine can access your VPC. For more information, see Access your VPC.
Notice The VSwitch and security group of the existing ENI in your cluster can be used.

Procedure

  1. Add the CIDR block of the ENI VSwitch to the whitelist or security group of your HBase cluster.
    • If you want to access an HBase cluster in EMR, you must add the CIDR block of the ENI VSwitch to the inbound rule of the security group of the HBase cluster. Then, expose this CIDR block to HBase port 2181. The HBase port may vary if you change the default one. The following figure shows the security group of the HBase cluster. For more information about the security group, see 该topic中文需要重写,后续重新提翻.
    • If you want to access an HBase cluster where X-Pack is deployed, you can add the CIDR block of the ENI VSwitch to the whitelist of the HBase cluster in the ApsaraDB for HBase console. For more information, see Configure a whitelist or a security group.
    • If you want to access a self-managed HBase cluster, you can add the CIDR block of the ENI VSwitch to the security group of this cluster, and expose the CIDR block to HBase port 2181. The HBase port may vary if you change the default one.
  2. Obtain the parameters that must be configured in the serverless Spark engine.
    Note Skip this step if you cannot run the Spark job.
    The serverless Spark engine cannot automatically obtain configurations from the directory specified by HBASE_CONF_DIR in your HBase cluster. Therefore, you must manually configure the parameters that are used to read configurations of the HBase cluster. You can use the tool provided by Alibaba Cloud to read the configurations of the HBase cluster. You can also use the following wget command to download the spark-examples-0.0.1-SNAPSHOT-shaded.jar package and upload it to OSS. Then, submit a Spark job to your cluster and obtain the required configurations from the output of the job.
    wget https://dla003.oss-cn-hangzhou.aliyuncs.com/GetSparkConf/spark-examples-0.0.1-SNAPSHOT-shaded.jar
    • HBase clusters in EMR: You can run the following command to submit the job after the JAR package is uploaded to OSS:
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      ossref://{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get hbase

      After the job succeeds, you can use SparkUI to view the stdout output of the driver. You can also view the output from the job logs on the job details page.

    • HBase clusters where X-Pack is deployed: You can log on to the ApsaraDB for HBase console. In the left-side navigation pane, click Clusters, click the instance name, and obtain hbase.zookeeper.quorum, as shown in the following figure.
    • Self-managed HBase clusters: If you have not specified the HBASE_CONF_DIR environment variable for this cluster, manually specify this variable.
      --class com.aliyun.spark.util.GetConfForServerlessSpark
      --deploy-mode client
      /{path/to}/spark-examples-0.0.1-SNAPSHOT-shaded.jar
      get --hadoop-conf-dir </path/to/your/hbase/conf/dir> hbase
  3. Edit the code in the SparkApplication file to access the HBase cluster.
    • Sample code:
      package com.aliyun.spark
      
      import org.apache.spark.sql.SparkSession
      
      object SparkHbase {
        def main(args: Array[String]): Unit = {
          // The zookeeper address of the HBase cluster. // The zookeeper address of the HBase cluster. The zookeeper address in the sample code is only for your reference. Replace the address with the zookeeper address of your HBase cluster.
          // Format: xxx-002.hbase.rds.aliyuncs.com:2181,xxx-001.hbase.rds.aliyuncs.com:2181,xxx-003.hbase.rds.aliyuncs.com:2181
          val zkAddress = args(0)
          // The name of the table in your HBase cluster. You must create a table in advance. For more information about how to create a table in your HBase cluster, click the following link: https://help.aliyun.com/document_detail/52051.html?spm=a2c4g.11174283.6.577.7e943c2eiYCq4k
          val hbaseTableName = args(1)
          // The name of the table in the serverless Spark engine.
          val sparkTableName = args(2)
      
          val sparkSession = SparkSession
            .builder()
            //      .enableHiveSupport() // After enableHiveSupport is performed, use the JDBC of the serverless Spark engine to query the table you created from the code.
            .appName("scala spark on HBase test")
            .getOrCreate()
      
      
          import sparkSession.implicits. _
      
          // If the table is found, delete it.
          sparkSession.sql(s"drop table if exists $sparkTableName")
      
          val createCmd =
            s"""CREATE TABLE ${sparkTableName} USING org.apache.hadoop.hbase.spark
               |    OPTIONS ('catalog'=
               |    '{"table":{"namespace":"default", "name":"${hbaseTableName}"},"rowkey":"rowkey",
               |    "columns":{
               |    "col0":{"cf":"rowkey", "col":"rowkey", "type":"string"},
               |    "col1":{"cf":"cf", "col":"col1", "type":"String"}}}',
               |    'hbase.zookeeper.quorum' = '${zkAddress}'
               |    )""".stripMargin
      
          println(s" the create sql cmd is: \n $createCmd")
          sparkSession.sql(createCmd)
          val querySql = "select * from " + sparkTableName + " limit 10"
          sparkSession.sql(querySql).show
        }
      }
    • Dependencies configured for your HBase cluster in the pom.xml file:
                      <dependency>
                  <groupId>com.aliyun.apsaradb</groupId>
                  <artifactId>alihbase-spark</artifactId>
                  <version>1.1.3_2.4.3-1.0.4</version>
                  <scope>provided</scope>
              </dependency>
              <dependency>
                  <groupId>com.aliyun.hbase</groupId>
                  <artifactId>alihbase-client</artifactId>
                  <version>1.1.3</version>
                  <scope>provided</scope>
                  <exclusions>
                      <exclusion>
                          <groupId>io.netty</groupId>
                          <artifactId>netty-all</artifactId>
                      </exclusion>
                  </exclusions>
              </dependency>
              <dependency>
                  <groupId>com.aliyun.hbase</groupId>
                  <artifactId>alihbase-protocol</artifactId>
                  <version>1.1.3</version>
                  <scope>provided</scope>
              </dependency>
              <dependency>
                  <groupId>com.aliyun.hbase</groupId>
                  <artifactId>alihbase-server</artifactId>
                  <version>1.1.3</version>
                  <scope>provided</scope>
                  <exclusions>
                      <exclusion>
                          <groupId>io.netty</groupId>
                          <artifactId>netty-all</artifactId>
                      </exclusion>
                  </exclusions>
              </dependency>
  4. Upload the SparkApplication JAR package and dependencies to OSS.
    For more information, see Upload objects.
    Note The region where OSS is deployed must be the same as the region where the serverless Spark engine is deployed.
  5. Submit a job in the serverless Spark engine and perform computation.
    1. Use HBase Shell of your HBase cluster to prepare data.
      bin/hbase shell
      hbase(main):001:0> create 'mytable', 'cf'
      hbase(main):001:0> put 'mytable', 'rowkey1', 'cf:col1', 'this is value'
    2. In the DLA console, submit a job to access your HBase cluster. For more information, see Create and run Spark jobs.
      {
          "args": [
              "{ip0}:2181,{ip1}:2181,{ip2}:2181",
              "mytable",
              "spark_on_hbase_job"
          ],
          "name": "spark-on-hbase",
          "className": "com.aliyun.spark.SparkHbase",
          "conf": {
          "spark.dla.eni.vswitch.id": "{The ID of your ENI switch}",
          "spark.dla.eni.security.group.id": "{The ID of your security group}",
          "spark.driver.resourceSpec": "medium",
          "spark.dla.eni.enable": "true",
          "spark.dla.connectors": "hbase",
          "spark.executor.instances": 2,
          "spark.executor.resourceSpec": "medium"
          },
          "file": "oss://{The OSS directory where your JAR package is saved"
      }
      The following table describes the parameters.
      Parameter Description Remarks
      ip0:2181,ip1:2181,ip2:2181 The zookeeper address of your HBase cluster. You can obtain this address from hbase.zookeeper.quorum in the ${HBASE_CONF_DIR}/hbase-site.xml file on the master node of your HBase cluster. The value of the hbase.zookeeper.quorum configuration item is in the format of Domain name:Port. To obtain the value of ip0:2181,ip1:2181,ip2:2181, you must convert the value format of the hbase.zookeeper.quorum configuration item to IP address:Port. To obtain the mappings between domain names and IP addresses, you can log on to the master node of your cluster and view the hosts file in the etc folder. Alternatively, You can ping domain names on the master node to obtain the mappings. You can also obtain the mappings from Step 2. If the port number is not specified, port 2181 is used.
      mytable The table in your HBase cluster. In this topic, the mytable table is used. HBase Shell is used to prepare table data.
      spark_on_hbase_job The name of the table in the serverless Spark engine. This table maps the table created in your HBase cluster.
      spark.dla.connectors Specifies whether to include the built-in JAR package of the serverless Spark engine in classpath. This JAR package may contain the dependency that allows you to read the tables in your HBase cluster. If the JAR package does not contain this dependency, you must configure this parameter. If the JAR package contains this dependency, you do not need to configure this parameter.
      spark.dla.eni.vswitch.id The ID of your ENI VSwitch.
      spark.dla.eni.security.group.id The ID of your security group.
      spark.dla.eni.enable Specifies whether to enable or disable your ENI VSwitch.
      After the job succeeds, find the job in the task list and click Log in the Operation column to view job logs.2