Configure Spark on MaxCompute to access ApsaraDB for HBase - MaxCompute

This topic describes how to configure Spark on MaxCompute to access ApsaraDB for HBase.

Background

Spark on MaxCompute can access instances of Alibaba Cloud services, such as Elastic Compute Service (ECS), ApsaraDB for HBase, and ApsaraDB RDS, that are deployed in virtual private clouds (VPCs). By default, the underlying network of MaxCompute is isolated from external networks. To access ApsaraDB for HBase in a VPC, Spark on MaxCompute provides a solution that requires you to configure the spark.hadoop.odps.cupid.eni.info=regionid:vpc id parameter. The configurations for ApsaraDB for HBase Standard Edition and ApsaraDB for HBase Performance-enhanced Edition (Lindorm) are different. For more information, see the following topics:

Access ApsaraDB for HBase Standard Edition from Spark on MaxCompute.
Access ApsaraDB for HBase Performance-enhanced Edition (Lindorm) from Spark on MaxCompute.

Preparations

Before you begin, complete the following preparations:

Activate MaxCompute and create a MaxCompute project. For more information, see Activate MaxCompute and Create a MaxCompute project.
Activate DataWorks. For more information, see DataWorks purchase guide.
Activate ApsaraDB for HBase. For more information, see ApsaraDB for HBase purchase guide.
Activate a VPC, and then configure the security group and whitelist for the ApsaraDB for HBase cluster. For more information, see Network activation process.
Note
- For ApsaraDB for HBase Standard Edition, open ports 2181, 10600, and 16020 in the security group.
- For ApsaraDB for HBase Performance-enhanced Edition (Lindorm), open ports 30020, 10600, and 16020 in the security group.

Access ApsaraDB for HBase Standard Edition from Spark on MaxCompute

In the HBase client, run the following statement to create an HBase table.
```
create 'test','cf'
```
Note
For more information about HBase commands, see Introduction to HBase Shell.

In IntelliJ IDEA, write the Spark code and package it.

Use the Scala programming language to write the Spark code as shown in the following example.

object App {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder()
      .appName("HbaseTest")
      .config("spark.sql.catalogImplementation", "odps")
      .config("spark.hadoop.odps.end.point","http://service.cn.maxcompute.aliyun.com/api")
      .config("spark.hadoop.odps.runtime.end.point","http://service.cn.maxcompute.aliyun-inc.com/api")
      .getOrCreate()

    val sc = spark.sparkContext
    val config = HBaseConfiguration.create()
    //The ZooKeeper endpoint of the HBase cluster.
    val zkAddress = "hb-2zecxg2ltnpeg8me4-master*-***:2181,hb-2zecxg2ltnpeg8me4-master*-***:2181,hb-2zecxg2ltnpeg8me4-master*-***:2181"
    config.set(HConstants.ZOOKEEPER_QUORUM, zkAddress);
    val jobConf = new JobConf(config)
    jobConf.setOutputFormat(classOf[TableOutputFormat])
    //The HBase table name.
    jobConf.set(TableOutputFormat.OUTPUT_TABLE,"test")

    try{
      import spark._
      //Write data from a MaxCompute table to an HBase table. The following statement for querying the MaxCompute table uses constants as an example. Replace them with your actual values in the development environment.
      spark.sql("select '7', 88 ").rdd.map(row => {
        val name= row(0).asInstanceOf[String]
        val id = row(1).asInstanceOf[Integer]
        val put = new Put(Bytes.toBytes(id))
        put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes(id), Bytes.toBytes(name))
        (new ImmutableBytesWritable, put)
      }).saveAsHadoopDataset(jobConf)
    } finally {
      sc.stop()
    }
  }
}

Note

Log on to the ApsaraDB for HBase console. On the product page for the HBase cluster instance, navigate to the Database Connection page to obtain the ZooKeeper endpoint.

The corresponding HBase dependencies are as follows.

<dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-mapreduce</artifactId>
      <version>2.0.2</version>
    </dependency>
     <dependency>
      <groupId>com.aliyun.hbase</groupId>
      <artifactId>alihbase-client</artifactId>
      <version>2.0.5</version>
    </dependency>

In IntelliJ IDEA, package the code and its dependencies into a JAR file. Then, use the MaxCompute client to upload the JAR file to your MaxCompute project. For more information, see Add a resource.
Note
The DataWorks interface limits JAR package uploads to 50 MB. Therefore, you must use the MaxCompute client to upload the JAR package if it exceeds this limit.

Create and configure an ODPS Spark node in DataWorks.
1. In DataWorks, select the MaxCompute project environment and add the uploaded JAR package as a resource. For more information, see Create and use MaxCompute resources.
2. Create an ODPS Spark node and set the task parameters. For more information, see Develop an ODPS Spark task.
  The configuration parameters for submitting the Spark task are as follows.
```
spark.hadoop.odps.cupid.eni.enable = true
spark.hadoop.odps.cupid.eni.info=cn-beijing:vpc-2zeaeq21mb1dmkqh0****
```

Access ApsaraDB for HBase Performance-enhanced Edition (Lindorm) from Spark on MaxCompute

In the HBase client, run the following statement to create an HBase table.
```
create 'test','cf'
```
Note
For more information about HBase commands, see Introduction to HBase Shell.

In IntelliJ IDEA, write the Spark code and package it.

Use the Scala programming language to write the Spark code as shown in the following example.

object McToHbase {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder()
      .appName("spark_sql_ddl")
      .config("spark.sql.catalogImplementation", "odps")
      .config("spark.hadoop.odps.end.point","http://service.cn.maxcompute.aliyun.com/api")
      .config("spark.hadoop.odps.runtime.end.point","http://service.cn.maxcompute.aliyun-inc.com/api")
      .getOrCreate()

      val sc = spark.sparkContext

    try{
      //Write data from a MaxCompute table to an HBase table. The following statement for querying the MaxCompute table uses constants as an example. Replace them with your actual values in the development environment.
      spark.sql("select '7', 'long'").rdd.foreachPartition { iter =>
        val config = HBaseConfiguration.create()
        //The endpoint of the ZooKeeper cluster (VPC internal network endpoint)
        config.set("hbase.zookeeper.quorum", "<ZooKeeper_endpoint>:30020");
        import spark._
        //The username and password for ApsaraDB for HBase.
        config.set("hbase.client.username", "<username>");
        config.set("hbase.client.password", "<password>");
        //The HBase table name.
        val tableName = TableName.valueOf( "test")
        val conn = ConnectionFactory.createConnection(config)
        val table = conn.getTable(tableName);
        val puts = new util.ArrayList[Put]()
        iter.foreach(
          row => {
            val id = row(0).asInstanceOf[String]
            val name = row(1).asInstanceOf[String]
            val put = new Put(Bytes.toBytes(id))
            put.addColumn(Bytes.toBytes("cf"), Bytes.toBytes(id), Bytes.toBytes(name))
            puts.add(put)
            table.put(puts)
          }
        )
      }
  } finally {
    sc.stop()
  }
  }
}

Note

Log on to the HBase console. On the Database Connection page of the cluster instance details, obtain the ZooKeeper connection address and the HBase username and password.

The corresponding HBase dependency is as follows.

<dependency>
      <groupId>com.aliyun.hbase</groupId>
      <artifactId>alihbase-client</artifactId>
      <version>2.0.8</version>
    </dependency>

In IntelliJ IDEA, package the code and its dependencies into a JAR file. Then, use the MaxCompute client to upload the JAR file to your MaxCompute project. For more information, see Add a resource.
Note
The DataWorks interface limits JAR package uploads to 50 MB. Therefore, you must use the MaxCompute client to upload the JAR package if it exceeds this limit.

Create and configure an ODPS Spark node in DataWorks.
1. In DataWorks, select the MaxCompute project environment and add the uploaded JAR package as a resource. For more information, see Create and use MaxCompute resources.
2. Create an ODPS Spark node and set the task parameters. For more information, see Develop an ODPS Spark task.
  The configuration parameters for submitting the Spark task are as follows.
```
spark.hadoop.odps.cupid.eni.enable = true
spark.hadoop.odps.cupid.eni.info=cn-beijing:vpc-2zeaeq21mb1dmkqh0****
```