All Products
Search
Document Center

MaxCompute:Access Phoenix data using Spark on MaxCompute

Last Updated:Sep 10, 2025

This topic describes how to use Spark to connect to Phoenix and write data from HBase to MaxCompute.

Background information

Phoenix is the SQL layer for HBase. It is designed for scenarios that require high concurrency, low latency, and simple queries. This topic describes how to access Phoenix data using Spark on MaxCompute. The procedure includes creating a Phoenix table, writing data to the table, writing Spark code in IntelliJ IDEA, and performing a smoke test on the code in DataWorks.

Prerequisites

Before you begin, complete the following prerequisites:

  • Enable MaxCompute and create a MaxCompute project. For more information, see Enable MaxCompute and Create a MaxCompute project.

  • Enable DataWorks. For more information, see DataWorks purchase guide.

  • Enable HBase. For more information, see HBase purchase guide.

    Note

    This tutorial uses HBase 1.1 as an example. You can use other versions of HBase in your development environment.

  • Download and install Phoenix 4.12.0. For more information, see Use HBase SQL (Phoenix) 4.x.

    Note

    HBase 1.1 corresponds to Phoenix 4.12.0. Ensure that the versions are compatible during development.

  • Enable a virtual private cloud (VPC) and configure a security group for the HBase cluster. For more information, see Network connection procedure.

    Note

    In this tutorial, the HBase cluster is in a VPC. Therefore, ports 2181, 10600, and 16020 must be open in the security group.

Procedure

  1. Navigate to the bin directory of Phoenix and run the following sample command to start the Phoenix client.

    ./sqlline.py hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181,hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181,hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181
    Note

    hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181,hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181,hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181 is the connection address of ZooKeeper. You can log in to the HBase console and obtain the ZooKeeper connection address from the Database Connection page of the HBase cluster instance details page.

  2. In the Phoenix client, run the following statements to create a table named users and insert data.

    CREATE TABLE IF NOT EXISTS users(
    id UNSIGNED_INT,
    username char(50),
    password char(50)
    CONSTRAINT my_ph PRIMARY KEY (id));
    UPSERT INTO users(id,username,password) VALUES (1,'kongxx','Letmein');
    Note

    For more information about Phoenix syntax, see Get started with HBase SQL (Phoenix).

  3. In the Phoenix client, run the following statement to view the data in the users table.

    select * from users;
  4. Write and package the Spark code in IntelliJ IDEA.

    1. Use the Scala programming language to write the Spark test code.

      In IntelliJ IDEA, configure the local development environment using the POM file. You can first use the public endpoint for testing. After the code is verified, adjust the value of the spark.hadoop.odps.end.point parameter in the sample code. To obtain the public endpoint, log in to the HBase console and go to the Database Connection page of the HBase cluster instance details page. The following is the sample code:

      package com.phoenix
      import org.apache.hadoop.conf.Configuration
      import org.apache.spark.sql.SparkSession
      import org.apache.phoenix.spark._
      /**
        * This example applies to Phoenix 4.x.
        */
      object SparkOnPhoenix4xSparkSession {
        def main(args: Array[String]): Unit = {
          // The ZooKeeper connection address of the HBase cluster.
          val zkAddress = hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181,hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181,hb-2zecxg2ltnpeg8me4-master*-***.hbase.rds.aliyuncs.com:2181
          // The name of the table in Phoenix.
          val phoenixTableName = users
          // The name of the table in Spark.
          val ODPSTableName = users_phoenix
          val sparkSession = SparkSession
            .builder()
            .appName("SparkSQL-on-MaxCompute")
            .config("spark.sql.broadcastTimeout", 20 * 60)
            .config("spark.sql.crossJoin.enabled", true)
            .config("odps.exec.dynamic.partition.mode", "nonstrict")
            // Set spark.master to local[N] to run the code directly. N is the number of concurrent operations.
            //.config("spark.master", "local[4]") 
            .config("spark.hadoop.odps.project.name", "***")
            .config("spark.hadoop.odps.access.id", "***")
            .config("spark.hadoop.odps.access.key", "***")
            //.config("spark.hadoop.odps.end.point", "http://service.cn.maxcompute.aliyun.com/api")
            .config("spark.hadoop.odps.end.point", "http://service.cn-beijing.maxcompute.aliyun-inc.com/api")
            .config("spark.sql.catalogImplementation", "odps")
            .getOrCreate()
          var df = sparkSession.read.format("org.apache.phoenix.spark").option("table", phoenixTableName).option("zkUrl",zkAddress).load()
          df.show()
          df.write.mode("overwrite").insertInto(ODPSTableName)
        }
      }
                              

      The following is the POM file.

      <?xml version="1.0" encoding="UTF-8"?>
      <!--
        Licensed under the Apache License, Version 2.0 (the "License");
        you may not use this file except in compliance with the License.
        You may obtain a copy of the License at
          http://www.apache.org/licenses/LICENSE-2.0
        Unless required by applicable law or agreed to in writing, software
        distributed under the License is distributed on an "AS IS" BASIS,
        WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        See the License for the specific language governing permissions and
        limitations under the License. See accompanying LICENSE file.
      -->
      <project xmlns="http://maven.apache.org/POM/4.0.0"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
          <modelVersion>4.0.0</modelVersion>
          <properties>
              <spark.version>2.3.0</spark.version>
              <cupid.sdk.version>3.3.8-public</cupid.sdk.version>        <scala.version>2.11.8</scala.version>
              <scala.binary.version>2.11</scala.binary.version>
              <phoenix.version>4.12.0-HBase-1.1</phoenix.version>
          </properties>
          <groupId>com.aliyun.odps</groupId>
          <artifactId>Spark-Phonix</artifactId>
          <version>1.0.0-SNAPSHOT</version>
          <packaging>jar</packaging>
          <dependencies>
              <dependency>
                  <groupId>org.jpmml</groupId>
                  <artifactId>pmml-model</artifactId>
                  <version>1.3.8</version>
              </dependency>
              <dependency>
                  <groupId>org.jpmml</groupId>
                  <artifactId>pmml-evaluator</artifactId>
                  <version>1.3.10</version>
              </dependency>
              <dependency>
                  <groupId>org.apache.spark</groupId>
                  <artifactId>spark-core_${scala.binary.version}</artifactId>
                  <version>${spark.version}</version>
                  <scope>provided</scope>
                  <exclusions>
                      <exclusion>
                          <groupId>org.scala-lang</groupId>
                          <artifactId>scala-library</artifactId>
                      </exclusion>
                      <exclusion>
                          <groupId>org.scala-lang</groupId>
                          <artifactId>scalap</artifactId>
                      </exclusion>
                  </exclusions>
              </dependency>
              <dependency>
                  <groupId>org.apache.spark</groupId>
                  <artifactId>spark-sql_${scala.binary.version}</artifactId>
                  <version>${spark.version}</version>
                  <scope>provided</scope>
              </dependency>
              <dependency>
                  <groupId>org.apache.spark</groupId>
                  <artifactId>spark-mllib_${scala.binary.version}</artifactId>
                  <version>${spark.version}</version>
                  <scope>provided</scope>
              </dependency>
              <dependency>
                  <groupId>org.apache.spark</groupId>
                  <artifactId>spark-streaming_${scala.binary.version}</artifactId>
                  <version>${spark.version}</version>
                  <scope>provided</scope>
              </dependency>
              <dependency>
                  <groupId>com.aliyun.odps</groupId>
                  <artifactId>cupid-sdk</artifactId>
                  <version>${cupid.sdk.version}</version>
                  <scope>provided</scope>
              </dependency>
              <dependency>
                  <groupId>com.aliyun.phoenix</groupId>
                  <artifactId>ali-phoenix-core</artifactId>
                  <version>4.12.0-AliHBase-1.1-0.8</version>
                  <exclusions>
                      <exclusion>
                          <groupId>com.aliyun.odps</groupId>
                          <artifactId>odps-sdk-mapred</artifactId>
                      </exclusion>
                      <exclusion>
                          <groupId>com.aliyun.odps</groupId>
                          <artifactId>odps-sdk-commons</artifactId>
                      </exclusion>
                  </exclusions>
              </dependency>
              <dependency>
                  <groupId>com.aliyun.phoenix</groupId>
                  <artifactId>ali-phoenix-spark</artifactId>
                  <version>4.12.0-AliHBase-1.1-0.8</version>
                  <exclusions>
                      <exclusion>
                          <groupId>com.aliyun.phoenix</groupId>
                          <artifactId>ali-phoenix-core</artifactId>
                      </exclusion>
                  </exclusions>
              </dependency>
          </dependencies>
          <build>
              <plugins>
                  <plugin>
                      <groupId>org.apache.maven.plugins</groupId>
                      <artifactId>maven-shade-plugin</artifactId>
                      <version>2.4.3</version>
                      <executions>
                          <execution>
                              <phase>package</phase>
                              <goals>
                                  <goal>shade</goal>
                              </goals>
                              <configuration>
                                  <minimizeJar>false</minimizeJar>
                                  <shadedArtifactAttached>true</shadedArtifactAttached>
                                  <artifactSet>
                                      <includes>
                                          <!-- Include here the dependencies you
                                              want to be packed in your fat jar -->
                                          <include>*:*</include>
                                      </includes>
                                  </artifactSet>
                                  <filters>
                                      <filter>
                                          <artifact>*:*</artifact>
                                          <excludes>
                                              <exclude>META-INF/*.SF</exclude>
                                              <exclude>META-INF/*.DSA</exclude>
                                              <exclude>META-INF/*.RSA</exclude>
                                              <exclude>**/log4j.properties</exclude>
                                          </excludes>
                                      </filter>
                                  </filters>
                                  <transformers>
                                      <transformer
                                              implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                          <resource>reference.conf</resource>
                                      </transformer>
                                      <transformer
                                              implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                          <resource>META-INF/services/org.apache.spark.sql.sources.DataSourceRegister</resource>
                                      </transformer>
                                  </transformers>
                              </configuration>
                          </execution>
                      </executions>
                  </plugin>
                  <plugin>
                      <groupId>net.alchim31.maven</groupId>
                      <artifactId>scala-maven-plugin</artifactId>
                      <version>3.3.2</version>
                      <executions>
                          <execution>
                              <id>scala-compile-first</id>
                              <phase>process-resources</phase>
                              <goals>
                                  <goal>compile</goal>
                              </goals>
                          </execution>
                          <execution>
                              <id>scala-test-compile-first</id>
                              <phase>process-test-resources</phase>
                              <goals>
                                  <goal>testCompile</goal>
                              </goals>
                          </execution>
                      </executions>
                  </plugin>
              </plugins>
          </build>
      </project>
    2. In IntelliJ IDEA, package the code and dependency files into a JAR file. Then, upload the JAR file to the MaxCompute project environment using the MaxCompute client. For more information, see Add resources.

      Note

      The DataWorks interface has a 50 MB upload limit for JAR files. Therefore, you must use the MaxCompute client to upload the JAR file.

  5. Perform a smoke test in DataWorks.

    1. In DataWorks, run the following statement to create a MaxCompute table. For more information, see Create and use MaxCompute tables.

      CREATE TABLE IF NOT EXISTS users_phoenix
      (
          id       INT   ,
          username STRING,
          password STRING
      ) ;
    2. In DataWorks, select the MaxCompute project environment and add the uploaded JAR file as a resource to the data development environment. For more information, see Create and use MaxCompute resources.

    3. Create an ODPS Spark node and set the job parameters. For more information, see Develop an ODPS Spark job.

      The following figure shows the configuration parameters for submitting the Spark job.

      image

      spark.hadoop.odps.cupid.eni.enable = true
      spark.hadoop.odps.cupid.eni.info=cn-beijing:vpc-2zeaeq21mb1dmkqh0****
    4. Click the Smoke Test icon to start the smoke test.

  6. After the smoke test succeeds, run the following query statement in an ad hoc query node.

    SELECT * FROM users_phoenix;

    The data is written to the MaxCompute table, as shown in the following figure.Data Query