Use Spark SQL to access Lindorm data - AnalyticDB for MySQL

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to use Spark SQL to access Lindorm data. This topic describes how to use Spark SQL to access data of Hive and wide tables in Lindorm.

Prerequisites

A single-zone Lindorm instance is created. For more information, see Create an instance.
Important
You must activate LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS for the Lindorm instance. For information about how to activate LDPS, see Activate LDPS and modify the configurations.
Object Storage Service (OSS) is activated. A bucket and a directory are created in OSS. For more information, see Activate OSS, Create buckets, and Manage directories.
An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.
A job resource group is created in the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a resource group.
A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.
- If you use an Alibaba Cloud account, you must create a privileged account. For more information, see the "Create a privileged account" section of the Create a database account topic.
- If you use a Resource Access Management (RAM) user, you must create both a privileged account and a standard account and associate the standard account with the RAM user. For more information, see Create a database account and Associate or disassociate a database account with or from a RAM user.

Preparations

Log on to the Lindorm console and go to the Instance Details page to view the vSwitch ID of the Lindorm instance.
Log on to the Lindorm console and go to the Access Control page. Click the Security Groups tab to view the ID of the security group to which the Lindorm instance belongs. For information about how to add a security group, see Add a security group.

Access the data of a Hive table in Lindorm

Obtain the hdfs-site configuration information of your Hadoop Distributed File System (HDFS) client.
Note
You can log on to the Lindorm console, go to the Database Connection page, and then click the File Engine tab. Click Generate Configuration Items to obtain the hdfs-site configuration information.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Editor > SQL Development.
On the SQLConsole tab, select the Spark engine and a job resource group.

On the SQLConsole tab, enter the following code:

SET spark.adb.connectors = oss,external_hive;
SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog;
SET spark.adb.eni.adbHostAlias.enabled=true;
SET spark.sql.catalogImplementation = hive;
-- Configure HDFS-related parameters.
SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083;
SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****;
SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
SET spark.hadoop.dfs.ha.automatic-failover.enabled=true;
SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2;
SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020;
SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020;
SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071;
SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071;

-- Create an external database and an external table in OSS.
CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db';
CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test';

-- Insert the data of a Hive table to the OSS external table.
INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test;

-- Read the data of the OSS external table.
SELECT * FROM external_oss_db.test;

Parameter	Required	Description
spark.adb.connectors	Yes	The names of the built-in connectors of AnalyticDB for MySQL Spark. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, and external_hive. In this example, `oss and external_hive` are used.
spark.adb.eni.enabled	Yes	Specifies whether to enable elastic network interface (ENI). If you use external tables to access other external data sources, you must enable ENI. Valid values: true false In this example, this parameter is set to `true`.
spark.adb.eni.vswitchId	Yes	The vSwitch ID of the Lindorm instance that is obtained during preparations.
spark.adb.eni.securityGroupId	Yes	The security group ID of the Lindorm instance that is obtained during preparations.
spark.sql.catalog.lindorm_catalog	Yes	The method that is supported for Spark SQL to configure the data source. This parameter is automatically set to `org.apache.spark.sql.hive.V2HiveExternalCatalog`. Note You can customize the `catalog_name` part in the parameter name. In this example, `lindorm_catalog` is used.
spark.adb.eni.adbHostAlias.enabled	Yes	Specifies whether to automatically write the resolved domain name information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values: true false If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.
HDFS-related parameters	Yes	The hdfs-site configuration information that is obtained in Step 1. Format: `set spark.hadoop.<name>=<value>`. Note Replace <name> and <value> with the actual values in the configuration information.
location	Yes	The OSS path of the external table. Format: `oss://<bucketname/database/table>`. In this example, `oss://testBucketName/warehouse/db/test` is used. Note Replace `<bucketname/database/table>` with the actual OSS path.

Click Run Now.
After the Spark job is run, go to the Spark JAR Development page, click the Applications tab, and then click Log in the Actions column to view the table data. For more information, see Spark editor.

Access the data of a wide table in Lindorm

Obtain the Virtual Private Cloud (VPC) endpoint of LindormTable for access by using the HBase Java API. For more information, see View endpoints.
Download the JAR package that is required to connect to the Lindorm instance. Decompress and upload the JAR package to OSS. For more information, see the "Upload an object" section of the Get started by using the OSS console topic.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.
In the left-side navigation pane, choose Job Editor > SQL Development.
On the SQLConsole tab, select the Spark engine and a job resource group.

On the SQLConsole tab, enter the following code:

SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog;
SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020;
SET spark.sql.catalog.lindorm_table.username=test;
SET spark.sql.catalog.lindorm_table.password=password;
add jar "oss://<bucketname>/hbase_jars/*";

-- Create an external database and an external table in OSS.
CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db';
CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2';

-- Insert the data of a wide table to the OSS external table.
INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb;

-- Read the data of the OSS external table.
SELECT * FROM external_oss_db.lindorm_table_2;

The following table describes the parameters.

Parameter	Required	Description
spark.adb.eni.enabled	Yes	Specifies whether to enable elastic network interface (ENI). If you use external tables to access other external data sources, you must enable ENI. Valid values: true false In this example, this parameter is set to `true`.
spark.adb.eni.vswitchId	Yes	The vSwitch ID of the Lindorm instance that is obtained during preparations.
spark.adb.eni.securityGroupId	Yes	The security group ID of the Lindorm instance that is obtained during preparations.
spark.sql.catalog.lindorm_table	Yes	The method that is supported for Spark SQL to configure the data source. This parameter is automatically set to `com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog`. Note You can customize the `catalog_name` part in the parameter name. In this example, `lindorm_table` is used.
spark.sql.catalog.lindorm_table.url	Yes	The VPC endpoint for access by using the HBase Java API that is obtained in Step 1.
spark.sql.catalog.lindorm_table.username	Yes	The username and password of LindormTable. If you forget your password, you can change the password in the cluster management system of LindormTable. For more information, see Manage users.
spark.sql.catalog.lindorm_table.password	Yes
jar	Yes	The OSS path of the package that is downloaded in Step 2. In this example, `oss://testBucketName/hbase_jars/*` is used.
location	Yes	The OSS path of the external table. Format: `oss://<bucketname/database/table>`. In this example, `oss://testBucketName/warehouse/db/lindorm_table_2` is used. Note Replace `<bucketname/database/table>` with the actual OSS path.

Click Run Now.
After the Spark job is run, go to the Spark JAR Development page, click the Applications tab, and then click Log in the Actions column to view the table data. For more information, see Spark editor.