Read data from Lindorm using Spark SQL - AnalyticDB - Alibaba Cloud Documentation Center

AnalyticDB for MySQL allows you to use Spark SQL to read data from Lindorm Hive tables and wide tables by routing data through OSS external tables.

Prerequisites

Before you begin, make sure you have:

A single-zone Lindorm instance with LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS activated. For information about how to activate LDPS, see Activate LDPS and modify the configurations
An OSS bucket and a directory created in OSS. See Activate OSS, Create a bucket, and Create a directory
An AnalyticDB for MySQL Data Lakehouse Edition cluster with a job resource group and a database account. See Create a resource group

How it works

Both access paths follow the same three-stage flow:

Configure Spark session settings — enable Elastic Network Interface (ENI), set network parameters, and configure the Lindorm catalog.
Create an OSS external table as an intermediate destination.
Insert Lindorm data into the external table, then query it with SELECT.

The two paths differ in how they connect to Lindorm:

	Hive table	Wide table
Catalog class	`org.apache.spark.sql.hive.V2HiveExternalCatalog`	`com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog`
Connection method	Hive Metastore URI + Hadoop Distributed File System (HDFS) NameNode addresses	VPC endpoint for HBase Java API
Extra setup	Get hdfs-site configuration from Lindorm technical support	Download and upload a JAR package to OSS

Preparations

Before running either access path, collect the following from the Lindorm console:

Log in to the Lindorm console and go to Instance Details. Note the vSwitch ID of your Lindorm instance.
Go to Access Control > Security Groups tab. Note the security group ID of your Lindorm instance. To add a security group, see Add a security group.

Read data from a Hive table

Before you start

Contact Lindorm technical support (DingTalk ID: s0s3eg3) to get the hdfs-site configuration for your Lindorm instance. Provide your instance ID. The configuration contains the Hive Metastore URI and HDFS NameNode addresses used in the SQL below.

Steps

Log in to the AnalyticDB for MySQL console. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, click your cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and your job resource group.

Enter and run the following SQL. It sets up the Lindorm Hive catalog over ENI, creates an OSS external table, copies Hive table data into it, and reads the result. Replace the placeholders with values from your preparations and the hdfs-site configuration:

SET spark.adb.connectors = oss,external_hive;
SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog;
SET spark.adb.eni.adbHostAlias.enabled=true;
SET spark.sql.catalogImplementation = hive;

-- HDFS-related parameters from hdfs-site configuration
SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083;
SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****;
SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
SET spark.hadoop.dfs.ha.automatic-failover.enabled=true;
SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2;
SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020;
SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020;
SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071;
SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071;

-- Create an external database and table in OSS
CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db';
CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test';

-- Copy data from the Lindorm Hive table into the OSS external table
INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test;

-- Query the data
SELECT * FROM external_oss_db.test;

The following table describes the key parameters.

Parameter	Required	Description
`spark.adb.connectors`	Yes	Connectors to enable. Set to `oss,external_hive` for this use case. Valid values: `oss`, `hudi`, `delta`, `adb`, `odps`, `external_hive`, `jindo`.
`spark.adb.eni.enabled`	Yes	Enables ENI for connecting to external data sources. Set to `true`.
`spark.adb.eni.vswitchId`	Yes	The vSwitch ID noted during preparations.
`spark.adb.eni.securityGroupId`	Yes	The security group ID noted during preparations.
`spark.sql.catalog.lindorm_catalog`	Yes	Configures the Hive catalog. Set to `org.apache.spark.sql.hive.V2HiveExternalCatalog`. You can replace `lindorm_catalog` with a custom catalog name.
`spark.adb.eni.adbHostAlias.enabled`	Yes	Writes domain name resolution mappings required by AnalyticDB for MySQL. If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to `true`.
`spark.sql.catalogImplementation`	Yes	Set to `hive`.
HDFS-related parameters	Yes	From the hdfs-site configuration. Format: `SET spark.hadoop.<name>=<value>`.
`location`	Yes	OSS path of the external table. Format: `oss://<bucket-name>/<database>/<table>`.

Click Execute.
After the Spark job completes, go to Spark JAR Development. On the Applications tab, click Logs in the Actions column to view the table data. For details, see Spark editor.

Read data from a wide table

Before you start

Get the VPC endpoint of LindormTable for access by using the HBase Java API. See the View the endpoints of LindormTable section.
Download the required JAR package, then extract and upload it to your OSS bucket. See the Upload an object section.

Steps

Log in to the AnalyticDB for MySQL console. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, click your cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and your job resource group.

Enter and run the following SQL. It configures the LindormTable catalog over ENI using the HBase Java API endpoint, loads the JAR package, creates an OSS external table, copies wide table data into it, and reads the result. Replace the placeholders with your actual values:

SET spark.adb.eni.enabled=true;
SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog;
SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020;
SET spark.sql.catalog.lindorm_table.username=test;
SET spark.sql.catalog.lindorm_table.password=password;
add jar "oss://<bucketname>/hbase_jars/*";

-- Create an external database and table in OSS
CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db';
CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2';

-- Copy data from the Lindorm wide table into the OSS external table
INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb;

-- Query the data
SELECT * FROM external_oss_db.lindorm_table_2;

The following table describes the key parameters.

Parameter	Required	Description
`spark.adb.eni.enabled`	Yes	Enables ENI for connecting to external data sources. Set to `true`.
`spark.adb.eni.vswitchId`	Yes	The vSwitch ID noted during preparations.
`spark.adb.eni.securityGroupId`	Yes	The security group ID noted during preparations.
`spark.sql.catalog.lindorm_table`	Yes	Configures the LindormTable catalog. Set to `com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog`. You can replace `lindorm_table` with a custom catalog name.
`spark.sql.catalog.lindorm_table.url`	Yes	The VPC endpoint for HBase Java API access obtained in step 1 of "Before you start".
`spark.sql.catalog.lindorm_table.username`	Yes	The LindormTable username.
`spark.sql.catalog.lindorm_table.password`	Yes	The LindormTable password. To reset the password, go to the LindormTable cluster management system.
`jar`	Yes	The OSS path of the JAR package uploaded in step 2 of "Before you start". Format: `oss://<bucket-name>/hbase_jars/*`.
`location`	Yes	OSS path of the external table. Format: `oss://<bucket-name>/<database>/<table>`.

Click Execute.
After the Spark job completes, go to Spark JAR Development. On the Applications tab, click Logs in the Actions column to view the table data. For details, see Spark editor.

AnalyticDB for MySQL supports two methods to access data in Lindorm wide tables: using Java Database Connectivity (JDBC) and using the Spark Datasource V2 interface. The differences between the two methods are as follows:

The JDBC method supports accessing only the default database in a Lindorm wide table.
The Spark Datasource V2 interface method supports accessing any database in a wide table.
The Lindorm team reports no significant difference in efficiency between the two methods because the Spark Datasource V2 interface also concatenates SQL statements at the underlying layer.

Access wide table data using JDBC

Obtain the virtual private cloud (VPC) endpoint for LindormTable SQL. For more information, see View endpoints.
Download the Lindorm JDBC Driver dependency and upload the JAR package to OSS. For more information, see Upload a file.
Note
The version of lindorm-all-client must be 2.1.6.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, click Job Development > SQL Development.
In the SQLConsole window, select the Spark engine and the job resource group.

In the SQLConsole window, enter the following job content:

set spark.adb.eni.enabled=true;
set spark.adb.eni.vswitchId=<vsw-bp14pj8h0k5p0kwu3****>;
set spark.adb.eni.securityGroupId=<sg-bp11m93k021tp4ca****>;

add jar oss://<bucketname>/hbase_jars/lindorm-all-client-2.1.6.jar;

CREATE TEMPORARY VIEW tb
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver 'com.aliyun.lindorm.table.client.Driver',
  url 'jdbc:lindorm:table:url=http://ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30060/default', -- private network address
  dbtable 'tb',
-- The database username obtained from the Lindorm console
  user 'test',
-- The database password obtained from the Lindorm console
  password 'password',
  fetchsize 10000
);

-- Create an external OSS database and table
create database if not exists external_oss_db location 'oss://<bucketname>/warehouse/db';
create table if not exists external_oss_db.lindorm_table_1(id string, name string) using parquet location 'oss://<bucketname>/warehouse/db/lindorm_table_1/';

-- Insert data from the Lindorm wide table into the external OSS table
insert into external_oss_db.lindorm_table_1 select * from tb;

-- Read data from the external OSS table
select * from external_oss_db.lindorm_table_1;

Parameter description:

Parameter	Description
jar	The OSS path where the JAR package that you downloaded in Step 2 is stored.
driver	The Lindorm JDBC Driver dependency. The value is fixed at `com.aliyun.lindorm.table.client.Driver`.
url	The VPC endpoint for LindormTable SQL that you obtained in Step 1. The format is `VPC endpoint for LindormTable SQL/database name`. For example: `jdbc:lindorm:table:url=http://ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30060/default`
dbtable	The name of the Lindorm wide table database. In this example, the value is `tb`.
user	If you forget the user password, you can change it in the cluster management system of LindormTable.
password
fetchsize	The number of data rows to read per connection. You can adjust this value as needed. Important You must configure the fetchsize parameter when you submit a Spark job. Otherwise, data cannot be queried.

Click Execute Now.
After a Spark job runs successfully, you can view the table data in the Log section on the Spark Jar Development page and the Application List tab. For more information, see Spark development editor.