AnalyticDB for MySQL allows you to use Spark SQL to read data from Lindorm Hive tables and wide tables by routing data through OSS external tables.
Prerequisites
Before you begin, make sure you have:
A single-zone Lindorm instance with LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS activated. For information about how to activate LDPS, see Activate LDPS and modify the configurations
An OSS bucket and a directory created in OSS. See Activate OSS, Create a bucket, and Create a directory
An AnalyticDB for MySQL Data Lakehouse Edition cluster with a job resource group and a database account. See Create a resource group
How it works
Both access paths follow the same three-stage flow:
Configure Spark session settings — enable Elastic Network Interface (ENI), set network parameters, and configure the Lindorm catalog.
Create an OSS external table as an intermediate destination.
Insert Lindorm data into the external table, then query it with
SELECT.
The two paths differ in how they connect to Lindorm:
Hive table | Wide table | |
Catalog class |
|
|
Connection method | Hive Metastore URI + Hadoop Distributed File System (HDFS) NameNode addresses | VPC endpoint for HBase Java API |
Extra setup | Get hdfs-site configuration from Lindorm technical support | Download and upload a JAR package to OSS |
Preparations
Before running either access path, collect the following from the Lindorm console:
Log in to the Lindorm console and go to Instance Details. Note the vSwitch ID of your Lindorm instance.
Go to Access Control > Security Groups tab. Note the security group ID of your Lindorm instance. To add a security group, see Add a security group.
Read data from a Hive table
Before you start
Contact Lindorm technical support (DingTalk ID: s0s3eg3) to get the hdfs-site configuration for your Lindorm instance. Provide your instance ID. The configuration contains the Hive Metastore URI and HDFS NameNode addresses used in the SQL below.
Steps
Log in to the AnalyticDB for MySQL console. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, click your cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and your job resource group.
Enter and run the following SQL. It sets up the Lindorm Hive catalog over ENI, creates an OSS external table, copies Hive table data into it, and reads the result. Replace the placeholders with values from your preparations and the hdfs-site configuration:
SET spark.adb.connectors = oss,external_hive; SET spark.adb.eni.enabled=true; SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****; SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****; SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog; SET spark.adb.eni.adbHostAlias.enabled=true; SET spark.sql.catalogImplementation = hive; -- HDFS-related parameters from hdfs-site configuration SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083; SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****; SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider; SET spark.hadoop.dfs.ha.automatic-failover.enabled=true; SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2; SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020; SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020; SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071; SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071; -- Create an external database and table in OSS CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db'; CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test'; -- Copy data from the Lindorm Hive table into the OSS external table INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test; -- Query the data SELECT * FROM external_oss_db.test;The following table describes the key parameters.
Parameter
Required
Description
spark.adb.connectorsYes
Connectors to enable. Set to
oss,external_hivefor this use case. Valid values:oss,hudi,delta,adb,odps,external_hive,jindo.spark.adb.eni.enabledYes
Enables ENI for connecting to external data sources. Set to
true.spark.adb.eni.vswitchIdYes
The vSwitch ID noted during preparations.
spark.adb.eni.securityGroupIdYes
The security group ID noted during preparations.
spark.sql.catalog.lindorm_catalogYes
Configures the Hive catalog. Set to
org.apache.spark.sql.hive.V2HiveExternalCatalog. You can replacelindorm_catalogwith a custom catalog name.spark.adb.eni.adbHostAlias.enabledYes
Writes domain name resolution mappings required by AnalyticDB for MySQL. If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to
true.spark.sql.catalogImplementationYes
Set to
hive.HDFS-related parameters
Yes
From the hdfs-site configuration. Format:
SET spark.hadoop.<name>=<value>.locationYes
OSS path of the external table. Format:
oss://<bucket-name>/<database>/<table>.Click Execute.
After the Spark job completes, go to Spark JAR Development. On the Applications tab, click Logs in the Actions column to view the table data. For details, see Spark editor.
Read data from a wide table
Before you start
Get the VPC endpoint of LindormTable for access by using the HBase Java API. See the View the endpoints of LindormTable section.
Download the required JAR package, then extract and upload it to your OSS bucket. See the Upload an object section.
Steps
Log in to the AnalyticDB for MySQL console. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition tab, click your cluster ID.
In the left-side navigation pane, choose Job Development > SQL Development.
On the SQLConsole tab, select the Spark engine and your job resource group.
Enter and run the following SQL. It configures the LindormTable catalog over ENI using the HBase Java API endpoint, loads the JAR package, creates an OSS external table, copies wide table data into it, and reads the result. Replace the placeholders with your actual values:
SET spark.adb.eni.enabled=true; SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****; SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****; SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog; SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020; SET spark.sql.catalog.lindorm_table.username=test; SET spark.sql.catalog.lindorm_table.password=password; add jar "oss://<bucketname>/hbase_jars/*"; -- Create an external database and table in OSS CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db'; CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2'; -- Copy data from the Lindorm wide table into the OSS external table INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb; -- Query the data SELECT * FROM external_oss_db.lindorm_table_2;The following table describes the key parameters.
Parameter
Required
Description
spark.adb.eni.enabledYes
Enables ENI for connecting to external data sources. Set to
true.spark.adb.eni.vswitchIdYes
The vSwitch ID noted during preparations.
spark.adb.eni.securityGroupIdYes
The security group ID noted during preparations.
spark.sql.catalog.lindorm_tableYes
Configures the LindormTable catalog. Set to
com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog. You can replacelindorm_tablewith a custom catalog name.spark.sql.catalog.lindorm_table.urlYes
The VPC endpoint for HBase Java API access obtained in step 1 of "Before you start".
spark.sql.catalog.lindorm_table.usernameYes
The LindormTable username.
spark.sql.catalog.lindorm_table.passwordYes
The LindormTable password. To reset the password, go to the LindormTable cluster management system.
jarYes
The OSS path of the JAR package uploaded in step 2 of "Before you start". Format:
oss://<bucket-name>/hbase_jars/*.locationYes
OSS path of the external table. Format:
oss://<bucket-name>/<database>/<table>.Click Execute.
After the Spark job completes, go to Spark JAR Development. On the Applications tab, click Logs in the Actions column to view the table data. For details, see Spark editor.
AnalyticDB for MySQL supports two methods to access data in Lindorm wide tables: using Java Database Connectivity (JDBC) and using the Spark Datasource V2 interface. The differences between the two methods are as follows:
The JDBC method supports accessing only the default database in a Lindorm wide table.
The Spark Datasource V2 interface method supports accessing any database in a wide table.
The Lindorm team reports no significant difference in efficiency between the two methods because the Spark Datasource V2 interface also concatenates SQL statements at the underlying layer.
Access wide table data using JDBC
Obtain the virtual private cloud (VPC) endpoint for LindormTable SQL. For more information, see View endpoints.
Download the Lindorm JDBC Driver dependency and upload the JAR package to OSS. For more information, see Upload a file.
NoteThe version of lindorm-all-client must be 2.1.6.
Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. Find the cluster that you want to manage and click the cluster ID.
In the navigation pane on the left, click Job Development > SQL Development.
In the SQLConsole window, select the Spark engine and the job resource group.
In the SQLConsole window, enter the following job content:
set spark.adb.eni.enabled=true; set spark.adb.eni.vswitchId=<vsw-bp14pj8h0k5p0kwu3****>; set spark.adb.eni.securityGroupId=<sg-bp11m93k021tp4ca****>; add jar oss://<bucketname>/hbase_jars/lindorm-all-client-2.1.6.jar; CREATE TEMPORARY VIEW tb USING org.apache.spark.sql.jdbc OPTIONS ( driver 'com.aliyun.lindorm.table.client.Driver', url 'jdbc:lindorm:table:url=http://ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30060/default', -- private network address dbtable 'tb', -- The database username obtained from the Lindorm console user 'test', -- The database password obtained from the Lindorm console password 'password', fetchsize 10000 ); -- Create an external OSS database and table create database if not exists external_oss_db location 'oss://<bucketname>/warehouse/db'; create table if not exists external_oss_db.lindorm_table_1(id string, name string) using parquet location 'oss://<bucketname>/warehouse/db/lindorm_table_1/'; -- Insert data from the Lindorm wide table into the external OSS table insert into external_oss_db.lindorm_table_1 select * from tb; -- Read data from the external OSS table select * from external_oss_db.lindorm_table_1;Parameter description:
Parameter
Description
jar
The OSS path where the JAR package that you downloaded in Step 2 is stored.
driver
The Lindorm JDBC Driver dependency. The value is fixed at
com.aliyun.lindorm.table.client.Driver.url
The VPC endpoint for LindormTable SQL that you obtained in Step 1. The format is
VPC endpoint for LindormTable SQL/database name.For example:
jdbc:lindorm:table:url=http://ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30060/defaultdbtable
The name of the Lindorm wide table database. In this example, the value is
tb.user
If you forget the user password, you can change it in the cluster management system of LindormTable.
password
fetchsize
The number of data rows to read per connection. You can adjust this value as needed.
ImportantYou must configure the fetchsize parameter when you submit a Spark job. Otherwise, data cannot be queried.
Click Execute Now.
After a Spark job runs successfully, you can view the table data in the Log section on the Spark Jar Development page and the Application List tab. For more information, see Spark development editor.