All Products
Search
Document Center

AnalyticDB for MySQL:Use Spark SQL to access Lindorm data

Last Updated:Nov 27, 2023

AnalyticDB for MySQL Data Lakehouse Edition (V3.0) allows you to use Spark SQL to access Lindorm data. This topic describes how to use Spark SQL to access data of Hive and wide tables in Lindorm.

Prerequisites

  • A single-zone Lindorm instance is created. For more information, see Create an instance.

    Important

    You must activate LindormTable, Lindorm Distributed Processing System (LDPS), and LindormDFS for the Lindorm instance. For information about how to activate LDPS, see Activate LDPS and modify the configurations.

  • Object Storage Service (OSS) is activated. A bucket and a directory are created in OSS. For more information, see Activate OSS, Create buckets, and Manage directories.

  • An AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster is created. For more information, see Create a cluster.

  • A job resource group is created in the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster. For more information, see Create a resource group.

  • A database account is created for the AnalyticDB for MySQL Data Lakehouse Edition (V3.0) cluster.

Preparations

  1. Log on to the Lindorm console and go to the Instance Details page to view the vSwitch ID of the Lindorm instance.

  2. Log on to the Lindorm console and go to the Access Control page. Click the Security Groups tab to view the ID of the security group to which the Lindorm instance belongs. For information about how to add a security group, see Add a security group.

Access the data of a Hive table in Lindorm

  1. Obtain the hdfs-site configuration information of your Hadoop Distributed File System (HDFS) client.

    Note

    You can log on to the Lindorm console, go to the Database Connection page, and then click the File Engine tab. Click Generate Configuration Items to obtain the hdfs-site configuration information.

  2. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.

  3. In the left-side navigation pane, choose Job Editor > SQL Development.

  4. On the SQLConsole tab, select the Spark engine and a job resource group.

  5. On the SQLConsole tab, enter the following code:

    SET spark.adb.connectors = oss,external_hive;
    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
    SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
    SET spark.sql.catalog.lindorm_catalog = org.apache.spark.sql.hive.V2HiveExternalCatalog;
    SET spark.adb.eni.adbHostAlias.enabled=true;
    SET spark.sql.catalogImplementation = hive;
    -- Configure HDFS-related parameters.
    SET spark.sql.catalog.lindorm_catalog.spark.hadoop.hive.metastore.uris=thrift://ld-bp1ttz8833x9c****-proxy-ldps-hms.lindorm.aliyuncs.com:9083;
    SET spark.hadoop.dfs.nameservices=ld-bp1ttz8833x9c****;
    SET spark.hadoop.dfs.client.failover.proxy.provider.ld-bp1ttz8833x9c****=org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider;
    SET spark.hadoop.dfs.ha.automatic-failover.enabled=true;
    SET spark.hadoop.dfs.ha.namenodes.ld-bp1ttz8833x9c****=nn1,nn2;
    SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c-master1-001.lindorm.rds.aliyuncs.com:8020;
    SET spark.hadoop.dfs.namenode.rpc-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:8020;
    SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn1=ld-bp1ttz8833x9c****-master1-001.lindorm.rds.aliyuncs.com:50071;
    SET spark.hadoop.dfs.namenode.http-address.ld-bp1ttz8833x9c****.nn2=ld-bp1ttz8833x9c****-master2-001.lindorm.rds.aliyuncs.com:50071;
    
    -- Create an external database and an external table in OSS.
    CREATE DATABASE external_oss_db location 'oss://testBucketName/warehouse/db';
    CREATE TABLE external_oss_db.test(id int, name string) using parquet location 'oss://testBucketName/warehouse/db/test';
    
    -- Insert the data of a Hive table to the OSS external table.
    INSERT INTO external_oss_db.test SELECT * FROM lindorm_catalog.spark_test.test;
    
    -- Read the data of the OSS external table.
    SELECT * FROM external_oss_db.test;

    Parameter

    Required

    Description

    spark.adb.connectors

    Yes

    The names of the built-in connectors of AnalyticDB for MySQL Spark. Separate multiple names with commas (,). Valid values: oss, hudi, delta, adb, odps, and external_hive.

    In this example, oss and external_hive are used.

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable elastic network interface (ENI).

    If you use external tables to access other external data sources, you must enable ENI. Valid values:

    • true

    • false

    In this example, this parameter is set to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the Lindorm instance that is obtained during preparations.

    spark.adb.eni.securityGroupId

    Yes

    The security group ID of the Lindorm instance that is obtained during preparations.

    spark.sql.catalog.lindorm_catalog

    Yes

    The method that is supported for Spark SQL to configure the data source.

    This parameter is automatically set to org.apache.spark.sql.hive.V2HiveExternalCatalog.

    Note

    You can customize the catalog_name part in the parameter name. In this example, lindorm_catalog is used.

    spark.adb.eni.adbHostAlias.enabled

    Yes

    Specifies whether to automatically write the resolved domain name information that AnalyticDB for MySQL requires to a mapping table of domain names and IP addresses. Valid values:

    • true

    • false

    If you use an ENI to read data from or write data to EMR Hive, you must set this parameter to true.

    HDFS-related parameters

    Yes

    The hdfs-site configuration information that is obtained in Step 1. Format: set spark.hadoop.<name>=<value>.

    Note

    Replace <name> and <value> with the actual values in the configuration information.

    location

    Yes

    The OSS path of the external table. Format: oss://<bucketname/database/table>. In this example, oss://testBucketName/warehouse/db/test is used.

    Note

    Replace <bucketname/database/table> with the actual OSS path.

  6. Click Run Now.

  7. After the Spark job is run, go to the Spark JAR Development page, click the Applications tab, and then click Log in the Actions column to view the table data. For more information, see Spark editor.

Access the data of a wide table in Lindorm

  1. Obtain the Virtual Private Cloud (VPC) endpoint of LindormTable for access by using the HBase Java API. For more information, see View endpoints.

  2. Download the JAR package that is required to connect to the Lindorm instance. Decompress and upload the JAR package to OSS. For more information, see the "Upload an object" section of the Get started by using the OSS console topic.

  3. Log on to the AnalyticDB for MySQL console. In the upper-left corner of the console, select a region. In the left-side navigation pane, click Clusters. On the Data Lakehouse Edition (V3.0) tab, find the cluster that you want to manage and click the cluster ID.

  4. In the left-side navigation pane, choose Job Editor > SQL Development.

  5. On the SQLConsole tab, select the Spark engine and a job resource group.

  6. On the SQLConsole tab, enter the following code:

    SET spark.adb.eni.enabled=true;
    SET spark.adb.eni.vswitchId=vsw-bp14pj8h0k5p0kwu3****;
    SET spark.adb.eni.securityGroupId=sg-bp11m93k021tp4ca****;
    SET spark.sql.catalog.lindorm_table = com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog;
    SET spark.sql.catalog.lindorm_table.url=ld-bp1ttz8833x9c****-proxy-lindorm.lindorm.rds.aliyuncs.com:30020;
    SET spark.sql.catalog.lindorm_table.username=test;
    SET spark.sql.catalog.lindorm_table.password=password;
    add jar "oss://<bucketname>/hbase_jars/*";
    
    -- Create an external database and an external table in OSS.
    CREATE DATABASE IF NOT exists external_oss_db location 'oss://testBucketName/warehouse/db';
    CREATE TABLE IF NOT exists external_oss_db.lindorm_table_2(id string, name string) using parquet location 'oss://testBucketName/warehouse/db/lindorm_table_2';
    
    -- Insert the data of a wide table to the OSS external table.
    INSERT INTO external_oss_db.lindorm_table_2 SELECT * FROM lindorm_table.spark_test.tb;
    
    -- Read the data of the OSS external table.
    SELECT * FROM external_oss_db.lindorm_table_2;

    The following table describes the parameters.

    Parameter

    Required

    Description

    spark.adb.eni.enabled

    Yes

    Specifies whether to enable elastic network interface (ENI).

    If you use external tables to access other external data sources, you must enable ENI. Valid values:

    • true

    • false

    In this example, this parameter is set to true.

    spark.adb.eni.vswitchId

    Yes

    The vSwitch ID of the Lindorm instance that is obtained during preparations.

    spark.adb.eni.securityGroupId

    Yes

    The security group ID of the Lindorm instance that is obtained during preparations.

    spark.sql.catalog.lindorm_table

    Yes

    The method that is supported for Spark SQL to configure the data source.

    This parameter is automatically set to com.alibaba.lindorm.ldspark.datasources.v2.LindormCatalog.

    Note

    You can customize the catalog_name part in the parameter name. In this example, lindorm_table is used.

    spark.sql.catalog.lindorm_table.url

    Yes

    The VPC endpoint for access by using the HBase Java API that is obtained in Step 1.

    spark.sql.catalog.lindorm_table.username

    Yes

    The username and password of LindormTable.

    If you forget your password, you can change the password in the cluster management system of LindormTable. For more information, see Manage users.

    spark.sql.catalog.lindorm_table.password

    Yes

    jar

    Yes

    The OSS path of the package that is downloaded in Step 2. In this example, oss://testBucketName/hbase_jars/* is used.

    location

    Yes

    The OSS path of the external table. Format: oss://<bucketname/database/table>. In this example, oss://testBucketName/warehouse/db/lindorm_table_2 is used.

    Note

    Replace <bucketname/database/table> with the actual OSS path.

  7. Click Run Now.

  8. After the Spark job is run, go to the Spark JAR Development page, click the Applications tab, and then click Log in the Actions column to view the table data. For more information, see Spark editor.