All Products
Search
Document Center

DataWorks:External catalog collection method

Last Updated:Dec 18, 2024

In Data Map, you can use the external catalog feature of a StarRocks data source to collect the metadata of external data sources. This topic describes the required configurations for creating an external catalog to collect the metadata of an external data source. After you complete the configurations and collect the metadata of an external data source, you can search for and view the metadata, such as the information about the tables and fields of the external data source, in Data Map.

Background information

By default, Data Map obtains only the metadata of the StarRocks internal catalog after you add a StarRocks data source to DataWorks and start metadata collection. If you want to obtain the metadata of an external data source by using the StarRocks external catalog feature, you must establish a connection with the external data source that is associated with the external catalog in DataWorks and collect the metadata. After the collection is complete, Data Map automatically associates the metadata of the external data source with the external catalog. You can view the external catalog and the associated metadata of the external data source in the StarRocks data source.

Prerequisites

  • A DataWorks workspace is created. For more information, see Create a workspace.

  • A StarRocks database is added to DataWorks as a StarRocks data source. For more information, see StarRocks data source.

  • If you want to collect the metadata of an external data source for which an IP address whitelist is enabled, make sure that the permissions related to the IP address whitelist are configured. For more information, see Configure IP address whitelists for metadata collection.

Limits

  • Elasticsearch external catalogs are not supported.

  • Object Storage Service (OSS) Paimon catalogs are not supported.

Procedure

In this topic, a MySQL external catalog supported by StarRocks is used to collect the metadata of a MySQL data source. You create a MySQL database named mysql_catalog_db. In Data Map, you create and configure a metadata crawler of the MySQL data source type. After you use the metadata crawler to collect the metadata of the MySQL database, you can search for and view the metadata in the mysql_catalog_db database on the Search page where StarRocks is selected on the Data Source tab.

Step 1: Prepare data

Create a MySQL data source

Create a MySQL data source based on the mysql_catalog_db database and create a sample table named mysql_catalog_table. Sample script:

CREATE TABLE mysql_catalog_table(
  catalog_table_id INT,
  catalog_table_name VARCHAR(255)
)

Prepare the MySQL JDBC driver package

You must upload the MySQL Java Database Connectivity (JDBC) driver package of the corresponding version to OSS.

  1. To upload the Java ARchive (JAR) package of the JDBC driver of the same version as your MySQL database to OSS, log on to the OSS console and click Buckets in the left-side navigation pane.

  2. On the Buckets page, find the desired bucket and click the bucket name to go to the Objects page. In this example, the catalog-bucket-oss bucket is used.

  3. On the Objects page, click Create Directory to create a directory that is used to store the JAR package. In the Create Directory panel, set Directory Name to libs and click OK.

  4. Go to the directory where the JDBC driver JAR package is stored. Click Upload Object. In the Files to Upload section, click Select Files and add the mysql-connector-java-8.0.28.jar JDBC driver JAR package. Then, click Upload Object.

  5. Find the JDBC driver JAR package that you uploaded and click View Details in the Actions column. In the View Details panel, click Set ACL. In the Set ACL panel, set the ACL parameter to Public Read/Write and click OK.

Step 2: Establish a connection with the external data source

  1. Go to the query list of the StarRocks instance.

    1. Log on to the EMR console. In the left-side navigation pane, choose EMR Serverless > StarRocks. The Instances tab appears.

    2. Find the StarRocks instance that you created and click Connect in the Actions column. On the page that appears, click the New Connection tab.

    3. On the New Connection tab, set the Region parameter to the region where the created StarRocks instance is deployed and the Instance parameter to the name of the created StarRocks instance, configure the Connection Name, Username, and Password parameters, and then click Test Network Connectivity. After the instance is connected, click OK to go to the query list of the StarRocks instance.

  2. Establish a connection with the MySQL data source in the StarRocks data source.

    1. On the Queries tab, click + File. In the Create a file dialog box, configure the Name and Storage path parameters and click Confirm.

    2. Double-click the name of the file that you created under All files to go to the configuration tab of the StarRocks instance. Enter the following sample external connection script. For more information, see the Sample configuration of a StarRocks external catalog section in this topic.

      CREATE EXTERNAL CATALOG mysql_db_catalog PROPERTIES (
          "driver_class" = "com.mysql.cj.jdbc.Driver", 
          "driver_url" = "https://catalog-bucket-oss.oss-cn-hangzhou-internal.aliyuncs.com/libs/mysql-connector-java-8.0.28.jar", 
          "type" = "jdbc", 
          "user" = "<UserName>", 
          "password"="<PassWord>",
          "jdbc_uri" = "jdbc:mysql://xxx:3306/mysql_catalog_db" );
      Note
      • The mysql_db_catalog parameter specifies the name of the external catalog that you created for the MySQL data source.

      • Configure the UserName and PassWord parameters based on your business requirements.

      • jdbc_uri: Enter the Uniform Resource Identifier (URI) of the external database that you created.

    3. After the script is edited, click Run to run the script. After the script is run, you can view the information about the related table on the Databases tab.

      image

Step 3: Configure the data sources

  1. Log on to the DataWorks console. In the top navigation bar, select the desired region. Then, click Workspace in the left-side navigation pane. On the Workspaces page, find the workspace that you created and click Manage in the Actions column. The SettingCenter page appears.

  2. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, add the StarRocks and MySQL data sources. For more information, see Add a StarRocks data source and Add a MySQL data source.

    Note

    Set the Configuration Mode parameter to Connection String Mode when you add a MySQL data source. Set the Configuration Mode parameter to Connection String Mode when you add an external data source of the JDBC type.

Step 4: Configure a metadata crawler

You must log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Metadata Management > Data Map. On the page that appears, click Go to Data Map. Configure a metadata crawler in Data Map.

Collect the metadata of the StarRocks internal catalog

  1. In the left-side navigation pane of the DataMap page, click the image icon. On the page that appears, find the StarRocks data source type and click Manage in the upper-right corner.

    image

  2. On the StarRocks Metadata Collection page, click the Data Sources for Which No Crawler Is Created tab. On the tab, set the Workspace parameter to your workspace and the Data Source Name parameter to the name of the StarRocks data source that you added in SettingCenter.

    image

  3. Configure a metadata crawler.

    Find the desired StarRocks data source and click Create Crawler in the Actions column. In the Configure Collection Plan dialog box, set the Resource Group Name parameter to the name of the resource group that you created and click Test Network Connectivity. Wait until the connectivity test is successful, configure the Collection Plan parameter, and then click Confirm.

  4. Run the metadata crawler.

    Click the Data Sources for Which Crawler Is Created tab. On the tab, find the collection plan that you configured. Click Run in the Actions column. Wait until the Status of the collection plan changes to Successful.

Collect the metadata of the StarRocks external catalog

You must perform the steps for collecting the metadata of the StarRocks internal catalog in Step 4: Configure a metadata crawler to collect the metadata of the MySQL data source. Otherwise, the external catalog information of the MySQL data source cannot be found.

Note

If you create a metadata crawler for a MySQL data source, select the MySQL data source type on the Collect Metadata page.

Step 5: Search for metadata

Wait until the StarRocks and MySQL metadata collection tasks are complete. Then, click the image icon in the left-side navigation pane of the DataMap page to go to the search page.

  1. On the Data Source tab in the Type section, click StarRocks. In the Filter Conditions section, set the Instance parameter to the StarRocks instance that you created, the Data Catalog parameter to the name of the external data storage catalog that you created for the MySQL data source, and the Database parameter to the corresponding MySQL database. You can view the MySQL catalog information in the StarRocks data source. The following figure shows the result.

    image

  2. Click MySQL on the Data Source tab in the Type section. In the Filter Conditions section, set the Database parameter to the name of the MySQL database you created. Then, check whether the table information is the same.

    image

  3. View the table details.

    You can click the table name to view the table details, as shown in the following figure.

    image

    The following figure shows the details.

    image

Sample configuration of a StarRocks external catalog

Syntax used by the sample configuration of a StarRocks external catalog

CREATE EXTERNAL CATALOG <Catalog_Name> COMMENT '' PROPERTIES("type"="","xxx1"="","xxx2"="");
Note

The Catalog_Name parameter specifies the name of the external catalog. You can configure the catalog name based on your business requirements.

The following table describes how to use the StarRocks external catalog feature to collect the metadata of external data sources. For more information.

Collection method

Connection method

Sample configuration of a StarRocks external catalog

Default Catalog

default

If the default catalog is used, the internal metadata of StarRocks is collected by default. In this case, you do not need to configure an external catalog. For more information, see Overview of catalogs.

ODPS Catalog

VPC

CREATE EXTERNAL CATALOG cwy_odps_vpc_hangzhou_catalog PROPERTIES(
    "type"="odps",
    "odps.access.id"="<AccessId>",
    "odps.access.key"="<AccessKey>",
    "odps.endpoint"="http://service.cn-hangzhou-vpc.maxcompute.aliyun-inc.com/api",
    "odps.project"="<ODPSProject>"
);
Note

You must replace AccessId, AccessKey, and ODPSProject in the sample code with the actual values. The sample value of EndPoint is http://service.cn-hangzhou-vpc.maxcompute.aliyun-inc.com/api. You can replace the value based on MaxCompute endpoints of different regions. For more information, see Endpoints.

Hive Catalog

Hive Metastore (HMS)

CREATE EXTERNAL CATALOG `hive_catalog_hms` PROPERTIES (
    "hive.metastore.type" = "hive", 
    "hive.metastore.uris" = "thrift://192.xxx.xxx.xxx:9083", 
    "type" = "hive" 
);
Note

Replace the value of the hive.metastore.uris parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

Data Lake Framework (DLF)

CREATE EXTERNAL CATALOG `hive_catalog_dlf` PROPERTIES (
    "hive.metastore.type" = "dlf", 
    "dlf.catalog.id"="123456",
    "type" = "hive" 
);
Note

Replace the value of the dlf.catalog.id parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

Iceberg Catalog

Hive

CREATE EXTERNAL CATALOG iceberg_catalog_by_hms 
PROPERTIES (
  "iceberg.catalog.type"  =  "hive",
  "hive.metastore.uris"  =  "thrift://192.xxx.xxx.xxx:9083",
  "type"  =  "iceberg"
);
Note

If you use Hive Metastore as the metadata service when you configure an Iceberg external catalog, replace the value of the hive.metastore.uris parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

Hudi Catalog

Hive

CREATE EXTERNAL CATALOG hudi_catalog_by_hms 
PROPERTIES (
  "hive.metastore.type"  =  "hive",
  "hive.metastore.uris"  =  "thrift://192.xxx.xxx.xxx:9083",
  "type"  =  "hudi"
);
Note

If you use Hive Metastore as the metadata service when you configure a Hudi external catalog, replace the value of the hive.metastore.uris parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

Data Lake Framework (DLF)

CREATE EXTERNAL CATALOG hudi_catalog_by_dlf
PROPERTIES (
  "hive.metastore.type" ="dlf",
  "dlf.catalog.id"="123456",
  "type"  =  "hudi"
);
Note

If you use Data Lake Formation (DLF) as the metadata service when you configure a Hudi external catalog, replace the value of the dlf.catalog.id parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

Delta Lake Catalog

Hive

CREATE EXTERNAL CATALOG `deltalake_hive` PROPERTIES (
    "hive.metastore.type" = "hive", 
    "hive.metastore.uris" = "thrift://192.xxx.xxx.xxx:9083",
    "type" = "deltalake" ,
    "aliyun.oss.access_key" = "<AccessId>",
    "aliyun.oss.secret_key" = "<AccessKey>",
    "aliyun.oss.endpoint" = "<EndPoint>" 
);
Note
  • Replace the value of the hive.metastore.uris parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

  • Replace AccessId and AccessKey in the sample code with the actual values.

  • Replace EndPoint in the sample code based on the OSS endpoints of different regions. For more information, see Regions and endpoints. The sample value is http://oss-cn-hangzhou-internal.aliyuncs.com.

Data Lake Framework (DLF)

CREATE EXTERNAL CATALOG `deltalake_dlf` PROPERTIES (
    "hive.metastore.type" = "dlf", 
    "dlf.catalog.id"="123456",
    "type" = "deltalake" 
);
Note

Replace the value of the dlf.catalog.id parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

JDBC Catalog

MySQL

CREATE EXTERNAL CATALOG cwy_mysql_db_catalog PROPERTIES (
    "driver_class" = "com.mysql.cj.jdbc.Driver", 
    "driver_url" = "https://bucket-hz.oss-cn-hangzhou-internal.aliyuncs.com/libs/mysql-connector-java-8.0.28.jar", 
    "type" = "jdbc", 
    "user" = "<UserName>", 
    "password"="<PassWord>",
    "jdbc_uri" = "jdbc:mysql://xxx:3306/<database>" );
Note
  • driver_url is the path where the JDBC driver JAR package is stored in OSS. For more information about how to upload or download a JDBC driver JAR package in OSS, see Simple download.

  • Replace UserName and PassWord in the sample code with the actual values.

  • The jdbc_uri parameter specifies the URI of the external database that you created. The database parameter specifies the name of the database that you created.

PostgreSQL

CREATE EXTERNAL CATALOG `cwy_postgressql_db_catalog` PROPERTIES (
    "driver_class" = "org.postgresql.Driver", 
    "driver_url" = "https://bucket-hz.oss-cn-hangzhou-internal.aliyuncs.com/libs/postgresql-42.3.3.jar", 
    "type" = "jdbc", 
    "user" = "<UserName>", 
    "password"="<PassWord>",
    "jdbc_uri" = "jdbc:postgresql://xxxx:5432/<database>" );
Note
  • driver_url is the path where the JDBC driver JAR package is stored in OSS. For more information about how to upload or download a JDBC driver JAR package in OSS, see Simple download.

  • Replace UserName and PassWord in the sample code with the actual values.

  • The jdbc_uri parameter specifies the URI of the external database that you created. The database parameter specifies the name of the database that you created.

Paimon Catalog

Hive (StarRocks 3.1 and later)

CREATE EXTERNAL CATALOG paimon_by_hive
COMMENT ''
PROPERTIES (
  "type" = "paimon",
  "paimon.catalog.warehouse" = "oss://bucket-hz/paimon_warehouse/",
  "paimon.catalog.type" = "hive",
  "hive.metastore.uris" = "thrift://192.xxx.xxx.xxx:9083"
);
Note

If you use Hive as the metadata service when you configure a Paimon external catalog, take note of the following items:

  • The paimon.catalog.warehouse parameter is a property in the Paimon data lake framework. The parameter specifies the root path of a data warehouse.

  • You can replace the value of the hive.metastore.uris parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

Data Lake Formation (StarRocks 3.1 and later)

CREATE EXTERNAL CATALOG paimon_by_dlf
PROPERTIES (
  "type" = "paimon",
  "paimon.catalog.type" = "dlf",
  "paimon.catalog.warehouse" = "oss://bucket-hz/paimon_warehouse/",
  "aliyun.oss.endpoint" = "EndPoint",
  "dlf.catalog.id"="123456"
);
Note

If you use dlf as the metadata service when you configure a Paimon external catalog, take note of the following items:

  • The paimon.catalog.warehouse parameter is a property in the Paimon data lake framework. The parameter specifies the root path of a data warehouse.

  • You can replace EndPoint in the sample code based on the OSS endpoints of different regions. For more information, see Regions and endpoints. The sample value is oss-cn-hangzhou-internal.aliyuncs.com.

  • You can replace the value of the dlf.catalog.id parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.

Unified Catalog

Hive (StarRocks 3.2 and later)

CREATE EXTERNAL CATALOG unified_by_dlf
PROPERTIES (
  "type" = "unified",
  "unified.metastore.type" = "hive",
  "hive.metastore.uris" = "thrift://192.xxx.xxx.xxx:9083"
);
Note

If you use Hive as the metadata service when you configure a unified external catalog, replace the value of the hive.metastore.uris parameter in the sample code with the value of the same parameter in the hive-site.xml configuration file of your Hive data source.