Create a Kudu data source - Dataphin - Alibaba Cloud Documentation Center

By creating a Kudu data source, you can enable Dataphin to read business data from Kudu or write data to Kudu. This topic describes how to create a Kudu data source.

Background information

Kudu provides features and data models similar to those of a relational database management system (RDBMS). It provides a storage structure similar to that of a relational database to store data, allowing users to insert, update, and delete data in the same way as in a relational database. Kudu is only a storage layer and does not store data. Therefore, it depends on external Hadoop processing engines (MapReduce, Spark, Impala). Kudu stores data in a columnar format in the underlying Linux file system.

Kudu is suitable for hybrid transactional and analytical processing (HTAP) scenarios (such as Internet of Things), which require higher complexity of data processing systems. From the early separation of OLTP and OLAP to the later Lambda architecture, there is complexity in data replication and synchronization. However, the single data architecture of Kudu avoids the complexity of data replication and synchronization in traditional architectures. For more information, see Kudu official website.

Permissions

Only custom global roles with the permission to create data sources and the super administrator, administrator, domain architect, and project administrator roles can create data sources.

Procedure

In the top navigation bar of the Dataphin homepage, choose Management Center > Datasource Management.
On the Datasource page, click +Create Data Source.
On the Create Data Source page, select Kudu in the Big Data section.
If you have recently used Kudu, you can also select Kudu in the Recently Used section. You can also enter keywords in the search box to quickly filter for Kudu.

On the Create Kudu Data Source page, configure the connection parameters.

Configure the basic information of the data source.

Parameter	Description
Datasource Name	The name must meet the following requirements: It can contain only Chinese characters, letters, digits, underscores (_), and hyphens (-). It cannot exceed 64 characters in length.
Datasource Code	After you configure the data source code, you can reference tables in the data source in Flink_SQL tasks by using the `data source code.table name` or `data source code.schema.table name` format. If you need to automatically access the data source in the corresponding environment based on the current environment, use the variable format `${data source code}.table` or `${data source code}.schema.table`. For more information, see Dataphin data source table development method. Important The data source code cannot be modified after it is configured successfully. After the data source code is configured successfully, you can preview data on the object details page in the asset directory and asset inventory. In Flink SQL, only MySQL, Hologres, MaxCompute, Oracle, StarRocks, Hive, and SelectDB data sources are currently supported.
Version	Select the corresponding Kudu version based on your actual situation. The supported versions are: CHD5:1.16 CHD6:1.16 CDP7.1.3:1.16.
Data Source Description	A brief description of the data source. It cannot exceed 128 characters.
Data Source Configuration	Select the data source to be configured: If the business data source distinguishes between production and development data sources, select Production + Development Data Source. If the business data source does not distinguish between production and development data sources, select Production Data Source.
Tag	You can categorize and tag data sources using tags. For information about how to create tags, see Manage data source tags.

Configure the connection parameters between the data source and Dataphin.

If you select Production + Development data source for your data source configuration, you need to configure the connection information for the Production + Development data source. If your data source configuration is Production data source, you only need to configure the connection information for the Production data source.

Note

Typically, production and development data sources should be configured as different data sources to isolate the development environment from the production environment and reduce the impact of the development data source on the production data source. However, Dataphin also supports configuring them as the same data source with identical parameter values.

Parameter	Description
Connection Url	Enter the connection address for the Kudu data source. Example format: `ip1:port1,ip2:port2`.
Kerberos	Kerberos is an identity authentication protocol based on symmetric key technology that provides identity authentication for target services. If Kudu has Kerberos authentication enabled, you need to enable Kerberos. After enabling it, you need to configure the following parameters: Krb5 File Configuration or KDC Server: Upload a Krb5 file containing the Kerberos authentication domain name or configure the KDC server address to assist with Kerberos authentication. Note Multiple KDC Server addresses can be configured, separated by commas (,). Keytab File: Upload the Keytab file for Kerberos authentication. Principal: Configure the Principal name for Kerberos authentication. Example format: `xxxx/hadoopclient@xxx.xxx`. If Kudu does not have Kerberos authentication, you do not need to enable Kerberos.
Configuration File	Upload the Hadoop configuration file. Note You can upload configuration files only when Kerberos is set to Enable.
Table Prefix	Enter a table prefix. When using the same Kudu service, table prefixes can effectively isolate production and development environments. For example, when the same Kudu service is used in combination with multiple storage systems such as Impala, you can use Impala as the table prefix to identify that the source data is from Impala, to distinguish it from tables in other storage systems.

Select a Default Resource Group, which is used to run tasks related to the current data source, including database SQL, offline database migration, data preview, and more.
Click Test Connection or directly click OK to save and complete the creation of the Kudu data source.
When you click Test Connection, the system tests whether the data source can connect to Dataphin normally. If you directly click OK, the system automatically tests the connection for all selected clusters. However, even if all selected clusters fail the connection test, the data source can still be created normally.