Fully managed Flink allows you to ingest log data into data warehouses in real time. This topic describes how to create a draft that synchronizes data from a Message Queue for Apache Kafka instance to a Hologres instance in the console of fully managed Flink.
Background information
For example, a topic named users is created for a Message Queue for Apache Kafka instance and 100 JSON data records exist in the topic. These JSON data records represent the log data that is written to Message Queue for Apache Kafka by using a log file collection tool or application. The following figure shows the data distribution.
If you want to create a draft to synchronize all log data in the topic from the Message Queue for Apache Kafka instance to a Hologres instance, you can perform the following steps:
In this topic, the CREATE TABLE AS statement that is provided by fully managed Flink is used to synchronize log data with one click and synchronize table schema changes in real time.
Prerequisites
The RAM user or RAM role that you use to access the console of fully managed Flink has the required permissions. For more information, see Permission management.
A workspace is created. For more information, see Activate Realtime Compute for Apache Flink.
Upstream and downstream storage instances are created.
A Message Queue for Apache Kafka instance is created. For more information, see Step 3: Create resources.
A Hologres instance is created. For more information, see Purchase a Hologres instance.
NoteWe recommend that the Message Queue for Apache Kafka instance and the Hologres instance reside in the same virtual private cloud (VPC) as the fully managed Flink workspace. If the instances do not reside in the same VPC as the fully managed Flink workspace, you must establish network connections between the Message Queue for Apache Kafka instance and the fully managed Flink workspace and between the Hologres instance and the fully managed Flink workspace. For more information, see How does fully managed Flink access a service across VPCs? or How does the fully managed Flink service access the Internet?.
Step 1: Configure an IP address whitelist
To allow fully managed Flink to access the Message Queue for Apache Kafka instance and Hologres instance, you must add the CIDR block of the vSwitch to which the fully managed Flink workspace belongs to the whitelists of the Message Queue for Apache Kafka instance and Hologres instance.
Obtain the CIDR block of the vSwitch to which the fully managed Flink workspace belongs.
Log on to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, find the workspace that you want to manage, and choose in the Actions column.
In the Workspace Details dialog box, view the CIDR block of the vSwitch to which the fully managed Flink workspace belongs.
Add the CIDR block of the vSwitch to which the fully managed Flink workspace belongs to the IP address whitelist of the Message Queue for Apache Kafka instance.
For more information, see Configure whitelists.
Add the CIDR block of the vSwitch to which the fully managed Flink workspace belongs to the IP address whitelist of the Hologres instance.
For more information, see Configure an IP address whitelist.
Step 2: Prepare test data of the Message Queue for Apache Kafka instance
Use a Faker source table of fully managed Flink as a data generator and write the data to the Message Queue for Apache Kafka instance. You can perform the following steps to write data to a Message Queue for Apache Kafka instance in the console of fully managed Flink.
Create a topic named users in the Message Queue for Apache Kafka console.
For more information, see Step 1: Create a topic.
Create a draft that writes data to a specified Message Queue for Apache Kafka instance.
Log on to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, click SQL Editor.
In the upper-left corner of the SQL Editor page, click New. On the SQL Scripts tab of the New Draft dialog box, click Blank Stream Draft.
Fully managed Flink provides various code templates and supports data synchronization. Each code template provides specific scenarios, code samples, and instructions for you. You can click the desired template to quickly understand the features and related syntax of Flink and implement your business logic. For more information, see Code templates and Data synchronization templates.
Click Next.
In the New Draft dialog box, configure the parameters of the draft. The following table describes the parameters.
Parameter
Example value
Description
Name
kafka-data-input
The name of the draft that you want to create.
NoteThe draft name must be unique in the current project.
Location
Development
The folder in which the code file of the draft is saved. By default, the code file of the draft is stored in the Development folder.
You can click the icon to the right of a folder to create a subfolder.
Engine Version
vvr-6.0.4-flink-1.15
You can view the engine version of Flink that is used by the draft. For more information about engine versions, version mappings, and important time points in the lifecycle of each version, see Engine version.
Click Create.
Copy the following code of a draft to the code editor.
CREATE TEMPORARY TABLE source ( id INT, first_name STRING, last_name STRING, `address` ROW<`country` STRING, `state` STRING, `city` STRING>, event_time TIMESTAMP ) WITH ( 'connector' = 'faker', 'number-of-rows' = '100', 'rows-per-second' = '10', 'fields.id.expression' = '#{number.numberBetween ''0'',''1000''}', 'fields.first_name.expression' = '#{name.firstName}', 'fields.last_name.expression' = '#{name.lastName}', 'fields.address.country.expression' = '#{Address.country}', 'fields.address.state.expression' = '#{Address.state}', 'fields.address.city.expression' = '#{Address.city}', 'fields.event_time.expression' = '#{date.past ''15'',''SECONDS''}' ); CREATE TEMPORARY TABLE sink ( id INT, first_name STRING, last_name STRING, `address` ROW<`country` STRING, `state` STRING, `city` STRING>, `timestamp` TIMESTAMP METADATA ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000', 'topic' = 'users', 'format' = 'json' ); INSERT INTO sink SELECT * FROM source;
Modify the following parameter configurations based on your business requirements.
Parameter
Example value
Description
properties.bootstrap.servers
alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000
The IP addresses or endpoints of Kafka brokers.
Format: host:port,host:port,host:port. Separate multiple host:port pairs with commas (,).
topic
users
The name of the Kafka topic.
Start the deployment for the draft.
In the upper-right corner of the SQL Editor page, click Deploy. In the dialog box that appears, click Confirm.
In the left-side navigation pane, click Deployments. Find the desired deployment and click Start in the Actions column. For more information about the parameters that you must configure when you start your deployment, see Start a deployment.
You can view the status and information about the deployment on the Deployments page.
The Faker data source provides bounded streams. Therefore, the deployment becomes complete about one minute after the deployment remains in the RUNNING state. When the deployment is complete, data in the deployment is written to the users topic of the Message Queue for Apache Kafka instance. The following sample code shows the format of the JSON data that is written to the Message Queue for Apache Kafka instance.
{ "id": 765, "first_name": "Barry", "last_name": "Pollich", "address": { "country": "United Arab Emirates", "state": "Nevada", "city": "Powlowskifurt" } }
Step 3: Create a Hologres catalog
If you want to perform single-table synchronization, you must create a destination table in a destination catalog. You can create a destination catalog in the console of fully managed Flink. In this topic, a Hologres catalog is used as the destination catalog. This section describes how to create a Hologres catalog.
Create a Hologres catalog named holo.
For more information, see Create a Hologres catalog.
ImportantYou must make sure that a database named flink_test_db is created in the instance to which you want to synchronize data. Otherwise, an error is returned when you create a catalog.
On the Catalogs page, verify that the catalog named holo is created.
Step 4: Create a data synchronization draft and start a data synchronization deployment
Log on to the console of fully managed Flink and create a data synchronization draft.
On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, click SQL Editor. In the upper-left corner of the SQL Editor page, click New.
On the SQL Scripts tab of the New Draft dialog box, click Blank Stream Draft.
Fully managed Flink provides various code templates and supports data synchronization. Each code template provides specific scenarios, code samples, and instructions for you. You can click the desired template to quickly understand the features and related syntax of Flink and implement your business logic. For more information, see Code templates and Data synchronization templates.
Click Next.
In the New Draft dialog box, configure the parameters of the draft. The following table describes the parameters.
Parameter
Example value
Description
Name
flink-quickstart-test
The name of the draft that you want to create.
NoteThe draft name must be unique in the current project.
Location
Development
The folder in which the code file of the draft is saved. By default, the code file of the draft is stored in the Development folder.
You can click the icon to the right of a folder to create a subfolder.
Engine Version
vvr-6.0.4-flink-1.15
You can view the engine version of Flink that is used by the draft. For more information about engine versions, version mappings, and important time points in the lifecycle of each version, see Engine version.
Click Create.
Copy the following code to the code editor and modify the parameter configurations in the code based on your business requirements.
You can use one of the following methods to synchronize data of the users topic from the Message Queue for Apache Kafka instance to the sync_kafka_users table of the flink_test_db database in Hologres:
NoteDuring data synchronization, we recommend that you declare the partition and offset fields of Kafka as the primary key for the Hologres table. This way, if data is retransmitted due to a deployment failover, only one copy of the data that has the same values of the partition and offset fields is stored.
You can use one of the following methods to specify the data types of the input and output values:
Use the CREATE TABLE AS statement
If you execute the CREATE TABLE AS statement to synchronize data, you do not need to manually create the table in Hologres or configure the types of the columns to which data is written as JSON or JSONB.
CREATE TEMPORARY TABLE kafka_users ( `id` INT NOT NULL, `address` STRING, `offset` BIGINT NOT NULL METADATA, `partition` BIGINT NOT NULL METADATA, `timestamp` TIMESTAMP METADATA, `date` AS CAST(`timestamp` AS DATE), `country` AS JSON_VALUE(`address`, '$.country'), PRIMARY KEY (`partition`, `offset`) NOT ENFORCED ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000', 'topic' = 'users', 'format' = 'json', 'json.infer-schema.flatten-nested-columns.enable' = 'true', -- Automatically expand nested columns. 'scan.startup.mode' = 'earliest-offset' ); CREATE TABLE IF NOT EXISTS holo.flink_test_db.sync_kafka_users WITH ( 'connector' = 'hologres' ) AS TABLE kafka_users;
NoteTo prevent duplicate data from being written to Hologres after a job failover, you can add the related primary key to the table to uniquely identify data. If data is retransmitted, Hologres ensures that only one copy of data that has the same values of the partition and offset fields is retained.
Use the INSERT INTO statement
A special method is used to optimize JSON and JSONB data in Hologres. Therefore, you can use the INSERT INTO statement to write nested JSON data to Hologres.
If you use the INSERT INTO statement to synchronize data, you must manually create a table in Hologres and configure the types of the columns to which data is written as JSON or JSONB. Then, you can execute the INSERT INTO statement to write the address data to the column of the JSON type in Hologres.
CREATE TEMPORARY TABLE kafka_users ( `id` INT NOT NULL, 'address' STRING, -- The data in this column is nested JSON data. `offset` BIGINT NOT NULL METADATA, `partition` BIGINT NOT NULL METADATA, `timestamp` TIMESTAMP METADATA, `date` AS CAST(`timestamp` AS DATE), `country` AS JSON_VALUE(`address`, '$.country') ) WITH ( 'connector' = 'kafka', 'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000', 'topic' = 'users', 'format' = 'json', 'json.infer-schema.flatten-nested-columns.enable' = 'true', -- Automatically expand nested columns. 'scan.startup.mode' = 'earliest-offset' ); CREATE TEMPORARY TABLE holo ( `id` INT NOT NULL, `address` STRING, `offset` BIGINT, `partition` BIGINT, `timestamp` TIMESTAMP, `date` DATE, `country` STRING ) WITH ( 'connector' = 'hologres', 'endpoint' = 'hgpostcn-****-cn-beijing-vpc.hologres.aliyuncs.com:80', 'username' = 'LTAI5tE572UJ44Xwhx6i****', 'password' = 'KtyIXK3HIDKA9VzKX4tpct9xTm****', 'dbname' = 'flink_test_db', 'tablename' = 'sync_kafka_users' ); INSERT INTO holo SELECT * FROM kafka_users;
The following table describes the parameters that you can configure in the code.
Parameter
Example value
Description
properties.bootstrap.servers
alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000
The IP addresses or endpoints of Kafka brokers.
Format: host:port,host:port,host:port. Separate multiple host:port pairs with commas (,).
topic
users
The name of the Kafka topic.
endpoint
hgpostcn-****-cn-beijing-vpc.hologres.aliyuncs.com:80
The endpoint of the Hologres instance.
Format: <ip>:<port>.
username
LTAI5tE572UJ44Xwhx6i****
The username that is used to access the Hologres database. You must enter the AccessKey ID of your Alibaba Cloud account.
password
KtyIXK3HIDKA9VzKX4tpct9xTm****
The password that is used to access the Hologres database. You must enter the AccessKey secret of your Alibaba Cloud account.
dbname
flink_test_db
The name of the Hologres database that you want to access.
tablename
sync_kafka_users
The name of the Hologres table.
NoteIf you use the INSERT INTO statement to synchronize data, you must create the sync_kafka_users table and define required fields in the database of the destination instance.
If the public schema is not used, you must set tablename to schema.tableName.
json.infer-schema.flatten-nested-columns.enable
true
Specifies whether to automatically expand nested columns. Valid values:
true: Flink automatically expands the new nested column and uses the access path of the column as the name of the column.
false: Flink does not automatically expand nested columns.
In the upper-right corner of the SQL Editor page, click Deploy. In the dialog box that appears, click Confirm.
In the left-side navigation pane, click Deployments. Find the desired deployment and click Start in the Actions column. For more information about the parameters that you must configure when you start your deployment, see Start a deployment.
Click Start.
You can view the status and information of the deployment on the Deployments page after the deployment is started.
Step 5: View the result of full data synchronization
Log on to the Hologres console.
On the Instances page, click the name of the instance.
In the upper-right corner of the page, click Connect to Instance.
On the Metadata Management tab, view the schema and data of the sync_kafka_users table to which data is synchronized in the users database.
The following figures show the schema and data of the sync_kafka_users table after full data synchronization.
Table schema
Double-click the name of the sync_kafka_users table to view the table schema.
Table data
In the upper-right corner of the page for the sync_kafka_users table, click Query table. In the SQL editor, enter the following statement and click Run:
SELECT * FROM public.sync_kafka_users order by partition, "offset";
The following figure shows the data of the sync_kafka_users table.
Step 6: Check whether table schema changes are automatically synchronized
Manually send a message that contains a new column in the Message Queue for Apache Kafka console.
Log on to the ApsaraMQ for Kafka console.
On the Instances page, click the name of the instance for data synchronization.
In the left-side navigation pane of the page that appears, click Topics. On the right side of the page, find the topic named users.
Choose More > Quick Start in the Actions column. In the Start to Send and Consume Message panel, configure the parameters and enter the content of the test message.
Parameter
Description
Method of Sending
Select Console.
Message Key
Enter flinktest.
Message Content
Copy and paste the following JSON content to the Message Content field.
{ "id": 100001, "first_name": "Dennise", "last_name": "Schuppe", "address": { "country": "Isle of Man", "state": "Montana", "city": "East Coleburgh" }, "house-points": { "house": "Pukwudgie", "points": 76 } }
NoteIn this example, house-points is a new nested column.
Send to Specified Partition
Select Yes.
Partition ID
Enter 0.
Click OK.
In the Hologres console, view the changes in the schema and data of the sync_kafka_users table.
Log on to the Hologres console.
On the Instances page, click the name of the instance for data synchronization.
In the upper-right corner of the page, click Connect to Instance.
On the Metadata Management tab, double-click the name of the sync_kafka_users table.
In the upper-right corner of the page for the sync_kafka_users table, click Query table. In the SQL editor, enter the following statement and click Run:
SELECT * FROM public.sync_kafka_users order by partition, "offset";
View the data of the table.
The following figure shows the data of the sync_kafka_users table.
The figure shows that the data record whose id is 100001 is written to Hologres. In addition, the house-points.house and house-points.points columns are added to Hologres.
NoteOnly data in the nested column house-points is included in the data that is inserted into the table of Message Queue for Apache Kafka. However, json.infer-schema.flatten-nested-columns.enable is declared in the parameters of the WITH clause for the kafka_users table. In this case, fully managed Flink automatically expands the new nested column. After the column is expanded, the path to access the column is used as the name of the column.
(Optional) Step 7: Adjust deployment resource configurations
To ensure optimal deployment performance, we recommend that you adjust the parallelism of deployments and resource configurations of different nodes based on the amount of data that needs to be processed. To adjust the parallelism of deployments and the number of CUs in a simple manner, use the basic resource configuration mode. To adjust the parallelism of deployments and resource configurations of nodes in a more fine-grained manner, use the expert resource configuration mode.
Log on to the console of fully managed Flink and go to the deployment details page.
Log on to the Realtime Compute for Apache Flink console.
On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.
In the left-side navigation pane, click Deployments.
On the Deployments page, click the name of the desired deployment. In the upper-right corner of the Resources section on the Deployment Detail tab, click Edit.
Modify resource configurations.
Select Expert for the Mode parameter.
Click Get Plan Now in the Resource Plan section.
Move the pointer over More and click Expand All.
View the complete topology to learn the data synchronization plan of the deployment. The plan shows the tables that need to be synchronized.
Manually configure the PARALLELISM parameter for each node.
The table in the users topic of Message Queue for Apache Kafka has four partitions. Therefore, you can set the PARALLELISM parameter for Message Queue for Apache Kafka to 4. Log data is written to only one Hologres table. To reduce the number of connections to Hologres, you can set the PARALLELISM parameter for Hologres to 2. For more information about how to configure resource parameters, see Configure a deployment.
In the upper-right corner of the Resources section, click Save.
On the Deployments page, find the desired deployment and click Start in the Actions column.
ImportantIf you want to modify the resource configuration of a deployment that is in the RUNNING state, perform the following steps: Find the deployment and click Cancel in the Actions column. When the deployment changes to the CANCELLED state, click Start in the Actions column. After the deployment is started, the modification on the resource configuration of the deployment takes effect.
Click the name of the desired deployment. On the Status tab, view the effect after the adjustment.
References
For more information about the CREATE TABLE AS statement, see CREATE TABLE AS statement.
For more information about the synchronization of table schema changes when the CREATE TABLE AS statement is executed to synchronize data from a Message Queue for Apache Kafka source table, see Create a Message Queue for Apache Kafka source table.