All Products
Search
Document Center

Realtime Compute for Apache Flink:Ingest log data into data warehouses in real time

Last Updated:Jan 19, 2024

Fully managed Flink allows you to ingest log data into data warehouses in real time. This topic describes how to create a draft that synchronizes data from a Message Queue for Apache Kafka instance to a Hologres instance in the console of fully managed Flink.

Background information

For example, a topic named users is created for a Message Queue for Apache Kafka instance and 100 JSON data records exist in the topic. These JSON data records represent the log data that is written to Message Queue for Apache Kafka by using a log file collection tool or application. The following figure shows the data distribution. Data distribution

In this topic, the CREATE TABLE AS statement that is provided by fully managed Flink is used to synchronize log data with one click and synchronize table schema changes in real time.

Prerequisites

Step 1: Configure an IP address whitelist

To allow fully managed Flink to access the Message Queue for Apache Kafka instance and Hologres instance, you must add the CIDR block of the vSwitch to which the fully managed Flink workspace belongs to the whitelists of the Message Queue for Apache Kafka instance and Hologres instance.

  1. Obtain the CIDR block of the vSwitch to which the fully managed Flink workspace belongs.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage, and choose More > Workspace Details in the Actions column.

    3. In the Workspace Details dialog box, view the CIDR block of the vSwitch to which the fully managed Flink workspace belongs.

      CIDR block

  2. Add the CIDR block of the vSwitch to which the fully managed Flink workspace belongs to the IP address whitelist of the Message Queue for Apache Kafka instance.

    For more information, see Configure whitelists. IP address whitelist of the Message Queue for Apache Kafka instance

  3. Add the CIDR block of the vSwitch to which the fully managed Flink workspace belongs to the IP address whitelist of the Hologres instance.

    For more information, see Configure an IP address whitelist. IP address whitelist of the Hologres instance

Step 2: Prepare test data of the Message Queue for Apache Kafka instance

Use a Faker source table of fully managed Flink as a data generator and write the data to the Message Queue for Apache Kafka instance. You can perform the following steps to write data to a Message Queue for Apache Kafka instance in the console of fully managed Flink.

  1. Create a topic named users in the Message Queue for Apache Kafka console.

    For more information, see Step 1: Create a topic.

  2. Create a draft that writes data to a specified Message Queue for Apache Kafka instance.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click SQL Editor.

    4. In the upper-left corner of the SQL Editor page, click New. On the SQL Scripts tab of the New Draft dialog box, click Blank Stream Draft.

      Fully managed Flink provides various code templates and supports data synchronization. Each code template provides specific scenarios, code samples, and instructions for you. You can click the desired template to quickly understand the features and related syntax of Flink and implement your business logic. For more information, see Code templates and Data synchronization templates.

    5. Click Next.

    6. In the New Draft dialog box, configure the parameters of the draft. The following table describes the parameters.

      Parameter

      Example value

      Description

      Name

      kafka-data-input

      The name of the draft that you want to create.

      Note

      The draft name must be unique in the current project.

      Location

      Development

      The folder in which the code file of the draft is saved. By default, the code file of the draft is stored in the Development folder.

      You can click the New Folder icon to the right of a folder to create a subfolder.

      Engine Version

      vvr-6.0.4-flink-1.15

      You can view the engine version of Flink that is used by the draft. For more information about engine versions, version mappings, and important time points in the lifecycle of each version, see Engine version.

    7. Click Create.

    8. Copy the following code of a draft to the code editor.

      CREATE TEMPORARY TABLE source (
        id INT,
        first_name STRING,
        last_name STRING,
        `address` ROW<`country` STRING, `state` STRING, `city` STRING>,
        event_time TIMESTAMP
      ) WITH (
        'connector' = 'faker',
        'number-of-rows' = '100',
        'rows-per-second' = '10',
        'fields.id.expression' = '#{number.numberBetween ''0'',''1000''}',
        'fields.first_name.expression' = '#{name.firstName}',
        'fields.last_name.expression' = '#{name.lastName}',
        'fields.address.country.expression' = '#{Address.country}',
        'fields.address.state.expression' = '#{Address.state}',
        'fields.address.city.expression' = '#{Address.city}',
        'fields.event_time.expression' = '#{date.past ''15'',''SECONDS''}'
      );
      
      CREATE TEMPORARY TABLE sink (
        id INT,
        first_name STRING,
        last_name STRING,
        `address` ROW<`country` STRING, `state` STRING, `city` STRING>,
        `timestamp` TIMESTAMP METADATA
      ) WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000',
        'topic' = 'users',
        'format' = 'json'
      );
      
      INSERT INTO sink SELECT * FROM source;
    9. Modify the following parameter configurations based on your business requirements.

      Parameter

      Example value

      Description

      properties.bootstrap.servers

      alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000

      The IP addresses or endpoints of Kafka brokers.

      Format: host:port,host:port,host:port. Separate multiple host:port pairs with commas (,).

      topic

      users

      The name of the Kafka topic.

  3. Start the deployment for the draft.

    1. In the upper-right corner of the SQL Editor page, click Deploy. In the dialog box that appears, click Confirm.

    2. In the left-side navigation pane, click Deployments. Find the desired deployment and click Start in the Actions column. For more information about the parameters that you must configure when you start your deployment, see Start a deployment.

      You can view the status and information about the deployment on the Deployments page. Status change

      The Faker data source provides bounded streams. Therefore, the deployment becomes complete about one minute after the deployment remains in the RUNNING state. When the deployment is complete, data in the deployment is written to the users topic of the Message Queue for Apache Kafka instance. The following sample code shows the format of the JSON data that is written to the Message Queue for Apache Kafka instance.

      {
        "id": 765,
        "first_name": "Barry",
        "last_name": "Pollich",
        "address": {
          "country": "United Arab Emirates",
          "state": "Nevada",
          "city": "Powlowskifurt"
        }
      }

Step 3: Create a Hologres catalog

If you want to perform single-table synchronization, you must create a destination table in a destination catalog. You can create a destination catalog in the console of fully managed Flink. In this topic, a Hologres catalog is used as the destination catalog. This section describes how to create a Hologres catalog.

  1. Create a Hologres catalog named holo.

    For more information, see Create a Hologres catalog. Create Catalog

    Important

    You must make sure that a database named flink_test_db is created in the instance to which you want to synchronize data. Otherwise, an error is returned when you create a catalog.

  2. On the Catalogs page, verify that the catalog named holo is created.

Step 4: Create a data synchronization draft and start a data synchronization deployment

  1. Log on to the console of fully managed Flink and create a data synchronization draft.

    1. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    2. In the left-side navigation pane, click SQL Editor. In the upper-left corner of the SQL Editor page, click New.

    3. On the SQL Scripts tab of the New Draft dialog box, click Blank Stream Draft.

      Fully managed Flink provides various code templates and supports data synchronization. Each code template provides specific scenarios, code samples, and instructions for you. You can click the desired template to quickly understand the features and related syntax of Flink and implement your business logic. For more information, see Code templates and Data synchronization templates.

    4. Click Next.

    5. In the New Draft dialog box, configure the parameters of the draft. The following table describes the parameters.

      Parameter

      Example value

      Description

      Name

      flink-quickstart-test

      The name of the draft that you want to create.

      Note

      The draft name must be unique in the current project.

      Location

      Development

      The folder in which the code file of the draft is saved. By default, the code file of the draft is stored in the Development folder.

      You can click the New Folder icon to the right of a folder to create a subfolder.

      Engine Version

      vvr-6.0.4-flink-1.15

      You can view the engine version of Flink that is used by the draft. For more information about engine versions, version mappings, and important time points in the lifecycle of each version, see Engine version.

    6. Click Create.

  2. Copy the following code to the code editor and modify the parameter configurations in the code based on your business requirements.

    You can use one of the following methods to synchronize data of the users topic from the Message Queue for Apache Kafka instance to the sync_kafka_users table of the flink_test_db database in Hologres:

    Note

    During data synchronization, we recommend that you declare the partition and offset fields of Kafka as the primary key for the Hologres table. This way, if data is retransmitted due to a deployment failover, only one copy of the data that has the same values of the partition and offset fields is stored.

    You can use one of the following methods to specify the data types of the input and output values:

    • Use the CREATE TABLE AS statement

      If you execute the CREATE TABLE AS statement to synchronize data, you do not need to manually create the table in Hologres or configure the types of the columns to which data is written as JSON or JSONB.

      CREATE TEMPORARY TABLE kafka_users (
        `id` INT NOT NULL,
        `address` STRING,
        `offset` BIGINT NOT NULL METADATA,
        `partition` BIGINT NOT NULL METADATA,
        `timestamp` TIMESTAMP METADATA,
        `date` AS CAST(`timestamp` AS DATE),
        `country` AS JSON_VALUE(`address`, '$.country'),
        PRIMARY KEY (`partition`, `offset`) NOT ENFORCED
      ) WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000',
        'topic' = 'users',
        'format' = 'json',
        'json.infer-schema.flatten-nested-columns.enable' = 'true', -- Automatically expand nested columns. 
        'scan.startup.mode' = 'earliest-offset'
      );
      
      CREATE TABLE IF NOT EXISTS holo.flink_test_db.sync_kafka_users
      WITH (
        'connector' = 'hologres'
      ) AS TABLE kafka_users;
      Note

      To prevent duplicate data from being written to Hologres after a job failover, you can add the related primary key to the table to uniquely identify data. If data is retransmitted, Hologres ensures that only one copy of data that has the same values of the partition and offset fields is retained.

    • Use the INSERT INTO statement

      A special method is used to optimize JSON and JSONB data in Hologres. Therefore, you can use the INSERT INTO statement to write nested JSON data to Hologres.

      If you use the INSERT INTO statement to synchronize data, you must manually create a table in Hologres and configure the types of the columns to which data is written as JSON or JSONB. Then, you can execute the INSERT INTO statement to write the address data to the column of the JSON type in Hologres.

      CREATE TEMPORARY TABLE kafka_users (
        `id` INT NOT NULL,
        'address' STRING, -- The data in this column is nested JSON data. 
        `offset` BIGINT NOT NULL METADATA,
        `partition` BIGINT NOT NULL METADATA,
        `timestamp` TIMESTAMP METADATA,
        `date` AS CAST(`timestamp` AS DATE),
        `country` AS JSON_VALUE(`address`, '$.country')
      ) WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000',
        'topic' = 'users',
        'format' = 'json',
        'json.infer-schema.flatten-nested-columns.enable' = 'true', -- Automatically expand nested columns. 
        'scan.startup.mode' = 'earliest-offset'
      );
      
      CREATE TEMPORARY TABLE holo (
        `id` INT NOT NULL,
        `address` STRING,
        `offset` BIGINT,
        `partition` BIGINT,
        `timestamp` TIMESTAMP,
        `date` DATE,
        `country` STRING
      ) WITH (
        'connector' = 'hologres',
        'endpoint' = 'hgpostcn-****-cn-beijing-vpc.hologres.aliyuncs.com:80',
        'username' = 'LTAI5tE572UJ44Xwhx6i****',
        'password' = 'KtyIXK3HIDKA9VzKX4tpct9xTm****',
        'dbname' = 'flink_test_db',
        'tablename' = 'sync_kafka_users'
      );
      
      INSERT INTO holo
      SELECT * FROM kafka_users;

    The following table describes the parameters that you can configure in the code.

    Parameter

    Example value

    Description

    properties.bootstrap.servers

    alikafka-host1.aliyuncs.com:9000,alikafka-host2.aliyuncs.com:9000,alikafka-host3.aliyuncs.com:9000

    The IP addresses or endpoints of Kafka brokers.

    Format: host:port,host:port,host:port. Separate multiple host:port pairs with commas (,).

    topic

    users

    The name of the Kafka topic.

    endpoint

    hgpostcn-****-cn-beijing-vpc.hologres.aliyuncs.com:80

    The endpoint of the Hologres instance.

    Format: <ip>:<port>.

    username

    LTAI5tE572UJ44Xwhx6i****

    The username that is used to access the Hologres database. You must enter the AccessKey ID of your Alibaba Cloud account.

    password

    KtyIXK3HIDKA9VzKX4tpct9xTm****

    The password that is used to access the Hologres database. You must enter the AccessKey secret of your Alibaba Cloud account.

    dbname

    flink_test_db

    The name of the Hologres database that you want to access.

    tablename

    sync_kafka_users

    The name of the Hologres table.

    Note
    • If you use the INSERT INTO statement to synchronize data, you must create the sync_kafka_users table and define required fields in the database of the destination instance.

    • If the public schema is not used, you must set tablename to schema.tableName.

    json.infer-schema.flatten-nested-columns.enable

    true

    Specifies whether to automatically expand nested columns. Valid values:

    • true: Flink automatically expands the new nested column and uses the access path of the column as the name of the column.

    • false: Flink does not automatically expand nested columns.

  3. In the upper-right corner of the SQL Editor page, click Deploy. In the dialog box that appears, click Confirm.

  4. In the left-side navigation pane, click Deployments. Find the desired deployment and click Start in the Actions column. For more information about the parameters that you must configure when you start your deployment, see Start a deployment.

  5. Click Start.

    You can view the status and information of the deployment on the Deployments page after the deployment is started. Status change

Step 5: View the result of full data synchronization

  1. Log on to the Hologres console.

  2. On the Instances page, click the name of the instance.

  3. In the upper-right corner of the page, click Connect to Instance.

  4. On the Metadata Management tab, view the schema and data of the sync_kafka_users table to which data is synchronized in the users database.

    Table sync_kafka_users

    The following figures show the schema and data of the sync_kafka_users table after full data synchronization.

    • Table schema

      Double-click the name of the sync_kafka_users table to view the table schema.

      Table schema

    • Table data

      In the upper-right corner of the page for the sync_kafka_users table, click Query table. In the SQL editor, enter the following statement and click Run:

      SELECT * FROM public.sync_kafka_users order by partition, "offset";

      The following figure shows the data of the sync_kafka_users table. Table data

Step 6: Check whether table schema changes are automatically synchronized

  1. Manually send a message that contains a new column in the Message Queue for Apache Kafka console.

    1. Log on to the ApsaraMQ for Kafka console.

    2. On the Instances page, click the name of the instance for data synchronization.

    3. In the left-side navigation pane of the page that appears, click Topics. On the right side of the page, find the topic named users.

    4. Choose More > Quick Start in the Actions column. In the Start to Send and Consume Message panel, configure the parameters and enter the content of the test message.

      Message Content

      Parameter

      Description

      Method of Sending

      Select Console.

      Message Key

      Enter flinktest.

      Message Content

      Copy and paste the following JSON content to the Message Content field.

      {
        "id": 100001,
        "first_name": "Dennise",
        "last_name": "Schuppe",
        "address": {
          "country": "Isle of Man",
          "state": "Montana",
          "city": "East Coleburgh"
        },
        "house-points": {
          "house": "Pukwudgie",
          "points": 76
        }
      }
      Note

      In this example, house-points is a new nested column.

      Send to Specified Partition

      Select Yes.

      Partition ID

      Enter 0.

    5. Click OK.

  2. In the Hologres console, view the changes in the schema and data of the sync_kafka_users table.

    1. Log on to the Hologres console.

    2. On the Instances page, click the name of the instance for data synchronization.

    3. In the upper-right corner of the page, click Connect to Instance.

    4. On the Metadata Management tab, double-click the name of the sync_kafka_users table.

    5. In the upper-right corner of the page for the sync_kafka_users table, click Query table. In the SQL editor, enter the following statement and click Run:

      SELECT * FROM public.sync_kafka_users order by partition, "offset";
    6. View the data of the table.

      The following figure shows the data of the sync_kafka_users table. kun

      The figure shows that the data record whose id is 100001 is written to Hologres. In addition, the house-points.house and house-points.points columns are added to Hologres.

      Note

      Only data in the nested column house-points is included in the data that is inserted into the table of Message Queue for Apache Kafka. However, json.infer-schema.flatten-nested-columns.enable is declared in the parameters of the WITH clause for the kafka_users table. In this case, fully managed Flink automatically expands the new nested column. After the column is expanded, the path to access the column is used as the name of the column.

(Optional) Step 7: Adjust deployment resource configurations

To ensure optimal deployment performance, we recommend that you adjust the parallelism of deployments and resource configurations of different nodes based on the amount of data that needs to be processed. To adjust the parallelism of deployments and the number of CUs in a simple manner, use the basic resource configuration mode. To adjust the parallelism of deployments and resource configurations of nodes in a more fine-grained manner, use the expert resource configuration mode.

  1. Log on to the console of fully managed Flink and go to the deployment details page.

    1. Log on to the Realtime Compute for Apache Flink console.

    2. On the Fully Managed Flink tab, find the workspace that you want to manage and click Console in the Actions column.

    3. In the left-side navigation pane, click Deployments.

    4. On the Deployments page, click the name of the desired deployment. In the upper-right corner of the Resources section on the Deployment Detail tab, click Edit.

  2. Modify resource configurations.

    1. Select Expert for the Mode parameter.

    2. Click Get Plan Now in the Resource Plan section.

    3. Move the pointer over More and click Expand All.

      View the complete topology to learn the data synchronization plan of the deployment. The plan shows the tables that need to be synchronized.

    4. Manually configure the PARALLELISM parameter for each node.

      The table in the users topic of Message Queue for Apache Kafka has four partitions. Therefore, you can set the PARALLELISM parameter for Message Queue for Apache Kafka to 4. Log data is written to only one Hologres table. To reduce the number of connections to Hologres, you can set the PARALLELISM parameter for Hologres to 2. For more information about how to configure resource parameters, see Configure a deployment.

    5. In the upper-right corner of the Resources section, click Save.

    6. On the Deployments page, find the desired deployment and click Start in the Actions column.

      Important

      If you want to modify the resource configuration of a deployment that is in the RUNNING state, perform the following steps: Find the deployment and click Cancel in the Actions column. When the deployment changes to the CANCELLED state, click Start in the Actions column. After the deployment is started, the modification on the resource configuration of the deployment takes effect.

  3. Click the name of the desired deployment. On the Status tab, view the effect after the adjustment.

    image.png

References