×
Community Blog FlinkSQL Temporary Join Development

FlinkSQL Temporary Join Development

The article introduces the use of temporary joins in real-time development for matching traffic logs with product attributes.

By Tuyuan

In real-time development, when obtaining the attributes corresponding to a specific moment through a dual-stream join, temporary joins are often utilized. During the real-time iteration of traffic upgrades, the author needed to accurately match traffic logs with product attributes corresponding to the browsing time, and encountered several challenges while developing with temporary joins. This article compiles some of those experiences for reference and discussion.

1. Background

Before reading this article, it is recommended to familiarize yourself with the background knowledge on real-time FlinkSQL dual-stream joins.

Currently, we have a TT stream A with traffic log details and a TT stream B with commodity tags. Performing a dual-stream join on streams A and B in Flink is similar to joining stream A with an HBase dimension table.

A temporary join has the following features:

  1. Single-stream driven: Although it is a dual-stream join, data emission is triggered by only one stream.
  2. You need to define a versioned table, which records the attribute information at each moment and is passively queried when you join two streams. Similar to a bank exchange rate table, you need to refer to the exchange rate at the time of currency exchange.
  3. Queries carry time-version information: The temporary join is triggered by the watermarks of the two streams. Therefore, the attributes retrieved correspond to those within the relevant time frame.

1
Source: Jincheng Sun, “Blink Series - Temporal Table JOIN”

2. Application Scenarios & Example Sharing

When it is necessary to calculate the total amount based on real-time exchange rates currency amounts or compute the total transaction amount from real-time product prices the number of items sold, a temporary join is often used to obtain the real-time exchange rate and price information. In my traffic upgrade business iteration, we need to acquire real-time product tags. Therefore, it is required to define a versioned table for product tags. The syntax is as follows:

CREATE TEMPORARY TABLE `tag_ri` (
  `id` VARCHAR,
  `tag` VARCHAR,
  `time` VARCHAR,
  `ts` AS `TO_TIMESTAMP`(`time`, 'yyyy-MM-dd HH:mm:ss'),
  WATERMARK FOR `ts` AS `withOffset`(`ts`, 0) -- Define the watermark
) WITH (
  'connector' = 'tt',
  'router' = '******',
  'topic' = 'tag_ri',
  'lineDelimiter' = '\n',
  'fieldDelimiter' = '\u0001',
  'encoding' = 'utf-8'
);

-- Define the version table
CREATE TEMPORARY VIEW `tag`
AS
SELECT `id`
     , `tag`
     , `time`
     , `ts`
FROM (    SELECT `id`
                , `tag`
                , `time`
                , `ts`
                , ROW_NUMBER() OVER (PARTITION BY 'id' -- Join the primary key
                  ORDER BY `time` DESC) AS `rownum`
           FROM `tag_ri`
     )
WHERE `rownum` = 1;

Similarly, we also need to define a watermark for the traffic log detail stream and perform a dual-stream join.

CREATE TEMPORARY TABLE `log_ri` (
  `id` VARCHAR,
  `time` VARCHAR,
  ......
  `ts` AS `TO_TIMESTAMP`(`time`, 'yyyy-MM-dd HH:mm:ss'),
  WATERMARK FOR `ts` AS `withOffset`(`ts`, 0)
) WITH (
  'connector' = 'tt',
  'router' = '******',
  'topic' = 'log_ri',
  'lineDelimiter' = '\n',
  'fieldDelimiter' = '\u0001',
  'encoding' = 'utf-8',
);


select `a`.`id`
      ,......
      ,`b`.`tag`
from  (
        SELECT *
        FROM `log_ri`
      ) AS `a`
LEFT JOIN `tag` FOR SYSTEM_TIME AS OF `a`.`ts` AS `b` ON `a`.`id` = `b`.`id`

The result is as follows:

-- Product tag information
12:00> SELECT * FROM tag_ri;

  id              tag (product tag)
=======      =======================
   t1                 A


12:30> SELECT * FROM tag_ri;

  id              tag (product tag)
=======      =======================
   t1                 B

-- Traffic log detail query, showing three entries for product t1
SELECT * FROM log_ri;

  id              time
=======          ========
  t1              12:00
  t1              12:15
  t1               12:30

-- Perform a temporary join
select `a`.`id`
      ,`a`.`time`
      ,`b`.`tag`
from  (
        SELECT *
        FROM `log_ri`
      ) AS `a`
LEFT JOIN `tag` FOR SYSTEM_TIME AS OF `a`.`ts` AS `b` ON `a`.`id` = `b`.`id`

  id               time              tag (product tag)
=======          ========        =======================
  t1              12:00                   A
  t1              12:15                   A
  t1              12:30                   B

3. Development Experience

3.1 Sparse Data Processing

Since a temporary join is triggered by the watermarks of two streams, if the versioned table represents a sparse stream (with no data flowing in for a period), the join might exhibit behavior where it waits without emitting any data. You can set the parameter set table.exec.source.idle-timeout = 10s, allowing data from stream A to bypass waiting.

For more information about the specific parameters, please refer to:
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/config/#table-exec-source-idle-timeout

3.2 Delayed Data Emission

Problem Description

During actual development, we observed that after performing a temporary join, data would consistently wait before being emitted, only proceeding at the top of the hour.

2

Cause Analysis

By analyzing the TT log flow with SQL syntax, we discovered that the logic involves a union of four sources followed by a join with the defined versioned table:

select a.*
      ,b.tag
from
(
select * from source_1 
union all 
select * from source_2
union all 
select * from source_3
union all 
select * from source_4
) a
temporay join 
b stream

Source_4 processes a small portion of data from the 59th minute of the current hour at the top of the hour. Since the temporary join is triggered by the watermarks of both streams, this causes stream A to wait until stream B reaches the 59th minute of the current hour before triggering the join.

3

Solution

For parts of source_4 where log_time is greater than the current time, setting log_time to the current time during the temporary join resolves this issue.

4. Summary

  1. In scenarios where the dual-stream join is driven by a single stream, the temporary join is a common processing method.
  2. Temporary joins are triggered by the watermarks of two streams, so preprocessing the watermarks of both streams is necessary to prevent issues like data sparseness and data preemption from affecting data emission.

References

[01] https://www.51cto.com/article/713922.html
[02] https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/config/#table-exec-source-idle-timeout


Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

0 1 0
Share on

Alibaba Cloud Community

1,129 posts | 348 followers

You may also like

Comments

Alibaba Cloud Community

1,129 posts | 348 followers

Related Products