By Tuyuan
In real-time development, when obtaining the attributes corresponding to a specific moment through a dual-stream join, temporary joins are often utilized. During the real-time iteration of traffic upgrades, the author needed to accurately match traffic logs with product attributes corresponding to the browsing time, and encountered several challenges while developing with temporary joins. This article compiles some of those experiences for reference and discussion.
Before reading this article, it is recommended to familiarize yourself with the background knowledge on real-time FlinkSQL dual-stream joins.
Currently, we have a TT stream A with traffic log details and a TT stream B with commodity tags. Performing a dual-stream join on streams A and B in Flink is similar to joining stream A with an HBase dimension table.
A temporary join has the following features:
Source: Jincheng Sun, “Blink Series - Temporal Table JOIN”
When it is necessary to calculate the total amount based on real-time exchange rates currency amounts or compute the total transaction amount from real-time product prices the number of items sold, a temporary join is often used to obtain the real-time exchange rate and price information. In my traffic upgrade business iteration, we need to acquire real-time product tags. Therefore, it is required to define a versioned table for product tags. The syntax is as follows:
CREATE TEMPORARY TABLE `tag_ri` (
`id` VARCHAR,
`tag` VARCHAR,
`time` VARCHAR,
`ts` AS `TO_TIMESTAMP`(`time`, 'yyyy-MM-dd HH:mm:ss'),
WATERMARK FOR `ts` AS `withOffset`(`ts`, 0) -- Define the watermark
) WITH (
'connector' = 'tt',
'router' = '******',
'topic' = 'tag_ri',
'lineDelimiter' = '\n',
'fieldDelimiter' = '\u0001',
'encoding' = 'utf-8'
);
-- Define the version table
CREATE TEMPORARY VIEW `tag`
AS
SELECT `id`
, `tag`
, `time`
, `ts`
FROM ( SELECT `id`
, `tag`
, `time`
, `ts`
, ROW_NUMBER() OVER (PARTITION BY 'id' -- Join the primary key
ORDER BY `time` DESC) AS `rownum`
FROM `tag_ri`
)
WHERE `rownum` = 1;
Similarly, we also need to define a watermark for the traffic log detail stream and perform a dual-stream join.
CREATE TEMPORARY TABLE `log_ri` (
`id` VARCHAR,
`time` VARCHAR,
......
`ts` AS `TO_TIMESTAMP`(`time`, 'yyyy-MM-dd HH:mm:ss'),
WATERMARK FOR `ts` AS `withOffset`(`ts`, 0)
) WITH (
'connector' = 'tt',
'router' = '******',
'topic' = 'log_ri',
'lineDelimiter' = '\n',
'fieldDelimiter' = '\u0001',
'encoding' = 'utf-8',
);
select `a`.`id`
,......
,`b`.`tag`
from (
SELECT *
FROM `log_ri`
) AS `a`
LEFT JOIN `tag` FOR SYSTEM_TIME AS OF `a`.`ts` AS `b` ON `a`.`id` = `b`.`id`
The result is as follows:
-- Product tag information
12:00> SELECT * FROM tag_ri;
id tag (product tag)
======= =======================
t1 A
12:30> SELECT * FROM tag_ri;
id tag (product tag)
======= =======================
t1 B
-- Traffic log detail query, showing three entries for product t1
SELECT * FROM log_ri;
id time
======= ========
t1 12:00
t1 12:15
t1 12:30
-- Perform a temporary join
select `a`.`id`
,`a`.`time`
,`b`.`tag`
from (
SELECT *
FROM `log_ri`
) AS `a`
LEFT JOIN `tag` FOR SYSTEM_TIME AS OF `a`.`ts` AS `b` ON `a`.`id` = `b`.`id`
id time tag (product tag)
======= ======== =======================
t1 12:00 A
t1 12:15 A
t1 12:30 B
Since a temporary join is triggered by the watermarks of two streams, if the versioned table represents a sparse stream (with no data flowing in for a period), the join might exhibit behavior where it waits without emitting any data. You can set the parameter set table.exec.source.idle-timeout = 10s, allowing data from stream A to bypass waiting.
For more information about the specific parameters, please refer to:
https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/config/#table-exec-source-idle-timeout
During actual development, we observed that after performing a temporary join, data would consistently wait before being emitted, only proceeding at the top of the hour.
By analyzing the TT log flow with SQL syntax, we discovered that the logic involves a union of four sources followed by a join with the defined versioned table:
select a.*
,b.tag
from
(
select * from source_1
union all
select * from source_2
union all
select * from source_3
union all
select * from source_4
) a
temporay join
b stream
Source_4 processes a small portion of data from the 59th minute of the current hour at the top of the hour. Since the temporary join is triggered by the watermarks of both streams, this causes stream A to wait until stream B reaches the 59th minute of the current hour before triggering the join.
For parts of source_4 where log_time is greater than the current time, setting log_time to the current time during the temporary join resolves this issue.
[01] https://www.51cto.com/article/713922.html
[02] https://nightlies.apache.org/flink/flink-docs-release-1.17/docs/dev/table/config/#table-exec-source-idle-timeout
Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.
Alibaba Cloud's Industry Leadership Recognized by Top Global Research Firms
Alibaba Cloud Unveils QwQ-32B: A Compact Reasoning Model with Cutting-Edge Performance
1,129 posts | 348 followers
FollowFarruh - January 12, 2024
Apache Flink Community China - July 27, 2021
Alibaba Cloud Community - September 3, 2024
Apache Flink Community China - July 28, 2020
Apache Flink Community China - December 25, 2019
ApsaraDB - January 15, 2024
1,129 posts | 348 followers
FollowRealtime Compute for Apache Flink offers a highly integrated platform for real-time data processing, which optimizes the computing of Apache Flink.
Learn MoreAlibaba Cloud provides big data consulting services to help enterprises leverage advanced data technology.
Learn MoreAlibaba Cloud experts provide retailers with a lightweight and customized big data consulting service to help you assess your big data maturity and plan your big data journey.
Learn MoreApsaraDB for HBase is a NoSQL database engine that is highly optimized and 100% compatible with the community edition of HBase.
Learn MoreMore Posts by Alibaba Cloud Community