Funnel analysis and retention analysis are common conversion analysis methods and are widely used in data operations and analysis scenarios, such as the analysis of user behavior, the traffic analysis of app data, and analysis of the product goal conversion. This topic describes the limits and features of funnel and retention functions that are supported by Hologres. This topic also provides examples on how to use the functions in Hologres.
Background information
An event represents a specific or series of meaningful behavior of a user, such as downloading, registering with, and logging on to a game app. The actual usage process of the user can be restored by analyzing various behavioral data of the user. This improves product conversion rates and helps business growth. Common user behavior analysis includes event analysis, funnel analysis, and retention analysis. Window funnel analysis and retention analysis are most commonly used for user behavior analysis.
Window funnel analysis: It is an analysis model used to analyze the conversion of users at specified stages. It can help you analyze the conversion rate of user behavior at each stage, and then help you measure the performance at each stage by using the conversion rate. This way, you can take specific measures to optimize user experience on products to increase conversion rates.
Retention analysis: It is an analysis model used to analyze user engagement. It helps you calculate the number of users who remain engaged with your product after they initially sign up or begin using your product. Then, you can judge the value of the product to users.
Hologres is a one-stop real-time data warehousing service developed by Alibaba Cloud. It supports multi-dimensional analysis of real-time data in multiple scenarios. In user behavior analysis scenarios, Hologres provides multiple funnel and retention functions to help you quickly and efficiently analyze user behavior. The functions are widely used in user analysis scenarios in industries such as the Internet, e-commerce, and gaming.
Limits
Only Hologres V0.9 and later support the windowFunnel and retention functions.
Only Hologres V0.10 and later support the range_retention_count and range_retention_sum functions.
Only Hologres V2.1 and later support the range_funnel function.
To use funnel and retention functions, you must execute the following statement as a superuser to install an extension. An extension is installed at the database level. For each database, you need to install an extension only once.
create extension flow_analysis; -- Install an extension.
Scenario description
The examples in this topic are based on the public event datasets on GitHub and are used to describe the functions related to window funnel analysis and retention analysis.
Dataset
A large number of developers develop open source projects on GitHub and generate a large number of events during the development of the projects. GitHub records the type and details of each event, the developer, and the code repository. GitHub also exposes public events, such as the events generated when you add items to your favorites or submit code. For more information about event types, see Webhook events and payloads.
Dataset import
You can use Hologres to import the github_event dataset to a Hologres instance with only a few clicks. For more information, see Import public datasets with a few clicks.
Statements for creating a table for GitHub public events:
BEGIN;
CREATE TABLE hologres_dataset_github_event.hologres_github_event (
id bigint,
actor_id bigint,
actor_login text,
repo_id bigint,
repo_name text,
org_id bigint,
org_login text,
type text,
created_at timestamp with time zone NOT NULL,
action text,
iss_or_pr_id bigint,
number bigint,
comment_id bigint,
commit_id text,
member_id bigint,
rev_or_push_or_rel_id bigint,
ref text,
ref_type text,
state text,
author_association text,
language text,
merged boolean,
merged_at timestamp with time zone,
additions bigint,
deletions bigint,
changed_files bigint,
push_size bigint,
push_distinct_size bigint,
hr text,
month text,
year text,
ds text
);
CALL set_table_property('hologres_dataset_github_event.hologres_github_event', 'orientation', 'column');
CALL set_table_property('hologres_dataset_github_event.hologres_github_event', 'bitmap_columns', 'actor_login,repo_name,org_login,type,action,commit_id,ref,ref_type,state,author_association,language,hr,month,year,ds');
CALL set_table_property('hologres_dataset_github_event.hologres_github_event', 'clustering_key', 'created_at:asc');
CALL set_table_property('hologres_dataset_github_event.hologres_github_event', 'dictionary_encoding_columns', 'actor_login:auto,repo_name:auto,org_login:auto,type:auto,action:auto,commit_id:auto,ref:auto,ref_type:auto,state:auto,author_association:auto,language:auto,hr:auto,month:auto,year:auto,ds:auto');
CALL set_table_property('hologres_dataset_github_event.hologres_github_event', 'distribution_key', 'id');
CALL set_table_property('hologres_dataset_github_event.hologres_github_event', 'segment_key', 'created_at');
CALL set_table_property('hologres_dataset_github_event.hologres_github_event', 'time_to_live_in_seconds', '3153600000');
COMMENT ON TABLE hologres_dataset_github_event.hologres_github_event IS NULL;
ALTER TABLE hologres_dataset_github_event.hologres_github_event OWNER TO test1_developer;
END;
windowFunnel function
Window funnel analysis is an analysis model used to analyze the conversion of users at specified stages. It can help you analyze the conversion rate of user behavior at each stage, and then help administrators or operations administrators measure the performance at each stage by using the conversion rate. This way, they can take specific measures to optimize user experience on products to increase conversion rates.
Function description
You can use the windowFunnel function to query events from a sliding time window. This function calculates the maximum number of events that match the query conditions.
This function returns the maximum number of ordered and matched events starting from the first event in the event list. If a match fails, the entire match ends.
Examples in which the time window is long enough:
The function specifies events c1, c2, and c3, and the user data indicates c1, c2, c3, and c4. In this case, events c1, c2, and c3 are matched. The function returns 3.
The function specifies events c1, c2, and c3, and the user data indicates c4, c3, c2, and c1. In this case, the event c1 is matched. The function returns 1.
The function specifies events c1, c2, and c3, and the user data indicates c4 and c3. In this case, no events are matched. The function returns 0.
Syntax
windowFunnel(window, mode, timestamp, cond1, cond2, ..., condN)
Parameter description
Parameter
Description
window
The length of the time window. The windowFunnel function uses the first event in the specified conditions as the start point of the time window. Then, this function determines the event list based on the length of the time window.
mode
The working mode of the function. Two modes are supported: default and strict. Default value: default.
default: The windowFunnel function matches as many events as possible starting from the first event in the specified time window. For example, if events c1, c2, and c3 are specified as conditions and the user data indicates c1, c2, c1, and c3, this function returns 3.
strict: The
windowFunnel()
function applies the conditions only to unique values. This function stops matching if a duplicate event appears. For example, if events c1, c2, and c3 are specified as conditions and the user data indicates c1, c2, c1, and c3, this function returns 2.
timestamp
The timestamp when an event occurred. Supported data types are TIMESTAMP, INT, and BIGINT.
cond
The event type, which indicates each step of the event.
Examples
The data in the public event datasets on GitHub in the preceding scenario description is used in this example. If you want to analyze the funnel data in which user events match the order of the specified events within a specified period of time, you can specify the following conditions in the SQL statement:
Time window: 30 minutes, which equals 1,800 seconds.
Time period: 2023-07-28 10:00:00+08 to 2023-07-31 10:00:00+08.
Event order: CreateEvent > PushEvent > IssuesEvent.
-- Analyze the funnel data of each user. SELECT actor_id, windowFunnel (1800, 'default', created_at, type = 'CreateEvent',type = 'PushEvent',type = 'IssuesEvent') AS level FROM hologres_dataset_github_event.hologres_github_event WHERE created_at >= TIMESTAMP '2023-07-28 10:00:00+08' AND created_at < TIMESTAMP '2023-07-31 10:00:00+08' GROUP BY actor_id
The following content is a part of the returned result, where:
level=0 indicates that the first event is not matched for the user within the time window.
level=1 indicates that the first event is matched for the user within the time window.
level=2 indicates that the second event is matched for the user within the time window.
level=3 indicates that all three events are matched for the user within the time window.
actor_id | level ----------+------ 69438111 | 0 62698183 | 0 139085682 | 1 1765075 | 1 132836353 | 2 122081842 | 2 89815869 | 3
To improve result readability, you can execute the following SQL statement to view the number of users converted at each stage:
WITH level_detail AS ( SELECT level, COUNT(1) AS count_user FROM ( SELECT actor_id, windowFunnel (1800, 'default', created_at, type = 'CreateEvent', type = 'PushEvent',type = 'IssuesEvent') AS level FROM hologres_dataset_github_event.hologres_github_event WHERE created_at >= TIMESTAMP '2023-07-28 10:00:00+08' AND created_at < TIMESTAMP '2023-07-31 10:00:00+08' GROUP BY actor_id) AS basic_table GROUP BY level ORDER BY level ASC ) SELECT CASE level WHEN 0 THEN 'total' WHEN 1 THEN 'CreateEvent' WHEN 2 THEN 'PushEvent' WHEN 3 THEN 'IssuesEvent' END ,SUM(count_user) over ( ORDER BY level DESC ) FROM level_detail GROUP BY level, count_user ORDER BY level ASC;
The following result is returned:
case | sum ------------+------ total | 864120 CreateEvent | 275053 PushEvent | 120242 IssuesEvent | 2652
range_funnel
When you use Hologres to perform window funnel analysis, you need to collect grouping statistics in most cases, such as statistics by day, statistics by hour, and statistics by another custom time window. To better meet business requirements, Hologres V2.1 and later introduce the additional function range_funnel based on the windowFunnel function.
Function description
The range_funnel function is used to calculate grouping statistics and aggregate the calculation results of a time window. The following section describes the matching logic:
If the function specifies events c1, c2, and c3 and the user data indicates c1, c2, c1, and c3, the function returns 3.
If the function specifies events c1, c1, and c1 and the user data indicates c1, c2, c1, and c3, the function returns 2.
Differences between the windowFunnel function and the range_funnel function:
The windowFunnel function aggregates the input event data only once, and returns a result for the complete time period. The range_funnel function can return both the aggregation result for the complete time period and the grouping statistics for a custom time period. The result is an array.
The windowFunnel function does not support the extraction of multiple identical events, whereas the range_funnel function supports extraction of multiple identical events.
Syntax
range_funnel(window, event_size, range_begin, range_end, interval, event_ts, event_bits)
Parameter description
Parameter
Data type
Description
window
interval
The length of the time window. The range_funnel function uses the first event in the specified conditions as the start point of the time window. Then, this function determines the event list based on the length of the time window. Unit: seconds.
If window is 0, the function performs truncation according to the start point and end point of each time period. If the truncation result is 0:00 every day, data is filtered by calendar day.
event_size
int
The total number of events that you want to analyze.
range_begin
timestamptz/timestamp/date
The start time of the time period for analysis, calculated from the first event.
range_end
timestamptz/timestamp/date
The end time of the time period for analysis, calculated from the first event.
interval
interval
The interval at which the time period for analysis is divided into multiple consecutive intervals. Then, window funnel analysis is performed on each interval to obtain results. Unit: seconds.
event_ts
timestamp/timestamptz
The time when an event occurred. The TIMESTAMP and TIMESTAMPTZ data types are supported.
NoteThis parameter is calculated from the 0:00. The time may differ from the actual time. In most case, this parameter is used to analyze the trend by day or week.
event_bits
bitmap
The event type. The value must be a bitmap of the INT32 type. An event is represented by an order from the least significant bit to the most significant bit. Therefore, the window funnel analysis supports up to 32 events.
Returned result
The range_funnel function returns an array of the INT64 type, which corresponds to BIGINT[] in SQL. Note that the result of the array is encoded. It is displayed by interval and consists of the start time of an interval and the number of extracted events. The length for the start time of an interval is 56 bits, and the length for the number of extracted events is 8 bits. Therefore, after the result is obtained, the content of the array must be decoded to obtain the final matching data.
Examples
The data in the public event datasets on GitHub in the preceding scenario description is used in this example. If you want to analyze the funnel data in which user events match the order of the specified events within a specified period of time and view the grouping result by day, you can specify the following conditions in the SQL statement:
Time window: 1 hour, which equals 3,600 seconds.
Time period: 2023-07-23 to 2023-07-25, 3 days in total
Event order: CreateEvent > PushEvent.
Grouping interval: 1 day, which equals 86,400 seconds. The window funnel analysis results are displayed by day.
The value of the type field is of the TEXT type. The value of the event_bits field in the range_funnel function must be a 32-bit bitmap. Therefore, you must use the bit_construct function to convert the value of the type field into a bitmap.
-- Result without decoding SELECT actor_id, range_funnel (3600, 2, '2023-07-23', '2023-07-25', 86400, created_at::timestamp, bits) AS result FROM ( SELECT actor_id, created_at::timestamp, type, bit_construct (a := type = 'CreateEvent', b := type = 'PushEvent') AS bits FROM hologres_dataset_github_event.hologres_github_event where ds >= '2023-07-23' AND ds <='2023-07-25') tt group by actor_id order by actor_id ;
The following content is a part of the returned result:
actor_id | result ----------+------ 17 |{432658022400,432680140800,432702259202,9223372036854775554} 47 |{432658022402,432680140800,432702259200,9223372036854775554} 235 |{432658022401,432680140800,432702259200,9223372036854775553}
Description of the result field:
If the value of the result field is empty, the behavior of the user does not match the condition within any interval.
If the value of the result field contains data, the data is an undecoded array that includes the overall funnel result by day.
Range funnel decoding functions
The results returned by the range_funnel function are encoded. You need to execute SQL statements to decode the results. Hologres V2.1.6 and later introduce the range_funnel_time
and range_funnel_level
functions for you to decode results returned by the range_funnel function.
Function description
range_funnel_time
: Decodes the event time in the result returned by the range_funnel function. The result is of the INT64 type.range_funnel_level
: Decodes the event level in the result returned by the range_funnel function. The result is of the INT64 type.
Syntax
range_funnel_time(range_funnel()) range_funnel_level(range_funnel())
Parameter description
range_funnel()
: the result of the INT64 type that is returned by therange_funnel
function.Examples
In this example, the
range_funnel_time
andrange_funnel_level
functions are used to decode the result that is returned by the range_funnel function in the preceding example. Sample SQL statement:SELECT actor_id, to_timestamp(range_funnel_time(result)) AS res_time, -- Decode the event time. range_funnel_level(result) AS res_level -- Decode the event level. FROM ( SELECT actor_id, result, count(1) AS cnt FROM ( SELECT actor_id, unnest(range_funnel (3600, 2, '2023-07-23', '2023-07-26', 86400, created_at::timestamp, bits)) as result from ( SELECT actor_id, created_at::timestamp, type, bit_construct (a := type = 'CreateEvent', b := type = 'PushEvent') AS bits from hologres_dataset_github_event.hologres_github_event where ds >= '2023-07-23' AND ds <='2023-07-25' ) a GROUP BY actor_id ) a GROUP BY actor_id ,result ) a ORDER BY actor_id ,res_time LIMIT 10000;
The following content is a part of the returned result, from which you can know the level and number of matches per user per day:
actor_id | res_time | res_level ----------+---------------------+----------- 17 |2023-07-23 08:00:00 | 0 17 |2023-07-24 08:00:00 | 0 17 |2023-07-24 08:00:00 | 2 17 |\N | 0 47 |2023-07-23 08:00:00 | 0 47 |2023-07-24 08:00:00 | 0 47 |2023-07-25 08:00:00 | 2 47 |\N | 2 235 |2023-07-23 08:00:00 | 1 235 |2023-07-24 08:00:00 | 0 235 |2023-07-25 08:00:00 | 0 235 |\N | 1
After you check the funnel result of each user by day, you can further explore the data based on your business requirements.
For example, you can execute the following SQL statement to view the daily step size summary and total summary data, where the next level contains the data of the previous level:
SELECT res_time, res_level, sum(cnt) over (partition by res_time ORDER BY res_level desc) AS res_cnt FROM ( SELECT to_timestamp(range_funnel_time(result)) AS res_time, -- Decode the event time. range_funnel_level(result) AS res_level, -- Decode the event level. cnt FROM ( SELECT result, count(1) AS cnt FROM ( SELECT actor_id, unnest(range_funnel (3600, 2, '2023-07-23', '2023-07-26', 86400, created_at::timestamp, bits)) AS result FROM ( SELECT actor_id, created_at::timestamp, type, bit_construct (a := type = 'CreateEvent', b := type = 'PushEvent') AS bits FROM hologres_dataset_github_event.hologres_github_event WHERE ds >= '2023-07-23' AND ds <='2023-07-25' ) a GROUP BY actor_id ) a GROUP BY result ) a )a WHERE res_level > 0 GROUP BY res_time, res_level, cnt ORDER BY res_time, res_level;
The following content shows the query result:
\N
indicates the result aggregated for multiple days.The cnt field contains the summary data of each level. The next level contains the data of the previous level. For example, if the value of the res_level field is 2 and the value of the cnt field is 1, only one user goes through CreateEvent and PushEvent.
res_time |res_level | cnt --------------------+-----------+------ 2023-07-23 08:00:00 |1 |114455 2023-07-23 08:00:00 |2 |56155 2023-07-24 08:00:00 |1 |154616 2023-07-24 08:00:00 |2 |71200 2023-07-25 08:00:00 |1 |178662 2023-07-25 08:00:00 |2 |81779 \N |1 |390262 \N |2 |188125
retention
Retention analysis is used to analyze user engagement. It is often used to analyze user retention rates over a period of time to help make further product-related decisions.
Function description
The retention function uses a set of conditions as parameters to determine whether an event meets the specified conditions. The parameters specify 1 to 32 conditions whose return values are of the UINT8 type.
Syntax
retention(cond1, cond2, ..., cond32);
Parameter description
Parameter
Description
cond
A conditional expression that returns a result. Return values:
1: The condition is met.
0: The condition is not met.
Examples
Example 1: Public event datasets on GitHub are used. You want to analyze user retention for the PushEvent event since a specific date. SQL statement:
SELECT sum(r[1]) as first_day, sum(r[2]) as second_day, sum(r[3]) as third_day, sum(r[4]) as fourth_day, sum(r[5]) as fifth_day, sum(r[6]) as sixth_day, sum(r[7]) as seventh_day FROM ( SELECT retention (ds = '2023-07-23' AND type = 'PushEvent',ds = '2023-07-24' AND type = 'PushEvent', ds = '2023-07-25'AND type = 'PushEvent', ds = '2023-07-26' AND type = 'PushEvent', ds = '2023-07-27'AND type = 'PushEvent' , ds = '2023-07-28'AND type = 'PushEvent', ds = '2023-07-29'AND type = 'PushEvent') AS r FROM hologres_dataset_github_event.hologres_github_event GROUP BY actor_id) tt;
The following result is returned:
first_day |second_day | third_day | fourth_day | fifth_day |sixth_day |seventh_day ----------+-----------+-----------+------------+------------+-----------+----------- 250678 |105351 |93219 |81071 |81474 |76264 |23319
Example 2: Analyze the retention of each user in the last seven days.
-- Retention of each user SELECT actor_id, retention( DATE(created_at) = DATE(TIMESTAMP '2023-07-23 10:00:00+08') , DATE(created_at) = DATE(TIMESTAMP '2023-07-23 10:00:00+08' + INTERVAL '1 day'), DATE(created_at) = DATE(TIMESTAMP '2023-07-23 10:00:00+08' + INTERVAL '2 day'), DATE(created_at) = DATE(TIMESTAMP '2023-07-23 10:00:00+08' + INTERVAL '6 day') ) AS r FROM hologres_dataset_github_event.hologres_github_event WHERE (created_at >= TIMESTAMP '2023-07-23 10:00:00+08') AND (created_at <= TIMESTAMP '2023-07-23 10:00:00+08' + INTERVAL '6 day') GROUP BY actor_id
The following result is returned:
actor_id | r ----------+------- 122588631 |{1,1,1,0} 909706 |{1,0,1,0} 1893117 |{0,0,0,0} 56136246 |{1,0,1,0} 33266253 |{0,0,0,0} 106332614 |{0,0,0,0}
range_retention_count and range_retention_sum
Function description
Retention analysis is the most common and typical user growth analysis scenario. In most cases, you can use charts to analyze user retention. Hologres provides the range_retention_count and range_retention_sum functions for retention analysis.
range_retention_count: records the retention of each user. The return value of this function is an array of the BIGINT type. The return value of the range_retention_count function cannot be directly read but can be used as the input of the range_retention_sum function.
range_retention_sum: calculates the daily retention rates for all users. The return value of this function is an array of the TEXT type.
Syntax
range_retention_count
range_retention_count(is_first, is_active, dt, retention_interval, retention_granularity,output_format)
range_retention_sum
range_retention_sum(range_retention_count)
Parameter description
Parameter
Data type
Description
is_first
boolean
Specifies whether the first event is matched. Valid values:
true: The first event is matched.
false: The first event is not matched.
is_active
boolean
Specifies whether the retention condition is met. Valid values:
true: The retention condition is met.
false: The retention condition is not met.
dt
date
The date on which the event occurred. Example: 2020-12-12.
retention_interval
int[]
The retention interval. A maximum of 15 retention intervals are supported. Example: ARRAY[1,3,5,7,15,30].
retention_granularity
text
The retention granularity. Valid values:
day
week
month
output_format
text
The output format. Default value: normal. Valid values:
normal
expand
NoteThis parameter is supported only in Hologres V1.1.38 and later.
You can set this parameter to expand to obtain daily retention details.
Examples
Public event datasets on GitHub are used in the following examples to analyze the retention of users.
Example 1: Calculate the retention of each user. SQL statement:
SELECT actor_id, r FROM ( SELECT actor_id, range_retention_count(type = 'CreateEvent', type = 'PushEvent', ds::date, array[1, 2],'day','normal') AS r FROM hologres_dataset_github_event.hologres_github_event GROUP BY actor_id ) AS t ORDER BY actor_id;
The following content is a part of the returned result:
actor_id | r ----------+------- 771 |{84022445211651} 772 |{84022445211651,84026740178947,84031035146243,84048215015424} 777 |{} 797 |{84013855277059,84018150244353,84022445211648,84035330113536}
Example 2: Calculate the retention rates in the last seven days. SQL statement:
WITH tbl_detail AS ( SELECT range_retention_count(is_first, is_active, dt, ARRAY[1, 3, 7],'day' ) AS detail FROM ( SELECT actor_id, created_at::DATE AS dt , CASE WHEN created_at >= timestamp '2023-07-23 00:00:00' AND created_at < timestamp '2023-07-23 00:00:00' + INTERVAL '7' day and type = 'CreateEvent' THEN true ELSE false END AS is_first , CASE WHEN created_at >= timestamp '2023-07-23 00:00:00' + INTERVAL '1' day AND created_at < timestamp '2023-07-23 00:00:00' + INTERVAL '7' day + INTERVAL '7' day THEN true ELSE false END AS is_active FROM hologres_dataset_github_event.hologres_github_event ) tbl GROUP BY actor_id ), tbl_sum AS ( SELECT regexp_split_to_array(unnest(range_retention_sum(detail)), ',') AS s FROM tbl_detail ) SELECT s[1] AS Start date ,s[3]::numeric / s[2]::numeric AS First day ,s[4]::numeric / s[2]::numeric AS Third day ,s[5]::numeric / s[2]::numeric AS Seventh day FROM tbl_sum ORDER BY s[1];
The following result is returned:
Start date | First day | Third day | Seventh day ----------+-------------+--------------+--------- 20230723 |0.3656546240 |0.2603730723 |0.0000000000 20230724 |0.4076098204 |0.3123156723 |0.2030384953 20230725 |0.3930270566 |0.3154280149 |0.0000000000 20230726 |0.4275885111 |0.1000445494 |0.0000000000 20230727 |0.4043748374 |0.0000000000 |0.0000000000 20230728 |0.1463595703 |0.2279064106 |0.0000000000 20230729 |0.0000000000 |0.0000000000 |0.0000000000