This topic describes how to use funnel and retention functions to perform funnel analysis in AnalyticDB for MySQL.
Prerequisites
The minor version of an AnalyticDB for MySQL cluster is 3.1.6.0 or later.
To view and update the minor version of an AnalyticDB for MySQL cluster, log on to the AnalyticDB for MySQL console and go to the Configuration Information section of the Cluster Information page.
Background information
Funnel analysis is a common conversion analysis method that shows user behavior conversion rates at various stages. It is widely used in user behavior analysis and app data analytics for tasks such as traffic analysis and product goal conversion analysis. AnalyticDB for MySQL supports the following funnel and retention functions:
window_funnel: searches an event list in a sliding time window and counts the maximum number of consecutive events.
retention: analyzes whether an event meets specified conditions.
retention_range_count and retention_range_sum: record and summarize retention status.
Test dataset
In this topic, the Taobao behavior data from Tianchi Lab is used to test funnel and retention functions. For more information, see User Behavior Data from Taobao for Recommendation.
The following table describes four types of user behavior that are provided in the test dataset.
Behavior type: | Description |
pv | Each product page view is counted as a click. |
buy | Purchases a commodity. |
cart | Adds a commodity to the shopping cart. |
fav | Adds a commodity to favorites. |
Before testing, upload the test dataset to Object Storage Service (OSS) and then import the data from OSS into AnalyticDB for MySQL.
Upload the test dataset to OSS. For more information, see Upload objects.
Use an OSS external table to import the data into AnalyticDB for MySQL. The following example shows the required operations:
Create an OSS external table.
CREATE TABLE `user_behavior_oss` ( `user_id` string, `item_id` string, `cate_id` string, `event` string, `ts` bigint) ENGINE = 'oss' TABLE_PROPERTIES = '{ "endpoint":"oss-cn-zhangjiakou.aliyuncs.com", "accessid":"******", "accesskey":"*******", "url":"oss://<bucket-name>/user_behavior/", "delimiter":"," }'For more information, see OSS external table syntax.
Create an AnalyticDB for MySQL test table.
CREATE TABLE user_behavior( uid string, event string, ts string )Import data from the OSS external table to the test table.
SUBMIT JOB INSERT OVERWRITE user_behavior SELECT user_id , event, ts FROM user_behavior_oss;
window_funnel
The window_funnel function is used to search an event list in a sliding time window and count the maximum number of consecutive events. The window_funnel function starts the count from the first event in the event list that you specify, checks the events in sequence, and then returns the maximum number of consecutive events.
If you specify an event list that contains c1, c2, and c3 and the user data contains c1, c2, c3, and c4, the function returns 3.
If you specify an event list that contains c1, c2, and c3 and the user data contains c4, c3, c2, and c1, the function returns 1.
If you specify an event list that contains c1, c2, and c3 and the user data contains c4 and c3, the function returns 0.
Syntax
window_funnel(window, mode, timestamp, cond1, cond2, ..., condN)Parameters
Parameter | Description |
window | The size of the sliding time window. |
mode | The working mode of the function. This parameter is an extensible parameter. Set the value to default. |
timestamp | The time column, which must be of the BIGINT data type. If the time column is of another data type, you must convert it to the BIGINT type. For example, for the TIMESTAMP data type, you must use the TIMESTAMPDIFF function to query the difference between the timestamp and the initial time. Sample statement: |
cond | The event steps. |
Examples
For example, you want to analyze the conversion status from 2017-11-25 00:00:00 to 2017-11-26 00:00:00 based on the following conversion path: browse a commodity > add the commodity to favorites > add the commodity to the shopping cart > purchase the commodity. The related SQL statement involves the following conditions:
The size of the sliding time window is 30 minutes, which is equivalent to 1,800 seconds.
The query time is from 2017-11-25 00:00:00 to 2017-11-26 00:00:00, which is equivalent to 1511539200 to 1511625600 in the timestamp format.
The event steps are performed in the following order: browse a commodity > add the commodity to favorites > add the commodity to the shopping cart > purchase the commodity.
Execute the following SQL statement:
SELECT funnel, count(1) FROM (SELECT uid, window_funnel(cast(1800 as integer),"default", ts, event='pv', event='fav', event='cart', event='buy') AS funnel FROM user_behavior WHERE ts > 1511539200 AND ts < 1511625600 group by uid) GROUP BY funnel;Sample result:
+--------+----------+
| funnel | count(1) |
+--------+----------+
| 0 | 19687 |
| 2 | 78458 |
| 3 | 11640 |
| 1 | 596104 |
| 4 | 746 |
+--------+----------+
5 rows in set (0.64 sec)retention
The retention function can use a set of conditions as parameters to analyze whether an event meets the conditions.
Syntax
retention(cond1, cond2, ..., cond32)Parameters
Parameter | Description |
cond | The analysis condition, which is of the UINT8 data type. You can specify 1 to 32 conditions. If a condition is met, the return value is increased by 1. If a condition is not met, the return value is increased by 0. |
Examples
For example, you want to analyze the user retention status as of November 25, 2017. The related SQL statement involves the following conditions:
The start date for analysis is 2017-11-25.
The number of active users on the first day is represented by
sum(r[1]). The number of retained users from the second to the seventh day is represented bysum(r[2]), sum(r[3])...sum(r[7]).
Execute the following SQL statement:
SELECT sum(r[1]),sum(r[2]),sum(r[3]),sum(r[4]),sum(r[5]),sum(r[6]),sum(r[7]) FROM (SELECT retention(ds='2017-11-25' AND event='pv', ds='2017-11-25', ds='2017-11-26',ds='2017-11-27',ds='2017-11-28',ds='2017-11-29',ds='2017-11-30') AS r FROM user_behavior_date GROUP BY uid);Sample result:
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| sum(r[1]) | sum(r[2]) | sum(r[3]) | sum(r[4]) | sum(r[5]) | sum(r[6]) | sum(r[7]) |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| 686953 | 686953 | 544367 | 529979 | 523516 | 524530 | 528105 |
+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
1 row in set (2.96 sec)retention_range_count and retention_range_sum
The retention analysis functions (retention_range_count and retention_range_sum) can be used to generate visualized graphs and analyze user retention in user growth analysis scenarios. The retention_range_count function can be used to record the retention status of each user. The return value of this function is a 2-dimensional array and can serve as an input parameter of the retention_range_sum function. The retention_range_sum function can be used to summarize the daily retention rate of all users.
Syntax
retention_range_count
retention_range_count(is_first, is_active, dt, intervals, outputFormat)retention_range_sum
retention_range_sum(retention_range_count_result)
Parameters
Parameter | Description |
is_first | Specifies whether the first event is matched. Valid values:
|
is_active | Specifies whether the retention condition is met. Valid values:
|
dt | The date on which the behavior occurred. The format is date, for example, |
intervals[] | The retention interval. Up to 15 retention intervals are supported. |
outputFormat | The format of the return value. Valid values:
The default value is normal. |
Examples
For example, you want to query the retention status on May 1, 2022 and May 2, 2022 based on the user data from May 1, 2022 to May 4, 2022. The activation event is login, and the retention event is pay.
Create a test table and insert data into the table.
Create a table.
CREATE TABLE event(uid string, event string, ds date);Insert data into the table.
INSERT INTO event VALUES("user1", "pay", "2022-05-01"),("user1", "login", "2022-05-01"),("user1", "pay", "2022-05-02"),("user1", "login", "2022-05-02"),("user2", "login", "2022-05-01"),("user3", "login", "2022-05-02"),("user3", "pay", "2022-05-03"),("user3", "pay", "2022-05-04");Sample test data:
+-------+-------+------------+ | uid | event | ds | +-------+-------+------------+ | user1 | login | 2022-05-01 | | user1 | pay | 2022-05-01 | | user1 | login | 2022-05-02 | | user1 | pay | 2022-05-02 | | user2 | login | 2022-05-01 | | user3 | login | 2022-05-02 | | user3 | pay | 2022-05-03 | | user3 | pay | 2022-05-04 | +-------+-------+------------+
Query the retention status of each user.
SELECT uid, r FROM ( SELECT uid, retention_range_count(event = 'login', event = 'pay', ds, array(1, 2)) AS r FROM event GROUP BY uid ) AS t ORDER BY uid;Sample result:
+-------+-----------------------------+ | uid | r | +-------+-----------------------------+ | user1 | [[738642,0,0],[738641,1,0]] | | user2 | [[738641,0,0]] | | user3 | [[738642,1,1]] | +-------+-----------------------------+Query the daily retention rate of all users.
SELECT from_days(u [1]) AS ds, u [3] / u [2] AS retention_d1, u [4] / u [2] AS retention_d2 FROM ( SELECT retention_range_sum(r) AS r FROM ( SELECT uid, retention_range_count(event = 'login', event = 'pay', ds, array(1, 2)) AS r FROM event GROUP BY uid ) AS t ORDER BY uid ) AS r, unnest(r.r) AS t(u);Sample result:
+------------+--------------+--------------+ | ds | retention_d1 | retention_d2 | +------------+--------------+--------------+ | 2022-05-02 | 0.5 | 0.5 | | 2022-05-01 | 0.5 | 0.0 | +------------+--------------+--------------+