All Products
Search
Document Center

Artificial Intelligence Recommendation:Data diagnostics

Last Updated:Jun 16, 2025

Data diagnostics analyzes user tables, item tables, and behavior tables to validate available features, guide discretization parameter settings, determine the data time window required for user preference and item feature statistics, and evaluate the data volume requirements for training samples. This ensures reasonable configuration of data quality and model training resources, improving the scientific nature of feature engineering, the efficiency of model training, and the accuracy of recommendation results.

Types of data diagnostics tasks

PAI-Rec data diagnostics includes the following task types:

Task type

Description

Basic statistical analysis

Used to analyze the value distribution and missing rate of fields, screen effective features (exclude fields with high missing rates or abnormal fields), and investigate whether there are problems with log upload, storage, or cleaning for abnormal features.

Analysis on item or user change rate

Used to analyze user tables or item tables.

For example, you can analyze the user_id field in a user table to calculate the number and percentage of daily increases and decreases. If there are many new users, you need to consider recommendation strategies for new users. If there are many new items added daily, you need to consider recommendation strategies for cold start items.

Analysis on statistical period of user preferences

  • Recurrence rate: Recurrence rate over k days = Number of users who visited both during Day T-k to Day T-1 and on Day T/Number of users who visited on Day T. The goal is to adjust k to find k1 and k2 values where the recurrence rate is 80% or 90%, which will be used for subsequent feature engineering, calculating user preference features, and determining the number of days for statistical training samples. When the recurrence rate over k2 days is below 90%, it indicates that over 10% of users are not active in the past k2 days. The system lacks recent data on their preferences and behaviors. You need to use the cold start policy for these users.

  • Retention rate: Retention rate over k days = Number of users who visited both on Day T+k and Day T/Number of users who visited on Day T.

Two-table join analysis

Used to check the availability of behavioral data, ID uniqueness, and whether features are available. It is possible that when a behavior table is joined with an item table, many item feature fields are empty. You need to analyze the reasons for these empty values.

Exception analysis

Analyzes user behavior tables by first defining upstream and downstream behaviors. Upstream behavior refers to exposure, and downstream behavior refers to click or add to cart. If the upstream behavior is click, then downstream behavior is like or comment. If you need to analyze both sets of upstream and downstream behaviors simultaneously, you need to create two diagnostics tasks.

If the number of exposures or conversion rate of specific users or items is too high, analyze the user logs.

Prerequisites

You have registered data tables.

Create a diagnostics task

  1. Log on to the PAI-Rec console, and choose Recommendation Solution Customization > Data Diagnostics in the left-side navigation pane.

  2. On the Task Management tab, click Create Diagnostics Task. In the Create Diagnostics Task panel, select the corresponding task type and complete the relevant configurations.

    Basic statistical analysis

    Parameter

    Description

    Partition Field

    Select the corresponding ds field. The yyyymmdd and yyyy-mm-dd formats are supported.

    Tag Field

    Select the field to analyze.

    Tag Field Delimiter

    Select the delimiter for the Tag fields.

    KV Field

    Select the fields of the KV type (such as descriptions of user category preferences). This will analyze the number of keys and the distribution of values. If there are none, you can leave this parameter empty.

    KV Delimiter

    Specify the delimiter for each group of KV data.

    Text Field

    Select data of the text type. If there is no title, you can leave this parameter empty.

    Null Value of STRING Type

    Specify what values are considered null values, which will be used for null value count/rate statistics.

    For example, a space.

    Data Percentile Distribution

    Specify which data value positions need to be analyzed. Use commas (,) to separate multiple data sets.

    The default percentiles for data percentile distribution are: 0%, 1%, 25%, 50%, 75%, 99%, 100%.

    Periodic Running

    • No (default): Does not perform periodic analysis on the data table. The default business time for rerunning tasks is 7 days.

    • Yes: Set Periodic Running Time to perform periodic analysis on the data table.

    Analysis on item or user change rate

    Parameter

    Description

    Partition Field

    Select the corresponding ds field. The yyyymmdd and yyyy-mm-dd formats are supported.

    Analysis Field

    Select a field with unique identification information.

    Periodic Running

    • No (default): Does not perform periodic analysis on the data table. The default business time for rerunning tasks is 7 days.

    • Yes: Set Periodic Running Time to perform periodic analysis on the data table.

    Analysis on statistical period of user preference

    Parameter

    Description

    Partition Field

    Select the corresponding ds field. The yyyymmdd and yyyy-mm-dd formats are supported.

    User ID Field

    Select a field that uniquely identifies users.

    Statistical Period of Recurrence Rate

    Enter the number of days for the period to calculate. If there are multiple periods to calculate, separate them with commas (,).

    Statistical Period of Single-day Retention Rate

    Enter the number of days for the period to calculate. If there are multiple periods to calculate, separate them with commas (,).

    Statistical Period of Periodic Retention Rate

    Specify the retention rate from one period to another, such as the retention rate of January users in February.

    You can select By Week (1 week, 4 weeks, or 12 weeks) or By Month (1 month or 2 months).

    Periodic Running

    • No (default): Does not perform periodic analysis on the data table. The default business time for rerunning tasks is 7 days.

    • Yes: Set Periodic Running Time to perform periodic analysis on the data table.

    Two-table join analysis

    Parameter

    Description

    Left Table

    Select the data table to be joined. The left table is typically the behavior table.

    Left Table Partition Field

    Select the corresponding ds field. The yyyymmdd and yyyy-mm-dd formats are supported.

    Left Table Analysis Field

    Select the field to analyze.

    Right Table

    Select the data table to be joined.

    Right Table Partition Field

    Select the corresponding ds field. The yyyymmdd and yyyy-mm-dd formats are supported.

    Right Table Analysis Field

    Select the field to analyze.

    Task Name

    Specify a name for the node.

    Join Field

    Select fields that are consistent between the left and right tables.

    Join Failures Displayed

    Enter the number of abnormal data records you want to see, which is used to display data that failed to join.

    Example: 10.

    Periodic Running

    • No (default): Does not perform periodic analysis on the data table. The default business time for rerunning tasks is 7 days.

    • Yes: Set Periodic Running Time to perform periodic analysis on the data table.

    Exception analysis

    Parameter

    Description

    Partition Field

    Select the corresponding ds field. The yyyymmdd and yyyy-mm-dd formats are supported.

    User ID Field

    Select a field that uniquely identifies users.

    Item ID Field

    Select a field that uniquely identifies items.

    Behavior Field

    Select a field that distinguishes different behavior events.

    Upstream Behaviors

    Enter the upstream behavior events to analyze. If there are multiple behaviors, separate them with commas (,).

    Downstream Behaviors

    Enter the downstream behavior events to analyze. If there are multiple behaviors, separate them with commas (,).

    Buckets

    Enter the number of buckets needed for equal-interval segmentation analysis of behavioral data to analyze user distribution in each segment.

    Periodic Running

    • No (default): Does not perform periodic analysis on the data table. The default business time for rerunning tasks is 7 days.

    • Yes: Set Periodic Running Time to perform periodic analysis on the data table.

  3. Click Save And Calculate.

View diagnostics reports

After you create a diagnostics task, perform the following operations to view the reports: Choose Recommendation Solution Customization > Data Diagnostics. On the Task Management tab, click Diagnostics Report in the Actions column of the desired diagnostics task.

image

The following diagnostics reports are for reference only. Refer to the actual data diagnostics results.

Basic statistical analysis

The basic statistical analysis report shows daily user volume, and information such as maximum value, minimum value, percentiles, and frequency statistics for multiple bigint features.

  • The diagnostic results show a missing rate greater than 0.4, requiring attention to the city field.

    image

  • Daily data volume

    image

  • Unique value statistics, showing the number of unique values in each field.

    image

    image

  • Percentile statistics. Use the age as an example. The 95th percentile is 50 years old, the maximum value is 52 years old, and the minimum value is 18 years old.

    image

  • Histogram statistics, dividing the data into 10 buckets to see the number in each bucket.

    image

  • Top 10 frequency statistics. Use the age as an example. The following figure shows the top 10 ages with the highest frequency of occurrence.

    image

  • Frequency percentiles. Check whether the maximum value is consistent with the most frequent one in the Top 10 frequency statistics.

    image

Analysis on item or user change rate

The item or user change rate analysis report shows the number of items or users added and reduced over a period of time, and the changes in addition and reduction rates. Use user table analysis as an example, the report analyzes the number of users added and reduced, and the changes in user addition and reduction rates.

image

image

Analysis on statistical period of user preference

The analysis on statistical period of user preference report conducts statistical analysis on user behavior preferences and shows user retention.

image

image

image

Two-table join analysis

The two-table join analysis report analyzes the correlation between data in two related data tables and shows the association rate of the right table in the left table.

image

image

image

Exception analysis

The exception analysis report analyzes upstream and downstream behaviors and shows whether there are abnormalities where downstream behaviors exceed upstream behaviors.

  • A low abnormal ratio indicates that there are no cases where downstream behaviors exceed upstream behaviors.

    image

  • Upstream behavior count statistics show the count of exposures, divided into 10 buckets. The x-axis represents the mean of upstream behavior counts, and the y-axis represents the frequency. Downstream behavior count statistics follow the same principle.

    image

  • Conversion rate analysis divides the conversion rate into 10 intervals and shows the number in each interval.

    image

  • Top statistics analysis displays the top values for upstream behaviors, downstream behaviors, and conversion rates, allowing you to identify the corresponding user IDs for more detailed analysis.

    image

View task logs

After you create a diagnostics task, you can view the task progress on the Task Logs tab. To go to the tab, choose Recommendation Solution Customization > Data Diagnostics.

  • Click View Log in the Actions column of the desired task to view the log code.

  • Click Configuration in the Actions column of the desired task to view the configuration code used when you create the task.

image