All Products
Search
Document Center

Dataphin:Create a data profiling task

Last Updated:Jun 23, 2026

Data profiling helps you quickly understand data overview, evaluate data availability, and identify potential risks in advance. You can perform full profiling or sampling profiling on data tables.

Prerequisites

You must activate the Data Quality module to use data profiling. Before V5.2.1, only the Domain Quality module was supported. In V5.2.1 and later, you can activate either the Domain Quality module or the Global Quality module.

Limits

  • Data profiling is not supported for meta tables, mirror tables, or logical tag tables.

  • Data profiling is not supported when the storage class of the data table is Hologres or Kudu.

  • You can select a maximum of 1500 fields for each profiling task.

  • For information about the data sources and compute engines that support data profiling, see Supported profiling partitions and ranges for different data sources.

  • If the data source is Hive and the table is a data lake table, you must enable Spark tasks for the table's compute engine to perform data profiling.

Important

Selecting many profiling fields or scenarios may consume significant computing resources from the data source, project, or business unit of the table. This can affect ad hoc queries and periodic task scheduling. Select only the fields and scenarios you need for each profiling task.

Permissions

  • Super administrators and operations administrators can create and manage data profiling for all data tables. Table owners can only create and manage data profiling for tables they own.

  • Project administrators can create and manage data profiling for physical tables in the projects they manage.

  • Business unit administrators can create and manage data profiling for logical tables in the business units they manage.

Scenarios

The core fields of the source table in an integration task are of the JSON type. You need to use the get_json_object function to extract key field information before synchronizing it to the target table. You can create an automatic data profiling task for the target table that triggers after the integration task runs successfully. This verifies whether the parsing results of the core fields meet expectations, such as whether empty values exist or whether the field distribution is as expected, avoiding impact on downstream business use.

Configure manual data profiling

  1. In the top navigation bar of the Dataphin homepage, choose Governance > Asset Inventory.

  2. Select Dataphin assets, click the Table tab. You can filter physical tables, logical tables, physical views, logical views, materialized views, or select other system assets based on table type to enter the list.

  3. In the list, click the name of the target table or the Actions column image icon to go to the object details page.

  4. On the Object Details page, click the Data Profile tab, then click the Trigger Manual Profile button to open the Create Manual Data Profiling dialog box.

    If the same table already has a manual profiling task that is Running or Waiting, you can click Stop Profiling and then initiate a new one.

  5. In the Create Manual Data Profiling dialog box, configure the parameters.

    Profiling Configuration

    Parameter

    Description

    Data Table

    Displays the name of the current asset. This parameter cannot be modified.

    Profile Partition

    Only partitioned tables need to configure the partition range for profiling. You can select existing partitions (those shown on the Details Information - Partition Information tab of the data table asset details page), or manually enter an expression to profile multiple partitions at once. Multiple partition levels are connected with and. For example: ds > '20230101' and (age >10 or age < 20) and city in ('beijing', 'hangzhou').

    If there are multiple partition levels, at least specify one partition level. For details about partition filter expressions, see Introduction to partition filter expressions.

    Profiling Content

    Select the fields to be profiled and the corresponding profiling scenarios.

    • Missing Value Analysis: Counts the number of records with Null values in the field. All data types are supported.

      • Numeric type: Additionally counts records with 0 values.

      • Text type: Additionally counts records with empty strings.

    • Value Distribution Analysis: Analyzes the distribution of field values and generates a field value distribution chart. All data types are supported.

      • Numeric type: Calculates the maximum value (Max), minimum value (Min), average value (Avg), number of Null value records, number of unique value records, standard deviation, 25% quantile, median, and 75% quantile for the selected records of the field.

      • Text type: Calculates the maximum character length, minimum character length, average character length, number of Null value records, and number of unique value records for the selected records of the field.

        Note

        When the compute engine is TDH 6.x, the length of a char field obtained is the length defined by the field type, not the actual length. For example, for the field name (char 10), if only 4 characters name are stored, the length function returns 10, not 4.

        When the compute engine is other types, the length of a char field obtained is the actual length of the field, not the length defined by the field type. For example, for the field name (char 10), if only 4 characters name are stored, the length function returns 4, not 10.

      • Date and time type: Calculates the maximum value (Max), minimum value (Min), number of Null value records, and number of unique value records for the field.

      • Boolean type: Counts the number of Null value records for the field.

    • Unique Value Analysis: Counts the number of unique value records and the top 5 field values with the highest repetition frequency. Boolean type does not support unique value record counting.

    Data Filtering

    After enabling this feature, you can configure data filtering scripts in the code box. For example: city = 'hangzhou'.

    Profile Range

    Sets the number of records to profile. You can choose all records, random sampling of n records, or percentage sampling of n% records.

    • All Records: All records in the specified partition of the selected fields participate in profiling. This option provides the most accurate results but may take longer and consume more resources for large tables.

    • Random Sampling Of N Records: Randomly samples n records from the specified partition of the selected fields. If the total number of profiled records is less than the sampling number, all records are returned. You can enter an integer from 1 to 10,000.

    • Percentage Sampling Of N% Records: Randomly samples n% of records from the specified partition of the selected fields. The result is approximate.

      Note
      • Percentage sampling is only supported when the compute engine is MaxCompute.

      • When random sampling or percentage sampling is selected, sampling results may differ between fields and between profiling scenarios of the same field. Fewer records in the table increases the likelihood of result differences.

      • For profiling ranges supported by different data sources, see Detection partitions and detection ranges supported by different data sources.

    Profiling Results

    Controls the visibility of profiling results in the Asset Inventory. You can configure viewing permissions for profiling reports to control access to sensitive data. Two options are available:

    • Public: Users who have permission to view the details page of the data table can view the profiling report.

    • Only Visible to Administrators and Owners: Super administrators, operations administrators, project administrators and business unit administrators of the current data table, the owner of the current data table, and the maintenance personnel of the current data source table can view the report.

    SQL Preview

    You can view the profiling SQL generated from the current configuration. Switch between SQLs to view them.

    To optimize performance and system stability, the system automatically splits profiling statements into multiple SQLs based on the selected fields and scenarios.

  6. Click OK to complete the manual data profiling configuration and initiate profiling.

Configure automatic data profiling

  1. Click the Configure Auto Profiling button to open the Auto Profiling Configuration dialog box.

  2. In the Auto Profiling Configuration dialog box, configure the parameters.

    Profiling Configuration

    Parameter

    Description

    Auto Profiling

    Disabled by default. Enable it to configure automatic profiling. If the current table cannot enable automatic profiling, contact the operations administrator to enable the relevant configuration. For more information, see Global configuration for data profiling.

    Note
    • If previously enabled, disabling it stops generating new profiling records. Currently executing profiling tasks are not affected.

    • If profiling tasks have been previously configured, modifications do not affect tasks that have already been generated or are currently executing.

    Data Table

    Displays the name of the current asset. This parameter cannot be modified.

    Profile Partition

    Only partitioned tables need to configure the partition range for profiling. Supports selecting the latest partition or specifying partitions.

    • Latest Partition: The system retrieves the latest partition value from periodically collected partition data under Governance > Asset Inventory (same as the partitions on the Details Information - Partition Information tab). This may differ from the actual latest partition of the table.

    • Specify Partition: You can manually enter an expression to profile multiple partitions at once. Multiple partition levels are connected with and. For example: ds > '20230101' and (age >10 or age < 20) and city in ('beijing', 'hangzhou'). You can also quickly select built-in partition expressions. For more information, see Introduction to partition filter expressions.

      Note
      • Cross-partition profiling is supported. The more partitions, the longer the runtime. To optimize resources, avoid profiling across too many partitions.

      • If there are multiple partition levels, at least specify one partition level.

    Profiling Frequency

    Sets the scheduling scenario for the profiling task. Two methods are available:

    • Scheduled Profiling: Periodically profiles data according to the set time. Scheduling cycles include Daily, Weekly, and Monthly.

    • Profile After Specified Task Runs Successfully: After the selected data table runs successfully, it triggers the data profiling task for the current table. You can select script tasks with daily, weekly, or monthly scheduling in the production environment.

      Note

      Data profiling will not be triggered when the selected task is in dry-run mode.

    Profiling Content

    The configuration method is the same as manual profiling. For more information, see Profiling Configuration.

    Data Filtering

    Profile Range

    Profiling Results

    SQL Preview

    You can view the profiling SQL generated from the current configuration. For more information, see SQL Preview.

  3. Click OK to complete the automatic profiling configuration.

    Note

    When the range of data tables with automatic profiling enabled changes in some projects or business units, and a previously profiled data table is deleted, automatic profiling for all related data tables is automatically disabled after the table is removed. Currently executing and already generated tasks are not affected.

What to do next

After the profiling task is completed, you can view results for different field data types. For more information, see View data profiling reports and profiling records.