All Products
Search
Document Center

Dataphin:Create a new data exploration task

Last Updated:Mar 05, 2025

Data exploration enables you to quickly understand data overviews, assess data availability, and identify potential risks in advance. You can conduct full data or sample explorations on data tables. This topic explains how to create a new data exploration task.

Prerequisites

The Data Quality feature must be activated to utilize the data exploration feature.

Limits

  • The data exploration feature is not supported when the compute engine is AnalyticDB for PostgreSQL, ArgoDB, or StarRocks.

  • Data exploration features are not supported for meta tables, image tables, or logical tag tables.

  • Data exploration is not supported when the storage class of the data table is Hologres or Kudu.

  • Each exploration can select a maximum of 1500 fields.

    Important

    Selecting numerous fields or scenarios for exploration can significantly consume compute resources from the project or segment hosting the data table, potentially impacting features like ad hoc query or periodic task scheduling. It is advisable to choose the number of fields or scenarios based on the specific business scenario.

Permission description

  • Super administrators and operation administrators can create and manage data exploration tasks for all data tables. The current data table owner can only create and manage data exploration for the tables they own.

  • Project administrators can create and manage data exploration for physical tables within their projects.

  • Segment administrators can create and manage data exploration for logical tables within their segments.

Scenarios

The core fields of the source table in the integration task are of JSON type. To extract key field information using the get_json_object function and sync it to the target data table, you can create an automatic data exploration task. This task is triggered after the integration task runs successfully to verify if the parsing results of the core fields meet expectations, such as checking for empty values or ensuring the field distribution aligns with expectations, to prevent impact on downstream business use.

Configure manual data exploration

  1. On the Dataphin home page, select Administration > Asset Checklist from the top menu bar.

  2. Click the Table tab. Here, you can filter by table type, including physical tables, logical tables, physical views, logical views, and materialized views.

  3. In the table list, click the target table name or the Actions column image icon to access the object detail page.

  4. On the Object Details product page, click the Data Profile tab, then click the Trigger Manual Profile button to open the Create Manual Data Exploration dialog box.

    If a running or pending manual exploration task exists for the same table, you can Terminate the Exploration and then start a new one.

  5. In the Create Manual Data Exploration dialog box, configure the parameters.

    Exploration Configuration

    Parameter

    Description

    Data Table

    Displays the name of the current asset, which cannot be modified.

    Profile Partition

    Only partitioned tables need to configure the partition range for exploration. You can quickly select existing specified partitions (the partitions generated on the partition information tab of the data table asset product page) or manually enter an expression to explore multiple partitions at once. Use and to connect between multi-level partitions. For example: ds > '20230101' and (age >10 or age < 20) and city in ('beijing', 'hangzhou').

    If there are multiple levels of partitions, at least one primary partition must be specified. For details on partition filter expressions, see Introduction to Partition Filter Expressions.

    Exploration Content

    Select the fields to be explored and the corresponding exploration scenarios.

    • Missing Value Analysis: Counts the number of records with Null values for the field. All data types are supported.

      • Numeric: Additionally counts the number of records with a value of 0.

      • Text: Additionally counts the number of records with an empty string.

    • Value Distribution Analysis: Analyzes the distribution of field values and generates a field value distribution graph. All data types are supported.

      • Numeric: Counts the maximum value (Max), minimum value (Min), average value (Avg), number of Null value records, number of unique value records, standard deviation, 25% quantile, median, and 75% quantile of the selected records for the field.

      • Text: Counts the maximum character length, minimum character length, average character length, number of Null value records, and number of unique value records for the selected records for the field.

        Note

        The length function (length) definition for the field type char varies between different compute engines.

        When the compute engine is TDH 6.x, the length of char obtained is the length defined by the field type, not the actual length. For example: field name (char 10), where only 4 characters of name are stored, but the length function returns a string length of 10, not 4.

        When the compute engine is of other types, the length of char obtained is the actual length of the field type, not the length defined by the field type. For example: field name (char 10), where only 4 characters of name are stored, but the length function returns a string length of 4, not 10.

      • DateTime: Counts the maximum value (Max), minimum value (Min), number of Null value records, and number of unique value records for the selected records for the field.

      • Boolean: Counts the number of Null value records for the field.

    • Unique Value Analysis: Counts the number of unique value records and the top 5 most frequently repeated field values for the field. Boolean type does not support unique value count.

    Data Filtering

    When enabled, you can configure data filtering scripts in the code box. For example: city = 'hangzhou'.

    Profile Range

    Used to set the number of records to be explored. Supports all records, random sampling of n records, and percentage sampling of n% records.

    • All Records: All records within the specified partition of the selected fields participate in the exploration. Suitable for scenarios where full data exploration is required, providing more accurate feedback on data conditions. If the number of records is large, it may run for a long time and consume more resources.

    • Random Sampling Of N Records: Randomly samples n records from the specified partition of the selected fields for exploration. If the total number of records available for exploration is less than the sampling number, all records are returned. A maximum of 10000 records can be sampled. Supports entering an integer between 1 and 10000.

    • Percentage Sampling Of N% Records: Randomly samples n% of records from the specified partition of the selected fields for exploration. The value is not precise.

      Note
      • Percentage sampling is only supported when the compute engine is MaxCompute.

      • When selecting random sampling or percentage sampling, if many exploration fields are selected, the sampling results for different fields and different exploration scenarios for the same field may vary, affecting the accuracy of the results.

    Exploration Results

    Supports configuring viewing permissions for exploration reports for different data tables to better manage sensitive data. Supports two configuration methods:

    • Public: Users with permission to view the data table product page can view the exploration report.

    • Visible to Administrators and Owners Only: Supports super administrators, operation administrators, project administrators, and segment administrators where the current data table is located, along with the current data table owner to view.

    SQL Preview

    You can review the exploration SQL generated from the current configuration to understand more about the exploration. It supports switching between different SQL views.

    To ensure system stability and improve exploration efficiency, the system automatically divides exploration statements into multiple SQL executions based on the selected fields and scenarios.

  6. Click OK to finalize the manual data exploration configuration and initiate the exploration.

Configure automatic data exploration

  1. Click the Configure Auto Profiling button to open the Automatic Exploration Configuration dialog box.

  2. In the Automatic Exploration Configuration dialog box, configure the parameters.

    Exploration Configuration

    Parameter

    Description

    Auto Profiling

    Disabled by default. You can enable and configure automatic exploration. If the current table cannot enable automatic exploration, you can contact the operation administrator to enable the relevant configuration. For more information, see Global Configuration for Data Exploration.

    Note
    • If previously enabled, disabling will not generate new exploration records. Ongoing exploration tasks are not affected.

    • If exploration tasks were previously configured, modifications will not affect generated or ongoing exploration tasks.

    Data Table

    Displays the name of the current asset, which cannot be modified.

    Profile Partition

    Only partitioned tables need to configure the partition range for exploration. Supports selecting the latest partition and specified partitions.

    • Latest Partition: The system retrieves the latest value of partition data obtained periodically under Administration > Asset Checklist (the partitions generated on the partition information tab of the data table asset product page), which may differ from the actual latest partition of the table.

    • Specified Partition: You can manually enter an expression to explore multiple partitions at once. Use and to connect between multi-level partitions. For example: ds > '20230101' and (age >10 or age < 20) and city in ('beijing', 'hangzhou'). You can also quickly select built-in partition filter expressions. For more information, see Introduction to Partition Filter Expressions.

      Note
      • Cross-partition exploration is supported. The more partitions, the longer the runtime. To optimize resources, avoid crossing too many partitions.

      • If there are multiple levels of partitions, at least one primary partition must be specified.

    Exploration Frequency

    Used to set the scheduling scenarios for exploration tasks. Supports two exploration methods:

    • Scheduled Exploration: Periodically explores data according to the set time. The scheduling cycle includes Day, Week, and Month.

    • Explore After Specified Task Runs Successfully: The data exploration task for the selected data table is triggered after successful execution, supporting script tasks scheduled daily, weekly, or monthly in the production environment.

      Note

      Data exploration will not be triggered when the task is scheduled for a dry-run.

    Exploration Content

    The configuration method is consistent with manual exploration. For more information, see Exploration Configuration.

    Data Filtering

    Profile Range

    Exploration Results

    SQL Preview

    You can view the exploration SQL that is generated based on the current configuration to gain deeper insights into the exploration process. For more information, see the manual exploration SQL Preview.

  3. Click OK to complete the automatic exploration configuration.

    Note

    If the scope of data tables with automatic exploration enabled changes, such as when a table is deleted, all related automatic explorations will be disabled after the table's removal. Ongoing and generated tasks are not affected.

Subsequent instructions

Once the exploration task is complete, you can examine the results for various data types. For more information, see View Data Exploration Reports and Records.