Data profiling helps you quickly understand data overview, evaluate data availability, and identify potential risks in advance. You can perform full profiling or sampling profiling on data tables. This topic describes how to create a data profiling task.
Prerequisites
You need to activate the Data Quality module to use the data profiling feature. Before version V5.2.1, you needed to activate the Domain Quality module to use the data profiling feature. In version V5.2.1 and later, you can use the data profiling feature by activating either the Domain Quality module or the Global Quality module.
Limits
Data profiling is not supported for meta tables, mirror tables, or logical tag tables.
Data profiling is not supported when the storage class of the data table is Hologres or Kudu.
You can select a maximum of 1500 fields for each profiling task.
For information about the data sources and compute engines that support data profiling, see Supported profiling partitions and ranges for different data sources.
If the data source is Hive and the table is a data lake table, you must enable Spark tasks for the table's compute engine to perform data profiling.
If you select multiple profiling fields or profiling scenarios, the profiling task may consume a large amount of computing resources from the data source where the source table is located, or from the project or business unit where the data table is located. This may affect features such as ad hoc queries or periodic task scheduling. We recommend that you select the number of fields or profiling scenarios for a single profiling task as needed.
Permissions
Super administrators and operations administrators can create and manage data profiling for all data tables. Table owners can only create and manage data profiling for tables they own.
Project administrators can create and manage data profiling for physical tables in the projects they manage.
Business unit administrators can create and manage data profiling for logical tables in the business units they manage.
Scenarios
The core fields of the source table in an integration task are of the JSON type. You need to use the get_json_object function to extract key field information before synchronizing it to the target data table. You can create an automatic data profiling task for the target table, specifying it to trigger profiling after the integration task runs successfully. This verifies whether the parsing results of the core fields meet expectations (such as whether empty values exist or whether the field distribution meets expectations), avoiding impact on downstream business use.
Configure manual data profiling
In the top navigation bar of the Dataphin homepage, choose Governance > Asset Inventory.
Select Dataphin assets, click the Table tab. You can filter physical tables, logical tables, physical views, logical views, materialized views, or select other system assets based on table type to enter the list.
In the list, click the name of the target table or the Actions column
icon to go to the object details page.On the Object Details page, click the Data Profile tab, then click the Trigger Manual Profile button to open the Create Manual Data Profiling dialog box.
If the same table already has a manual profiling task that is Running or Waiting, you can click Stop Profiling and then initiate a new one.
In the Create Manual Data Profiling dialog box, configure the parameters.
Profiling Configuration
Parameter
Description
Data Table
Displays the name of the current asset. This parameter cannot be modified.
Profile Partition
Only partitioned tables need to configure the partition range for profiling. You can quickly select existing specified partitions (partitions already generated in the Details Information - Partition Information tab of the data table asset details page), or manually enter an expression to profile multiple partitions at once. Multiple partition levels are connected with
and. For example:ds > '20230101' and (age >10 or age < 20) and city in ('beijing', 'hangzhou').If there are multiple partition levels, at least specify one partition level. For details about partition filter expressions, see Introduction to partition filter expressions.
Profiling Content
Select the fields to be profiled and the corresponding profiling scenarios.
Missing Value Analysis: Counts the number of records with Null values in the field. All data types are supported.
Numeric type: Additionally counts records with 0 values.
Text type: Additionally counts records with empty strings.
Value Distribution Analysis: Analyzes the distribution of field values and generates a field value distribution chart. All data types are supported.
Numeric type: Calculates the maximum value (Max), minimum value (Min), average value (Avg), number of Null value records, number of unique value records, standard deviation, 25% quantile, median, and 75% quantile for the selected records of the field.
Text type: Calculates the maximum character length, minimum character length, average character length, number of Null value records, and number of unique value records for the selected records of the field.
NoteWhen the compute engine is TDH 6.x, the length of a char field obtained is the length defined by the field type, not the actual length. For example, for the field
name (char 10), if only 4 charactersnameare stored, the length function returns 10, not 4.When the compute engine is other types, the length of a char field obtained is the actual length of the field, not the length defined by the field type. For example, for the field
name (char 10), if only 4 charactersnameare stored, the length function returns 4, not 10.Date and time type: Calculates the maximum value (Max), minimum value (Min), number of Null value records, and number of unique value records for the field.
Boolean type: Counts the number of Null value records for the field.
Unique Value Analysis: Counts the number of unique value records and the top 5 field values with the highest repetition frequency. Boolean type does not support unique value record counting.
Data Filtering
After enabling this feature, you can configure data filtering scripts in the code box. For example:
city = 'hangzhou'.Profile Range
Used to set the number of records to be profiled. Supports all records, random sampling of n records, or percentage sampling of n% records.
All Records: All records in the specified partition of the selected fields participate in profiling. This is suitable for scenarios that require profiling of full data and can more accurately reflect the data situation. If there are many records, it may take longer to run and consume more resources.
Random Sampling Of N Records: Randomly samples n records from the specified partition of the selected fields for profiling. If the total number of records that can be profiled is less than the sampling number, all records are returned. The maximum sampling is 10,000 records. Supports entering integers between 1 and 10,000.
Percentage Sampling Of N% Records: Randomly samples n% of records from the specified partition of the selected fields for profiling. This is not an exact value.
NotePercentage sampling is only supported when the compute engine is MaxCompute.
When random sampling or percentage sampling is selected, the sampling results between different fields and between different profiling scenarios of the same field may be different. The fewer records in the profiled table, the more likely there will be differences in results.
For profiling ranges supported by different data sources, see Detection partitions and detection ranges supported by different data sources.
Profiling Results
This configuration affects whether profiling results are visible in the Asset Inventory. It supports configuring viewing permissions for profiling reports of different data tables to better control sensitive data. Two configuration methods are supported:
Public: Users who have permission to view the details page of the data table can view the profiling report.
Only Visible to Administrators and Owners: Super administrators, operations administrators, project administrators and business unit administrators of the current data table, the owner of the current data table, and the maintenance personnel of the current data source table can view the report.
SQL Preview
You can view the profiling SQL generated based on the current profiling configuration to understand more profiling information. You can switch between SQLs to view them.
To optimize profiling tasks, ensure system stability, and improve profiling efficiency, the system automatically splits the profiling statements into multiple SQLs based on the selected profiling fields and scenarios.
Click OK to complete the manual data profiling configuration and initiate profiling.
Configure automatic data profiling
Click the Configure Auto Profiling button to open the Auto Profiling Configuration dialog box.
In the Auto Profiling Configuration dialog box, configure the parameters.
Profiling Configuration
Parameter
Description
Auto Profiling
Disabled by default. You can enable it and configure automatic profiling. If the current table cannot enable automatic profiling, you can contact the operations administrator to enable the relevant configuration. For more information, see Global configuration for data profiling.
NoteIf previously enabled, disabling it will not generate new profiling records. Profiling tasks that are currently executing will not be affected.
If profiling tasks have been previously configured, modifications will not affect profiling tasks that have been generated or are currently executing.
Data Table
Displays the name of the current asset. This parameter cannot be modified.
Profile Partition
Only partitioned tables need to configure the partition range for profiling. Supports selecting the latest partition or specifying partitions.
Latest Partition: The system obtains the latest value of partition data that is periodically retrieved under Governance > Asset Inventory (same as the partitions already generated in the Details Information - Partition Information tab of the data table asset details page). This may differ from the actual latest partition of the table.
Specify Partition: You can manually enter an expression to profile multiple partitions at once. Multiple partition levels are connected with
and. For example:ds > '20230101' and (age >10 or age < 20) and city in ('beijing', 'hangzhou'). You can also quickly select built-in partition expressions. For more information, see Introduction to partition filter expressions.NoteCross-partition profiling is supported. The more partitions, the longer the runtime. To optimize resources, avoid profiling across too many partitions.
If there are multiple partition levels, at least specify one partition level.
Profiling Frequency
Used to set the scheduling scenario for the profiling task. Two profiling methods are supported:
Scheduled Profiling: Periodically profiles data according to the set time. Scheduling cycles include Daily, Weekly, and Monthly.
Profile After Specified Task Runs Successfully: After the selected data table runs successfully, it triggers the data profiling task for the current table. You can select script tasks with daily, weekly, or monthly scheduling in the production environment.
NoteData profiling will not be triggered when the selected task is in dry-run mode.
Profiling Content
The configuration method is the same as manual profiling. For more information, see Profiling Configuration.
Data Filtering
Profile Range
Profiling Results
SQL Preview
You can view the profiling SQL generated based on the current profiling configuration to understand more profiling information. For more information, see SQL Preview.
Click OK to complete the automatic profiling configuration.
NoteWhen the range of data tables with automatic profiling enabled in some projects or business units changes, and if a data table that previously had data profiling configured is deleted, automatic profiling for all related data tables will be automatically disabled after the data table is removed. Tasks that are currently being profiled and tasks that have already been generated will not be affected.
What to do next
After the profiling task is completed, you can view the result display corresponding to different field data types. For more information, see View data profiling reports and profiling records.