Exploring data before synchronizing it to Dataphin helps you understand data distribution, null values, and other information in advance, facilitating more standardized data usage. This topic describes how to configure data exploration.
Prerequisites
You must purchase Data Quality to use the data exploration feature.
Limits
Data exploration is supported for tables of specific data source types. For supported data sources, see Supported partition exploration and exploration scope for different data sources.
This feature is not supported for compute source tables when the compute engine is set to AnalyticDB for PostgreSQL, ArgoDB, or StarRocks.
Permission description
Super administrators, operation administrators, and custom global roles with Exploration And Analysis-Data Exploration Configuration permissions can configure data exploration.
Data exploration configuration
In the top menu bar of the Dataphin home page, select Administration > Metadata.
In the left-side navigation pane, select General Configuration > Exploration And Analysis. On the Data Exploration And Analysis page, you can configure data exploration separately for compute source tables and data source tables.
Basic configuration
Configure the record retention policy for all data source types.
Click the Edit button at the bottom and configure the parameters.
Profiling Record: Two options are available:
Only Retain The Latest Exploration Record And Report:
If the latest run is successful and generates a report, all previous records, both successful and failed, will be deleted.
If the latest run fails, only the failed record and the most recent successful report will be kept, while other failed records are deleted. If no successful records exist, only the current failed record is retained.
Retain The Latest N Days Of Exploration Records: Keep all records and reports from the past n days, both successful and failed. The default is 15 days, and you can set any integer between 1 and 90 days.
Click Confirm to complete the basic configuration.
Compute source
Configure the scope of data tables eligible for automatic data exploration.
ImportantData exploration consumes compute resources from the project where the data table resides. Configure judiciously based on actual business needs.
Click the Edit button at the bottom and configure the parameters.
Parameter
Description
Concurrent Rate Limiting
Controls the number of simultaneously running tasks, including both data exploration and metric analysis tasks. The system supports a minimum of 1 concurrent task, with a default of 5. Enter an integer between 1 and 5.
Advanced Parameter Configuration
When enabled, allows you to set parameters for global exploration tasks to optimize performance or accommodate specific compute engines for both exploration and metric analysis tasks.
Click the Reference Example box to view and copy the example statement.
Click Typical Scenario Description to view common exploration task errors and their solutions through parameter configuration. For more information, see Typical scenario description.
Exploration Timeout
Limit the maximum duration of exploration tasks to prevent extended resource use. Tasks exceeding the set time will be marked as failed. Set any value from 1 to 24 hours, with precision up to one decimal place.
Physical Table Range
Choose the range of physical tables and views for automatic exploration by project. Options include all projects, all production projects (Basic and Prod), or specific projects.
All Projects: Includes all physical tables and views under every project, both existing and future, for automatic exploration.
All Production Projects (basic And Prod): Encompasses all physical tables and views under production projects, both existing and future, for automatic exploration.
Specified Projects: Allows selection of specific projects for automatic exploration, with support for multiple selections.
Logical Table Range
Select the range of logical tables and views for automatic exploration by data section. Options include all sections, all production sections (Basic and Prod), or specific sections.
All Sections: Covers all logical tables and views under every section, both existing and future, for automatic exploration.
All Production Sections (basic And Prod): Includes all logical tables and views under production sections, both existing and future, for automatic exploration.
Specified Sections: Allows selection of specific sections for automatic exploration, with support for multiple selections.
Click Confirm to complete the compute source table data exploration configuration.
NoteIf the scope of supported tables for automatic exploration changes, the automatic exploration switch will be turned off for tables that are no longer supported. Ongoing exploration tasks will not be affected.
Data source
The Data Source page displays data source types that have been collected in metadata and support data source exploration and metric analysis. Configure the scope of data source tables eligible for automatic data exploration.
You can view information about data sources, including name, type, maximum concurrent tasks, data exploration status, exploration timeout, creator, and last modification time.
You can search by data source name or filter by data source type.
To configure data exploration for a target data source, click the Edit icon in the Operation column. In the Control Settings dialog box, configure the parameters.
Parameter
Description
Concurrency settings
Concurrent Rate Limiting
Controls the number of simultaneously running data source table exploration tasks. The system supports a minimum of 1 concurrent task, with a default of 5. Enter an integer between 1 and 5.
Advanced Parameter Configuration
When enabled, allows you to set parameters for global exploration tasks to optimize performance or accommodate specific compute engines for both data source table exploration and metric analysis tasks.
Click Reference Example in the parameter configuration box to view and copy reference statements.
Click Typical Scenario Description to view common exploration task errors and their solutions through parameter configuration. For more information, see Typical scenario description.
Data exploration
Data Profile
Disabled by default. When enabled, data source tables that support data exploration can be explored.
Exploration Timeout: Available when data exploration is enabled. Limits the maximum duration of exploration tasks to prevent extended resource use. Tasks exceeding the set time will be marked as failed. Set any value from 1 to 24 hours, with precision up to one decimal place.
Click Confirm to complete the data source table data exploration configuration.
What to do next
After completing the data exploration configuration, you can configure automatic exploration for data tables. For more information, see Create a data exploration task.