Sampling configuration - Dataphin - Alibaba Cloud Documentation Center

Data sampling helps business users understand data patterns and assists in SQL development. This topic describes how to configure data sampling.

Prerequisites

To use sampling configuration, you must have the Agile R&D Edition or a later version, or have the Asset Operations feature enabled.

Limits

Automatic sampling is supported only for data tables with 1500 or fewer fields. Tables that exceed this limit are automatically ignored.

Permissions

Super administrators, operations administrators, and users assigned custom global roles with the Sampling Configuration - Manage permission can manage sampling configurations.

Procedure

In the top menu bar of the Dataphin home page, choose Administration > Metadata.

In the navigation pane on the left, choose General Configuration > Sampling Configuration. On the Sampling Configuration page, you can configure sampling for compute source and data source tables.

Basic configuration

Click Edit at the bottom of the page to configure the parameters.

Parameter		Description
Data Sampling		The master switch for sampling configuration. After you enable this feature, you can configure basic settings, compute sources, and data sources. Check that the automatic data sampling settings on the compute source and data source pages meet your needs. After you disable this feature, neither automatic nor manual sampling can be triggered. You cannot use sample data in related scenarios. You can choose to delete the sample data or keep it. Delete Synchronously: Deletes the saved sample data. Keep Data: The saved sample data cannot be viewed or used, but it can be used directly the next time you enable sampling.
Query configuration	Automatic Sampling Trigger Scenarios	Configure the node types that automatically trigger data sampling queries. Supported types include Metadata Collection, Data Profile, and Execution Of Security Detection Rules/standard Label Mapping Rules. The system automatically determines whether to start a new sampling query based on the last sampling update time and the update policy. Data Profile: You must enable the Global Quality or Domain-specific Quality feature to use data profiling. Execution Of Security Detection Rules/standard Label Mapping Rules: Enable this option when security detection rules involve content-based detection, or when standard label mapping is configured for intelligent mapping based on detection features. Otherwise, each detection performs a temporary data query, which may consume significant computing resources.
	Automatic Sampling Update Policy	Controls the update frequency of data sampling queries. The system determines whether to perform a sampling query in the preceding scenarios based on the last sampling update time (including automatic and manual sampling) and this policy. You can choose to update at a fixed interval or not to update. Update at a fixed interval: Resamples if the last update was more than N days ago. N can be an integer from 1 to 60. For example, if N is 7, the system resamples during node execution if the last successful sampling was more than 7 days ago. Do Not Update: Samples and stores data only once. If sampling is successful, the data is not actively updated later.
	Null Value Compensation	The policy for handling null values in fields during a data sampling query. You can choose not to compensate or to perform a compensation query for fields that are entirely null. Do Not Compensate: If some fields in the sample data are entirely null, the system does not perform another non-null sampling for these fields, nor does it perform detection on them. When Some Fields In The Sample Result Are Null, Query To Compensate For Null Fields: If some fields in the sample data are entirely null, the system performs another non-null sampling for these fields. If the sampling is successful, the result is used for the next detection. If the sampling fails, detection is not performed on the field. The following is a sample script: `-- First sampling query for fields a, b, and c of tableA select a,b,c from tableA limit 100; -- Because the first 100 rows for field a are all null, a second sampling query is performed for field a select distinct a from tableName where a is not null limit 100;` Important Null value compensation improves detection accuracy but consumes more computing resources. Configure this feature as needed.
Storage configuration	Sample Storage	The number of sample values to save for a single field. The default value is 100. You can enter an integer from 1 to 100.
Usage configuration	For Data Preview	Used for data preview in the asset checklist and asset folder. If a data table already has sample data, the sample data is displayed by default. You can also manually trigger a query for the latest data. If no sample data exists, a data preview query is automatically triggered. The sample data for each field is stored and sorted independently. The existence and correctness of row records are not guaranteed. During preview, the system first checks the column-level permissions of the current account and the desensitization policy of the field. You can only view sample data for fields you have permission to access. Data is not filtered based on row-level permissions. For example, a desensitization policy is configured for `field_b` in Table A. The raw data and sample data are shown in the following figure:
	For Security/standard Detection	This setting is displayed only if you have purchased the Data Security or Data Standard feature. When security detection rules involve content-based detection, or when standard label mapping is configured for intelligent mapping based on detection features, sample data is used by default. If no data is available, a temporary data query is performed.
	For Smart Applications	This setting is displayed only if you have enabled a smart application. You can edit the sampling data configuration for a smart application on the Super X > Smart application management > Smart application page.

Click OK to complete the basic configuration.

Compute source

Configure the scope of data tables for which automatic sampling can be enabled.

Click Edit at the bottom of the page to configure the parameters.

Parameter		Description
Automatic Sampling		After you enable this feature, you can configure automatic data sampling for compute source tables. You can modify the trigger scenarios for automatic sampling on the Basic configuration page.
Automatic sampling configuration	Physical Table Scope	Select the scope of physical tables and physical views for which automatic sampling can be enabled. You can select all projects, all production projects (Basic and Prod), or specific projects. All Projects: Automatic sampling can be enabled for physical tables and physical views in all projects, including existing and future projects. All Production Projects (Basic And Prod): Automatic sampling can be enabled for physical tables and physical views in all production projects, including existing and future production projects. Specific Projects: Select the projects for which you want to enable automatic sampling. You can select multiple projects.
Automatic sampling configuration	Logical Table Scope	Select the scope of logical tables and logical views for which automatic sampling can be enabled by data board. You can select all data boards, all production data boards (Basic and Prod), or specific data boards. All Data Boards: Automatic sampling can be enabled for logical tables and logical views in all data boards, including existing and future data boards. All Production Data Boards (Basic And Prod): Automatic sampling can be enabled for logical tables and logical views in all production data boards, including existing and future production data boards. Specific Data Boards: Select the data boards for which you want to enable automatic sampling. You can select multiple data boards.
Sampling execution Note Applies to both automatic sampling and temporary sampling queries triggered by security detection rules requiring content-based identification when automatic sampling is disabled.	Execution Space	Select the computing resources for executing data sampling query nodes. You can use the project where the data resides or a specified project. Project Where The Data Resides: Executes in the project to which the selected data asset belongs. Specified Project: Executes in a project in the corresponding environment based on the environment of the selected data asset. Development tables use computing resources from development projects, and production tables use computing resources from production projects. Note Data sampling queries consume computing resources. We recommend that you execute them in the project where the data asset resides. If you want to reduce the resource pressure and query costs on the project where the data resides, you can assign dedicated project resources or queues for sampling queries. For example, you can select a separate subscription project. This avoids interference with normal business projects. Make sure that the account configured in the compute source of the selected project has read permissions on the relevant sample data tables.
	Concurrent Request Throttling	Controls the number of data sampling query nodes that can run at the same time. The default value is 16. You can enter an integer from 1 to 100. Note Concurrent queries help ensure compute cluster stability and prevent system breakdowns caused by many queries in a short period. Increasing the number of concurrent queries speeds up sampling query nodes but puts more pressure on the cluster. Configure this setting as needed. Scanning consumes cluster computing resources.
	Query Timeout	If the total runtime of a data sampling query node (from start to end, excluding resource and scheduling wait times) exceeds the set threshold, the system automatically stops the node and marks it as failed. The default value is 0.5 hours. You can set a value from 0 to 12 hours, with up to one decimal place.
	Scan Disable Period	After enabling this feature, set a start and end time. During this period, automatically triggered data sampling queries are not started and are marked as failed immediately. This prevents excessive computing resource consumption and ensures stable operation of production environment tasks.

Click OK to complete the data sampling configuration for compute source tables.

Data source

The Data source tab displays the data source types that support data sampling and for which metadata has been collected. On this tab, you can configure the scope of data source tables for which automatic sampling is enabled.

You can view the name, type, maximum number of concurrent nodes, automatic data sampling status, query timeout, and last modification time for each data source.
You can search by data source name or filter by data source type.

To configure data sampling for a target data source, click the Edit icon in the Actions column. In the Sampling Configuration dialog box, configure the parameters.

Parameter		Description
Automatic sampling scope	Development/Production Environment	This can be configured only when a collection node is configured for the corresponding environment in the data source. You can configure the automatic sampling scope for the production and development environments separately. After you enable this feature, you can configure different collection scopes for different data source types. For more information, see Collection scope.
Sampling execution Note Applies to both automatic sampling and temporary sampling queries triggered by security detection rules requiring content-based identification when automatic sampling is disabled.	Concurrent Request Throttling	Controls the number of data sampling query nodes that can run at the same time. The default value is 16. You can enter an integer from 1 to 100. Note Concurrent queries help ensure compute cluster stability and prevent system breakdowns caused by many queries in a short period. Increasing the number of concurrent queries speeds up sampling query nodes but puts more pressure on the cluster. Configure this setting as needed. Scanning consumes cluster computing resources.
	Query Timeout	If the total runtime of a data sampling query node (from start to end, excluding resource and scheduling wait times) exceeds the set threshold, the system automatically stops the node and marks it as failed. The default value is 0.5 hours. You can set a value from 0 to 12 hours, with up to one decimal place.
	Scan Disable Period	After enabling this feature, set a start and end time. During this period, automatically triggered data sampling queries are not started and are marked as failed immediately. This prevents excessive computing resource consumption and ensures stable operation of production environment tasks.

Click OK to complete the data sampling configuration for data source tables.