Collection tasks connect to specified data sources through collection adapters, collect object metadata information from source databases to Dataphin, parse it through built-in resolvers, store it, and present it in a unified way. This topic describes how to create and manage metadata collection tasks.
Prerequisites
You need to create an application system in Management Center > Datasource Management > Application System before you can use the application system type collection source.
Limits
If the collected metadata contains objects with the same name but different case, the system only recognizes the default writing format supported by the compute engine (such as Oracle recognizing uppercase object names by default, DM (DaMeng) recognizing objects collected for the first time). Other metadata with the same name will not be processed.
PolarDB-X (formerly DRDS) data sources of version 2.0 or later support the collection of view objects.
Metadata collection for relational databases is supported by default. To collect metadata from other data source types, you need to purchase the corresponding features.
Prior to version 5.3, some data sources required you to initialize the Metadata Center in the metadata warehouse tenant before collection could start. These data sources include AnalyticDB for MySQL 3.0, PolarDB-X (formerly DRDS), SAP HANA, and Hologres. In version 5.3 and later, you do not need to initialize the Metadata Center. You can configure collection tasks directly.
Due to collection workflow upgrades, if you created collection tasks for PostgreSQL, MySQL, Microsoft SQLServer, Oracle, IBM DB2, Hive (MySQL metadatabase), StarRocks before V5.1 and upgraded to V5.1 or later without re-running the collection tasks, you will not be able to view the historical collection instance run logs.
Elasticsearch data source does not support listing management.
Permission requirements
Super administrators, system administrators, and custom global roles with metadata collection task management permissions can create and manage metadata collection tasks.
Metadata collection workflow description
If the network environment of the collected data source does not connect with the network environment where the Dataphin cluster is located, you need to use the register scheduling cluster feature. The collected data will be written to the object storage system (such as OSS) that Dataphin deployment depends on as a transit, and then written to the Dataphin system. This process will incur additional storage costs.
Create a collection task
In the top navigation bar of the Dataphin homepage, choose Administration > Metadata.
Click Collection Task in the navigation pane on the left, and then click the +New Collection Task button to enter the New Collection Task dialog box.
In the New Collection Task dialog box, configure the parameters.
Parameter
Description
Collection Task Name
The name of the collection task, which must be globally unique and cannot exceed 512 characters.
Owner
The owner of the collection task. You can select a member who has collection task management permissions.
Collection Task Description
You can add a description for the collection task, which cannot exceed 1,000 characters.
Data Source
Select the collection source range based on the data source to collect metadata. Supported data sources include data sources and application systems.
Datasource: Supports relational databases and big data storage databases. For more information, see Data sources supported by Dataphin.
Application System: Currently only supports Quick BI. Select the application system from which you want to collect metadata.
You can click View to go to the Data Source Management page, where the system will filter the relevant data sources for you.
NoteIf the selected data source does not have a data source encoding configured, you may not be able to use the collected metadata through JDBC or in a BI platform later. For information about how to configure data source encoding, see Data sources supported by Dataphin.
A data source can only be configured with one collection task. Two different environment sources (development environment and production environment) of the same data source can be configured with separate collection tasks.
Collection Range
You can configure different task collection ranges based on different data source types or application systems.
When the data source type is Hive, the system will automatically parse the corresponding dbname (database name) based on the JDBC URL configured for the data source.
If the data source type is MySQL, AnalyticDB for MySQL 3.0, PolarDB-X, StarRocks, OceanBase (MySQL Tenant), ClickHouse, Amazon RDS for MySQL, SelectDB, Doris, DolphinDB, or TDSQL for MySQL, you can configure the collection scope based on the database under the data source instance. You can select All Databases or Specified Database.
All Databases: Dynamically retrieves all databases with query permissions based on the data source configuration.
Specified Database: Specifies other databases with permissions based on the data source configuration. If a database is already configured for the data source, it will be filled in by default. If you enter a custom database, the characters are case-sensitive.
When the data source type is Oracle, PostgreSQL, Microsoft SQL Server, SAP HANA, IBM DB2, Hologres, OceanBase (Oracle tenant), Greenplum, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon RDS for Oracle, Amazon RDS for DB2, Amazon Redshift, DM (DaMeng), or openGauss, configure the collection range based on the schema, which is the database name under the data source instance. Select All Schemas or Specified Schema.
All Schemas: Dynamically retrieves all schemas with query permissions based on the data source configuration.
Specified Schema: Specifies other schemas with permissions based on the data source configuration or quickly fills in the default schema with one click. If you enter a custom schema, the characters are case-sensitive.
When the data source is Quick BI, you can configure the collection range based on workspace. You can select all workspaces or specified workspaces.
All Workspaces: Dynamically retrieves all workspaces with query permissions based on the application system configuration.
Specified Workspace: Specifies other workspaces with permissions based on the application system configuration.
NoteWhen the collection range is for Hive, StarRocks data sources, the system collects the most recent 100,000 partitions for a single partitioned table based on creation time.
When the data source is OceanBase, the collection range is determined by the tenant mode configured for the data source. MySQL tenant collects metadata based on Database, while Oracle tenant collects metadata based on Schema.
Collection Object Type
Selected by default and cannot be modified. When the data source is selected as a data source, it supports Tables, Views, and Fields. When the data source is an application system, it supports collecting Dashboards.
NoteWhen the data source is Elasticsearch, the collection object type for indexes is tables, and the collection object type for index aliases is views.
When the data source is StarRocks, synchronized materialized views are not supported for collection.
Source System
Only supported when the data source is a data source. Select the source system to which the metadata collected from this source belongs. This can be used for asset object filtering, source system lineage relationship display, and other scenarios. For information about how to create a source system, see Create and manage source systems.
Automatic Data Sampling
This option is available if data sampling is enabled in Administration > Metadata > Sampling Configuration, the trigger scenario includes metadata collection, and data preview is supported. If you enable this option, sample data is automatically collected during execution based on the collection scope defined in Sampling Configuration > Data Source. You can modify the collection scope.
Click Next to configure the collection strategy.
Parameter
Description
Data Update Strategy
New/Changed Metadata
Compared with the previous collection, if there is new or updated data in the source system, the system will Add New Metadata And Update Changed Metadata. For dashboards, if a work is modified but not published (the work status is "Saved but not published"), the system will retain the previously collected published data without updating it.
Deleted Metadata
Compared with the previous collection, if there is deleted data in the source system, you can choose Delete from metadata list and asset list or Ignore deletion operation. For dashboards, you can choose If The Work Status Changes From "Published" To "Offline", Treat As Deleted or Ignore Deletion Operation.
Delete from metadata list and asset list/If the work status changes from "Published" to "Offline", treat as deleted: Synchronously delete the collected metadata information, which cannot be recovered after deletion.
Ignore deletion operation: Ignore the deletion operation in the source system. You can still view the object details and historical versions in the metadata list and asset list, and you can manually delete them later.
Data Collection Schedule
Collection Frequency
Used to control the frequency of task collection. Supports Scheduled Collection and Manual Collection.
Scheduled Collection: Automatically executes task collection according to the configured schedule time. Suitable for scenarios with high timeliness requirements for collection task updates. Supports Daily, Weekly, and Monthly schedules. The configurable scheduled start execution time range is 00:00 to 23:59. When selecting Monthly schedule time, you can select Last day of month.
When the system time zone (the time zone in User Center) is different from the scheduling time zone (the time zone configured in Management Center > System Settings > Basic Settings), the system will display both time zones. When a collection task is configured with scheduled collection time, the system will automatically calculate the corresponding time in the scheduling time zone and execute according to that time.
Manual Collection: Requires manual triggering of task collection. Suitable for scenarios where metadata changes infrequently and resource conservation is desired.
Runtime Configuration
Error Retry
For failed collection instances, you can determine whether to rerun the collection instance based on the configured Retry Count and Retry Interval.
Retry Count: Whether to automatically retry running after a collection instance fails and the maximum number of automatic retries. The default is 1 time, and you can configure a positive integer between 1 and 10 times.
Retry Interval: The time interval for each automatic rerun. The default is 5 minutes, and the configurable time range is 1 to 60 minutes.
NoteError retry and scheduled collection may conflict. If the next collection time point is reached while the previous collection task is still running, the next scheduled collection will be automatically delayed. You can manually terminate the task execution in the collection instance list. For more information, see View and manage collection instances.
Runtime Timeout
If the total running time of a collection task (from start to end, not including resource waiting and scheduling waiting time) exceeds the set threshold and has not ended, the system will automatically terminate it and mark it as failed. The configurable time range is 0 to 24 hours, with a maximum of one decimal place.
Schedule Resource
The collection task will occupy the resource quota of this resource group when scheduled. To avoid high concurrency occupying too many resources and affecting the normal operation of other system tasks, all collection tasks created by all tenants globally follow a unified concurrent running number. Please allocate scheduling resources reasonably. You can select resource groups with a status of Normal under the current tenant.
The network environment of the data source you select and the network environment of the scheduling resource group need to be interconnected, otherwise the collection task cannot be executed. After selection, you can click Test Connection to test network connectivity. If the test connection fails, you can click View Log to see the specific failure reason.
Connection Configuration
You can view the connection configuration information of the selected collection source as a reference for collection frequency and collection time configuration. For more information, see Data sources supported by Dataphin.
NoteThe current connection configuration will be applied to offline integration tasks, global quality monitoring rules, and metadata collection tasks.
Click OK to complete the creation of the collection task.
Manage collection tasks
The Collection Task page displays information about collection tasks, including name, data source and data source encoding, data source type, collection method, status and time of the most recent collection, description, owner, effective status, task status, and last update time. You can click the Datasource Management button in the upper right corner to navigate to the Management Center > Data Source page to manage collection sources.
Task Status: View the task status of the corresponding task in the collection task list. Different statuses correspond to different individual operations. The operations that can be performed under the respective task statuses are shown in the following table.
Task Status
Operations
Normal
View, Edit, Temporary Manual Execution (supported for scheduled collection tasks), Manual Execution (supported for manual tasks), Clone, Delete, View Metadata, View Collection Instances, Enable or Disable Effective Status.
Creation Failed
Retry, View Execution Log, View, Edit, Delete.
Update Failed/Deletion Failed/Enable Failed/Disable Failed
Retry, View Execution Log, View, Edit, Delete, View Metadata, View Collection Instances.
Enabling/Disabling
View.
Modifying the effective status is not supported when enabling or disabling.
Creating/Updating/Deleting
View.
Abnormal
View, Edit, Delete, View Metadata, View Collection Instances.
(Optional) You can search for target collection tasks by collection task or data source name, quickly filter tasks that you are responsible for or effective tasks, or filter target collection tasks by task status, effective status, owner, data source, or collection method.
You can perform the following operations in the operation column of the target collection task.
Operation
Description
Retry
Supports rerunning failed collection tasks.
View Execution Log
Supports viewing the execution logs of failed collection tasks.
View
Supports viewing the configuration information of collection tasks.
Edit
Does not support modifying the data source type and data source. Other information modifications do not affect the effective status.
Temporary Manual Execution
Only scheduled collection tasks in normal status support temporary manual execution. If the instance from this execution is still not finished when the next scheduled run time is reached, it may cause data inconsistency. If the task already has a running instance (scheduled collection instance or temporary manually executed collection instance), you need to terminate that instance first, and then perform the operation again.
Manual Execution
Only manual collection tasks in normal status support manual execution. If the task already has a running instance (scheduled collection instance or temporary manually executed collection instance), you need to terminate that instance first, and then perform the operation again.
Clone
Supports quickly copying the configuration information of a collection task, but you need to reconfigure the data source and collection range.
Delete
Single Delete: You can click
in the operation column and select Delete to delete the collection task.Batch Delete: Select the collection tasks you want to delete, and click the
icon at the bottom to batch delete the collection tasks.
NoteDeleting a task does not affect instances that are currently running. If needed, you can manually terminate them. After the task is successfully deleted, no new collection instances will be generated. You can configure the deletion strategy as Synchronously delete collected metadata or Only delete the task, retain collected metadata.
Synchronously delete collected metadata: Synchronously delete the metadata collected from the specified data source through this task from the metadata list and asset list.
Only delete the task, retain collected metadata: Only delete the collection task itself, and retain the metadata already collected from the specified data source in the metadata list and asset list. If you later create a new collection task with the same data source, it may overwrite the retained metadata information.
View Metadata List
Supports jumping to the metadata list page, where the system will filter out metadata information related to the data source configured for this task.
View Collection Instances
Supports jumping to the collection instance list page, where the system will filter out instances related to this task.
Modify Effective Status
Modify Single Effective Status: You can click the
switch in the effective status column to enable or disable the effective status.Batch Modify Effective Status: Select the collection tasks for which you want to modify the effective status, and click the
icon at the bottom to enable or disable the effective status.
NoteAfter enabling, the collection task will automatically execute according to the configured schedule. After disabling, instances that are currently running or have been generated and are waiting to run are not affected, but subsequently generated collection instances will not be automatically executed. You can manually run the task.
What to do next
After the collection task is completed, you can view the task's execution status in the collection instance list. For more information, see View and manage collection instances.
After the collection task runs successfully, you can view the collected metadata in the metadata list. For more information, see View and manage metadata list.