Sync Greenplum Data in Offline Pipelines via Input Component - Dataphin

The Greenplum input component reads data from a Greenplum data source. To synchronize data from a Greenplum data source to another data source, first configure the Greenplum input component to read data from the source. Then, configure the destination data source for data synchronization. This topic describes how to configure the Greenplum input component.

Prerequisites

You have created a Greenplum data source. For more information, see Create a Greenplum data source.
The account used to configure the Greenplum input component properties must have read-through permissions on the data source. If your account does not have the required permissions, request them. For more information, see Request data source permissions.

Procedure

On the top menu bar of the Dataphin home page, choose Developer > Data Integration.
On the top menu bar of the integration page, select a Project. If you are in Dev-Prod mode, you must also select an environment.
In the left navigation pane, click Offline Integration. In the Offline Integration list, click the offline pipeline that you want to develop to open its configuration page.
In the upper-right corner of the page, click Component Library to open the Component Library panel.
In the navigation pane on the left of the Component Library panel, select Input. From the list of input components on the right, find the Greenplum component and drag it to the canvas.
On the Greenplum input component card, click the icon to open the Greenplum Input Configuration dialog box.

In the Greenplum Input Configuration dialog box, configure the parameters.

Parameter	Description
Step Name	The name of the Greenplum input component. Dataphin automatically generates a step name. You can also change the name as needed. The naming convention is as follows: Can contain only Chinese characters, letters, underscores (_), and digits. Cannot exceed 64 characters.
Datasource	The drop-down list displays all Greenplum data sources in Dataphin. This includes data sources for which you have read-through permissions and those for which you do not. If you do not have read-through permissions for a data source, click Request next to it to request the permissions. For more information, see Request data source permissions. If you do not have a Greenplum data source, click New to create one. For more information, see Create a Greenplum data source.
Schema	Select the schema that contains the source table. Reading tables across schemas is supported. If the data source connection specifies a schema, it is selected by default. You can also select another schema for which you have permissions.
Source Table Quantity	Select the number of source tables. Options include single table and multiple tables: Single table: Use this option to synchronize data from one table to one destination table. Multiple tables: Use this option to synchronize data from multiple tables to a single destination table. This supports enumeration, regular expression-like patterns, and a mix of both, such as `table_[001-100];table_102`.
Table Match Mode	Select either General Rule or Database Regex. Note This parameter is available only when Source Table Quantity is set to Multiple tables.
Table	Select the source table or tables: If you selected Single table for Source Table Quantity, you can search by entering a table name keyword, or enter the exact table name and click Exact Match. After you select a table, the system automatically checks its status. Click the icon to copy the name of the selected table. If you selected Multiple tables for Source Table Quantity, enter an expression to add tables based on the selected table match mode. If you selected General Rule as the match mode, enter an expression in the input box to filter for tables with the same structure. The system supports enumeration, regular expression-like patterns, and a mix of both. For example, `table_[001-100];table_102;`. If you selected Database Regex as the match mode, enter a regular expression supported by the current database. The system uses this expression to match tables in the destination database. At runtime, the node matches the latest set of tables based on the regular expression and synchronizes them. After you enter the expression, click Exact Match to view a list of matched tables in the Confirm Match Details dialog box.
Split Key	Select a column of an integer data type from the source table to use as the split key. For best results, use the primary key or an indexed column. When reading data, the system partitions the data based on the split key to perform concurrent reads, which improves data synchronization efficiency.
Batch Read Size	The number of records to read in a single batch. Instead of reading one record at a time, you can configure a batch size, such as 1024 records. This reduces interactions with the data source, improves I/O efficiency, and lowers network latency.
Input Filter	Configure filter conditions to extract specific data. The configuration is as follows: Configure a static value to extract corresponding data. For example, `ds=20210101`. Configure a variable to extract a subset of data. For example, `ds=${bizdate}`.
Output Fields	The Output Fields section displays all fields from the selected tables that match the filter conditions. The following operations are supported: Field management: If you do not need to output certain fields to downstream components, you can delete them: To delete a single field: To delete a small number of fields, click the icon in the Actions column to remove unwanted fields. To delete fields in batch: To delete many fields, click Field Management. In the Field Management dialog box, select multiple fields, click the left arrow icon to move them to the unselected list, and then click OK. Batch add: Click Batch Add to configure fields in batch using JSON, TEXT, or DDL format. Note After you add fields in batch and click OK, the existing field configuration is overwritten. To configure in JSON format, for example: `// Example: [{ "index": 1, "name": "id", "type": "int(10)", "mapType": "Long", "comment": "comment1" }, { "index": 2, "name": "user_name", "type": "varchar(255)", "mapType": "String", "comment": "comment2" }]` Note index specifies the column number of the object, name specifies the field name after import, and type specifies the field type after import. For example, `"index":3,"name":"user_id","type":"String"` means to import the fourth column from the file, name the field user_id, and set the field type to String. To configure in TEXT format, for example: `// Example: 1,id,int(10),Long,comment1 2,user_name,varchar(255),Long,comment2` The row delimiter separates the information for each field. The default delimiter is a line feed (\n). Semicolons (;) and periods (.) are also supported. The column delimiter separates field names from field types. The default is a half-width comma (,). It supports `','`. Field types can be omitted, and the default is `','`. To configure in DDL format, for example: `CREATE TABLE tablename ( user_id serial, username VARCHAR(50), password VARCHAR(50), email VARCHAR (255), created_on TIMESTAMP, );` Add output field: Click + Add Output Field. Follow the on-screen instructions to enter the Column, Type, and Comment, and select the Mapping Type. After you configure the current row, click the icon to save.

Click Confirm to save the configuration for the Greenplum input component.