Write Data to Greenplum via Output Component - Dataphin

Configure a Greenplum output component to write data from external databases to Greenplum, or to replicate and push data from storage systems connected to big data platforms to Greenplum for data integration and reprocessing. This topic describes how to configure a Greenplum output component.

Prerequisites

A Greenplum data source is created. For more information, see Creating Greenplum Data Source.
The account used to configure the Greenplum output component properties must have read-through permission for the data source. If you do not have permission, request data source permission. For more information, see Request Data Source Permission.

Procedure

On the Dataphin homepage, choose Development > Data Integration from the top menu bar.
On the Integration page, in the top menu bar, select Project (In Dev-Prod mode, select Environment).
In the left navigation pane, click Batch Pipeline. In the Batch Pipeline list, click the Offline Pipeline you want to develop to open its configuration page.
In the upper-right corner of the page, click Component Library to open the Component Library panel.
In the left navigation pane of the Component Library panel, select Outputs. Then, from the output component list on the right, find the Greenplum component and drag it to the canvas.
Click and drag the icon of the target input, transform, or flow component to connect it to the current Greenplum output component.
Click the icon in the Greenplum output component card to open the Greenplum Output Configuration dialog box.

In the Greenplum Output Configuration dialog box, configure parameters.

Parameter		Description
Basic Settings	Step Name	The name of the Greenplum output component. Dataphin automatically generates the step name. You can modify it as needed. Naming conventions are as follows: Can contain only Chinese characters, letters, underscores (_), and numbers. Cannot exceed 64 characters.
	Datasource	The data source drop-down list displays all Greenplum data sources, including those for which you have write-through permission and those for which you do not. For data sources for which you do not have write-through permission, click Request next to the data source to request write-through permission. For more information, see Request Data Source Permission. If you do not have a Greenplum data source, click the New icon to create one. For more information, see Create Greenplum Data Source.
	Schema	Select a schema from the database. This parameter is required. If the data source link already contains schema information, the configured schema is used by default. You can also select other schemas for which you have permission.
	Table	Select the target table for output data. If the Greenplum data source does not have a target table for data synchronization, use the one-click table creation feature to quickly generate a target table. The detailed procedure is as follows: Click One-Click Table Creation. Dataphin automatically matches the code for creating the target table, including the target table name (source table name by default), field type (initially converted based on Dataphin fields), and other information. The following figure shows an example: Modify the SQL script for creating the target table as needed, and then click New. After the target table is created, Dataphin automatically uses the new target table as the target table for output data. Note If a table with the same name exists in the development environment, Dataphin reports an error indicating that the table already exists after you click New.
	Production Table Missing Policy	The policy for handling cases where the production table does not exist. Select Do Not Process or Automatic Creation. The default value is Automatic Creation. If you select Do Not Process, the production table is not created when the task is published. If you select Automatic Creation, a table with the same name is created in the target environment when the task is published. Do Not Process: If the target table does not exist, the system prompts that the target table does not exist when you submit the task. However, you can still publish the task. In this case, create the target table in the production environment before executing the task. Automatic Creation: You must Edit The Table Creation Statement. The statement is pre-populated with the table creation statement for the selected table, and you can modify it. The table name in the statement must be the placeholder `${table_name}`. Only this placeholder is supported. At runtime, it is replaced with the actual table name. If the target table does not exist, the system creates the table based on the table creation statement. If table creation fails, the publishing check result is Failed. Modify the table creation statement based on the error message, and then publish the task again. If the target table already exists, the system does not create the table. Note This parameter is supported only in Dev-Prod mode projects.
	Loading Policy	Select the Append Data or Copy policy: Append Data: If a primary key/constraint conflict occurs, the system prompts a dirty data error. Copy Policy: The system performs actions based on the selected conflict resolution policy. This policy supports only tables, not views.
	Conflict Resolution Policy	When Loading Policy is set to Copy, select a conflict resolution policy. Greenplum supports only Report Error on Conflict.
	Batch Write Data Volume (Optional)	The amount of data written at a time. You can also set Batch Write Record Count. During writing, the system writes data based on whichever of the two configurations reaches the upper limit first. The default value is 32 MB.
	Batch Write Record Count (Optional)	The default value is 2,048 records. When data is written, the system uses a batch write policy. The parameters include Batch Write Record Count and Batch Write Data Volume. When the accumulated data volume reaches either of the set upper limits (that is, the batch write data volume or record count limit), the system considers a batch of data to be full and immediately writes this batch of data to the destination. Set the batch write data volume to 32 MB. For the upper limit of bulk insert records, adjust it flexibly based on the actual size of a single record. Usually, set a larger value to fully leverage the advantages of batch writing. For example, if the size of a single record is about 1 KB, set the bulk insert byte size to 16 MB. Considering this condition, set the bulk insert record count to a value greater than the result of 16 MB divided by the single record size of 1 KB (that is, greater than 16,384 records). Here, we assume it is set to 20,000 records. After this configuration, the system triggers batch write operations based on the bulk insert byte size. Each time the accumulated data volume reaches 16 MB, a write action is performed.
	Prepare Statement (Optional)	The SQL script executed on the database before data import. For example, to ensure continuous service availability, create target table Target_A before this step writes data, write data to Target_A, and after this step finishes writing data, rename Service_B (the table that continuously provides services in the database) to Temp_C, then rename Target_A to Service_B, and finally delete Temp_C.
	End Statement (Optional)	The SQL script executed on the database after data import.
Field Mapping	Input Fields	The system displays input fields based on the output of upstream components.
	Output Fields	The system displays output fields. It supports the following operations: Field Management: Click Field Management to select output fields. Click the icon to move Selected Input Fields to Unselected Input Fields. Click the icon to move Unselected Input Fields to Selected Input Fields. Batch Add: Click Batch Add. It supports batch configuration in JSON, TEXT, and DDL formats. Batch configure in JSON format. For example: `// Example: [{ "name": "user_id", "type": "String" }, { "name": "user_name", "type": "String" }]` Note Name indicates the name of the imported field, and type indicates the field type after import. For example, `"name":"user_id","type":"String"` indicates that the field named user_id is imported and its field type is set to String. Batch configure in TEXT format. For example: `// Example: user_id,String user_name,String` The row delimiter separates information for each field. The default value is a line feed (\n). It supports line feeds (\n), semicolons (;), and periods (.). The column delimiter separates the field name and field type. The default value is a comma (,). Batch configure in DDL format. For example: `CREATE TABLE tablename ( id INT PRIMARY KEY, name VARCHAR(50), age INT );` New Output Field: Click + New Output Field, fill in Column as prompted on the page, and select Type. After you finish configuring the current row, click the icon to save.
	Field Mapping	Manually select field mappings based on upstream input and target table fields. Quick Mapping includes Row-Based Mapping and Name-Based Mapping. Name-Based Mapping: Maps fields with the same field name. Row-Based Mapping: The source table and target table have different field names, but the data in the corresponding rows needs to be mapped. Only map fields in the same row.

Click Confirm to complete the Greenplum Output Widget configuration.