Amazon Redshift data source - DataWorks - Alibaba Cloud Documentation Center

DataWorks provides Amazon Redshift Reader and Amazon Redshift Writer for you to read data from and write data to Amazon Redshift data sources. You can configure a synchronization task for an Amazon Redshift data source by using the codeless user interface (UI) or code editor. This topic describes the capabilities of synchronizing data from or to Amazon Redshift data sources.

Supported Amazon Redshift versions

The driver that is used by Amazon Redshift is redshift-jdbc4.2 Driver 2.1.0.1. For information about the capabilities of the driver, see Configure a JDBC driver version for Amazon Redshift.

Data type mappings

For information about the mappings between Amazon Redshift data types and SQL data types and the mappings between Amazon Redshift data types and Java data types, see the official Amazon Redshift documentation. The following table lists the data type mappings.

Amazon Redshift data type	SQL data type	Java data type
BIGINT	SQL_BIGINT	LONG
BOOLEAN	SQL_BIT	Boolean
CHAR	SQL_CHAR	STRING
DATE	SQL_TYPE_DATE	java.sql.Date
DECIMAL	SQL_NUMERIC	BigDecimal
DOUBLE PRECISION	SQL_DOUBLE	Double
GEOMETRY	SQL_ LONGVARBINARY	byte[]
INTEGER	SQL_INTEGER	INTEGER
OID	SQL_BIGINT	LONG
SUPER	SQL_LONGVARCHAR	STRING
REAL	SQL_REAL	Float
SMALLINT	SQL_SMALLINT	SHORT
TEXT	SQL_VARCHAR	STRING
TIME	SQL_TYPE_TIME	java.sql.Time
TIMETZ	SQL_TYPE_TIME	java.sql.Time
TIMESTAMP	SQL_TYPE_ TIMESTAMP	java.sql.Timestamp
TIMESTAMPTZ	SQL_TYPE_ TIMESTAMP	java.sql.Timestamp
VARCHAR	SQL_VARCHAR	STRING

Establish a network connection for an Amazon Redshift data source

Before you run a synchronization task configured for an Amazon Redshift data source to synchronize data, you must establish a network connection between the data source and your exclusive resource group for Data Integration over an internal network. For more information, see Establish a network connection between a resource group and a data source.

Develop a synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the synchronization task.

Add a data source

Before you configure a synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources. You must configure the following parameters when you add an Amazon Redshift data source to DataWorks:

JDBC URL: Specify a JDBC connection string that contains an IP address, a port number, a database name, and connection parameters. You can specify a public IP address or a private IP address. If you specify a public IP address, make sure that your exclusive resource group for Data Integration can access the host on which the Amazon Redshift data source is deployed.
Username: Specify the username of the Amazon Redshift database that you want to connect to.
Password: Specify the password of the Amazon Redshift database that you want to connect to.

Configure a batch synchronization task to synchronize data of a single table

For more information about the configuration procedure, see Configure a batch synchronization task by using the codeless UI and Configure a batch synchronization task by using the code editor.
For information about all parameters that are configured and the code that is run when you use the code editor to configure a batch synchronization task, see Appendix: Code and parameters.

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for Amazon Redshift Reader and parameters in code for Amazon Redshift Reader

Code for Amazon Redshift Reader

{
  "stepType": "redshift"
  "parameter":
  {
    "datasource":"redshift_datasource",
    "table": "redshift_table_name",
    "where": "xxx=3",
    "splitPk": "id",
    "column":
    [
      "id",
      "table_id",
      "table_no",
      "table_name",
      "table_status"
    ]
  },
  "name": "Reader",
  "category": "reader"
}

Parameters in code for Amazon Redshift Reader

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.	Yes	No default value
table	The name of the table from which you want to read data.	Yes	No default value
column	The names of the columns from which you want to read data. Separate the names with commas (,), such as "column":["id","name","age"]. If you want to read data from all the columns in the source table, set this parameter to an asterisk (), such as "column":[""]**.	Yes	No default value
where	The WHERE clause. The batch synchronization task concatenates the values of the column, table, and where parameters to form an SQL statement and executes the SQL statement to read data from the source table. When you perform a test, you can set the where parameter to limit 10. To read the data that is generated on the current day, you can set the where parameter to gmt_create > $bizdate. You can use the WHERE clause to read incremental data. If the where parameter is not provided or is left empty, Amazon Redshift Reader reads all data from the source table.	No	No default value
splitPk	The field that is used for data sharding. If you configure this parameter, the source table is sharded based on the value of this parameter. Data Integration then runs parallel threads to read data. This way, data can be synchronized more efficiently.	No	No default value

Code for Amazon Redshift Writer and parameters in code for Amazon Redshift Writer

Code for Amazon Redshift Writer

{
  "stepType": "redshift",// The plug-in name. 
  "parameter":
  {
    "postSql":["delete from XXX;"],
    "preSql":["delete from XXX;"],
    "datasource":"redshift_datasource",// The name of the data source. 
    "table": "redshift_table_name",// The name of the table. 
    "writeMode": "insert",
    "batchSize": 2048,
    "column":
    [
      "id",
      "table_id",
      "table_no",
      "table_name",
      "table_status"
    ]
  },
  "name": "Writer",
  "category": "writer"
}

Parameters in code for Amazon Redshift Writer

Parameter	Description	Required	Default value
datasource	The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.	Yes	No default value
table	The name of the table to which you want to write data.	Yes	No default value
column	The names of the columns to which you want to write data. Separate the names with commas (,), such as "column":["id","name","age"]. If you want to write data to all the columns in the destination table, set this parameter to an asterisk (), such as "column":[""]**.	Yes	No default value
preSql	The SQL statement that you want to execute before the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to delete outdated data. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor.	No	No default value
postSql	The SQL statement that you want to execute after the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to add a timestamp. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor.	No	No default value
batchSize	The maximum number of data records to write at a time.	No	2048
writeMode	The write mode. Set the value to insert.	No	insert