All Products
Search
Document Center

DataWorks:Amazon Redshift data source

Last Updated:Sep 04, 2023

DataWorks provides Amazon Redshift Reader and Amazon Redshift Writer for you to read data from and write data to Amazon Redshift data sources. You can configure a synchronization task for an Amazon Redshift data source by using the codeless user interface (UI) or code editor. This topic describes the capabilities of synchronizing data from or to Amazon Redshift data sources.

Supported Amazon Redshift versions

The driver that is used by Amazon Redshift is redshift-jdbc4.2 Driver 2.1.0.1. For information about the capabilities of the driver, see Configure a JDBC driver version for Amazon Redshift.

Data type mappings

For information about the mappings between Amazon Redshift data types and SQL data types and the mappings between Amazon Redshift data types and Java data types, see the official Amazon Redshift documentation. The following table lists the data type mappings.

Amazon Redshift data type

SQL data type

Java data type

BIGINT

SQL_BIGINT

LONG

BOOLEAN

SQL_BIT

Boolean

CHAR

SQL_CHAR

STRING

DATE

SQL_TYPE_DATE

java.sql.Date

DECIMAL

SQL_NUMERIC

BigDecimal

DOUBLE PRECISION

SQL_DOUBLE

Double

GEOMETRY

SQL_ LONGVARBINARY

byte[]

INTEGER

SQL_INTEGER

INTEGER

OID

SQL_BIGINT

LONG

SUPER

SQL_LONGVARCHAR

STRING

REAL

SQL_REAL

Float

SMALLINT

SQL_SMALLINT

SHORT

TEXT

SQL_VARCHAR

STRING

TIME

SQL_TYPE_TIME

java.sql.Time

TIMETZ

SQL_TYPE_TIME

java.sql.Time

TIMESTAMP

SQL_TYPE_ TIMESTAMP

java.sql.Timestamp

TIMESTAMPTZ

SQL_TYPE_ TIMESTAMP

java.sql.Timestamp

VARCHAR

SQL_VARCHAR

STRING

Establish a network connection for an Amazon Redshift data source

Before you run a synchronization task configured for an Amazon Redshift data source to synchronize data, you must establish a network connection between the data source and your exclusive resource group for Data Integration over an internal network. For more information, see Establish a network connection between a resource group and a data source.

Develop a synchronization task

For information about the entry point for and the procedure of configuring a synchronization task, see the following sections. For information about the parameter settings, view the infotip of each parameter on the configuration tab of the synchronization task.

Add a data source

Before you configure a synchronization task to synchronize data from or to a specific data source, you must add the data source to DataWorks. For more information, see Add and manage data sources. You must configure the following parameters when you add an Amazon Redshift data source to DataWorks:

  • JDBC URL: Specify a JDBC connection string that contains an IP address, a port number, a database name, and connection parameters. You can specify a public IP address or a private IP address. If you specify a public IP address, make sure that your exclusive resource group for Data Integration can access the host on which the Amazon Redshift data source is deployed.

  • Username: Specify the username of the Amazon Redshift database that you want to connect to.

  • Password: Specify the password of the Amazon Redshift database that you want to connect to.

Configure a batch synchronization task to synchronize data of a single table

Appendix: Code and parameters

Configure a batch synchronization task by using the code editor

If you use the code editor to configure a batch synchronization task, you must configure parameters for the reader and writer of the related data source based on the format requirements in the code editor. For more information about the format requirements, see Configure a batch synchronization task by using the code editor. The following information describes the configuration details of parameters for the reader and writer in the code editor.

Code for Amazon Redshift Reader and parameters in code for Amazon Redshift Reader

Code for Amazon Redshift Reader

{
  "stepType": "redshift"
  "parameter":
  {
    "datasource":"redshift_datasource",
    "table": "redshift_table_name",
    "where": "xxx=3",
    "splitPk": "id",
    "column":
    [
      "id",
      "table_id",
      "table_no",
      "table_name",
      "table_status"
    ]
  },
  "name": "Reader",
  "category": "reader"
}

Parameters in code for Amazon Redshift Reader

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

table

The name of the table from which you want to read data.

Yes

No default value

column

The names of the columns from which you want to read data. Separate the names with commas (,), such as "column":["id","name","age"].

If you want to read data from all the columns in the source table, set this parameter to an asterisk (*), such as "column":["*"].

Yes

No default value

where

The WHERE clause. The batch synchronization task concatenates the values of the column, table, and where parameters to form an SQL statement and executes the SQL statement to read data from the source table. When you perform a test, you can set the where parameter to limit 10.

To read the data that is generated on the current day, you can set the where parameter to gmt_create > $bizdate.

  • You can use the WHERE clause to read incremental data.

  • If the where parameter is not provided or is left empty, Amazon Redshift Reader reads all data from the source table.

No

No default value

splitPk

The field that is used for data sharding. If you configure this parameter, the source table is sharded based on the value of this parameter. Data Integration then runs parallel threads to read data. This way, data can be synchronized more efficiently.

No

No default value

Code for Amazon Redshift Writer and parameters in code for Amazon Redshift Writer

Code for Amazon Redshift Writer

{
  "stepType": "redshift",// The plug-in name. 
  "parameter":
  {
    "postSql":["delete from XXX;"],
    "preSql":["delete from XXX;"],
    "datasource":"redshift_datasource",// The name of the data source. 
    "table": "redshift_table_name",// The name of the table. 
    "writeMode": "insert",
    "batchSize": 2048,
    "column":
    [
      "id",
      "table_id",
      "table_no",
      "table_name",
      "table_status"
    ]
  },
  "name": "Writer",
  "category": "writer"
}

Parameters in code for Amazon Redshift Writer

Parameter

Description

Required

Default value

datasource

The name of the data source. It must be the same as the name of the added data source. You can add data sources by using the code editor.

Yes

No default value

table

The name of the table to which you want to write data.

Yes

No default value

column

The names of the columns to which you want to write data. Separate the names with commas (,), such as "column":["id","name","age"].

If you want to write data to all the columns in the destination table, set this parameter to an asterisk (*), such as "column":["*"].

Yes

No default value

preSql

The SQL statement that you want to execute before the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to delete outdated data. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor.

No

No default value

postSql

The SQL statement that you want to execute after the batch synchronization task is run. For example, you can set this parameter to the SQL statement that is used to add a timestamp. You can execute only one SQL statement on the codeless UI and multiple SQL statements in the code editor.

No

No default value

batchSize

The maximum number of data records to write at a time.

No

2048

writeMode

The write mode. Set the value to insert.

No

insert