All Products
Search
Document Center

DataWorks:PostgreSQL

Last Updated:Mar 26, 2026

DataWorks supports PostgreSQL as a bidirectional data source for both offline (batch) and real-time synchronization. Use the codeless UI or the code editor to configure sync tasks.

Supported versions

Supported PostgreSQL versions: 10, 11, 12, 13, 14, 15, and 16.4.

To check your PostgreSQL version, run:

SHOW SERVER_VERSION;

When to use real-time sync vs. offline sync

Criteria Recommended mode
You need to capture row deletions Real-time sync
The database changes frequently and you need near-real-time data Real-time sync
You need full CDC (change data capture) support Real-time sync
You need a scheduled bulk transfer or one-time migration Offline sync
The database does not support logical replication Offline sync

Real-time sync requires additional database-level prerequisites (Steps 2–4 in Prerequisites). Offline sync only requires Step 1.

Limitations

Offline read and write

Reading from views is supported.
  • Authentication: Password-based authentication is supported, including SCRAM-SHA-256. If the password or authentication method changes in the PostgreSQL database, update the data source configuration, test the connection, and manually re-run related tasks.

  • Special characters in table and field names: Table or field names that start with a digit, are case-sensitive, or contain a hyphen (-) must be enclosed in double quotation marks (""). Because double quotation marks are a JSON keyword in the PostgreSQL Reader and Writer plugins, escape each one with a backslash (\). For example, 123Test becomes \"123Test\". Example (code editor):

    - Escape both the opening and closing double quotation marks. - The codeless UI does not support escaping. Switch to the code editor to escape characters.
    "parameter": {
        "datasource": "abc",
        "column": [
            "id",
            "\"123Test\"" // Escaped column name
        ],
        "where": "",
        "splitPk": "id",
        "table": "public.wpw_test"
    }
  • Updates based on unique index: Not supported. Write data to a temporary table first, then use the RENAME operation.

Real-time read

The following limitations apply to real-time sync tasks in Data Integration.

DDL operation support

DDL operation Supported Notes
ADD COLUMN Yes (with constraints) Cannot be combined with other ADD COLUMN, DROP COLUMN, or DDL statements in the same transaction. Combining with DROP COLUMN or RENAME COLUMN causes the sync task to fail.
ALTER TABLE No
CREATE TABLE No
TRUNCATE No
Other DDL operations No Data Integration cannot detect DDL operations other than ADD COLUMN.

Other limitations

  • Unsupported table types: Replication of TEMPORARY, UNLOGGED, and Hyper tables is not supported. PostgreSQL does not provide a mechanism to parse logs for these table types.

  • Sequences: Replication of Sequences (serial, bigserial, identity) is not supported.

  • Large objects: Replication of large objects (Bytea) is not supported.

  • Views and foreign tables: Replication of views, materialized views, and foreign tables is not supported.

  • Table ownership: For single-table or whole-database real-time sync, only tables owned by the configured database user can be synchronized.

Supported data types

Most PostgreSQL data types are supported. For offline read and write, verify that your data types are in the list below — unsupported types are excluded from the sync.

Category PostgreSQL data types
Integer types BIGINT, BIGSERIAL, INTEGER, SMALLINT, SERIAL
Floating-point types DOUBLE PRECISION, MONEY, NUMERIC, REAL
String types VARCHAR, CHAR, TEXT, BIT, INET
Date and time types DATE, TIME, TIMESTAMP
Boolean type BOOL
Binary type BYTEA
For the PostgreSQL Reader, cast MONEY, INET, and BIT to a compatible type using PostgreSQL cast syntax, for example: a_inet::varchar.

Prerequisites

The steps below prepare your PostgreSQL database for use with DataWorks. Steps 2–4 apply only to real-time sync.

Step 1: Create a database account and configure permissions

Create a database login account with REPLICATION and LOGIN permissions.

Real-time sync uses logical replication, which follows a publish-and-subscribe model. The subscriber pulls data changes from publications on the publisher node. Logical replication starts by copying a snapshot from the publisher database, then streams changes in real time.
  1. Create a database account.

  2. Verify that the account has the replication permission:

    SELECT userepl FROM pg_user WHERE usename='<username>';

    If the result is False, grant the permission:

    ALTER USER <username> REPLICATION;

Step 2: Confirm the database is a primary (real-time sync only)

Real-time sync supports only primary databases. Run the following to check:

SELECT pg_is_in_recovery();

The result must be False. If it returns True, the database is a standby. Update the data source configuration to use the primary database. For more information, see Configure a PostgreSQL data source.

Step 3: Verify that wal_level is set to logical (real-time sync only)

SHOW wal_level;

The result must be logical. If it is not, logical replication is not supported on this database.

Step 4: Verify that wal_sender processes are available (real-time sync only)

-- Check the maximum number of wal_sender processes allowed.
SHOW max_wal_senders;

-- Check how many are currently in use.
SELECT COUNT(*) FROM pg_stat_replication;

If max_wal_senders is greater than the count from pg_stat_replication, idle wal_sender processes are available. PostgreSQL starts a wal_sender process for each data synchronization program that sends logs to subscribers.

For each table you want to synchronize, run:

ALTER TABLE <table_name> REPLICA IDENTITY FULL;

This is required. Without it, the real-time sync task reports an error.

Warning

After a real-time PostgreSQL sync task starts, DataWorks automatically creates a replication slot and a publication in the database:

  • Slot name format: di_slot_<solution_id>

  • Publication name format: di_pub_<solution_id> After you stop or unpublish the real-time sync task, manually delete the slot and publication. If you do not, PostgreSQL Write-Ahead Logging (WAL) will continue to grow and may exhaust disk space.

Add a PostgreSQL data source

Add the PostgreSQL data source to DataWorks before developing a sync task. For instructions, see Data source management. Parameter descriptions are available in the DataWorks console when you add the data source.

If SSL authentication is enabled on your PostgreSQL database, also enable SSL authentication when adding the data source in DataWorks. For more information, see Appendix 2: Add SSL authentication to a PostgreSQL data source.

Configure a sync task

Offline sync (single table)

For a complete parameter reference and script example, see Appendix 1: Script examples and parameter descriptions.

Real-time sync and full database offline sync

See Configure a real-time sync task for a full database.

FAQ

Why does data from a standby database differ from the primary?

In an active/standby disaster recovery setup, the standby continuously restores data from the primary, but a time lag exists. Network latency or high replication load can widen this gap significantly. Data synchronized from the standby is not a complete, up-to-date image of the primary. Configure the sync task to use the primary database to avoid this.

Can the PostgreSQL Reader guarantee data consistency across concurrent threads?

Not strictly. The PostgreSQL Reader uses a snapshot-based approach for single-threaded reads, so another process writing to the database during a sync does not affect the read. For concurrent reads using splitPk, each shard runs as a separate read transaction with no shared snapshot across threads, so the data is not from a single consistent point in time.

If consistency is critical, choose one of these approaches:

  • Use single-threaded sync (no splitPk). This is slower but ensures a consistent read.

  • Disable other writers during the sync window — for example, by locking tables. This affects online services.

How do I perform incremental sync?

The PostgreSQL Reader uses SELECT ... WHERE ... for data extraction. Two common patterns:

  • Timestamp-based: If your records have a modified_at or similar field, set the where parameter to filter rows updated since the last sync timestamp.

  • Auto-increment ID-based: Set the where parameter to filter rows with an ID greater than the maximum ID from the previous sync.

If your data has no field to distinguish new or updated rows, full sync is the only option.

What happens with encoding?

PostgreSQL supports only EUC_CN and UTF-8 for Simplified Chinese. The PostgreSQL Reader uses Java Database Connectivity (JDBC), which automatically detects and transcodes the encoding — no additional configuration is needed.

If the actual write encoding of the PostgreSQL database differs from its configured encoding, the Reader cannot detect or correct this, and the exported data may contain garbled characters.

Is the `querySql` parameter safe?

The querySql parameter lets you specify a custom SELECT statement. The PostgreSQL Reader does not perform any security checks on the query. Review the SQL before using this parameter in production.

Appendix 1: Script examples and parameter descriptions

Reader script example

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "postgresql",
            "parameter": {
                "datasource": "",        // Data source name
                "column": [              // Columns to sync
                    "col1",
                    "col2"
                ],
                "where": "",             // Filter condition
                "splitPk": "",           // Column for data sharding (enables concurrent sync)
                "table": ""              // Table name
            },
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": "0"                // Maximum number of error records
        },
        "speed": {
            "throttle": true,            // true: throttling enabled; false: mbps has no effect
            "concurrent": 1,             // Number of concurrent jobs
            "mbps": "12"                 // Throttle rate (1 mbps = 1 MB/s)
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Reader parameters

Parameter Description Required Default
datasource Name of the data source. Must match the name used when the data source was added. Yes None
table Name of the table to sync. Yes None
column Columns to sync, as a JSON array. Use ["*"] to sync all columns. Supports column subsets, reordering, constants, and expressions. Yes None
splitPk Column used for data sharding. Enables concurrent sync. Set to the primary key for even distribution. Only integer types are supported — non-integer values cause this parameter to be ignored and single-channel sync is used. If not specified, single-channel sync is used. No None
splitFactor Number of shards per concurrent thread. Total shards = number of concurrent threads x splitFactor. Valid range: 1–100. A large value may cause an out-of-memory (OOM) error. No 5
where Filter condition. The Reader builds a SELECT query using column, table, and where. If not set, the entire table is synced. No None
querySql Custom SQL query for advanced filtering (not available in the codeless UI). When set, table, column, and where are ignored. Example: SELECT a, b FROM table_a JOIN table_b ON table_a.id = table_b.id. No None
fetchSize Number of records fetched per batch from the database. Controls network round trips between Data Integration and the server. Values greater than 2048 may cause an OOM error. No 512

Writer script example

{
    "type": "job",
    "version": "2.0",
    "steps": [
        {
            "stepType": "stream",
            "parameter": {},
            "name": "Reader",
            "category": "reader"
        },
        {
            "stepType": "postgresql",
            "parameter": {
                "datasource": "",        // Data source name
                "column": [              // Destination columns
                    "col1",
                    "col2"
                ],
                "table": "",             // Destination table name
                "preSql": [],            // SQL to run before the sync task
                "postSql": []            // SQL to run after the sync task
            },
            "name": "Writer",
            "category": "writer"
        }
    ],
    "setting": {
        "errorLimit": {
            "record": "0"
        },
        "speed": {
            "throttle": true,
            "concurrent": 1,
            "mbps": "12"
        }
    },
    "order": {
        "hops": [
            {
                "from": "Reader",
                "to": "Writer"
            }
        ]
    }
}

Writer parameters

Parameter Description Required Default
datasource Name of the data source. Must match the name used when the data source was added. Yes None
table Name of the destination table. Yes None
column Destination columns to write to, as a JSON array. Use ["*"] to write to all columns in order. For example: ["id", "name", "age"]. Yes None
writeMode Write mode: insert or copy. insert (default): Uses INSERT INTO ... VALUES .... If a primary key or unique index conflict occurs, the row fails and becomes dirty data. copy: Uses COPY FROM to bulk-load data. Use this if you encounter performance issues with insert mode. No insert
preSql SQL statements to run before the sync task. The codeless UI supports one statement; the code editor supports multiple, for example to clear old data. No None
postSql SQL statements to run after the sync task. The codeless UI supports one statement; the code editor supports multiple, for example to add a timestamp. No None
batchSize Number of records submitted per batch. A larger value reduces network interactions with PostgreSQL and improves throughput, but setting it too high may cause an OOM error in Data Integration. No 1,024
pgType Type mappings for PostgreSQL-specific types. Supported: bigint[], double[], text[], jsonb, json. No None

`pgType` example:

{
    "job": {
        "content": [{
            "reader": {},
            "writer": {
                "parameter": {
                    "column": [
                        "bigint_arr",
                        "double_arr",
                        "text_arr",
                        "jsonb_obj",
                        "json_obj"
                    ],
                    "pgType": {
                        "bigint_arr": "bigint[]",
                        "double_arr": "double[]",
                        "text_arr":   "text[]",
                        "jsonb_obj":  "jsonb",
                        "json_obj":   "json"
                    }
                }
            }
        }]
    }
}

Appendix 2: Add SSL authentication to a PostgreSQL data source

SSL authentication files

When creating or modifying a PostgreSQL data source connection in DataWorks, configure SSL authentication based on your Access Control List (ACL) mode.

SSL link encryption Client-based encryption ACL mode Files required
Enabled Disabled Not applicable Truststore File (optional). If not configured, a regular connection is used. If configured, an SSL-encrypted connection is used.
Enabled Enabled prefer Keystore Certificate File and Private Key File (optional). The server does not enforce client verification. If Private Key File is encrypted, also configure Private Key Password.
Enabled Enabled verify-ca Keystore Certificate File (required), Private Key File (required), Private Key Password (required only if the private key is encrypted).
prefer: The PostgreSQL server does not forcibly verify the client certificate. If no SSL files are configured, a regular connection is used.
verify-ca: All three files are required. The server verifies the client certificate before establishing the connection.

Obtain the SSL authentication files

This section uses an ApsaraDB RDS for PostgreSQL instance as an example.

Obtain the Truststore File

For full instructions, see Use a cloud certificate to quickly enable SSL encryption.

  1. Go to the RDS Instances page, locate the RDS instance, and click the instance ID.

  2. Select the connection string to protect.

    - If a public endpoint is enabled, both an internal and a public endpoint are displayed. A cloud certificate can protect only one endpoint. Protect the public endpoint, as it is exposed to the internet; the internal endpoint is more secure within the VPC. See View internal and public endpoints. - To protect both endpoints, see Use a custom certificate to enable SSL encryption. - After configuring the cloud certificate, the instance Running Status changes to Modifying SSL for about three minutes. Wait until it returns to Running before continuing.

    image

  3. Click Download CA Certificate. The downloaded package contains three files. Upload the .pem or .p7b file to the Truststore File field in DataWorks.

    image

Obtain the Keystore File, Private Key File, and Private Key Password

Prerequisites: SSL encryption is enabled on the RDS instance (via cloud or custom certificate), and OpenSSL is installed.

Linux includes OpenSSL by default. For Windows, download and install the OpenSSL package.

For full instructions, see Configure a client CA certificate.

  1. Generate a self-signed certificate (ca1.crt) and private key (ca1.key):

    openssl req -new -x509 -days 3650 -nodes -out ca1.crt -keyout ca1.key -subj "/CN=root-ca1"
  2. Generate a client certificate signing request (CSR) and client private key. Set the CN value to the database username:

    openssl req -new -nodes -text -out client.csr -keyout client.key -subj "/CN=<client_username>"
  3. Generate the client certificate (client.crt):

    openssl x509 -req -in client.csr -text -days 365 -CA ca1.crt -CAkey ca1.key -CAcreateserial -out client.crt
  4. If the RDS server needs to verify the client CA certificate, open ca1.crt, copy its content, and paste it into the Enter Public Key from Client Certificate Authority dialog box.

    image

  5. Convert the client private key to PKCS#8 format (client.pk8). Upload this file to the Private Key File field in DataWorks.

    cp client.key client.pk8
  6. (Optional) To password-protect the private key, run:

    openssl pkcs8 -topk8 -inform PEM -in client.key -outform der -out client.pk8 -v1 PBE-MD5-DES

    You will be prompted to set a password. Use the same password for the Private Key Password field in the DataWorks PostgreSQL data source configuration.

Configure SSL authentication files in DataWorks

Upload the certificate files when creating or editing the PostgreSQL data source:

image

To set the Configure ACL mode: Go to the RDS Instances page, click the instance ID, then go to Data Security > Configure ACL. For more information, see Force clients to use SSL connections.

image
prefer: The PostgreSQL server does not forcibly verify the client certificate.
verify-ca: Upload the correct client certificate in DataWorks so the server can verify the client's identity.