When you use DataWorks to synchronize data from a PostgreSQL database to Hologres, you can refer to the operations in this topic to configure the network, whitelists, and permissions for data sources to implement data synchronization.

Prerequisites

Before you configure data sources, make sure that the following operations are performed:
  • Data sources are purchased. A source PostgreSQL data source and a destination Hologres data source are purchased.
  • Plan and prepare resources: An exclusive resource group for Data Integration is purchased and configured. For more information, see Plan and configure resources.
  • Evaluate and plan the network environment: Before you perform data integration, connect data sources to exclusive resource groups for Data Integration based on your business requirements. After data sources and exclusive resource groups for Data Integration are connected, you can refer to the operations in this topic to configure access settings such as vSwitches and whitelists.
    • If data sources and exclusive resource groups for Data Integration reside in the same region and virtual private cloud (VPC), they are automatically connected.
    • If data sources and exclusive resource groups for Data Integration reside in different network environments, you must connect data sources and resource groups by using methods such as a VPN gateway.

Background information

Before you synchronize data from the source to the destination, make sure that the data sources and exclusive resource groups for Data Integration are connected. In addition, you must create an account and authorize the account to access the data sources.
  • Configure whitelists for the data sources
    If the data sources and the exclusive resource group for Data Integration reside in the same VPC, you must add the CIDR block of the vSwitch that is bound to the exclusive resource group for Data Integration during network configuration to the whitelists of the data sources. This ensures that the exclusive resource group for Data Integration can be used to access the data sources. VPC connection
  • Create an account and grant permissions the account

    You must create an account that can be used to access the data sources, read data from the source, and write data to the destination during the data synchronization process.

  • Check whether the database version is the version supported by real-time synchronization nodes of Data Integration in DataWorks.

    The following PostgreSQL database versions are supported: PostgreSQL 10, 11, and 12. You can execute the following statement to view the version of the PostgreSQL database:

    show server_version

Limits

Real-time synchronization nodes of Data Integration have the following limits:
  • Data Integration supports ADD COLUMN statements.
    • An ADD COLUMN statement cannot be combined with other DDL statements, such as DROP COLUMN.
      Notice If you execute an ADD COLUMN statement together with an ALTER COLUMN statement, such as a DROP COLUMN or RNAME COLUMN statement on a data synchronization node, the node cannot normally work.
    • DDL statements except ADD COLUMN statements cannot be identified.
  • ALTER TABLE and CREATE TABLE statements are not supported.
  • TEMPRORARY tables and UNLOGGED tables cannot be replicated. The PostgreSQL database does not provide a mechanism for log parsing subscriptions to these two types of tables.
  • Sequences cannot be replicated (serial, bigserial, and identity).
  • TRUNCATE statements are not supported.
  • Large objects cannot be replicated (BYTEA).
  • Views, materialized views, and foreign tables cannot be replicated.

Procedure

  1. Configure a whitelist for the PostgreSQL database.
    Add the CIDR block of the virtual private cloud (VPC) where the exclusive resource group resides to the whitelist of the PostgreSQL database.
    1. View and record the elastic IP address (EIP) and CIDR block of the exclusive resource group.
    2. Add the elastic IP address (EIP) and CIDR block of the exclusive resource group to the whitelist of the PostgreSQL database.
  2. Create an account and grant the required permissions to the account.
    You must create an account to log on to the PostgreSQL database for subsequent operations. The account must have the REPLICATION and LOGIN permissions on the PostgreSQL database.
    Note

    Real-time synchronization supports only the logical replication mechanism. Logical replication uses a publish and subscribe model in which one or more subscribers subscribe to one or more publications on the node of the publisher. The subscribers extract data from the publications to which they subscribe.

    Logical replication of a table usually starts with creating a snapshot for the data in the database of the publisher and replicating the snapshot to the subscriber. After logical replication is complete, modifications on the publisher are sent to the subscriber in real time.

    1. Create an account.
    2. Grant the required permissions to the account.
      Execute the following statement to check whether the account has the REPLICATION permission on the PostgreSQL database:

      select userepl from pg_user where usename='xxx'

      The expected return result is True. If False is returned, the account does not have the REPLICATION permission on the PostgreSQL database. You can execute the following statement to grant the REPLICATION permission to the account:

      ALTER USER <user> REPLICATION;

  3. Execute the following statement to check whether real-time synchronization supports secondary databases:
    SELECT pg_is_in_recovery()

    Real-time synchronization supports only primary databases. The expected return result is False. If True is returned, it indicates that the data source is a secondary database. Real-time synchronization does not support secondary databases. You must modify the configuration of the data source to the primary database. For more information, see Configure a PostgreSQL connection.

  4. Check whether the value of the wal_level parameter is logical.
    show wal_level

    The wal_level parameter specifies the wal_log level. The expected return result is logical. If logical is not returned, the logical replication mechanism is not supported.

  5. Check whether the wal_sender process can be started.
    -- Query the value of the max_wal_senders parameter.
    show max_wal_senders;
    
    -- Query the number of pg_stat_replication.
    select count(*) from pg_stat_replication

    If the max_wal_senders parameter is not empty and the value of the max_wal_senders parameter is greater than the number of pg_stat_replication, the wal_sender process can be used. The PostgreSQL database starts the wal_sender process for the data synchronization program to send logs to subscribers.