All Products
Search
Document Center

AnalyticDB for PostgreSQL:Migrate data from a self-managed Greenplum cluster to an AnalyticDB for PostgreSQL instance

Last Updated:Jun 25, 2023

AnalyticDB for PostgreSQL is optimized by Alibaba Cloud based on Greenplum. AnalyticDB for PostgreSQL supports vector computing and transaction processing in the multi-coordinator architecture. AnalyticDB for PostgreSQL uses the same API operations as Greenplum Community Edition. This topic describes how to migrate data from a self-managed Greenplum cluster to an AnalyticDB for PostgreSQL instance.

Migration process

  1. Evaluate migration risks and select a migration solution.

  2. Purchase an AnalyticDB for PostgreSQL test instance and check whether the migration solution can be used to migrate your application and data.

  3. Purchase an AnalyticDB for PostgreSQL production instance and migrate the production library from the self-managed Greenplum cluster to the AnalyticDB for PostgreSQL instance.

  4. Connect your application to the AnalyticDB for PostgreSQL production instance and check whether your application and data are migrated to the instance.

Usage notes

  • Data can be migrated but cannot be synchronized. If new data is written during the data migration, you must backfill data after the data migration.

  • We recommend that you migrate data when write operations are paused. Otherwise, your service performance is compromised and the data migration slows down.

  • After data migration, the AnalyticDB for PostgreSQL instance executes the ANALYZE statement to collect statistics to generate an optimal execution plan. When the ANALYZE statement is executed, CPU workload on the coordinator node temporarily increases based on the data volume.

  • We recommend that you create an AnalyticDB for PostgreSQL instance whose number of compute nodes is greater than or equal to the number of nodes in the self-managed Greenplum cluster. If you use gpcopy to migrate data, the same number of nodes delivers the optimal migration speed.

  • Check the differences between extensions that are supported by the self-managed Greenplum cluster and the AnalyticDB for PostgreSQL instance. For information about the extensions supported by AnalyticDB for PostgreSQL, see Manage extensions.

  • The syntax of AnalyticDB for PostgreSQL is partially incompatible with that of Greenplum 4X. You must make modifications to adapt to the syntax.

  • If the self-managed Greenplum cluster contains partition tables or inherit tables, we recommend that you migrate data based on child tables.

Select a migration method

Migration method

Scenario

Use gpcopy to migrate data

The self-managed Greenplum cluster can connect to the AnalyticDB for PostgreSQL instance.

Important

The gpcopy utility is not supported to migrate data for AnalyticDB for PostgreSQL instances in elastic storage mode (V7.0) or Serverless mode.

Use gptransfer to migrate data

The AnalyticDB for PostgreSQL instance can connect to the self-managed Greenplum cluster.

Use an OSS foreign table to migrate data

The self-managed Greenplum cluster and the AnalyticDB for PostgreSQL instance cannot connect to each other.

Use gpcopy to migrate data

  1. Submit a ticket to contact Alibaba Cloud technical support and provide the following information: the ID of the AnalyticDB for PostgreSQL instance, the IP address and port number of the self-managed Greenplum cluster, an account of the Greenplum cluster that has superuser permissions to migrate data, and the IP address of the server where the Greenplum cluster resides.

  2. Add 0.0.0.0/0 to an IP address whitelist of the AnalyticDB for PostgreSQL instance to allow external access.

    Warning

    0.0.0.0/0 specifies that all IP addresses can access the AnalyticDB for PostgreSQL instance, which leads to high security risks. After data migration, modify the IP address whitelist in a timely manner.

  3. Download gpcopy on the server where the self-managed Greenplum cluster resides. Contact Alibaba Cloud technical support to download and install gpcopy on the AnalyticDB for PostgreSQL instance.

    Note

    In this topic, gpcopy 2.3.2 is used. We recommend that you use gpcopy 2.3.x to migrate data. If you use gpcopy 2.4.x to migrate data, SSL encryption must be enabled. For more information, see Configure SSL encryption.

  4. In the directory where the gpcopy package is located, run the following code to install gpcopy:

    #!/bin/bash
    
    tar xzvf gpcopy-2.3.2.tar.gz 
    
    cd gpcopy-2.3.2/
    cp gpcopy $GPHOME/bin
    cp gpcopy_helper $GPHOME/bin
    chmod 755 $GPHOME/bin/gpcopy
    chmod 755 $GPHOME/bin/gpcopy_helper
    
    host_sql=`psql -A -t -c "select address from gp_segment_configuration where role = 'p' and content >= 0 group by address"`
    host=(${host_sql})
    host_len=${#host[@]}
    
    host_str=""
    for ((i=0; i<$host_len; ++i)); do
        host_str=$host_str" -h "${host[$i]}
    done
    
    gpscp $host_str gpcopy_helper =:$GPHOME/bin
    gpssh $host_str -e "chmod 755 $GPHOME/bin/gpcopy_helper"
  5. Run the following command to migrate data:

    gpcopy --full --source-host <SOURCEHOST> --source-port <SOURCEPORT> --source-user <SOURCEUSER> --dest-host <DESTHOST> --dest-port <DESTPORT> --dest-user <DESTUSER> --jobs 4 --drop --validate count

    The following table describes the parameters.

    Parameter

    Description

    --full

    Migrates full data of the self-managed Greenplum cluster.

    If you want to migrate only a single database of the self-manged Greenplum cluster, you can replace this parameter with -d <database name>.

    --jobs

    The number of tables that can be concurrently migrated. Default value: 4.

    You can configure this parameter based on your business requirements. We recommend that you calculate the value of this parameter by using the following formula: 2 × Number of connections + 1.

    <SOURCEHOST>

    The IP address of the server where the self-managed Greenplum cluster resides.

    If you run the migration command on the server where the self-managed Greenplum cluster resides, you can replace this parameter with localhost.

    <SOURCEPORT>

    The port number of the self-managed Greenplum cluster. Default value: 5432.

    You can run the gpstate -s command on the server where the self-managed Greenplum cluster resides to query the actual port number.

    <SOURCEUSER>

    The account of the self-manged Greenplum cluster that has superuser permissions.

    <DESTHOST>

    The IP address of the coordinator node of the AnalyticDB for PostgreSQL instance. For more information, contact Alibaba Cloud technical support.

    <DESTPORT>

    The port number of the coordinator node of the AnalyticDB for PostgreSQL instance. To obtain the port number, contact Alibaba Cloud technical support.

    <DESTUSER>

    The account of the AnalyticDB for PostgreSQL instance that has superuser permissions. To obtain the account, contact Alibaba Cloud technical support.

  6. After data is migrated, modify the IP address whitelist of the AnalyticDB for PostgreSQL instance from 0.0.0.0/0 to the IP address that is required for your business. Contact Alibaba Cloud technical support to restore the configurations of the superuser account, security groups, and pg_hba password-free logon.

Use gptransfer to migrate data

  1. Create a privileged account for the AnalyticDB for PostgreSQL instance. For more information, see the "Create a privileged account" section of the Create a database account topic.

  2. Add 0.0.0.0/0 to an IP address whitelist of the AnalyticDB for PostgreSQL instance to allow external access.

    Warning

    0.0.0.0/0 specifies that all IP addresses can access the AnalyticDB for PostgreSQL instance, which leads to high security risks. After data migration, modify the IP address whitelist in a timely manner.

  3. Update the AnalyticDB for PostgreSQL instance to the latest minor version. For information about how to view and update the minor version, see View the minor engine version and Update the minor engine version.

  4. Submit a ticket to contact Alibaba Cloud technical support and provide the following information: the ID of the AnalyticDB for PostgreSQL instance, the IP address and port number of the self-managed Greenplum cluster, an account of the Greenplum cluster that has permissions to migrate data, and the IP address of the server where the Greenplum cluster resides.

  5. Wait until the data migration is complete.

  6. Modify the IP address whitelist of the AnalyticDB for PostgreSQL instance from 0.0.0.0/0 to the IP address that is required for your business and check whether your data is migrated to the instance. After you make sure that your data is migrated, connect your application to the AnalyticDB for PostgreSQL instance.

Use an OSS foreign table to migrate data

Schema migration

  1. On the master node of the self-managed Greenplum cluster, use pg_dumpall or pg_dump to export the DDL schema. Bash commands are used in the following example:

    export PGHOST=<IP address of the self-managed Greenplum cluster>
    export PGPORT=<Port number of the self-managed Greenplum cluster>
    export PGUSER=<Superuser account of the self-managed Greenplum cluster>
    export PGPASSWORD=<Password of the superuser account>
    
    #-- Export the full schema for a Greenplum 4X cluster. 
    pg_dumpall -s -q --gp-syntax  > full_schema.sql
    
    #-- Export the full schema for a Greenplum 5X or 6X cluster. 
    pg_dumpall -s --gp-syntax  > full_schema.sql
  2. Use psql to log on to the self-managed Greenplum cluster. Execute the following statement to check whether custom libraries exist. If a custom library exists, you must manually create a schema in the AnalyticDB for PostgreSQL instance.

    SELECT * FROM pg_catalog.pg_library;
  3. Log on to the AnalyticDB for PostgreSQL instance and execute the DDL schema. Bash commands are used in the following example:

    export PGHOST=<IP address of the AnalyticDB for PostgreSQL instance>
    export PGPORT=<Port number of the AnalyticDB for PostgreSQL instance>
    export PGUSER=<Privileged account of the AnalyticDB for PostgreSQL instance>
    export PGPASSWORD=<Password of the privileged account>
    
    psql postgres -f full_schema.sql > psql.log 2>&1 &

    During the execution of bash commands, check psql.log for error messages. If an error message exists, analyze the error cause and troubleshoot the problem. Most error messages are related to SQL syntax, especially for Greenplum 4X. For information about incompatible items, see Compatibility comparisons between AnalyticDB for PostgreSQL V4.3 and V6.0 and Check for incompatibility between AnalyticDB for PostgreSQL V4.3 and V6.0.

Table data migration

  1. Use the COPY TO statement or a gpfdist external table to export table data from the self-managed Greenplum cluster.

    • COPY TO

      Execute the COPY TO statement on the psql client to export table data. Sample statements:

      • Export table data from Greenplum 4X, 5X, or 6X.

        COPY public.t1 TO '/data/gpload/public_t1.csv' FORMAT CSV ENCODING 'UTF8';
      • Export table data from Greenplum 5X or 6X by using ON SEGMENT.

        COPY public.t1 TO '<SEG_DATA_DIR>/public_t1_<SEGID>.csv' FORMAT CSV ENCODING 'UTF8' 
        ON SEGMENT;
        Note

        Do not modify the <SEGID> string. The system automatically generates a corresponding <SEGID> during the export. You can remain the <SEG_DATA_DIR> string unchanged or customize an absolute path.

    • gpfdist external table

      1. Activate the gpfdist service that is stored in the bin directory of the Greenplum startup program. Bash commands are used in the following example:

        mkdir -p /data/gpload
        gpfdist -d /data/gpload -p 8000 & 
      2. Use psql to connect to the self-managed Greenplum cluster and export table data from the cluster. Sample statements:

        -- public.t1 specifies the table to be migrated.
        CREATE WRITABLE EXTERNAL TABLE ext_w_t1 (LIKE public.t1)
        LOCATION ('gpfdist://<IP address of the self-managed Greenplum cluster>:8000/public_t1.csv') FORMAT 'CSV' ENCODING 'UTF8';
        
        -- Write the table data to an external table.
        INSERT INTO ext_w_t1 SELECT * FROM public.t1;
  2. Upload the exported data to an Object Storage Service (OSS) bucket. For more information, see Upload objects.

    Note

    The involved OSS bucket must reside in the same region as the AnalyticDB for PostgreSQL instance.

  3. Use an OSS foreign table to import data from the OSS bucket to the AnalyticDB for PostgreSQL instance. For more information, see the "Import OSS data to an AnalyticDB for PostgreSQL table" section of the Use OSS foreign tables for data lake analysis topic.

Data check

Check whether your data and business are consistent between the AnalyticDB for PostgreSQL instance and the self-managed Greenplum cluster. Data check falls into data consistency check and business consistency check.

  • Data consistency check: You can check the data consistency by comparing full or sample data. If you use sample comparison, check whether data is consistent by examining the table data volume and the aggregate results of numeral fields.

  • Business consistency check: Connect your application to the AnalyticDB for PostgreSQL instance and check whether your business queries can be normally executed and return expected responses.

Other migration methods

You can also use Alibaba Cloud DataWorks to migrate data. For more information, see Use Data Integration to migrate and batch synchronize data.