AnalyticDB for PostgreSQL 6.0 is built based on Greenplum 6.0 and optimized by Alibaba Cloud. AnalyticDB for PostgreSQL V6.0 supports vector computing and transaction processing when multiple coordinator nodes are supported. It uses the same API operations as Greenplum Community Edition. Both applications and data can be migrated from Greenplum to AnalyticDB for PostgreSQL. AnalyticDB for PostgreSQL allows you to smoothly migrate your applications and select from a variety of data migration solutions to migrate data.

Process for migrating data from a self-managed Greenplum cluster to an AnalyticDB for PostgreSQL instance:

  1. Evaluate migration risks and specify a migration solution.
  2. Buy an AnalyticDB for PostgreSQL test instance and check whether the migration solution can be used to migrate applications and data.
  3. Buy an AnalyticDB for PostgreSQL production instance and migrate the production library from the self-managed Greenplum cluster to the AnalyticDB for PostgreSQL instance.
  4. Connect to the AnalyticDB for PostgreSQL production instance and check whether applications and data are migrated to the instance.

The data migration solution varies based on the version of the self-managed Greenplum cluster (4X, 5X, or 6X) and whether the AnalyticDB for PostgreSQL instance can connect to the self-managed Greenplum cluster. You can migrate data by performing the following steps:

  1. Migrate the DDL schema.
    • Libraries: pg_dump cannot be used to export the database schema. You must manually create a schema in the AnalyticDB for PostgreSQL instance.
    • Extensions: You must check the difference between extensions supported by the self-managed Greenplum cluster and the AnalyticDB for PostgreSQL instance. For more information about the extensions supported by AnalyticDB for PostgreSQL, see Manage extensions.
    • Syntax compatibility: AnalyticDB for PostgreSQL is built based on PostgreSQL 9.4. The syntax of PostgreSQL 9.4 is not fully compatible with that of Greenplum 4X and must be manually modified.
  2. Migrate table data.
    • Full migration and incremental migration: We recommend that you select full migration or incremental migration based on your business requirements. For example, you can perform incremental migration if fields such as a timestamp and a tag are added to a table.
    • Partition table or inherit table: We recommend that you migrate data based on child tables.
  3. Check data and business consistency.
    • Check whether data is consistent by examining fields generated from aggregate functions such as COUNT, SUM, MIN, and MAX.
    • Check whether views and stored procedures are consistent and whether business queries can be normally executed and return correct results.

Solution when the AnalyticDB for PostgreSQL instance cannot connect to the self-managed Greenplum cluster

  1. Schema migration
    1. Use pg_dumpall or pg_dump to export the DDL schema from the master node of the self-managed Greenplum cluster. Bash commands are used in the following example:
      export PGHOST=<Host of the source cluster>
      export PGPORT=<Port number of the source cluster>
      export PGUSER=<Superuser of the source cluster>
      export PGPASSWORD=<Password of the source cluster>
      
      #-- Export all schemes for a Greenplum 4X cluster.
      pg_dumpall -s -q --gp-syntax  > full_schema.sql
      
      #-- Export all schemes for a Greenplum 5X or 6X cluster.
      pg_dumpall -s --gp-syntax  > full_schema.sql
    2. Use psql to log on to the self-managed Greenplum cluster. Execute the following statement to check whether custom libraries exist. If a custom library exists, you must manually create a library in the AnalyticDB for PostgreSQL instance.
      select * from pg_catalog.pg_library;
    3. Use psql to log on to the AnalyticDB for PostgreSQL instance and execute the DDL schema. Bash commands are used in the following example:
      export PGHOST=<Host of the destination instance>
      export PGPORT=<Port number of the destination instance>
      export PGUSER=<Superuser of the destination instance>
      export PGPASSWORD=<Password of the destination instance>
      
      psql postgres -f full_schema.sql > psql.log 2>&1 &

      Check psql.log for error messages. If an error message exists, analyze the error cause and troubleshoot the problem. Most error messages are related to SQL syntax, especially for Greenplum 4X. For more information about incompatible items, see Compatibility comparisons between AnalyticDB for PostgreSQL V4.3 and V6.0 and Check for incompatibility between AnalyticDB for PostgreSQL V4.3 and V6.0.

  2. Table data migration
    1. Use a COPY TO statement or gpfdist external table to export table data from the self-managed Greenplum cluster.
      • COPY TO

        Execute a COPY TO statement on the psql client to export table data. Example:

        -- 4X/5x/6x
        COPY public.t1 TO '/data/gpload/public_t1.csv' FORMAT CSV ENCODING 'UTF8';
        
        -- COPY ON SEGMENT is supported in Greenplum 5X and 6X.
        -- We recommend that you do not modify <SEGID> because the system can identify it. You can customize <SEG_DATA_DIR>.
        COPY public.t1 TO '<SEG_DATA_DIR>/public_t1_<SEGID>.csv' FORMAT CSV ENCODING 'UTF8' 
        ON SEGMENT;

        For more information about the COPY statement, see Use the \COPY command to import or export on-premises data.

      • gpfdist external table

        To use a gpfdist external table to export data, you must first activate the gpfdist service stored in the bin directory of the Greenplum startup program. Bash commands are used in the following example:

        mkdir -p /data/gpload
        gpfdist -d /data/gpload -p 8000 & 

        Use psql to connect to the self-managed Greenplum cluster and export its table data. Example:

        -- public.t1 indicates the table to be migrated.
        CREATE WRITABLE EXTERNAL TABLE ext_w_t1 (LIKE public.t1)
        LOCATION ('gpfdist://<host>:8000/public_t1.csv') FORMAT 'CSV' ENCODING 'UTF8';
        
        -- Write the table data to an external table.
        insert into ext_w_t1 select * from public.t1;
    2. Upload the exported data to Alibaba Cloud Object Storage Service (OSS). For more information about Alibaba Cloud OSS, see What is OSS?
      Note Alibaba Cloud OSS must reside within the same region as the AnalyticDB for PostgreSQL instance.
    3. Use psql to connect to the AnalyticDB for PostgreSQL instance and create an OSS foreign table. Write data to the destination table by reading the OSS foreign table. For more information, see Use OSS foreign tables to access OSS data.
  3. Data check

    Check whether data and business are consistent between the AnalyticDB for PostgreSQL instance and the self-managed Greenplum cluster.

    Data check falls into data consistency check and business consistency check.

    • Data consistency check: You can check the data consistency by comparing full or sample data. If you use sample comparison, you must check whether data is consistent by examining the table data volume and fields generated from aggregate functions such as SUM, AVG, MIN, and MAX.
    • Business consistency check: You must connect your applications to the AnalyticDB for PostgreSQL instance and check whether business queries can be normally executed and return correct results.

Solution when the AnalyticDB for PostgreSQL instance can connect to the self-managed Greenplum cluster

Use gptransfer to migrate the DDL schema and data.

gptransfer is a data migration tool supported by Greenplum 4X and 5X. The syntax of Greenplum 4X is different from that of AnalyticDB for PostgreSQL. We recommend that you upgrade the self-managed Greenplum cluster from 4X to 5X before you use gptransfer.

You can obtain gptransfer from the /gpMgmt/bin directory in the 5X_STABLE pipeline of the Greenplum source code.

Note When full data is being migrated, we recommend that you do not write data to the self-managed Greenplum cluster.

gptransfer command

You must comment out the following code before you can run the gptransfer command:

cmd = MakeDirectory('Create work directory',
                    self._work_dir, REMOTE, self._options.dest_host)
self._pool.addCommand(cmd)
cmd = RemoveDirectory('Remove work directory',
                      self._work_dir, REMOTE, self._options.dest_host)
self._pool.addCommand(cmd)

Run the gptransfer command on the master node of the self-managed Greenplum cluster. Full data migration is used in the following example:

-- Password of the current user in the Linux system that runs the source cluster
export SSHPASS=......

python gptransfer -a --validate count --batch-size 8 --max-line-length 104857600 \
--source-host <Host of the source cluster> --source-port <Port number of the source cluster> --source-user <Current user in the Linux system that runs the source cluster> \
--dest-host <AnalyticDB for PostgreSQL database endpoint> --dest-port <AnalyticDB for PostgreSQL database port number> --dest-user <AnalyticDB for PostgreSQL database user> \
--source-map-file <Segment IP configuration file of the source cluster> --work-base-dir <Work directory> \
--full --truncate --analyze --no-final-count > gptransfer.log &

The following list describes the command parameters:

  • --batch-size: the number of tables that can be migrated in parallel. Default value: 2. Maximum value: 10.
  • --source-map-file: the file that contains all segment IP addresses of the source cluster. A unique IP address is displayed in each row. File format:
    seg1_ip,seg1_ip
    seg2_ip,seg2_ip

You can also use gptransfer to migrate data from a single database or table. For more information, see the python gptransfer --help command output.

gptransfer precautions

  • gptransfer does not support incremental migration.
  • Extensions used by the self-managed Greenplum cluster are contained in the extension list supported by the AnalyticDB for PostgreSQL instance.
  • If the self-managed Greenplum requires heavy use of indexes, we recommend that you delete indexes, migrate data, and then recreate indexes to accelerate data migration.
  • Ports 8000 to 9999 must be enabled for all nodes of the self-managed Greenplum cluster.
  • The AnalyticDB for PostgreSQL instance can access the gpfdist service of the self-managed Greenplum cluster. You can create a readable external table in the AnalyticDB for PostgreSQL instance and check the query results of SELECT statements.
  • If the gpfdist service can be accessed only over the Internet and if the Internet traffic permits, you must modify the logic of gptransfer.
  • The IP address of the master node of the self-managed Greenplum cluster must be added to an IP address whitelist of the AnalyticDB for PostgreSQL instance, so that the Greenplum cluster can be connected to the AnalyticDB for PostgreSQL instance without password by using psql.

If the gpfdist service can be accessed only over the Internet, modify the logic of gptransfer, as shown in the following figure.

gptransfer tool
  • Modify the initial value of the urls parameter by changing the internal endpoint to public endpoint.
  • Check gptransfer.log for error messages. If an error message exists, analyze the error cause and troubleshoot the problem.

Solutions for other migration methods

You can also use Alibaba Cloud DataWorks to migrate data.