Migrate data from a Google BigQuery dataset to an AnalyticDB for PostgreSQL instance - AnalyticDB

This topic describes how to migrate data from a Google BigQuery dataset to an AnalyticDB for PostgreSQL instance.

Preparations

BigQuery is activated. A BigQuery dataset is created, and data is inserted into the dataset.
Google Cloud Storage is activated. A Cloud Storage bucket is created.
An Identity and Access Management (IAM) user is created and has the permissions to access Cloud Storage.
An AnalyticDB for PostgreSQL instance is created. For more information, see Create an instance.
Alibaba Cloud Object Storage Service (OSS) is activated. For more information, see What is OSS?
An OSS bucket is created. For more information, see Create buckets.
Note We recommend that you create an OSS bucket within the same region as the AnalyticDB for PostgreSQL instance to facilitate subsequent data imports.

Step 1: Export data from the BigQuery dataset to the Cloud Storage bucket

To export data from BigQuery to Cloud Storage, you must use the bq CLI tool. For more information, see Using the bq command-line tool.

Use the bq CLI tool to query DDL statements for the BigQuery dataset, and then download the statements to your computer. For more information, see INFORMATION_SCHEMA.TABLES view.
In BigQuery, you cannot use SHOW CREATE TABLE statements to query DDL statements in a specified dataset. You must use a built-in user-defined function (UDF).
Use the bq CLI tool to run the bq extract command to export table data from the BigQuery dataset to the Cloud Storage bucket. For information about the export process, data formats, and compression types, see Exporting table data.
The following sample code provides an example of the export command:
```
bq extract
--destination_format AVRO
--compression SNAPPY
tpcds_100gb.web_site
gs://bucket_name/web_site/web_site-**.avro.snappy;
```
Check whether data is exported to the Cloud Storage bucket.

Step 2: Synchronize data from Cloud Storage to OSS

Use the Data Transport service to synchronize data from the Cloud Storage bucket to the OSS bucket. For more information, see Migrate data.

Step 3: Create a destination table

Create a destination table that is used to store BigQuery data in the AnalyticDB for PostgreSQL instance. The destination table must use the same schema as the source table. For information about the syntax, see CREATE TABLE .

For information about the data type and DDL mappings between BigQuery and AnalyticDB for PostgreSQL, see the "Syntax conversion" section of this topic.

Step 4: Import data from OSS to the AnalyticDB for PostgreSQL instance

You can use the COPY statement or an OSS foreign table to import data to AnalyticDB for PostgreSQL:

For information about how to use the COPY statement to import OSS data, see Use the COPY or UNLOAD statement to import or export data between OSS foreign tables and AnalyticDB for PostgreSQL tables.
For information about how to use an OSS foreign table to import OSS data, see Import OSS data to an AnalyticDB for PostgreSQL table.
Important We recommend that you use Avro files to migrate data. You cannot migrate data of the STRUCT or GEOGRAPHY type to AnalyticDB for PostgreSQL.

Syntax conversion

Data types


BigQuery data type	AnalyticDB for PostgreSQL data type
INT64	BIGINT
FLOAT64	FLOAT
NUMERIC	DECIMAL
BIGNUMERIC	DECIMAL
BOOL	BOOLEAN
BYTES (2-byte header)	BYTES (1-byte header)
STRING/STRING()	TEXT/VARCHAR()
DATE	DATE
DATETIME	TIMESTAMP
TIME	TIME
TIMESTAMP	TIMESTAMP
INTERVAL	INTERVAL
ARRAY	ARRAY[]
STRUCT	CREATE TYPE
JSON	JSON
GEOGRAPHY	CREATE TYPE/GEOGRAPHY

DDL statements

CREATE TABLE

PARTITION BY
BigQuery supports the following partitioning methods:
- Integer-range partitioning
  This method corresponds to the number-range partitioning method of AnalyticDB for PostgreSQL. You can specify a column of a numeric data type as the partition key. The following section provides an example on how to specify number-range partitions in AnalyticDB for PostgreSQL:
```
CREATE TABLE rank (id int, rank int, year int, gender
char(1), count int)
DISTRIBUTED BY (id)
PARTITION BY RANGE (year)
( START (2020) END (2023) EVERY (1),
  DEFAULT PARTITION extra );
```
  In the preceding statement, EVERY corresponds to INTERVAL that is used to specify integer-range partitions in BigQuery.
- Time-unit column partitioning
  This method corresponds to the date-range partitioning method of AnalyticDB for PostgreSQL. You can specify a column of the DATE or TIMESTAMP type as the partition key column.
  BigQuery uses DAY, MONTH, and YEAR to specify the partition granularity. AnalyticDB for PostgreSQL uses EVERY(INTERVAL) to specify the partition granularity. The following section provides an example on how to specify date-range partitions by day in AnalyticDB for PostgreSQL:
```
CREATE TABLE sales(id int, date date, amt decimal(10,2))
DISTRIBUTED BY (id)
PARTITION BY RANGE (date)
( START (date '2022-01-01') INCLUSIVE 
    END (date '2023-01-01') EXCLUSIVE
    EVERY (INTERVAL '1 day') );
```
- Ingestion-time partitioning
  This method corresponds to the date-range partitioning method of AnalyticDB for PostgreSQL. When you convert ingestion-time partitioning of BigQuery to date-range partitioning of AnalyticDB for PostgreSQL, you must add a column named create_time to the destination table because AnalyticDB for PostgreSQL does not support pseudo-time columns.
CLUSTER BY
The CLUSTER BY clause corresponds to the DISTRIBUTED BY clause of AnalyticDB for PostgreSQL. Each BigQuery table supports up to four CLUSTER BY columns. AnalyticDB for PostgreSQL does not impose limits on the number of DISTRIBUTED BY columns.
DEFAULT COLLATE
The COLLATE clause specifies the rule based on which results of clauses such as ORDER BY and GROUP BY are sorted. DEFAULT COLLATE indicates the default rule, which is binary. In most cases, this clause can be omitted during data migration.

CREATE EXTERNAL TABLE

The connection information about foreign tables is different between BigQuery and AnalyticDB for PostgreSQL.

BigQuery: uses the WITH CONNECTION clause to configure credentials for accessing external data and allows you to specify a connection name in the PROJECT_ID.LOCATION.CONNECTION_ID format.
AnalyticDB for PostgreSQL: uses the LOCATION clause to configure credentials and connection information for accessing foreign tables.
AnalyticDB for PostgreSQL supports the file, gpfdist, and HTTP protocols for foreign tables. If you have stored data in OSS, you can use the oss_fdw extension to create an OSS foreign table and export the data from OSS to AnalyticDB for PostgreSQL.

CREATE PROCEDURE

You can replace the CREATE PROCEDURE statement with the CREATE FUNCTION statement in AnalyticDB for PostgreSQL.

Other SQL statements

MERGE

BigQuery allows you to combine INSERT, UPDATE, and DELETE operations into a MERGE statement. If the destination table contains tuples that match the source table, the MERGE statement updates or deletes the tuples. If the destination table contains tuples that do not match the source table, the MERGE statement inserts, updates, or deletes the tuples.

In AnalyticDB for PostgreSQL, you can use a single transaction to execute the MERGE statement. In the following sample statement, the names and quantity of new products are inserted into a table, and the quantity of the existing products in the table is updated:

BEGIN;
-- other operations
SAVEPOINT sp1;
INSERT INTO wines VALUES('hateau Lafite 2023', '24');
-- Assume the above fails because of a unique key violation,
-- so now we issue these commands:
ROLLBACK TO sp1;
UPDATE wines SET stock = stock + 24 WHERE winename = 'Chateau Lafite 2023';
-- continue with other operations, and eventually
COMMIT;

If you use the primary key to match the table rows, you can also execute the INSERT ON CONFLICT statement.

INSERT INTO wines VALUES('Chateau Lafite 2023', '24') ON CONFLICT (winename) DO UPDATE SET
stock = stock + 24;

SELECT

AnalyticDB for PostgreSQL does not support the SELECT * EXCEPT/REPLACE statement.
AnalyticDB for PostgreSQL does not support the QUALIFY clause for filtering window function results. If you want to filter window function results, you can use nested queries.