×
Community Blog Data Migration from MaxCompute to HybridDB for PostgreSQL

Data Migration from MaxCompute to HybridDB for PostgreSQL

MaxCompute and HybridDB use a cluster architecture comprising of multiple data nodes. Users often struggle to migrate data from MaxCompute to HybridDB.

Introduction

Both MaxCompute and HybridDB use a cluster architecture comprising of multiple data nodes. You need to ensure that the data nodes are actively pushing data if you want to efficiently migrate data from MaxCompute to HybridDB with high throughput using such an architecture. You can make this possible using MaxCompute and HybridDB's ability to read and write data from and to Alibaba Cloud Object Storage Service (OSS) with this method. And there you have the solution.

It is also imperative to ensure that you have a common data format to exchange data on OSS. After a comprehensive investigation into this matter, I have discovered that MaxCompute supports writing data in text format (TEXT/CSV), and HybridDB supports reading data in text format.

Based on my personal experience, I will share some best practices to ensure a successful data migration from MaxCompute to HybridDB based on my research.

Steps to Migrate Data from MaxCompute to HybridDB

Step 1: Create an External MaxCompute OSS Table

Begin by creating an external table with the same structure as that of the MaxCompute data table. The table will serve to open the data channel between MaxCompute and OSS. Refer to the screenshot below to create the table.

 CREATE external TABLE `demo_oss_ext` (
    id string COMMENT 'id',
    data1 string COMMENT 'data1',
    data2 string COMMENT 'data2'
  ) 
partitioned by (ds string)
STORED BY 'com.alibabacloud.maxcompute.TextStorageHandler'
WITH SERDEPROPERTIES ('maxcompute.text.option.delimiter'='\t')
LOCATION 'oss://id:key@endpoint/bucketname/oss_dir/';

Key Parameters:

1.The parameter com.alibabacloud.maxcompute.TextStorageHandler defines the data format that you can use to store data in OSS.

•Text Storage Handler is developed in JAVA and is the default option for data delivery.
•The default Text Storage Handler does not support the complete TEXT/CSV protocol. If you want it to support the complete TEXT/CSV protocol, you can use open-source JAVA CSV.

2.Text Storage Handler supports two custom parameters:

•maxcompute.text.option.delimiter specifies the column delimiters.
•maxcompute.text.option.use.quote defines the quote characters.
•The default value of the NULL column is N, and you cannot change it.
•Text Storage Handler does not allow escaping special characters. You can escape special characters only through a custom handler.

3.LOCATION specifies the specific account and location on OSS to which you deliver the data. It includes ID, key, endpoint, bucket, and a specific location.

Step 2: Migrate Data from the External Table to OSS

Use the following SQL statement to migrate data from MaxCompute to OSS.

insert into demo_oss_ext select * from t_data;

Note:

•This operation is carried out in a parallel manner, where the default size of each concurrent unit is 256MB. You can also set a smaller value through set maxcompute.sql.mapper.split.size=xxx; to increase the concurrency.
•OSS data flow control affects the data transfer from MaxCompute to OSS. Technically, OSS network bandwidth for a single concurrency is 100MB/s.
•If you want to further increase the bandwidth, you would need to contact the OSS administrator to release the restrictions.

Step 3: Create a HybridDB External Table

HybridDB external table: oss_ext

The screenshot below will help you create a HybridDB external table.

CREATE READABLE EXTERNAL TABLE user_data_oss_ext (
    id int64,
    data1 text,
    data2 text
)
location('oss://endpoint 
    dir=data_oss_dir
    id=ossid 
    key=osskey 
    bucket=bucketname') 
FORMAT 'TEXT' (DELIMITER '\t' )
LOG ERRORS INTO error_track_table SEGMENT REJECT LIMIT 10;

Key Parameters:

1.Location specifies all OSS related parameters.
2.File format needs to match the format of the MaxCompute external table - FORMAT 'TEXT' (DELIMITER 't')
3.Set to skip wrong lines

oDuring the migration of heterogeneous data, you may encounter data that is not able to pass the verification. This data may exist in the form of special characters or invalid code.
oLOG ERRORS INTO error_track_table will write data that throws an error into a table.
oSEGMENT REJECT LIMIT X allows you to set the threshold number of errors permitted in a single segment, or the allowed percentage of error.

4.HybridDB also conducts import in parallel, the number of parallel executions is equal to the number of computing nodes.
5.Importing text/csv data in gzip format may improve the performance by more than 100%, provided that MaxCompute supports exporting compressed data.

Step 4: Execute Column Compression for HybridDB Local Tables

The next step involves compressing the column for HybridDB local tables. You can do this by referring to the screenshot below.

CREATE TABLE Tao(
    id int64,
    data1 text,
    data2 text
)
with (
APPENDONLY=true, COMPRESSTYPE=zlib,
 ,BLOCKSIZE=2097152,
ORIENTATION=COLUMN,CHECKSUM=true,
OIDS=false) 
DISTRIBUTED BY (id);

Key Parameters:

1.If you do not need to modify the imported data to HybridDB on a large scale, I would suggest you to use apply only to organize data by column, and then compress it.

•Use the following parameter settings: APPENDONLY=true COMPRESSTYPE=zlib COMPRESSLEVEL=5 ORIENTATION=COLUMN BLOCKSIZE=2097152
•HybridDB supports column compression, which offers a much higher compression ratio than row compression. By setting COMPRESSLEVEL=5, the compression ratio can easily reach 20% of the original side.

2.Use DISTRIBUTED BY (column) to distribute data evenly to each of the HybridDB computing nodes. Even distribution of data is the key to selecting a distribution column.

Step 5: Import Data from OSS to HybridDB

Use the following SQL statement to import data from OSS to HybridDB:

insert into t_ao select * from user_data_oss_ext;

Both HybridDB and PostgreSQL support reading and writing data from and to OSS

Similar to S3 of AWS, OSS is a low-price storage service that opens the communication channels between all cloud products. It is also Alibaba Cloud's recommended cloud data channel.

Both PostgreSQL and HybridDB on the cloud currently support reading and writing the OSS data source.

•PostgreSQL + OSS read and write external data source oss_fdw.
•HybridDB for PostgreSQL + OSS import and export data in parallel oss_ext.

Conclusion

I hope that the steps mentioned in this article will help you successfully migrate data from MaxCompute to HybridDB. To know more about data migration, click here.

References (articles in Chinese language):

1.PostgreSQL + OSS oss_fdw
2.HybridDB for PostgreSQL + OSS oss_ext
3.SLS supports delivery of CSV data to OSS
4.Formatting open-source JAVA data
5.Export data from -MaxCompute to OSS
6.How to access OSS from MaxCompute

1 1 1
Share on

Alibaba Clouder

2,605 posts | 747 followers

You may also like

Comments

Raja_KT March 6, 2019 at 3:39 am

Good one. MPP based solution performance is characterized by key(s) .