Both MaxCompute and HybridDB use a cluster architecture comprising of multiple data nodes. You need to ensure that the data nodes are actively pushing data if you want to efficiently migrate data from MaxCompute to HybridDB with high throughput using such an architecture. You can make this possible using MaxCompute and HybridDB's ability to read and write data from and to Alibaba Cloud Object Storage Service (OSS) with this method. And there you have the solution.
It is also imperative to ensure that you have a common data format to exchange data on OSS. After a comprehensive investigation into this matter, I have discovered that MaxCompute supports writing data in text format (TEXT/CSV), and HybridDB supports reading data in text format.
Based on my personal experience, I will share some best practices to ensure a successful data migration from MaxCompute to HybridDB based on my research.
Begin by creating an external table with the same structure as that of the MaxCompute data table. The table will serve to open the data channel between MaxCompute and OSS. Refer to the screenshot below to create the table.
CREATE external TABLE `demo_oss_ext` ( id string COMMENT 'id', data1 string COMMENT 'data1', data2 string COMMENT 'data2' ) partitioned by (ds string) STORED BY 'com.alibabacloud.maxcompute.TextStorageHandler' WITH SERDEPROPERTIES ('maxcompute.text.option.delimiter'='\t') LOCATION 'oss://id:key@endpoint/bucketname/oss_dir/';
1.The parameter com.alibabacloud.maxcompute.TextStorageHandler defines the data format that you can use to store data in OSS.
•Text Storage Handler is developed in JAVA and is the default option for data delivery.
•The default Text Storage Handler does not support the complete TEXT/CSV protocol. If you want it to support the complete TEXT/CSV protocol, you can use open-source JAVA CSV.
2.Text Storage Handler supports two custom parameters:
•maxcompute.text.option.delimiter specifies the column delimiters.
•maxcompute.text.option.use.quote defines the quote characters.
•The default value of the NULL column is N, and you cannot change it.
•Text Storage Handler does not allow escaping special characters. You can escape special characters only through a custom handler.
3.LOCATION specifies the specific account and location on OSS to which you deliver the data. It includes ID, key, endpoint, bucket, and a specific location.
Use the following SQL statement to migrate data from MaxCompute to OSS.
insert into demo_oss_ext select * from t_data;
•This operation is carried out in a parallel manner, where the default size of each concurrent unit is 256MB. You can also set a smaller value through set maxcompute.sql.mapper.split.size=xxx; to increase the concurrency.
•OSS data flow control affects the data transfer from MaxCompute to OSS. Technically, OSS network bandwidth for a single concurrency is 100MB/s.
•If you want to further increase the bandwidth, you would need to contact the OSS administrator to release the restrictions.
HybridDB external table: oss_ext
The screenshot below will help you create a HybridDB external table.
CREATE READABLE EXTERNAL TABLE user_data_oss_ext ( id int64, data1 text, data2 text ) location('oss://endpoint dir=data_oss_dir id=ossid key=osskey bucket=bucketname') FORMAT 'TEXT' (DELIMITER '\t' ) LOG ERRORS INTO error_track_table SEGMENT REJECT LIMIT 10;
1.Location specifies all OSS related parameters.
2.File format needs to match the format of the MaxCompute external table - FORMAT 'TEXT' (DELIMITER 't')
3.Set to skip wrong lines
oDuring the migration of heterogeneous data, you may encounter data that is not able to pass the verification. This data may exist in the form of special characters or invalid code.
oLOG ERRORS INTO error_track_table will write data that throws an error into a table.
oSEGMENT REJECT LIMIT X allows you to set the threshold number of errors permitted in a single segment, or the allowed percentage of error.
4.HybridDB also conducts import in parallel, the number of parallel executions is equal to the number of computing nodes.
5.Importing text/csv data in gzip format may improve the performance by more than 100%, provided that MaxCompute supports exporting compressed data.
The next step involves compressing the column for HybridDB local tables. You can do this by referring to the screenshot below.
CREATE TABLE Tao( id int64, data1 text, data2 text ) with ( APPENDONLY=true, COMPRESSTYPE=zlib, ,BLOCKSIZE=2097152, ORIENTATION=COLUMN,CHECKSUM=true, OIDS=false) DISTRIBUTED BY (id);
1.If you do not need to modify the imported data to HybridDB on a large scale, I would suggest you to use apply only to organize data by column, and then compress it.
•Use the following parameter settings: APPENDONLY=true COMPRESSTYPE=zlib COMPRESSLEVEL=5 ORIENTATION=COLUMN BLOCKSIZE=2097152
•HybridDB supports column compression, which offers a much higher compression ratio than row compression. By setting COMPRESSLEVEL=5, the compression ratio can easily reach 20% of the original side.
2.Use DISTRIBUTED BY (column) to distribute data evenly to each of the HybridDB computing nodes. Even distribution of data is the key to selecting a distribution column.
Use the following SQL statement to import data from OSS to HybridDB:
insert into t_ao select * from user_data_oss_ext;
Both HybridDB and PostgreSQL support reading and writing data from and to OSS
Similar to S3 of AWS, OSS is a low-price storage service that opens the communication channels between all cloud products. It is also Alibaba Cloud's recommended cloud data channel.
Both PostgreSQL and HybridDB on the cloud currently support reading and writing the OSS data source.
I hope that the steps mentioned in this article will help you successfully migrate data from MaxCompute to HybridDB. To know more about data migration, click here.
1.PostgreSQL + OSS oss_fdw
2.HybridDB for PostgreSQL + OSS oss_ext
3.SLS supports delivery of CSV data to OSS
4.Formatting open-source JAVA data
5.Export data from -MaxCompute to OSS
6.How to access OSS from MaxCompute
Alibaba Clouder - January 11, 2018
Alibaba Clouder - December 13, 2017
Michael Peng - September 24, 2019
Alibaba Clouder - February 27, 2018
Alibaba Clouder - September 3, 2019
afzaalvirgoboy - February 25, 2020
Secure and easy solutions for moving you workloads to the cloudLearn More
Fully managed and less trouble database servicesLearn More
Alibaba Cloud offers Independent Software Vendors (ISVs) the optimal cloud migration solutions to ready your cloud business with the shortest path.Learn More
An online MPP warehousing service based on the Greenplum Database open source programLearn More
More Posts by Alibaba Clouder