A Guide to Migration from Redshift to MaxCompute

By Long Qingyun

This article explains how Amazon Redshift can migrate to Alibaba Cloud MaxCompute, including the aspects of syntax comparison and data migration.

1. Overview

This article is a detailed introduction to the similarities and differences of SQL syntax between Redshift and MaxCompute. It is helpful to accelerate the migration from SQL tasks to MaxCompute. There are many syntax differences between Redshift and MaxCompute. Therefore, we need to modify the script written on Redshift before we use it in MaxCompute.

2. Comparisons between Redshift and MaxCompute before Migration

2.1 Data Type and Type Conversion

For more details about MaxCompoute data types, please see this link

2.2 Syntax Comparison

MaxCompute does not have the concepts of schema, group, library, and stored procedure. There are only projects, tables, and partitions. MaxCompute does not have auto-incremental sequence and foreign keys when creating tables. It does not support choosing the encoding method, and its default encoding method is UTF-8. The internal tables do not support choosing the storage format, with Aliorc as the default storage format.

2.3 Compound Expressions

2.4 Conditions

2.5 DDL Syntax

Main Differences:

MaxCompute does not support primary key auto-increment and PRIMARY KEY.
Specified default value default does not support functions.
The default value specified by decimal does not support -1.

2.6 DML Syntax Differences

2.7 Built-In Functions

Other unlisted functions are not supported in Redshift.

7_1
7_2
7_3
7_4
7_5
7_6

For more details about MaxCompoute built-In functions, please see this link

2.8 Features of MaxCompute

3. An Introduction to Migration Tools for Redshift to MaxCompute

Choose one syntax verification tool:

4. The General Migration Solution

Database migration mainly includes the following steps:

Migration Implementation Plan:

5. Detailed Migration Solution

5.1. Analysis of Current Situation and Demand

5.2. Design of Migration Plan

Users can make a reasonable migration plan based on their existing Redshift data volume, QPS, TPS, other performance indicators, high availability requirements, and future business growth requirements.

5.3. Resource Planning

Users need to prepare the MaxCompute environment and obtain corresponding migration tools. For more information about migration tools, please see the section entitled An Introduction to Migration Tools for Redshift to MaxCompute.

5.4 Transformation and Test Verification

5.4.1 Transformation

The migration tool can use MaxCompute studio (or a new SQL node created by DataWorks) client syntax verification to create a new SQL file. Sections with unsupported syntax will turn red.

MaxCompute studio installation document

5.4.1.1 Create a Table

Obtain the definitions of table list and table field in Redshift and convert them based on the field values supported by MaxCompute. A transactional table is needed for tables with update and delete statements.

For information about type conversion, please refer to the section entitled Data Type and Type Conversion.

The syntax for creating a table:

Create a table

create [external] table [if not exists] 
[(  [not null] [default ] [comment ], ...)]
[comment ]
[partitioned by (  [comment ], ...)]

Used to configure the Shuffle and Sort properties of a clustered table when creating it

[clustered by | range clustered by ( [, , ...]) [sorted by ( [asc | desc] [,  [asc | desc] ...])] into  buckets]

For external tables only

[stored by StorageHandler]

For external tables only

[with serdeproperties (options)]

For external tables only

[location ]

Set the table to a Transactional table. You can update or delete data from the transactional table later. The usage of transactional tables has restrictions, so a transactional table should be created based on your needs.

[tblproperties("transactional"="true")]   
[lifecycle ];

Create a table based on an existing table and replicate data from the existing table to the new table. Partition properties are not replicated.

create table [if not exists]  [lifecycle ] as ;

Create a table based on an existing table and replicate data from the existing table. Partition properties are not replicated.

create table [if not exists]  like  [lifecycle ];

Note:

Table and column names are not case-sensitive.
If you do not specify the if not exists” option and another table with the same name exists, an error is returned. If you specify this option, a message that indicates the operation succeeded is returned. The message is returned regardless of whether an existing table with the same name exists, even if its schema of the existing table is different from the table you want to create. The metadata of the existing table with the same name will remain unchanged.
A table name or column name cannot contain special characters. It can contain only lowercase and uppercase letters, digits, and underscores (_) and must start with a letter.

The name cannot exceed 128 bytes in length.

tblproperties("transactional"="true"): optional (update and delete statements must be set.) Set the table to a transactional table. Later, you can perform update and delete operations on transactional tables to update or delete data at the row level. For more information, please see this link.
The partitioned by option specifies the partition field. The value can only be a string. The partition value cannot contain double-byte characters (such as Chinese). It must start with a letter (lowercase or uppercase), followed by letters or digits. The name can be up to 128 bytes in length. Characters allowed include spaces, colons (:), underscores (_), dollar signs ($), pound signs (#), dots (.), exclamation points (!), and @. Other character lines are considered to be undefined, such as t, n, and /. After you use partition fields to partition a table, a full table scan is unnecessary when you add partitions, update partition data, or read partition data. This makes data processing more efficient.
The content of a comment is a string whose length does not exceed 1,024 bytes.
The lifecycle option indicates the lifecycle of the table in the unit of day. The create table like statement does not copy the lifecycle attribute of source table.
Theoretically, a source table can have up to six levels of partitions. Considering the partition's expansion way under extreme storage conditions, use as few partitions as possible.
You can configure the maximum number of table partitions for a specific project. The default maximum number is 60,000.
In the create table ... as select ... statement, if a constant is used as the column's value in a select clause, we advise specifying the column's name.
You can try the following create table ... like statement to ensure that the destination table has the same schema as the source table:

5.4.1.1.1 Examples of Table Creation

Double quotation marks for column names need to be removed.
The auto-increment columns of primary keys, such as BIGINT primary key identity(1,1), have to be removed, with only the default value of default 1 retained.
Numeric data type needs to be converted to decimal type.
In ::character varying, '1900/01/01'::text, the two colons and their following contents need to be deleted because MaxCompute does not support them.
In "n_car_no" numeric DEFAULT -1::numeric, -1 needs to be removed. MaxCompute does not support the default value of -1.
In the "ts_req_time" timestamp without time zone DEFAULT to_timestamp('1900/00/00 00:00:00'::text, 'YYYY-MM-DD HH24:MI:SS.MS'::text), you need to remove the time zone and change it to the timestamp DEFAULT timestamp "1900-01-01 00:00:00".
In INTERLEAVED SORTKEY(vc_trans_id), MaxCompute does not support the interlaced sorting column function, so consider replacing it with zorder.
MaxCompute does not support time zone, so contents related to it need to be deleted.
AUTO REFRESH YES needs to be removed when modifying materialized views, and the materialized view of MaxCompute does not support window functions.

5.4.1.2 SQL Migration

SQL migration converts Redshift SQL into MaxCompute SQL based on the differences to make SQL available. For the differences between SQL statements, please refer to the chapter entitled Comparison Between Redshift and MaxCompute Before Migration.

5.4.1.2.1 Examples of SQL Migration

DML Statements

You need to create a transaction table ("transactional"="true") to execute the update or delete statement.
For example, when adding a comment to the column using COMMENT ON column, atzc_dev_dw.t_com_fact_auto_pay_gw_trans_pay_gw."n_trans_amt" is 'dml';, you need to change it to the syntax supported by MaxCompute, using alter table change column comment '';.

DQL Statement:

Built-In Functions:

5.4.1.2.2 Migration of Stored Procedure

We suggest changing to the temporary tables or PyODPS method.

5.4.2 Data Migration

Serial No.	Description
1	Unload data from Amazon Redshift to a data lake on Amazon Simple Storage Service (S3).
2	Migrate data from Amazon S3 to an OSS bucket using the Data Online Migration service of OSS.
3	Migrate data from the OSS bucket to a MaxCompute project in the same region and then verify the integrity and accuracy of the migrated data.

Click this link to see the reference document on data migration.

5.4.3 Test and Verification

Currently, there is no tool to support the data test and verification of the migration from Redshift to MaxCompute. So, we have to write a script tool to complete this task. The common verification solutions are listed below:

Schema Verification: Export the columns and type definitions from Redshift and MaxCompute and use MD5 to verify
Rows Comparison of a Table: Execute SQL statements in Redshift and MaxCompute to count the number of rows of data in the same table for row-by-row comparison.
Full Data Verification: This is generally used in the core table and with a small data volume. Export full data or use the full data segmentation to use MD5 for verification.
Data Sampling Verification: This is generally used in the data verification scenario of large core tables. Sample data from the source and target table for verification according to certain sampling rules.

Community