×
Community Blog Realtime Synchronization From MySQL to MaxCompute with DataWorks - Friday Blog - Week 42

Realtime Synchronization From MySQL to MaxCompute with DataWorks - Friday Blog - Week 42

Learn how to import data from MySQL into MaxCompute in real time, using the new real time synchronization feature.

By: Jeremy Pedersen

Warning: The real-time synchronization feature depends on Exclusive Resource Groups for Data Integration, which is a paid feature in DataWorks. If you already have an Exclusive Resource Group, great! You should be able to use it to try the steps in this blog. If you don't (and you'd rather not pay), just read along instead. ^_^

Realtime Synchronization From MySQL To MaxCompute

MaxCompute is Alibaba Cloud's data warehousing and batch data processing engine. It's a great tool for storing datasets up to petabytes (PB) in size, and can run batch processing jobs at scale.

Typically, data is read into MaxCompute on a regular schedule: say, once per day or once per hour. This is well and good for offline batch processing, but what can you do about data sources that generate new data in near-real-time, such as a Kafka queue or a MySQL database that is constantly adding new records?

Luckily, there is a solution! MaxCompute can now read data in real time from several different data sources, including MySQL, Kafka, and DataHub.

Let's look at how we can set this up for a MySQL Database.

Setting Up The Environment

Before we start, we will need:

  1. A MySQL database
  2. A DataWorks Workspace

You can follow along with this blog post to set up an RDS MySQL database instance, and the first half of this post to set up a DataWorks Workspace and connect it to the MySQL Database.

Setting Up Real-time Synchronization

Now, we need to locate our DataWorks Workspace in the "Workspaces" list, and click on "Data Integration":

01_dataworks

From the Data Integration console, create a new task:

02_dataworks

Make sure the data source is MySQL and the destination is MaxCompute (odps). Make sure the task type is "One-click real-time synchronization to MaxCompute":

03_dataworks

Click "Next Step". Make sure the data source and destination are set correctly, and click both "Test Connectivity" buttons to make sure things are working, as shown here:

04_dataworks

Note that if you do not have an Exclusive Resource Group for Data Integration, you will have to create one. In my case, I have an existing Exclusive Resource Group leftover from a previous test of DataWorks, which you can see in the middle of the screenshot above (rg_group_dw_test).

For the next few steps, I assume that your MySQL database contains the users, products, and orders tables (and associated data) provided in this blog post. Follow the steps in the screenshots below, to configure the users table as the one which will be synchronized in real time:

05_dataworks

06_dataworks

Set up the Destination Table configuration, as shown here (you may need to click on "Refresh source table and MaxCompute table mapping" to get the table mapping to load):

07_dataworks

Click on "Next Step". Next, you need to configure how often records will be synchronized with your MaxCompute table.

"But wait", I hear you saying, "aren't records being imported in real time?". Yes they are, but they aren't written directly into our MaxCompute target table. They are first stored in a temporary "log" table. That table is then synchronized with our target MaxCompute table at regular intervals. I can synchronize by full date (yyyy-mm-dd), by day, or once an hour, as shown here:

08_dataworks

In my case, I chose to sync once a day, using the full date (yyyy-mm-dd) to partition my MaxCompute users table:

09_dataworks

Next, we need to tell the synchronization job what to do with DDL statements, which may be contained in the binlogs that MySQL ships us. We can leave all those settings at defaults (as shown here):

10_dataworks

Finally, we set up the synchronization task. Actually tasks: a one-time synchronization task followed by an ongoing realtime synchronization task, as shown here:

11_dataworks

Clicking on "Complete Configuration" returns us to the "Tasks" view, where we can click on "Submit and Run" to start the migration task:

12_dataworks

If everything is working, we should be able to click on the task and see its status (note: you might see a different status for the full migration, if it hasn't yet completed):

13_dataworks

That's it! Now we are ready to add a few records to the users table in our MySQL database, and see what happens.

Testing It Out: Adding Some Records To The MySQL Database

Next, let's see if it's really working! To do this, we need to navigate over to the RDS console and use DMS to log into the MySQL database:

14_rds

15_rds

16_rds

Assuming you have your MySQL database loaded with the test data from this blog post (mentioned several times above), you should be able to run these 3 INSERT INTO statements, to add some new users:

INSERT INTO `users` (user_id, name, age, sex, country, country_code) VALUES (1000, 'Kate Melgrove', 26, 'F', 'United States', 'US');

INSERT INTO `users` (user_id, name, age, sex, country, country_code) VALUES (1001, 'Gerald Rawlins', 55, 'M', 'United States', 'US');

INSERT INTO `users` (user_id, name, age, sex, country, country_code) VALUES (1002, 'Jeremy Pedersen', 32, 'M', 'United States', 'US');

Testing It Out: Seeing The Changes In MaxCompute:

My MaxCompute project is called jdp_dw_test_maxp and my users table is called mysql_source_users. Let's see what happens if I attempt a full table read, looking only for users with a user ID higher than 999 (i.e. only the 3 records added above).

The SQL code for that is:

set odps.sql.allow.fullscan=true;
select * from jdp_dw_test_maxp.mysql_source_users WHERE user_id > 999;

And the results? See for yourself:

17_maxp_test

Nothing! So where is the data that we added?

Remember when we configured our real-time synchronization task, the task was set to update the MaxCompute table once a day. So where are the records going, when we add new items to users in the MySQL database? Is MaxCompute simply ignoring the changes? No. MaxCompute is reading those records in, as soon as they appear in MySQL. They are being stored temporarily in a "log" table. You can actually see that log table on the left-hand side of the DataStudio console:

18_maxp_test

What happens if we query this table? Let's try:

set odps.sql.allow.fullscan=true;
select * from jdp_dw_test_maxp.mysql_source_odps_first_log;

So what's the result? This time, it works!

19_maxp_test

Waiting For The Synchronization

If we wait long enough, the entries in the "log" table will be added to the users table in MaxCompute, which is called mysql_source_users. I executed the INSERT INTO statements (above) on 2021-12-29, and they showed up in the _log table in MaxCompute just a few seconds later.

I then waited a few days, dear readers, and re-ran my SQL query on the mysql_source_users table, using this SQL query:

set odps.sql.allow.fullscan=true;
select * from jdp_dw_test_maxp.mysql_source_users WHERE ds = 20220103 and user_id > 999;

The result? Now the records have been added:

20_final_results

So everything is working as intended!

Now you know how to set up realtime synchronization between MySQL and MaxCompute! Use the knowledge well.

I've Got A Question!

Great! Reach out to me at jierui.pjr@alibabacloud.com and I'll do my best to answer in a future Friday Q&A blog.

You can also follow the Alibaba Cloud Academy LinkedIn Page. We'll re-post these blogs there each Friday.

Not a LinkedIn person? We're also on Twitter and YouTube.

0 0 0
Share on

JDP

65 posts | 76 followers

You may also like

Comments

JDP

65 posts | 76 followers

Related Products