Data Integration using Alibaba Cloud DataWorks

Data Integration uses data studio in Data works environment. Data processing uses Hive SQL to parse through the massive amount of data available and Q.

MaxCompute is a data warehouse for exabyte levels of data and it was previously known as Open Data Processing System. MaxCompute is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and ensure data security. Apart from providing a data lake and processing of exabyte levels of data, MaxCompute supports multiple computational models, reliable data security by enabling multi-level sandbox protection and monitoring. It is used for three main functionalities as listed below.

Data Integration
Data Processing
Data Visualization

Data Integration uses data studio in Data works environment. Data processing uses Hive SQL to parse through the massive amount of data available and Quick BI is used to visualize the data in the form of a report or dashboard. For performing these operations, it is required to create a workspace in Data works. This blog explains the procedures to follow before creating nodes and performing tasks like data integration, data processing and data visualization. To start with, open the Alibaba Cloud Console and open Data Works.
blogpic
Click on “Create Workspace” button.
blogpic
Now click on “Confirm to Use Time Zone of Server”.
blogpic
Enter all the details needed and then click “Commit”. Here for an example scenario, we create a Workspace name as “BDDemo” and then click on “Commit”. After committing the workspace, a PAI instance is associated by default and now we are left with an option to associate with data warehousing services like MaxCompute, Hologres and E-MapReduce. We choose to associate to MaxCompute as of now. There are many other options also available. For getting availed with those, we need to get into Management Center.
blogpic
Click on “Associate Now” near MaxCompute. For an example, we enter the name of Resource Display name as “BDDemoRD”. Click on Pay by Volume and select the quota available from the dropdown box.
blogpic
Then enter the project name and here we have filled it with “BDDemoP” and given access to the Alibaba Cloud Primary Account by clicking on it. Then click on “Complete Association”. "BDDemoP" is the maxcompute project name.
blogpic
Now the DataWorks workspace is associated with MaxCompute’s compute engine.
blogpic
Click on “Back”.
blogpic
Now we can see the created workspace on the dashboard. For data integration we need to use the data integration portal and the data studio. We can reach data integration by clicking on “Go to Data Integration” button or “Data Integration” option in the left pane.
blogpic
Data Studio option is available in the left pane as well. Now we have created the workspace and we need to purchase and create a resource group. Click on the Resource Groups in the left pane.
blogpic
Here I have created 4 resource groups for data integration. I can use either of these or create a new one and use. I choose to do the latter by clicking on “Create Resource Group for Data Integration”.
blogpic
We can select any of the regions listed and choose the configurations needed. Here I have chosen the minimum configuration of 4 vCPUs, 8 GiB Memory for a month.
blogpic
Enter a name for the resource group for easy identification while working with the project. Click on “Buy Now”.
blogpic
Agree for the DataWorks Exclusive Resources Agreement of Service and then click “Pay” and proceed with the payment.
blogpic
Complete the payment and click on “Subscribe” to finish with the purchase of resource group.
blogpic
Click on the Console to further work on data integration and processing.
blogpic
In the resource groups page, we can see the newly purchased resource group listed. Since we gave the name while creating, we will now be able to locate the right resource group we need. Click on Data Integration in the left pane.
blogpic
Click on “Go to Data Integration” after selecting the workspace we created earlier.
blogpic
For performing data integration, we need data sources and the destination sources. Click on “Data Source” in the left pane.
blogpic
There is a default ODPS MaxCompute data source available. However we create our customized data source by adding a MySQL table and data from OSS. Click on “Create Data Source”.
blogpic
Now we need to add a MySQL database as a data source. So, we Select “MySQL”.
blogpic
Select the access method as Aliyun RDS as we are associating an RDS instance. We have given a name as BDRDS1 for the data source. Select the region where we created the RDS instance. In our case we have selected Singapore. Select the Current Alibaba Cloud Account as the instance owner. Then it will list the instances owned by the Alibaba Cloud Account in Singapore. We choose the RDS instance we need and then type the database name. After this enter the username and password and click Complete.
blogpic
Now in the page of Data Source, we will be able to see the MySQL data source we set to be available in data sources available for data integration. After this we need to ensure that the resource group we purchased earlier is bound to this workspace. Go back to Data Works console and click on resource group.
blogpic
Click on “Change” Option with our desired resource group.
blogpic
Click on “Bind” to create connection between the workspace and resource group. Then go to Data Works and click on Data Integration.
blogpic
Here we need to create a connection between a MySQL RDS instance and a MaxCompute table. Select MySQL in source and MaxCompute in destination and then click Create.
blogpic
Fill in the details as needed.
blogpic
In the Synchronization method, select real time synchronization to a single table. Then it opens a pop up asking to do it in Data Studio. Click Confirm.
blogpic
Select the path as Business Flow/Workflow. Name the node with a convenient one and click confirm.
blogpic
In the 2nd pane, right click on Table under MaxCompute and click Create Table.
blogpic
Enter a name and click “Create”.
blogpic
Click on DDL.
blogpic
Enter the query and Click Generate Table Schema.
blogpic
Now click on “Commit to Production Environment”. With this the destination dataset is available. Now we need to create it as a data source. Go to Data Works and click on New.
blogpic
Enter a name for data source and fill the details corresponding to the data set and click Complete. Now move to the node’s tab and double click MaxCompute.
blogpic
Now go to data studio and select the MaxCompute data source.
blogpic
With this, complete the synchronization in DataWorks and refresh the source table with MaxCompute table mapping.
blogpic
Provide the required information.
blogpic
Click on Complete Configuration.
blogpic
Now the node synchronization is successful. Similarly we can add any number of data sources to integrate into a MaxCompute data warehouse.