edit-icon download-icon

MaxCompute Data Ingestion from ApsaraDB for RDS

Last Updated: Apr 10, 2018

Introduction

MaxCompute (previously known as ODPS) is a general purpose, fully managed, multi-tenancy data processing platform for large-scale data warehousing. MaxCompute supports various data importing solutions and distributed computing models, enabling users to effectively query massive datasets, reduce production costs, and guarantee data security.

Alibaba Cloud offers a platform called DataWorks for users to perform data ingestion, data processing and data management in MaxCompute. It provides fully hosted workflow services and a one-stop development and management interface to help enterprises mine and explore the full value of their data. DataWorks uses MaxCompute as its core computing and storage engine, to provide massive offline data processing, analysis, and mining capabilities.

Currently, data from the following data sources can be imported to or exported from the workspace through the data integration function: RDS, MySQL, SQL Server, PostgreSQL, MaxCompute, ApsaraDB for Memcache, DRDS, OSS, Oracle, FTP, dm, HDFS, and MongoDB.

In this document, the focus is on data ingestion from a MySQL database in ApsaraDB for RDS.

Solution Architecture

In this solution architecture, the user ingests data into a MaxCompute ODPS table from MySQL RDS, by using the web-based DataWorks platform.

1

Pre-Requisites and Preparation

You need the following information.

  • An Alibaba Cloud Account.
  • A sample dataset (click here to download).
  • A sample database table.
  • A MySQL ApsaraDB for RDS instance.
  • A MySQL client (such as MySQL Workbench).
ComponentConfigurationSample values
MySQL、RDSRegionAsia Pacific SE 3 (Kuala Lumpur)
 Database Namedemo_db
 MySQL Tablecreate table if not exists credit_data_mysql ( id char(48), income double, expenses double, credit_cards bigint) ;
 Source csv filesample_credit_data.csv
 MaxCompute IP Whitelist (Malaysia Region)11.193.188.0/24,11.221.205.0/24,11.221.206.0/24,11.221.207.0/24,100.64.0.0/8,11.214.81.0/24,47.254.212.0/24
DataWorksData source namerds_demo_xxx
 RDS IDrm-z**
 Account UID5903**
 Database Namedemo_db
 Username**
 Password**
 ODPS Tablecreate table if not exists credit_data_rds ( id string, income double, expenses double, credit_cards bigint) ;

Getting Started

Setting up a Database in RDS

  • Create a database on your MySQL RDS instance by clicking on the Database section in the left navigation pane.

    2

  • Click Create Database to create a MySQL database.

    3

  • Name the MySQL database and select the authorized account for this database.

    4

  • Next, to create a table in the MySQL Database, connect to your RDS instance with a MySQL client.

  • MySQL client references are listed as follows.

  • In this document, MySQL Workbench is used to connect to the MySQL RDS instance.

  • After connecting successfully to the MySQL RDS instance, create a table by using the DDL code as follows.

  • If you are new to MySQL Workbench, be aware that after copy-pasting the code as follows into the “Query 1” window, you need to double click on your database name in the left-side menu to select it (it is called “demo_db” here), and then click on the small lightning bolt icon in the top bar to run your code.

    Create MySQL Table:

    1. create table if not exists credit_data_mysql (
    2. id char(48),
    3. income double,
    4. expenses double,
    5. credit_cards bigint
    6. ) ;

    5

  • Upload sample source data (from a local CSV file) to the table (copy-paste the code as follows).

  • Note that you must change the file path in this example to match the location where you have saved sample_credit_data.csv.

    Upload Local CSV:

    1. LOAD DATA LOCAL INFILE '/Users/xiaomei/Desktop/sample_credit_data.csv'
    2. INTO TABLE credit_data_mysql FIELDS TERMINATED BY ','
    3. ENCLOSED BY '' LINES TERMINATED BY '\n'

    6

  • Verify that the data has been successfully loaded into the table by running a SELECT statement (see as follows).

  • After running the SELECT query, you must see output that looks like this. Note that a row at the top says ID 0 0 0. This is because we did not remove the title row from our CSV file (this is also why we got a warning in the screenshot preceding). In general, you must make sure to remove any notes or title information before loading your data into RDS.

    Check RDS Table:

    1. SELECT * FROM credit_data_mysql;

    7

Whitelist MaxCompute IP in RDS instance

  • In order for MaxCompute to be able to read data from this MySQL RDS instance, MaxCompute IPs need to be whitelisted.

  • This whitelisting is an important security measure to only allow known IPs to access the database. If you do not add MaxCompute to the RDS instance’s IP whitelist, MaxCompute cannot be able to connect to your database.

  • Find a list of the MaxCompute IPs which must be whitelisted in each region here:

    Add whitelist

  • In the MySQL RDS instance, click on Security on the left navigation pane, then click on Add a Whitelist Group.

8

  • Input the MaxCompute IPs to be whitelisted:

    MaxCompute IP Whitelist(Malaysia Region)11.193.188.0/24,11.221.205.0/24,11.221.206.0/24,11.221.207.0/24,100.64.0.0/8,11.214.81.0/24,47.254.212.0/24
    9

  • The whitelisted group of MaxCompute IPs are displayed here:

    10

Set up MySQL RDS as Data Source in DataWorks

  • Navigate to DataWorks > Data Integration.

    11

  • On the Data Integration page, click on New Source to sync data from RDS.

    12

  • Select MySQL Relational Database as the data source.

13

  • Configure the MySQL RDS data source information (sample values are shown as follows).

    Data source namerds_demo_credit
    MySQL RDS IDrm-zf*
    Account UID5903**
    Database Namedemo_db
    Username**
    Password**
  • After that, click test connectivity to check whether DataWorks can connect to the MySQL RDS database.

  • If the connection is successful, a green box appear in the upper right corner saying “connectivity test successfully”.

    15

  • After that, click on Complete.

  • In DataWorks under Data Integration, click on Data Sources in the left navigation pane and the newly created data source from RDS is visible there.

    16

Data Ingestion: Get Source Data from RDS

  • Go to Sync Tasks in the left navigation pane within Data Integration and click on New Source.

    17

  • Click on Wizard Mode to set up data ingestion from RDS.

    18

Step 1: Configure Data Ingestion Source

  • The data source is a RDS data source that was created in preceding.

  • The table is a table that was created in the MySQL RDS database – credit_data_mysql.

  • Click on data preview to preview the data and verify that it is correct, then click Next.

    19

Step 2: Configure Data Ingestion Target

  • Choose odps_first (odps) as the data ingestion target.

    The odps_first is the default data repository for MaxCompute.

  • Before data can be ingested into MaxCompute, a table has to be created within MaxCompute.

  • Click on Create New Target Table.

    20

  • Enter the table creation statement (see as follows – you may need to replace the code that MaxCompute automatically fills in).

    ODPS Table:

    1. create table if not exists credit_data_rds (
    2. id string,
    3. income double,
    4. expenses double,
    5. credit_cards bigint
    6. ) ;

    21

  • Click on Create Table and then click Next.

Step 3: Configure Source and Target Column Mapping

  • It is important to make sure the order of columns in the source data map correctly onto the columns of the ODPS table.

    The recommended approach is to make sure the source data’s column order is the same as the order of the columns in the ODPS table.

  • Click on peer mapping to map source to target.

  • Click Next once the column mapping is correct.

In this tutorial the columns of the source data file and columns of the ODPS table are in the same order, so straight lines appear from source to target. In this case, it is not necessary to change the peer mapping.

22

Step 4: Configure Channel Control

  • Verify that Maximum Operating Rate and Concurrent Jobs appear as follows, then click Next.

    23

Step 5: Configuration Preview

  • Verify configuration and if everything is correctly configured, click Save.

    24

  • Give the data integration task a name (here we choose datasync_rds).

    25

  • After you have saved, click on operation to initiate data ingestion from RDS.

    26

  • Monitor the log near the bottom of the screen to check the status of the data synchronization task.

  • If the data synchronization ended with return code: [0], then data ingestion was successful.

    27

Data Validation: Validate the Ingested Data with MaxCompute Data Development

  • Click on Data Development in the DataWorks top menu, and select the table which data has been ingested into. You can then click on Create Script to create a script (query) which can be run against your ODPS table. The script name can be anything you like.

    28

  • After creating a script, copy-paste the code SELECT * FROM credit_data_rds; (your table name may be different) and click on Run to generate results (see as follows).

  • The result is displayed in the log tab.

  • Validate the data by comparing it against the original data in the RDS database.

    29

Conclusion

Ingesting data from ApsaraDB for RDS using the DataWorks IDE is user friendly and easy, can be done end-to-end using a web-based approach, and enables customers - especially business users – to quickly import data, saving time for more important tasks such as data analysis.

Products Product Links for Reference
RDS https://www.alibabacloud.com/product/rds
DataWorks https://www.alibabacloud.com/product/ide
MaxCompute https://www.alibabacloud.com/product/maxcompute
Thank you! We've received your feedback.