This topic explains how to add HttpFile and MySQL data sources to access basic user information and website access logs provided in the example. It also covers configuring data synchronization tasks to transfer data to a private OSS data source and using EMR Hive nodes to create tables and query the synchronized data.
Prerequisites
-
An environment is prepared. For more information, see Prepare the environment.
-
To ensure network connectivity, add a security group rule in the ECS console to open port
10000
on the ECS instance and set the authorization object to the CIDR block of the VPC switch in the resource group. For more information, see Add a security group rule.
Step 1: Create data sources
For the scenario in this experiment, create and register the following three data sources in the DataWorks workspace:
-
HttpFile data source: Accesses website access log data provided in this tutorial.
-
MySQL data source: Accesses user information data provided in this tutorial.
-
OSS data source: Receives test data from the HttpFile and MySQL data sources.
Create an HttpFile data source
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
-
Click Add Data Source, and in the Add Data Source dialog box, search for and select HttpFile as the data source type.
-
On the Create HttpFile Data Source page, configure the parameters using example values for both development and production environments.
Parameter
Description
Data Source Name
Enter the data source name. In this tutorial, enter user_behavior_analysis_httpfile.
Data Source Description
The description of the data source. Specify that the data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. In addition, you also need to specify that the data source can be used only for data reading in data synchronization scenarios.
URL Domain Name
The URL domain name for both the development and production environments is
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
. -
Click the Connectivity (development Environment) and Connectivity (production Environment) columns of the specified resource group, and click Test Connectivity. Wait for the interface to indicate that the test is complete and the connectivity status is Connected.
ImportantEnsure that at least one resource group is in the Connected status. Otherwise, this data source cannot be used to create a sync task in codeless UI.
-
Click Complete Creation.
Create a MySQL data source
-
On the Management Center page, click to enter the Data Source page and then click Add Data Source.
-
In the Add Data Source dialog box, search for and select MySQL as the data source type.
-
On the Create MySQL Data Source page, configure the parameters using example values for both development and production environments.
Parameter
Description
Data Source Name
Enter the data source name. In this tutorial, enter user_behavior_analysis_mysql.
Data Source Description
The description of the data source. Specify that the data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. In addition, you also need to specify that the data source can be used only for data reading in data synchronization scenarios.
Configuration Mode
Select Connection String Mode.
Connection Address
Host Address IP:
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
Port Number:
3306
Database Name
Enter the database name. In this tutorial, enter
workshop
.Username
Enter the username. In this tutorial, enter workshop.
Password
Enter the password. In this tutorial, enter workshop#2017.
Authentication Option
Set this parameter to No Authentication.
-
Click the Connectivity (development Environment) and Connectivity (production Environment) columns of the specified resource group, and click Test Connectivity. Wait for the interface to indicate that the test is complete and the connectivity status is Connected.
-
Click Complete Creation.
Create an OSS data source
In this experiment, synchronize user information from the MySQL data source and log information from the HttpFile data source to the OSS data source.
-
On the Management Center page, click to enter the Data Source page and then click Add Data Source.
-
In the Add Data Source dialog box, search for and select OSS as the data source type.
-
On the Create OSS Data Source page, configure the parameters using example values for both development and production environments.
Parameter
Description
Data Source Name
Enter the name of the data source. In this example, enter test_g.
Data Source Description
Provide a brief description of the data source.
Access Mode
Select Access Key Mode.
AccessKey ID
The AccessKey ID of the account that is used to log on. You can go to the AccessKey page to copy the AccessKey ID.
AccessKey Secret
Enter the AccessKey secret of the account that is used to log on.
ImportantThe AccessKey secret is displayed only during creation. You cannot view the AccessKey secret after you create it. Keep it confidential. If the AccessKey is leaked or lost, delete it and create a new AccessKey.
Endpoint
Enter
http://oss-cn-shanghai-internal.aliyuncs.com
.Bucket
The name of the OSS bucket that is configured when you prepare the environment and create the EMR cluster. In this example, enter dw-emr-demo.
-
Click the Connectivity (development Environment) and Connectivity (production Environment) columns of the specified resource group, and click Test Connectivity. Wait for the interface to indicate that the test is complete and the connectivity status is Connected.
NoteEnsure that at least one resource group is in the Connected status. Otherwise, this data source cannot be used to create a sync task in codeless UI.
-
Click Complete Creation.
Step 2: Set up synchronization links
-
Click the
icon in the upper left corner, and select .
-
In the Data Development panel, right-click Business Flow, and select Create Business Flow.
-
In the Create Business Flow dialog box, enter Business Name as workshop_emr, and then click Create.
-
Enter the business flow development panel, and create a zero load node and offline synchronization node.
-
Click Create Node, select Zero Load Node, and drag it onto the editing page on the right. In the Create Node dialog box, enter Node Name as workshop_start_emr, and click Confirm.
-
Click Create Node, select Offline Synchronization node, and drag it onto the editing page on the right. Set the Node Name to ods_raw_log_d_2oss_emr and ods_user_info_d_2oss_emr respectively, which are used to synchronize MySQL user information and HttpFile log information. Then click Confirm.
-
-
In the business flow development panel, drag and connect lines to set the workshop_start_emr node as the ancestor node of the two offline synchronization nodes.
Step 3: Configure synchronization tasks
Configure the initial node of the business flow
-
On the Data Development page, double-click the zero load node under the corresponding business flow. Open the node's editing page, and click Scheduling Configuration on the right.
-
Configure the scheduling properties.
Scheduling Configuration
Image Display
Description
Scheduling Time Configuration
The scheduling time for the zero load node is set to 00:30. This zero load node will start the current business flow and execute it at 00:30 every day.
Set Rerun Property to Rerun On Success Or Failure.
Scheduling Dependency Configuration
Since the zero load node
workshop_start_emr
has no upstream dependency, it can directly depend on the Workspace Root Node, which triggers the execution of theworkshop_start_emr
node.The workspace root node is named:
WorkspaceName_root
. -
After completing the configuration, click the
icon in the upper left corner to save.
Configure offline synchronization nodes
-
Synchronize user information from the MySQL data source to the self-built OSS.
-
On the Data Development page, double-click the ods_user_info_d_2oss_emr node to enter the node configuration page.
-
Establish network connections for data synchronization.
After completing the network and resource configurations, click Next, and follow the interface prompts to complete the connectivity test.
Parameter
Description
Data Source
Data Source: MySQL.
Data Source Name: user_behavior_analysis_mysql.
My Resource Group
Select the purchased Serverless Resource Group.
Data Destination
Data Destination: OSS.
Data Source Name: Set it to test_g, which specifies the name of the private OSS data source that you added.
-
Configure the data synchronization node.
Parameter
Description
Data Source
Table: Select ods_user_info_d from the data source.
Shard Key: It is recommended to use the primary key or an indexed column as the shard key. Only fields of the integer type are supported. Set the shard key to uid.
Data Destination
Text Type: Select text type.
File Name (including Path): Enter the path based on the directory of your self-built OSS. In this example, enter ods_user_info_d/user_${bizdate}/user_${bizdate}.txt. Here, ods_user_info_d is the directory name you created, and $bizdate indicates the date of the previous day.
Column Delimiter: Enter the column delimiter as |.
-
Configure the scheduling properties.
On the configuration page, click Scheduling Configuration on the right to enter the Scheduling Configuration panel to configure scheduling and node information. The following table describes the configurations.
Configuration Item
Configuration Content
Image
Scheduling Parameter
Add the following in the Scheduling Parameter area:
Parameter Name: bizdate
Parameter Value: $[yyyymmdd-1]
Time Attribute
Set Rerun Property to Rerun On Success Or Failure.
Scheduling Dependency
In Scheduling Dependency, confirm that the output table is set as the output of this node.
The format is
WorkSpaceName.NodeName
-
After the configuration is complete, click the
icon in the toolbar to save.
-
-
Synchronize log information from the HttpFile data source to the self-built OSS.
-
On the Data Development page, double-click the ods_raw_log_d_2oss_emr node to enter the node configuration page.
-
Establish network connections for data synchronization.
After completing the network and resource configurations, click Next, and follow the interface prompts to complete the connectivity test.
Parameter
Description
Data Source
Data Source: HttpFile.
Data Source Name: user_behavior_analysis_httpfile.
My Resource Group
Select the purchased Serverless Resource Group.
Data Destination
Data Destination: OSS.
Data Source Name: Set it to test_g, which specifies the name of the private OSS data source that you added.
-
Configure the data synchronization node.
Parameter
Description
Data Source
File Path: /user_log.txt.
Text Type: Select text type.
Column Delimiter: Enter the column delimiter as |.
Compression Format: Includes four types: None, Gzip, Bzip2, and Zip. Select None.
Skip Header: Select No.
Data Destination
Text Type: Select text type.
File Name (including Path): Enter the path based on the directory of your self-built OSS. In this example, enter ods_raw_log_d/log_${bizdate}/log_${bizdate}.txt. Here, ods_raw_log_d is the directory name you created, and $bizdate indicates the date of the previous day.
Column Delimiter: Enter the column delimiter as |.
-
Configure the scheduling properties.
On the configuration page, click Scheduling Configuration on the right to enter the Scheduling Configuration panel to configure scheduling and node information. The following table describes the configurations.
Configuration Item
Configuration Content
Image
Scheduling Parameter
Add the following in the Scheduling Parameter area:
Parameter Name: bizdate
Parameter Value: $[yyyymmdd-1]
Time Attribute
Set Rerun Property to Rerun On Success Or Failure.
Scheduling Dependency
In Scheduling Dependency, confirm that the output table is set as the output of this node.
The format is
WorkSpaceName.NodeName
-
After the configuration is complete, click the
icon in the toolbar to save.
-
Create EMR tables to obtain synchronized data
Create two tables ods_user_info_d_emr and ods_raw_log_d_emr to query the synchronized original MySQL user data and OSS log data respectively.
-
On the Data Development page, open the newly created business flow, right-click EMR, and select .
-
In the Create Node dialog box, enter Node Name, and click Confirm.
Create two EMR Hive nodes (ods_user_info_d_emr and ods_raw_log_d_emr) to establish the corresponding tables. Configure the upstream and downstream dependencies of the nodes in the business flow development panel by drawing lines, as shown in the figure below.
-
Enter the table creation statements in the EMR Hive nodes respectively, and in Scheduling Configuration, select Schedule Resource Group as the purchased Serverless Resource Group. Then click Save and Run each table creation statement.
-
Create the ods_user_info_d_emr table.
Double-click the ods_user_info_d_emr node to enter the editing page and configure the node.
-
Edit the SQL code.
CREATE EXTERNAL TABLE IF NOT EXISTS ods_user_info_d_emr ( `uid` STRING COMMENT 'User ID', `gender` STRING COMMENT 'Gender', `age_range` STRING COMMENT 'Age range', `zodiac` STRING COMMENT 'Zodiac sign' ) PARTITIONED BY ( dt STRING ) ROW FORMAT delimited fields terminated by '|' LOCATION 'oss://dw-emr-demo/ods_user_info_d/'; ALTER TABLE ods_user_info_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}') LOCATION 'oss://dw-emr-demo/ods_user_info_d/user_${bizdate}/';
NoteThe location in the code above is an example path, which should match the data destination configured in the offline synchronization task. Replace 'dw-emr-demo' with the OSS bucket domain name you created during environment preparation.
-
Configure the scheduling properties.
Configuration Item
Configuration Content
Image
Scheduling Parameter
Add the following in the Scheduling Parameter area:
Parameter Name: bizdate
Parameter Value: $[yyyymmdd-1]
Time Attribute
Set Rerun Property to Rerun On Success Or Failure.
Scheduling Dependency
In Scheduling Dependency, confirm that the output table is set as the output of this node.
The format is
WorkSpaceName.NodeName
-
Once the configuration is complete, click
to save the node.
-
-
Create the ods_raw_log_d_emr table.
Double-click the ods_raw_log_d_emr node to configure the ods_raw_log_d_emr node.
-
Edit the SQL code
--Create the target table corresponding to the OSS log CREATE EXTERNAL TABLE IF NOT EXISTS ods_raw_log_d_emr ( `col` STRING ) PARTITIONED BY ( dt STRING ); ALTER TABLE ods_raw_log_d_emr ADD IF NOT EXISTS PARTITION (dt='${bizdate}') LOCATION 'oss://dw-emr-demo/ods_raw_log_d/log_${bizdate}/';
NoteThe location in the code above is an example path, which should match the data destination configured in the offline synchronization task. Replace 'dw-emr-demo' with the OSS bucket name you created during environment preparation.
-
Configure the scheduling properties.
Configuration Item
Configuration Content
Image
Scheduling Parameter
Add the following in the Scheduling Parameter area:
Parameter Name: bizdate
Parameter Value: $[yyyymmdd-1]
Time Attribute
Set Rerun Property to Rerun On Success Or Failure.
Scheduling Dependency
In Scheduling Dependency, confirm that the output table is set as the output of this node.
The format is
WorkSpaceName.NodeName
-
After completing the configuration, click
to save the node.
-
-
Step 4: Run and view results
Run the business flow
-
On the Data Development page, double-click Business Flow under workshop_emr, open the business flow panel, and click the
icon in the toolbar to run the business flow according to the upstream and downstream dependencies.
-
Confirm the status.
-
View the task execution status:
-
If the node is in the
status, it indicates that the synchronization process is successful.
-
If the node is in the
status and an error such as
java.net.ConnectException: Connection timed out (Connection timed out)
occurs, you need to add a security group rule in the ECS console to open port10000
of the ECS instance and set the authorization object to the CIDR block of the VPC switch in the resource group. You can click Network Settings of the corresponding resource group in the resource group list to obtain the CIDR block. For more information about how to add a security group rule, see Add a security group rule.
-
-
View the task execution logs: Right-click the Ods_user_info_d_emr, Ods_raw_log_d_emr nodes, and select View Logs. If the information shown in the following figure appears in the logs, the node is run and data is synchronized.
-
Query synchronization results
-
Create an ad hoc query.
On the Data Development page, click
in the left-side navigation pane to enter the ad hoc query panel. Right-click Ad Hoc Query, and select Create Node>EMR Hive.
-
Query synchronization result tables.
Execute the following SQL statements to confirm the data write results. View the number of records that are imported into the ods_raw_log_d_emr and ods_user_info_d_emr tables.
--The partition key column in the query statement needs to be updated to the data timestamp. For example, if the node is run on November 7, 2019, the data timestamp is 20191106, which is one day before the node is run. SELECT * from ods_user_info_d_emr where dt=Data timestamp; SELECT * from ods_raw_log_d_emr where dt=Data timestamp;
What to do next
After familiarizing yourself with data synchronization in this tutorial, you can move on to the next one. The following tutorial will teach you how to compute and analyze the synchronized data. For more information, see Process data.