This topic describes how to add HttpFile and MySQL data sources to access the basic user information and user website access logs that are provided in the tutorial in this topic. This topic also describes how to configure data synchronization tasks to synchronize data to the Object Storage Service (OSS) data source that you added when you prepare an environment, and use E-MapReduce (EMR) Hive nodes to create tables and query the synchronized data in OSS. You can query and verify the data synchronization result to check whether the data synchronization is complete.
Prerequisites
An environment is prepared. For more information, see Prepare environments.
An inbound rule is added to the security group that is associated with the EMR cluster in the ECS console to ensure network connectivity. In the inbound rule, the Action parameter is set to Allow, the Port Range parameter to
10000
, and the Authorization Object parameter to the CIDR block of the vSwitch with which the resource group is associated. For more information, see Add a security group rule.
Step 1: Add data sources
To ensure smooth data processing in subsequent operations, you must add the following data sources to your DataWorks workspace to obtain and store the test data provided by the system.
MySQL data source: a test data source provided by DataWorks to store the basic user information.
HttpFile data source: a test data source provided by DataWorks to store the user website access logs.
OSS data source: the OSS-HDFS data source that you created in Prepare an EMR environment. The OSS data source is used to store the test data of the basic user information and user website access logs obtained from the HttpFile and MySQL data sources.
Add a MySQL data source named user_behavior_analysis_mysql
The basic user information provided by this tutorial is stored in a MySQL database. You must add the MySQL database to your DataWorks workspace as a data source. This way, you can obtain the basic user information from the MySQL table ods_user_info_d
in the data source.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, enter MySQL in the search box and click MySQL.
On the Add MySQL Data Source page, configure the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.
The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.
Parameter
Description
Data Source Name
Enter a name for the data source. In this tutorial,
user_behavior_analysis_mysql
is used.Data Source Description
Enter a description for the data source. In this tutorial, the following description is entered: The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.
Configuration Mode
Select Connection String Mode.
Connection Address
Host IP Address: Enter
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
.Port Number: Enter
3306
.
Database Name
Enter a database name. In this tutorial,
workshop
is entered.Username
Enter a username used to connect to the database. In this tutorial,
workshop
is entered.Password
Enter a password used to connect to the database. In this tutorial,
workshop#2017
is entered.Authentication Method
Select No Authentication.
In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.
ImportantYou must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.
If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.
Click Complete Creation.
Add an HttpFile data source named user_behavior_analysis_httpfile
The user website access records provided by this tutorial are stored in OSS. You must add an HttpFile data source to your DataWorks workspace. This way, you can obtain the user website access records from the OSS object user_log.txt
in the data source.
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, enter HttpFile in the search box and click HttpFile.
On the Add HttpFile Data Source page, configure the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.
The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.
Parameter
Description
Data Source Name
Enter a name for the data source. In this tutorial,
user_behavior_analysis_httpfile
is entered.Data Source Description
Enter a description for the data source. In this tutorial, the following description is entered: The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.
URL
Enter a URL. In this tutorial, the URL parameters for both the development environment and production environment are set to
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
.In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.
ImportantYou must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.
If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.
Click Complete Creation.
Add an OSS data source named test_g
You can perform the following operations to add the OSS data source created in Prepare an EMR environment to DataWorks to store the data that is synchronized from the MySQL data source and the HttpFile data source.
When you add an OSS data source, you must make sure that the AliyunOSSFullAccess
policy is attached to the current account. For more information, see View the permissions of a RAM user and Grant permissions to a RAM user.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, enter OSS in the search box and click OSS.
On the Add OSS Data Source page, configure the parameters. The following table describes the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.
Parameter
Description
Data Source Name
The name of the data source. In this example,
test_g
is used.Data Source Description
The description of the data source.
Access Mode
Select AccessKey Mode.
AccessKey ID
The AccessKey ID of the account that is used to log on to the DataWorks console.
You can log on to the DataWorks console, move the pointer over the profile picture in the upper-right corner of the console, and click AccessKey to go to the AccessKey page. On the AccessKey page, you can obtain the AccessKey ID of the RAM user that has the required permissions.
AccessKey Secret
The AccessKey secret of the account that is used to log on to the DataWorks console.
ImportantThe AccessKey secret is displayed only during creation. You cannot view the AccessKey secret after you create it. Keep it confidential. If the AccessKey pair is leaked or lost, delete the AccessKey pair and create an AccessKey pair.
Region
In this example, select China (Shanghai).
Endpoint
The endpoint of OSS. In this example, enter
oss-cn-shanghai-internal.aliyuncs.com
.Bucket
The name of the OSS bucket that you created when you prepared the environment. In this example, the bucket name is
dw-emr-demo
.In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.
ImportantYou must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.
If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.
Click Complete Creation.
Step 2: Establish a synchronization link
In this step, you need to establish a synchronization link to synchronize the basic user information and user website access records to Hive tables to make preparations for subsequent data processing.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the left-side navigation pane of the Data Studio page, click the
icon to go to the DATASTUDIO pane. The Workspace Directories section appears.
Create a workspace directory named
work
and a workflow namedworkshop_emr
. For more information, see Workspace directories and Auto triggered workflow.In the Create Node dialog box, create a zero load node, two batch synchronization nodes, and two EMR Hive nodes.
On the configuration tab of the workflow, drag Zero Load Node to the canvas. In the Create Node dialog box, set the Node Name parameter to
workshop_start_emr
and click Confirm.On the configuration tab of the workflow, drag Batch Synchronization to the canvas twice. In the Create Node dialog box, separately set the Node Name parameter to
ods_user_info_d_2oss_emr
andods_raw_log_d_2oss_emr
to create two batch synchronization nodes and click Confirm. The two nodes are separately used to synchronize the basic user information stored in MySQL and user website access logs stored in HttpFile to the created OSS bucket.On the configuration tab of the workflow, drag EMR Hive to the canvas twice. In the Create Node dialog box, separately set the Node Name parameter to
ods_user_info_d_emr
andods_raw_log_d_emr
to create two EMR Hive nodes and click Confirm. The two nodes are used to synchronize the data in the OSS bucket to the Hive table.
The following table describes the node names that are used in this tutorial and the functionalities of the nodes.
Node type
Node name
Node functionality
Zero load node
workshop_start_emr
This node is used to manage the entire user profile analysis workflow and clarify the data forwarding path. This node is a dry-run node. You do not need to edit code for the node.
Batch synchronization node
ods_user_info_d_2oss_emr
This node is used to synchronize the basic user information stored in MySQL to the OSS data source
test_g
.Batch synchronization node
ods_raw_log_d_2oss_emr
This node is used to synchronize the user website access logs stored in HttpFile to the OSS data source
test_g
.EMR Hive node
ods_user_info_d_emr
This node is used to synchronize the basic user information stored in the OSS data source
test_g
to the Hive tableods_user_info_d_emr
.EMR Hive node
ods_raw_log_d_emr
This node is used to synchronize the user website access logs stored in the OSS data source
test_g
to the Hive tableods_raw_log_d_emr
.On the workflow configuration tab, drag directed lines to configure the
workshop_start_emr
node as the ancestor node of the two batch synchronization nodes, as shown in the following figure.
Step 3: Configure the inner nodes of the workflow
You can perform the following steps to synchronize the basic user information in the MySQL data source and the user website access logs in the HttpFile data source to the OSS data source that you created, and to create Hive external tables to parse and obtain the synchronized data in OSS.
Configure the zero load node
You can perform the following operations to configure the scheduling properties of the zero load node.
Open the zero load node.
In the canvas, move the pointer over the zero load node and click Open Node.
Configure the zero load node.
In the right-side navigation pane of the configuration tab of the zero load node, click Properties. On the properties tab, configure the parameters. The following table describes the key parameters that you must configure in this tutorial.
Parameter
Description
Resource Group For Scheduling
Select the serverless resource group that you purchased in the environment preparation phase.
Scheduling Dependencies
The workflow does not need to depend on another node. You can ignore this parameter. To facilitate management, you can click Use Workspace Root Node to configure the root node of the workspace as the ancestor node of the workflow.
The name of the root node is in the
Workspace name_root
format.Save the zero load node.
After the configuration is complete, click the
icon in the top toolbar to save the node.
Step 4: Parse OSS data
Wait until the running of the batch synchronization node is complete. You can perform the following steps to create a Hive external table to parse the test data that is stored in the test_g
OSS data source.
Step 5: Run the batch synchronization nodes
Configure scheduling properties for the workflow.
In the right-side navigation pane of the configuration tab of the workflow, click Properties. On the Properties tab, configure the parameters. The following table describes the parameters that you must configure for this tutorial. You can retain default values for other parameters.
Parameter
Description
Scheduling Parameters
You can configure scheduling parameters for the entire workflow. Then, the inner nodes of the workflow can directly use the scheduling parameters. In this tutorial, the scheduling parameter
bizdate=$[yyyymmdd-1]
is configured to obtain the date of the previous day.Scheduling Cycle
In this tutorial, this parameter is set to
Day
.Scheduling Time
In this tutorial, this parameter is set to
00:30
. This way, the workflow starts to run at00:30
every day.Scheduling Dependencies
The workflow does not have an ancestor dependency, and you can ignore this parameter. To facilitate unified management, you can click Use Workspace Root Node to configure the root node of the workspace as the ancestor node of the workflow.
The root node of a workspace is named in the
Workspace name_root
format.After the scheduling properties are configured for the workflow, click Save in the top toolbar of the configuration tab of the workflow. In the Change Review dialog box, click Save.
In the top toolbar of the configuration tab, click Run. In the Enter runtime parameters dialog box, configure the Value Used in This Run parameter to specify the value that you want to use for the scheduling parameter defined for the inner nodes of the workflow in this run, and click OK. In this tutorial, the Value Used in This Run parameter is set to
20250223
. In actual business scenarios, you can change the value based on your business requirements.If the
icon is displayed next to the ods_raw_log_d_emr and ods_user_info_d_emr nodes, the synchronization operation is successful.
If the
icon is displayed next to the ods_raw_log_d_emr and ods_user_info_d_emr nodes, errors occur. If the
java.net.ConnectException: Connection timed out (Connection timed out)
error message is displayed, you must add an inbound rule to the security group associated with the EMR cluster in the ECS console to ensure network connectivity. In the inbound rule, you must set the Action parameter to Allow, the Port Range parameter to10000
, and the Authorization Object parameter to the CIDR block of the vSwitch with which the resource group is associated. You can find the resource group on the Exclusive Resource Groups tab of the Resource Groups page of the DataWorks console and click Network Settings in the Actions column to obtain the CIDR block of the vSwitch with which the resource group is associated. For more information about how to add a security group rule, see Add a security group rule.
NoteData synchronization is estimated to require approximately
24
minutes.
Step 6: Query the data synchronization results
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the left-side navigation pane of the Data Studio page, click the
icon to go to the DATASTUDIO pane. The Workspace Directories section appears.
Right-click the
work
workspace directory that you created and choose . In the Create Node dialog box, enter the name of the EMR Hive node and click OK to create the node.Go to the configuration tab of the EMR Hive node, replace the data timestamp in the following script with the current data timestamp, and confirm the data write results. View the number of data records that are synchronized to the ods_raw_log_d_emr and ods_user_info_d_emr tables.
NoteYou must change the partition key column in the following query statements to the data timestamps of the ods_user_info_d_emr and ods_raw_log_d_emr nodes. For example, if a node is scheduled to run on
February 23, 2025
, the data timestamp of the node is20250222
, which is one day earlier than the scheduling time of the node.SELECT * FROM ods_user_info_d_emr WHERE dt=Data timestamp; SELECT * FROM ods_raw_log_d_emr WHERE dt=Data timestamp;
If the results returned after you execute the preceding statements show that data exists in the destination, the data synchronization is complete.
If the results returned after you execute the preceding statements show that data does not exist in the destination, you must make sure that the values specified for the scheduling parameters defined for the inner nodes of the workflow in this run are the same as the value of the
dt
field in the preceding statements when you run the workflow. You can click Running History in the right-side navigation pane of the configuration tab of the workflow, and then click View in the Actions column of the running record generated for this run to view the data timestamp that is used when the workflow is run in the run logs of the workflow. The data timestamp is in thepartition=[pt=xxx]
format.
What to do next
After you understand how to synchronize data based on this tutorial, you can proceed to the next tutorial. In the next tutorial, you will learn how to compute and analyze the synchronized data. For more information, see Process data.