This topic describes how to add HttpFile and MySQL data sources to access the basic user information and user website access logs that are provided in the tutorial in this topic. This topic also describes how to configure data synchronization tasks to synchronize data to the Object Storage Service (OSS) data source that you added when you prepare an environment, and use E-MapReduce (EMR) Hive nodes to create tables and query the synchronized data in OSS. You can query and verify the data synchronization result to check whether the data synchronization is complete.
Prerequisites
An environment is prepared. For more information, see Prepare environments.
A security group rule is added to the security group that is associated with the EMR cluster in the ECS console to ensure network connectivity. You can use Port
10000
of the Elastic Compute Service (ECS) instance to connect to DataWorks and set Authorization Object to the CIDR block of the vSwitch with which the resource group is associated. For more information, see Add a security group rule.
Step 1: Add data sources
To ensure smooth data processing in subsequent operations, you must add the following data sources to your DataWorks workspace to obtain and store the test data provided by the system.
MySQL data source: a test data source provided by DataWorks to store the basic user information.
HttpFile data source: a test data source provided by DataWorks to store the user website access logs.
OSS data source: the OSS-HDFS data source that you created in Prepare an EMR environment. The OSS data source is used to store the test data of the basic user information and user website access logs obtained from the HttpFile and MySQL data sources.
Add a MySQL data source named user_behavior_analysis_mysql
The basic user information provided by this tutorial is stored in a MySQL database. You must add the MySQL database to your DataWorks workspace as a data source. This way, you can obtain the basic user information from the MySQL table ods_user_info_d
in the data source.
Log on to the DataWorks console and go to the SettingCenter page. In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click MySQL.
On the Add MySQL Data Source page, configure the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.
The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.
Parameter
Description
Data Source Name
Enter a name for the data source. In this tutorial,
user_behavior_analysis_mysql
is used.Data Source Description
Enter a description for the data source. In this tutorial, the following description is entered: The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.
Configuration Mode
Select Connection String Mode.
Connection Address
Host IP Address: Enter
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com
.Port Number: Enter
3306
.
Database Name
Enter a database name. In this tutorial,
workshop
is entered.Username
Enter a username used to connect to the database. In this tutorial,
workshop
is entered.Password
Enter a password used to connect to the database. In this tutorial,
workshop#2017
is entered.Authentication Method
Select No Authentication.
In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.
ImportantYou must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.
If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.
Click Complete Creation.
Add an HttpFile data source named user_behavior_analysis_httpfile
The user website access records provided by this tutorial are stored in OSS. You must add an HttpFile data source to your DataWorks workspace. This way, you can obtain the user website access records from the OSS object user_log.txt
in the data source.
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile.
On the Add HttpFile Data Source page, configure the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.
The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.
Parameter
Description
Data Source Name
Enter a name for the data source. In this tutorial,
user_behavior_analysis_httpfile
is entered.Data Source Description
Enter a description for the data source. In this tutorial, the following description is entered: The data source is exclusive for experiencing cases in DataWorks. You can configure the data source in a batch synchronization task to access the test data provided by the system. The data source can be used only for data reading in Data Integration and cannot be used in other services.
URL
Enter a URL. In this tutorial, the URL parameters for both the development environment and production environment are set to
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com
.In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.
ImportantYou must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.
If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.
Click Complete Creation.
Add an OSS data source named test_g
You can perform the following operations to add the OSS data source created in Prepare an EMR environment to DataWorks to store the data that is synchronized from the MySQL data source and the HttpFile data source.
Log on to the DataWorks console and go to the SettingCenter page. In the left-side navigation pane of the SettingCenter page, click Data Sources.
On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click OSS.
On the Add OSS Data Source page, configure the parameters. The following table describes the parameters. In this tutorial, the parameters for the development environment and the production environment are configured by referring to the instructions in the following table.
Parameter
Description
Data Source Name
The name of the data source. In this example,
test_g
is used.Data Source Description
The description of the data source.
Access Mode
Select AccessKey Mode.
AccessKey ID
The AccessKey ID of the account that is used to log on to DataWorks. You can go to the AccessKey page to copy the AccessKey ID.
AccessKey Secret
The AccessKey secret of the account that is used to log on to DataWorks.
ImportantThe AccessKey secret is displayed only during creation. You cannot view the AccessKey secret after you create it. Keep it confidential. If the AccessKey pair is leaked or lost, delete the AccessKey pair and create a new AccessKey pair.
Region
In this example, select China (Shanghai).
Endpoint
The endpoint of OSS. In this example, enter
oss-cn-shanghai-internal.aliyuncs.com
.Bucket
The name of the OSS bucket that you created when you prepared the environment. In this example, the bucket name is
dw-emr-demo
.In the Connection Configuration section, find the resource group that you want to use and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. Make sure that the displayed connection status is Connected.
ImportantYou must make sure that the resource group is associated with your workspace and can access the Internet. Otherwise, an error is reported when you run the synchronization task configured for the data source. For more information, see Prepare environments.
If no resource group is available, you can click Purchase or Associate Purchased Resource Group in the prompt message that is displayed in the Connection Configuration section to purchase a resource group or associate your resource group with your workspace.
Click Complete Creation.
Step 2: Establish a synchronization link
In this step, you need to establish a synchronization link to synchronize the basic user information and user website access records to Hive tables to make preparations for subsequent data processing.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
In the left-side navigation pane of the Data Studio page, click the
icon to go to the DATASTUDIO pane. The Workspace Directories section appears.
Create a workspace directory named
work
and a workflow namedworkshop_emr
. For more information, see Workspace directories and Auto triggered workflow.In the Create Node dialog box, create a zero load node, two batch synchronization nodes, and two EMR Hive nodes.
On the configuration tab of the workflow, drag Zero Load Node to the canvas. In the Create Node dialog box, set the Node Name parameter to
workshop_start_emr
and click Confirm.On the configuration tab of the workflow, drag Batch Synchronization to the canvas. In the Create Node dialog box, set the Node Name parameter to
ods_user_info_d_2oss_emr
orods_raw_log_d_2oss_emr
to separately create two batch synchronization nodes and click Confirm. The nodes are used to synchronize the basic user information stored in MySQL and user website access logs stored in HttpFile to the created OSS bucket.On the configuration tab of the workflow, drag EMR Hive to the canvas. In the Create Node dialog box, set the Node Name parameter to
ods_user_info_d_emr
orods_raw_log_d_emr
to separately create two EMR Hive nodes and click OK. The nodes are used to synchronize the data in the OSS bucket to the Hive table.
The following table describes the node names that are used in this tutorial and the functionalities of the nodes.
Node type
Node name
Node functionality
Zero load node
workshop_start_emr
This node is used to manage the entire user profile analysis workflow and clarify the data forwarding path. This node is a dry-run node. You do not need to edit code for the node.
Batch synchronization node
ods_user_info_d_2oss_emr
This node is used to synchronize the basic user information stored in MySQL to the OSS data source
test_g
.Batch synchronization node
ods_raw_log_d_2oss_emr
This node is used to synchronize the user website access logs stored in HttpFile to the OSS data source
test_g
.EMR Hive node
ods_user_info_d_emr
This node is used to synchronize the basic user information stored in the OSS data source
test_g
to the Hive tableods_user_info_d_emr
.EMR Hive node
ods_raw_log_d_emr
This node is used to synchronize the user website access logs stored in the OSS data source
test_g
to the Hive tableods_raw_log_d_emr
.On the workflow configuration tab, drag directed lines to configure the
workshop_start_emr
node as the ancestor node of the two batch synchronization nodes, as shown in the following figure.Configure scheduling properties for the workflow.
In the right-side navigation pane of the configuration tab of the workflow, click Properties. On the Properties tab, configure the parameters. The following table describes the key parameters that you must configure in this tutorial. You can retain default values for other parameters.
Parameter
Description
Scheduling Parameters
You can configure scheduling parameters for the workflow. The inner nodes of the workflow can directly use the configured scheduling parameters. In this tutorial,
bizdate=$[yyyymmdd-1]
is configured to obtain the date of the previous day.Scheduling Cycle
In this tutorial, this parameter is set to
Day
.Scheduling Time
In this tutorial, the Scheduling Time parameter is set to
00:30
. This way, the workflow starts to run at00:30
each day.Scheduling Dependencies
The workflow does not need to depend on another node. You can ignore this parameter. To facilitate management, you can click Use Workspace Root Node to configure the root node of the workspace as the ancestor node of the workflow.
The name of the root node is in the
Workspace name_root
format.After you configure the workflow scheduling time, click Save. Click Change Review. In the Change Review dialog box, check the configuration of the workflow. Then, click Save.
Step 3: Configure the inner nodes of the workflow
You can perform the following steps to synchronize the basic user information in the MySQL data source and the user website access logs in the HttpFile data source to the OSS data source that you created, and to create Hive external tables to parse and obtain the synchronized data in OSS.
Configure the zero load node
You can perform the following operations to configure the scheduling properties of the zero load node.
Open the zero load node.
In the canvas, move the pointer over the zero load node and click Open Node.
Configure the zero load node.
In the right-side navigation pane of the configuration tab of the zero load node, click Properties. On the properties tab, configure the parameters. The following table describes the key parameters that you must configure in this tutorial.
Parameter
Description
Support For Rerun
Set the value to Allow Regardless of Running Status.
Resource Group For Scheduling
Select the serverless resource group that you purchased in the environment preparation phase.
Scheduling Dependencies
The workflow does not need to depend on another node. You can ignore this parameter. To facilitate management, you can click Use Workspace Root Node to configure the root node of the workspace as the ancestor node of the workflow.
The name of the root node is in the
Workspace name_root
format.Save the zero load node.
After the configuration is complete, click the
icon in the top toolbar to save the node.
Step 4: Run the batch synchronization node
In the top toolbar of the configuration tab of the workflow, click Run. In the Enter runtime parameters dialog box, specify a value that is used for scheduling parameters defined for each node in this run, and click OK. In this tutorial, 20250223
is specified. You can specify a value based on your business requirements.
Step 5: Parse OSS data
Wait until the batch synchronization node is complete. You can perform the following steps to create a Hive external table to parse the test data that is stored in the test_g
OSS data source.
What to do next
After you understand how to synchronize data based on this tutorial, you can proceed to the next tutorial. In the next tutorial, you will learn how to compute and analyze the synchronized data. For more information, see Process data.