This tutorial shows how to create HttpFile and MySQL Data Sources to access user and website log data. You will configure a data synchronization pipeline to transfer this data to an Object Storage Service (OSS) bucket. You will then create an E-MapReduce (EMR) Hive External Table to parse the data in OSS and run a query to verify the results.
Prerequisites
Ensure the required environment is prepared. For more information, see Prepare the environment.
To ensure network connectivity, go to the ECS console and add a security group rule to the Security Group associated with your EMR cluster. This rule must allow inbound traffic on port
10000from the vSwitch CIDR block of the resource group's VPC. For more information, see Add a security group rule.
1. Create data sources
Add the following Data Sources to your DataWorks workspace to retrieve and store the sample data for subsequent processing.
MySQL Data Source: A sample Data Source provided by DataWorks that stores basic user information.
HttpFile Data Source: A sample Data Source provided by DataWorks that stores user website access logs.
OSS Data Source: The EMR OSS-HDFS storage that you created in Prepare the EMR environment. This is used to store the sample user information and website access logs synchronized from the MySQL and HttpFile Data Sources.
Create a MySQL data source (user_behavior_analysis_mysql)
The sample user data for this tutorial is stored in a MySQL database. Create a MySQL Data Source to retrieve this data (ods_user_info_d).
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
On the SettingCenter page, click Data Sources in the left-side navigation pane.
Click Add Data Source, and then search for and select MySQL as the Data Source type.
On the Add MySQL Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.
The following table describes only the key parameters. Use the default values for all other parameters.
Parameter
Description
Data source name
Enter a name for the Data Source. In this tutorial, enter
user_behavior_analysis_mysql.Data source description
This is a read-only Data Source for DataWorks tutorials. It provides sample data for Batch Synchronization nodes within Data Integration and is not supported in other modules.
Configuration mode
Select Connection String Mode.
Connection address
Host IP address:
rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.comPort number:
3306
Database name
Enter the database name. In this tutorial, enter
workshop.Username
Enter the username. In this tutorial, enter
workshop.Password
Enter the password. In this tutorial, enter
workshop#2017.Authentication method
No authentication.
In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.
ImportantEnsure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.
If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.
Click Complete Creation.
Create an HttpFile data source (user_behavior_analysis_httpfile)
The user website access logs for this tutorial are stored in a sample OSS bucket provided by DataWorks. Create an HttpFile Data Source to retrieve these logs (user_log.txt).
Go to the Data Sources page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
On the SettingCenter page, click Data Sources in the left-side navigation pane.
Click Add Data Source, and then search for and select HttpFile as the Data Source type.
On the Add HttpFile Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.
The following table describes only the key parameters. Use the default values for all other parameters.
Parameter
Description
Data source name
Enter a name for the Data Source. In this tutorial, enter
user_behavior_analysis_httpfile.Data source description
This is a read-only Data Source for DataWorks tutorials. It provides sample data for Batch Synchronization nodes within Data Integration and is not supported in other modules.
URL
Set the URL to
https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.comfor both the development and production environments.In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure the status is Connected.
ImportantEnsure a resource group with Public Access is associated with the workspace to prevent Data Synchronization failures. For configuration steps, see Prepare the environment.
If you do not have an available resource group, follow the on-screen instructions to click Purchase and Associate Purchased Resource Group.
Click Complete Creation.
Create an OSS data source (test_g)
Add the OSS bucket you created in Prepare the EMR environment to DataWorks. This bucket will store the data synchronized from the MySQL and HttpFile Data Sources.
When adding an OSS Data Source, ensure the current account has the AliyunOSSFullAccess permission for Object Storage Service (OSS). For detailed instructions, see View the permissions of a RAM user and Manage the permissions of a RAM user.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Management Center.
On the SettingCenter page, click Data Sources in the left-side navigation pane.
Click Add Data Source, and then search for and select OSS as the Data Source type.
On the Add OSS Data Source page, configure the parameters. For this tutorial, use the same example values for both the development and production environments.
Parameter
Description
Data source name
Enter a name for the Data Source. In this example, use
test_g.Data source description
Enter a brief description for the Data Source.
Access mode
Select AccessKey Mode.
AccessKey ID
The AccessKey ID of the current account.
You can log on to the DataWorks console, hover over your profile picture in the upper-right corner, and go to the AccessKey page to obtain the
AccessKey IDof a RAM user with the required permissions.AccessKey Secret
Enter the AccessKey Secret of the current account.
ImportantThe AccessKey Secret is displayed only upon creation and cannot be retrieved later. Keep it confidential. If your AccessKey is compromised or lost, delete it and create a new one.
Region
Select the China (Shanghai) Region.
Endpoint
Enter
oss-cn-shanghai-internal.aliyuncs.com.Bucket
The name of the OSS bucket that you configured when creating the EMR cluster. In this example, use
dw-emr-demo.In the Connection Configuration section, click Test Network Connectivity for both the production and development environments. Ensure that the connectivity status is Connected.
ImportantEnsure that the resource group is attached to the workspace and has public network access enabled. Otherwise, data synchronization fails. For more information, see Prepare the environment.
If you do not have a resource group available, follow the prompts in the connection configuration section. Click Purchase and Associated Purchased Resource Group.
Click Complete Creation.
2. Build the synchronization pipeline
In this step, you build a synchronization pipeline to move user data and website access logs to an OSS bucket for subsequent processing.
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
On the Data Studio page, click the
icon in the left-side navigation pane to go to the Data Development page. Then, select Workspace Directories in the directory tree on the left.Create a directory named
workand a workflow namedworkshop_emr. For more information, see Workspace Directories and Orchestrate periodic workflows.Click the
workshop_emrworkflow to open the workflow orchestration page. Then, create a Zero Load Node, two Batch Synchronization nodes, and two EMR Hive nodes.Select and drag it to the editing canvas on the right. In the Create Node dialog box, enter
workshop_start_emrfor Node Name and click Confirm.Drag two Data Integration nodes to the editing canvas. In the Create Node dialog box, set the source type to MySQL and the destination type to OSS. Select Batch Synchronization as the specific node type. Name the nodes
ods_user_info_d_2oss_emrandods_raw_log_d_2oss_emr, respectively. These nodes will synchronize MySQL user information and HttpFile log data to your OSS bucket. Click Confirm.Drag two nodes to the editing canvas. In the Create Node dialog box, name the nodes
ods_user_info_d_emrandods_raw_log_d_emr, respectively. These nodes will create Hive tables to parse the data in your OSS bucket. Click Confirm.
The following table describes the example nodes used in this tutorial.
Type
Parameter
Function
Zero Load Nodeworkshop_start_emrOrchestrates the workflow and defines the data flow. This Zero Load Node requires no code editing.
Batch Synchronization nodeods_user_info_d_2oss_emrSynchronizes basic user information from the MySQL Data Source to your OSS Data Source (
test_g).
Batch Synchronization nodeods_raw_log_d_2oss_emrSynchronizes user website access logs from the HttpFile Data Source (OSS) to your OSS Data Source (
test_g).
EMR Hive nodeods_user_info_d_emrCreates the Hive table
ods_user_info_d_emrto parse basic user information from the OSS Data Source (test_g).
EMR Hive nodeods_raw_log_d_emrCreates the Hive table
ods_raw_log_d_emrto parse user website access logs from the OSS Data Source (test_g).On the workflow development panel, drag connections to set the
workshop_start_emrnode as the upstream dependency for the two Batch Synchronization nodes. The final result should look like this:
3. Configure synchronization nodes
Follow these steps to synchronize user information from the MySQL Data Source and log data from the HttpFile Data Source to your OSS bucket. Then, create Hive External Tables to parse and query the data stored in OSS.
Configure the zero-load node
Configure the scheduling properties for the Zero Load Node as follows.
Open the Zero Load Node.
On the workflow orchestration page, hover over the Zero Load Node and click Open Node to go to the node editing page.
Configure the node.
In the right-side pane, click Properties and configure the following parameters.
Parameter
Description
Resource group for scheduling
Select the serverless resource group you created during environment preparation.
Scheduling dependencies
The workflow has no upstream dependencies, so no configuration is needed. For unified management, you can click Use Workspace Root Node to set the Workspace Root Node as the parent dependency.
The Workspace Root Node is named in the format
WorkspaceName_root.Save the node.
After you complete the configuration, click the
icon in the toolbar to save the node.
4. Parse OSS data
After the Batch Synchronization nodes finish running, follow these steps to create Hive External Tables to parse the sample data stored in your test_g OSS Data Source.
5. Run the synchronization workflow
Configure workflow scheduling.
On the right side of the workflow orchestration page, click Properties and configure the parameters. The following table describes only the key parameters. Use the default values for all other parameters.
Scheduling parameter
Description
Scheduling parameters
Set scheduling parameters that are available to all nodes in the workflow. For this tutorial, set it to
bizdate=$[yyyymmdd-1]to get the date of the previous day.Recurrence
For this tutorial, set Recurrence to
Day.Start time
For this tutorial, set the Start Time to
00:30. The workflow will start at00:30every day.Scheduling dependencies
The workflow has no upstream dependencies, so no configuration is needed. For unified management, you can click Use Workspace Root Node to set the Workspace Root Node as the parent dependency.
The Workspace Root Node is named in the format
WorkspaceName_root.After configuring the workflow scheduling properties, click Run in the toolbar at the top of the workflow canvas. Set the parameter variables for this run. This tutorial uses
20250223as an example, but you can use a different value. Click Confirm and wait for the run to complete.If the
ods_raw_log_d_emrandods_user_info_d_emrnodes show the
status, the synchronization succeeded.If the
ods_raw_log_d_emrandods_user_info_d_emrnodes show the
status, and you see a java.net.ConnectException: Connection timed out (Connection timed out)error, you must add a security group rule in the ECS console to open ECS port10000. Set the authorized object to the vSwitch CIDR block of the resource group's VPC. You can obtain the vSwitch CIDR block by clicking Network Settings for the corresponding resource group in the resource group list. For more information about how to add a security group rule, see Add a security group rule.
Data Synchronization takes approximately 24 minutes.
6. Verify data synchronization
Go to the Workspaces page in the DataWorks console. In the top navigation bar, select a desired region. Find the desired workspace and choose in the Actions column.
On the Data Studio page, click the
icon to enter Data Development, and find the Workspace Directories section in the navigation pane.Right-click the
workdirectory you created and select . Enter a custom name for the EMR Hive node and press Enter to create the node.On the EMR Hive node editing page, replace
your_data_timestampwith the data timestamp from your workflow run. Run the following queries to verify that data is correctly mapped to theods_raw_log_d_emrandods_user_info_d_emrtables.NoteThe partition column in the query must be updated to the data timestamp. For example, if the workflow runs on
20250223, the data timestamp is20250222, which is the day before the run date.SELECT * FROM ods_user_info_d_emr WHERE dt=your_data_timestamp; SELECT * FROM ods_raw_log_d_emr WHERE dt=your_data_timestamp;If the preceding commands return data, it means the data processing is complete.
If no data is returned, ensure that the Value Used in This Run that is configured for the workflow run matches the data timestamp specified by
dtin the query. You can click the workflow, click Running History in the right-side pane, and then click View in the Actions column for a run record to confirm the data timestamp value (partition=[pt=xxx]) in the workflow run log.
Next steps
Now that you have synchronized the data, you can proceed to the next tutorial to learn how to process and analyze it. For more information, see Process data.