This topic describes how to use DataWorks to collect logs to MaxCompute.
Background information
In this topic, you must add an Object Storage Service (OSS) bucket and an ApsaraDB RDS instance as data sources from which you want to read data. You must also create tables to which you want to write data.
- You can use the data sources that are prepared for you in this topic. You can also use your own data sources.
- The prepared data sources reside in the China (Shanghai) region. We recommend that you use a workspace in the China (Shanghai) region to make sure that the prepared data sources are accessible when you add these data sources.
Add an OSS data source
- Go to the Data Source page.
- On the Data Source page, click Add data source in the upper-right corner.
- In the Add data source dialog box, click OSS.
- In the Add OSS data source dialog box, configure the parameters.
Parameter Description Data Source Name The name of the data source. Enter oss_workshop_log. Data Source Description The description of the data source. Environment The environment in which the data source is used. Select Development. Note After you add a data source in the development environment, add the same data source in the production environment by selecting Production. Otherwise, an error is reported when a node that uses this data source is run in the production environment.Endpoint The OSS endpoint. Enter http://oss-cn-shanghai-internal.aliyuncs.com
.Bucket The name of the OSS bucket. Enter new-dataworks-workshop. AccessKey ID&AccessKey Secret The AccessKey pair that is used to connect to OSS. Enter LTAI4FvGT3iU4xjKotpUMAjS for the AccessKey ID and 9RSUoRmNxpRC9EhC4m9PjuG7Jzy7px for the AccessKey secret. Note To help you better experience DataWorks services, an AccessKey pair that has read-only permissions is provided for trial use in this topic. The sample AccessKey pair can be used to only read sample data. You can replace the AccessKey pair with the AccessKey pair that is used to connect to your data source in the production environment. - Test the connectivity between the data source and the resource group that you select. In this topic, the shared resource group for Data Integration (debugging) is used to run the node that uses this data source. Click Advanced in the lower-right corner of the dialog box to add the shared resource group for Data Integration (debugging) and test the connectivity between the data source and the resource group. We recommend that you select an exclusive resource group for Data Integration to develop data because exclusive resource groups provide enhanced capabilities that can ensure the effectiveness of data synchronization. For more information about how to add and use an exclusive resource group for Data Integration, see Create and use an exclusive resource group for Data Integration.A synchronization node can use only one type of resource group. To ensure that your synchronization nodes can be normally run, you must test the connectivity of all the resource groups for Data Integration on which your synchronization nodes will be run. If you want to test the connectivity of multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Establish a network connection between a resource group and a data source.Note
- By default, the resource group list displays only exclusive resource groups for Data Integration. To ensure the stability and performance of data synchronization, we recommend that you use exclusive resource groups for Data Integration.
- If you want to test the network connectivity between the shared resource group or a custom resource group and the data source, click Advanced below the resource group list. In the Warning message, click Confirm. Then, all available shared and custom resource groups appear in the resource group list.
- After the connection passes the connectivity test, click Complete. Note
- If the connectivity test fails, check whether the AccessKey ID and AccessKey secret that you entered are correct and whether the DataWorks workspace is in the China (Shanghai) region.
- If OSS cannot be accessed by using the internal endpoint, use the public endpoint.
Add an ApsaraDB RDS data source
- Go to the Data Source page.
- In the Add data source dialog box, click MySQL.
- In the Add MySQL data source dialog box, configure the parameters.
Parameter Description Data Source Type Set this parameter to Connection String Mode. Data Source Name The name of the data source. Enter rds_workshop_log. Data Source Description The description of the data source. Enter RDS user information synchronization. Environment The environment in which the data source is used. Select Development. Note After you add a data source in the development environment, add the same data source in the production environment by selecting Production. Otherwise, an error is reported when a node that uses this data source is run in the production environment.JDBC URL Set this parameter to jdbc:mysql://rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com:3306/workshop
.Username The username that is used to connect to the ApsaraDB RDS database. Enter workshop. Password The password that is used to connect to the ApsaraDB RDS database. Enter workshop#2017. Special Authentication Method Set this parameter to None Auth. - Test the connectivity between the data source and the resource group that you select. In this topic, Data Integration is selected for Resource Group connectivity. Click Advanced in the lower-right corner of the Add MySQL data source dialog box to add a shared resource group and test the connectivity between the data source and the shared resource group. You can add a data source in connection string mode and configure a public IP address that is used to access the data source. This way, the MySQL data source that is provided by DataWorks for a tutorial test can be accessed by using the public IP address. You are charged for Internet traffic that is generated when the data source is accessed by using the public IP address. For information about the billing of Internet traffic in DataWorks, see Billing of Internet traffic.Important If the data sources used for your data synchronization node are created in Alibaba Cloud instance mode, you cannot use the shared resource group for Data Integration to run the node to synchronize data across accounts or regions. We recommend that you select an exclusive resource group for Data Integration to develop data because exclusive resource groups provide enhanced capabilities that can ensure the effectiveness of data synchronization. For information about how to add and use an exclusive resource group for Data Integration, see Create and use an exclusive resource group for Data Integration.
A synchronization node can use only one type of resource group. To ensure that your synchronization node can be run as expected, you must test the connectivity between the data source and all resource groups for Data Integration on which your synchronization node will be run. If you want to test the connectivity for multiple resource groups for Data Integration at a time, select the resource groups and click Batch test connectivity. For more information, see Establish a network connection between a resource group and a data source.
- After the data source passes the connectivity test, click Complete.
Create a workflow
- Click the
icon in the upper-left corner of the Data Integration page and choose .
- In the Scheduled Workflow pane, right-click Business Flow and select Create Workflow.
- In the Create Workflow dialog box, configure the Workflow Name and Description parameters. Important The workflow name can be a maximum of 128 characters in length and can contain letters, digits, underscores (_), and periods (.).
- Click Create.
- Double-click the new workflow to go to the workflow configuration tab. Drag Zero-Load Node in the General section to the canvas on the right.
- In the Create Node dialog box, set the Name parameter to workshop_start and click Commit.
Drag Batch Synchronization in the Data Integration section to the canvas on the right to create two batch synchronization nodes named oss_synchronization and rds_synchronization.
- Drag directed lines to configure the workshop_start node as the ancestor node of the two batch synchronization nodes.
Configure the workshop_start node
- In the Scheduled Workflow pane, double-click the workshop_start node in the workflow. On the node configuration tab, click the Properties tab in the right-side navigation pane.
- In the Dependencies section, click Add Root Node to configure the root node of the workspace as the ancestor node of the workshop_start node. In the latest version of DataWorks, each node must have its ancestor and descendant nodes. Therefore, you must configure an ancestor node for the workshop_start node. In this example, the root node of the workspace is configured as the ancestor node of the workshop_start node. The root node of the workspace is named in the Workspace name_root format.
- Click the
icon in the top toolbar.
Create tables to which you want to write data
- In the Scheduled Workflow pane, click the new workflow, right-click MaxCompute, and then choose .
- In the Create Table dialog box, configure the Name parameter and click Create. In this workshop, you must create two tables named ods_raw_log_d and ods_user_info_d. The ods_raw_log_d table is used to store logs that are synchronized from OSS and the ods_user_info_d table is used to store user information that is synchronized from ApsaraDB RDS.Important The table name can be a maximum of 64 characters in length. It must start with a letter and cannot contain special characters.
- Create the tables by executing DDL statements.
- Create the ods_raw_log_d table. On the creation tab of the ods_raw_log_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to create the table.
-- Create the ods_raw_log_d table. CREATE TABLE IF NOT EXISTS ods_raw_log_d ( col STRING ) PARTITIONED BY ( dt STRING );
- Create the ods_user_info_d table. On the creation tab of the ods_user_info_d table, click DDL Statement. In the DDL Statement dialog box, enter the following statement to create the table.
-- Create the ods_user_info_d table. CREATE TABLE IF NOT EXISTS ods_user_info_d ( uid STRING COMMENT 'User ID', gender STRING COMMENT 'Gender', age_range STRING COMMENT 'Age range', zodiac STRING COMMENT 'Zodiac sign' ) PARTITIONED BY ( dt STRING );
- Create the ods_raw_log_d table.
- Click Generate Table Schema. In the Confirm message, click OK.
- On the creation tab for each table, enter the display name in the General section.
- After the creation is complete, click Commit to Development Environment and Commit to Production Environment in sequence.
Configure the batch synchronization nodes
After the nodes are deployed in the production environment, you can apply for the permissions to read data from and write data to the tables that are in the development environment.
- Configure the oss_synchronization node.
- Configure the rds_synchronization node.
Commit the workflow
- In the Scheduled Workflow pane, double-click the new workflow. On the workflow configuration tab, click the
icon in the top toolbar.
- In the Commit dialog box, select the nodes to be committed, enter your comments in the Change description field, and then select Ignore I/O Inconsistency Alerts.
- Click Commit. The Committed successfully message appears.
Run the workflow
- In the Scheduled Workflow pane, double-click the new workflow. On the workflow configuration tab that appears, click the
icon in the top toolbar.
- Right-click the rds_synchronization node and select View Log. If the information shown in the following figure appears in the logs, the rds_synchronization node is run and data is synchronized.
- Right-click the oss_synchronization node and select View Log. View the logs to check whether the oss_synchronization node is run and data is synchronized.
Verify data synchronization to MaxCompute
- In the left-side navigation pane, click Ad-Hoc Query.
- In the Ad-Hoc Query pane, right-click Ad-Hoc Query and choose .
- In the Create Node dialog box, enter the node name and click Commit. On the node configuration tab that appears, write and execute SQL statements to view the number of data records that are synchronized to the ods_raw_log_d and ods_user_info_d tables. Note Execute the following SQL statements. In each statement, change the partition key value to the data timestamp of the node. For example, if the node is run on July 17, 2018, the data timestamp is 20180716, which is one day before the node is run.
-- Check whether the data is written to MaxCompute. select count(*) from ods_raw_log_d where dt=Data timestamp of the node; select count(*) from ods_user_info_d where dt=Data timestamp of the node;
Subsequent steps
You understand how to collect and synchronize data. You can now proceed with the next tutorial. In the next tutorial, you will learn how to compute and analyze collected data. For more information, see Process data.