All Products
Search
Document Center

DataWorks:Prepare environments

Last Updated:Dec 13, 2024

This tutorial describes how to perform user profile analysis. In this tutorial, DataWorks is used to synchronize data, process data, and monitor data quality. All data resources involved in this tutorial reside in the China (Shanghai) region. To ensure that you can complete the tutorial as expected, you must first create an E-MapReduce (EMR) Serverless StarRocks instance and a DataWorks workspace and configure the required environments.

Prepare an OSS environment

In this tutorial, a custom function is used. The resources used to register the function are uploaded to Object Storage Service (OSS). Make sure that OSS is activated and an OSS bucket is created.

Prepare an EMR Serverless StarRocks environment

In this tutorial, EMR Serverless StarRocks is used to process data. Make sure that you have an EMR Serverless StarRocks instance. If you do not have an EMR Serverless StarRocks instance, you can go to the Alibaba Cloud Free Trial page to check whether you are eligible for a free trial of EMR Serverless StarRocks, or purchase an instance on the buy page of E-MapReduce Serverless StarRocks.

  • Instance Type: Compute-storage Integration.

  • Region: China (Shanghai).

  • Instance Edition: Basic Edition.

    Important

    Basic Edition is only for trial use and feature testing. The service level agreement (SLA) for this edition is not guaranteed. You can select Standard Edition for the Instance Edition parameter based on your business requirements.

  • Version: 3.1.

In this tutorial, data is processed in the database user_behavior_analysis. After the EMR Serverless StarRocks instance is created, you must create a database named user_behavior_analysis. You can log on to the EMR Serverless StarRocks instance and execute the following SQL statement in the SQL Editor to create a database:

CREATE DATABASE user_behavior_analysis;

Prepare a DataWorks environment

Before you develop tasks in DataWorks, you must activate DataWorks. For more information, see Prepare an environment.

Step 1: Create a DataWorks workspace

  1. Log on to the DataWorks console. In the upper-left corner, select a region in which DataWorks is activated.

  2. In the left-side navigation pane of the DataWorks console, click Workspaces. On the Workspaces page, click Create Workspace. In the Create Workspace panel, configure the parameters to create a workspace. For more information, see Create a workspace.

Note
  • If a workspace exists, skip this step and use the existing workspace.

  • In this tutorial, the MySQL and HttpFile data sources reside in the China (Shanghai) region. Therefore, the China (Shanghai) region is used in this tutorial.

Step 2: Create a resource group

  1. Purchase a resource group. You must use a resource group to run StarRocks tasks in DataWorks. For more information about how to purchase resource groups, see Create and use a serverless resource group.

  2. Test the network connectivity between the StarRocks data source and the resource group. Make sure that a network connection is established between the resource group and the StarRocks data source. For more information about how to establish a network connection between a resource group and a data source, see Network connectivity solutions.

    • Check the StarRocks network environment.image

    • Associate the resource group with the virtual private cloud (VPC) in which the StarRocks data source resides.image

    • Configure the IP address whitelist of the StarRocks data source to allow the serverless resource group to access the data source.

      1. Obtain the outbound IP address of the DataWorks serverless resource group.image

      2. Click the name of the EMR Serverless StarRocks instance. In the Basic Information section of the Instance Details tab, click Internal Whitelist to add the CIDR block of the vSwitch with which the serverless resource group is associated.image

    • Configure an NAT gateway for the VPC with which the resource group is associated to allow the resource group to use the EIP associated with the NAT gateway to access the data source over the Internet.

      1. Log on to the VPC console and go to the Internet NAT Gateway page. In the top navigation bar, select the China (Shanghai) region.

      2. In the upper-left corner of the Internet NAT Gateway page, click Create NAT Gateway. Configure the parameters that are described in the following table.

        Parameter

        Description

        Region

        Select China (Shanghai).

        VPC

        Select the VPC and vSwitch with which the resource group is associated.

        To obtain the VPC and vSwitch with which the resource group is associated, perform the following steps: Log on to the DataWorks console. In the top navigation bar, select a region. In the left-side navigation pane, click Resource Groups. On the Resource Groups page, find the created resource group and click Network Settings in the Actions column. In the Data Scheduling & Data Integration section of the VPC Binding tab, view the VPC and vSwitch with which the resource group is associated. For more information about VPCs and vSwitches, see What is a VPC?

        Associate vSwitch

        Access Mode

        Select SNAT for All VPC Resources.

        EIP

        Select Purchase EIP.

        Create Service-Linked Role

        Click Create Service-Linked Role to create a service-linked role. If this is the first time you create an Internet NAT gateway, this step is required.

        Note

        Retain the default values for other parameters that are not described in the preceding table.

      3. Click Buy Now. On the Confirm page, read the terms of service, select the Terms of Service check box, and then click Confirm.

Step 3: Add a StarRocks data source

In the left-side navigation pane of the DataWorks console, click Management Center. On the page that appears, select a workspace from the drop-down list and click Go to Management Center. In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click StarRocks. On the Add StarRocks Data Source page, select Alibaba Cloud Instance Mode for the Configuration Mode parameter to add a StarRocks data source to the DataWorks workspace.

image

  1. Configure the basic information about the StarRocks data source.

    Log on to the EMR console to obtain the information about the StarRocks data source. In the DataWorks console, configure the basic information of the StarRocks data source based on the information on the Instance Details tab in the EMR console. The following table describes the required parameters.

    Parameter

    Description

    Data Source Name

    The name of the data source. In this tutorial, set the value to Doc_StarRocks_Storage_Compute_Tightly_01.

    Data Source Description

    The description of the data source.

    Configuration Mode

    Set the value to Alibaba Cloud Instance Mode.

    Region

    Set the value to China East 2 (Shanghai).

    Instance

    Select the Serverless instance that you created.

    Database Name

    The name of the database in StarRocks. In this tutorial, set the value to user_behavior_analysis. All operations in this tutorial are performed in this database.

    Username

    The username of the StarRocks database.

    Password

    The password of the StarRocks database.

  2. Test the network connectivity between the StarRocks data source and the resource group. If the network connectivity test is successful, click Complete Creation. The StarRocks data source is added to the DataWorks workspace.

Step 4: Add a MySQL data source

  1. On the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source.

  2. In the Add Data Source dialog box, select MySQL.

  3. On the Add MySQL Data Source page, configure the parameters. In this example, the sample values are used.

    Parameter

    Description

    Data Source Name

    The name of the data source. In this example, user_behavior_analysis_mysql is used.

    Data Source Description

    The description of the data source. Specify that the data source is exclusively provided for the use cases of DataWorks and is used as the source of a batch synchronization task to access the provided test data. In addition, you also need to specify that the data source can be used only for data reading in data synchronization scenarios.

    Configuration Mode

    Set this parameter to Connection String Mode.

    Connection Address

    • Host IP Address: Enter rm-bp1z69dodhh85z9qa.mysql.rds.aliyuncs.com.

    • Port Number: Enter 3306.

    Database Name

    The name of the database. In this example, workshop is used.

    Username

    The username. In this example, workshop is used.

    Password

    The password. In this example, workshop#2017 is used.

    Authentication Method

    Set this parameter to No Authentication.

  4. Find a desired resource group and separately click Test Network Connectivity in the Connection Status (Development Environment) and Connection Status (Production Environment) columns. If the network connectivity test is successful, Connectable appears in the corresponding column.

  5. Click Complete Creation.

Step 5: Add an HttpFile data source

In the left-side navigation pane of the SettingCenter page, choose Data Sources > Data Sources. On the Data Sources page, click Add Data Source. In the Add Data Source dialog box, click HttpFile. On the Add HttpFile Data Source page, add an HttpFile data source to the DataWorks workspace.

image

  1. Configure the basic information about the HttpFile data source.

    The following table describes the parameters that you must configure in the Basic Information section to add an HttpFile data source.

    Parameter

    Description

    Data Source Name

    Enter the display name of the public HttpFile data source in your workspace. In this tutorial, set the value to user_behavior_analysis_httpfile.

    Data Source Description

    The description of the data source.

    The data source is exclusively provided for the use cases of DataWorks and serves as the source of a batch synchronization task to access the provided test data. You also need to specify that the data source can be used only for data reading in data synchronization scenarios.

    URL Domain

    Enter https://dataworks-workshop-2024.oss-cn-shanghai.aliyuncs.com.

  2. Test the network connectivity between the HttpFile data source and the resource group. If the network connectivity test is successful, click Complete Creation. The HttpFile data source is added to the DataWorks workspace.

What to do next

You have prepared your environments and can proceed to the next tutorial. In the next tutorial, you will learn how to synchronize basic user information and website access logs of users to StarRocks. For more information, see Synchronize data.