This tutorial describes how to perform user profile analysis. In this tutorial, DataWorks is used to synchronize data, process data, and monitor the quality of data. To ensure that you can complete the tutorial as expected, you must first create an E-MapReduce (EMR) cluster and a DataWorks workspace and configure the development environment.
Prerequisites
DataWorks is activated. For more information, see Purchase guide.
NoteAll data resources involved in this tutorial reside in the China (Shanghai) region. We recommend that you activate DataWorks in the China (Shanghai) region.
Object Storage Service (OSS) is activated. For more information, see Activate OSS.
Step 1: Create an OSS bucket
This tutorial requires an OSS bucket, which is used to store user information and website access logs for data modeling and data analysis.
Log on to the OSS console.
In the left-side navigation pane, click Buckets. On the Buckets page, click Create Bucket.
In the Create Bucket panel, configure the parameters and click OK.
Bucket Name: Configure this parameter based on your business requirements.
Region: Select China (Shanghai).
OSS-HDFS: Turn on this switch.
For more information, see Create buckets.
Go back to the Buckets page, find the bucket, and then click the bucket name to go to the Objects page.
Step 2: Create an EMR cluster
This tutorial also requires an EMR cluster, which needs to be registered to DataWorks. This allows you to run data processing tasks based on the EMR cluster in the DataWorks console.
For more information, see Create a cluster. When you create an EMR cluster, take note of the following items in the Software Configuration step:
Region: Select China (Shanghai).
Business Scenario: Select Data Lake.
Product Version: Select the latest version.
Optional Services: Select components based on your business requirements. This tutorial requires the Hive component.
Metadata: Select DLF Unified Metadata.
Root Storage Directory of Cluster: Select the OSS bucket for which the OSS-HDFS service is activated in Step 1.
The support of DataWorks for different configurations of an EMR cluster varies. Before you create an EMR cluster and develop EMR tasks in DataWorks based on the EMR cluster, we recommend that you read the Best practices for configuring EMR clusters used in DataWorks topic.
Step 3: Create a DataWorks workspace
Before you develop tasks in DataWorks, you must create a DataWorks workspace.
All the data resources involved in this tutorial reside in the China (Shanghai) region. We recommend that you create a DataWorks workspace in the China (Shanghai) region. If you create a workspace in a different region and want to add a data source to the workspace, the data source may fail the network connectivity test. To simplify operations, we recommend that you set the Isolate Development and Production Environments parameter to No when you create a workspace.
Log on to the DataWorks console.
In the left-side navigation pane, click Workspaces.
In the top navigation bar, select the China (Shanghai) region.
On the Workspaces page, click Create Workspace. In the Create Workspace panel, enter a name in the Workspace Name field. For more information, see Create a workspace.
Step 4: Configure the environment required to develop EMR tasks in DataWorks
Before you can develop and run EMR tasks in DataWorks, you must perform the following steps to prepare the required environment:
Purchase and configure a serverless resource group.
Serverless resource groups are dedicated computing resources, which can ensure that tasks are scheduled to run on time. You must purchase a serverless resource group and establish a network connection between the resource group and the virtual private cloud (VPC) in which the EMR cluster is deployed. For more information about how to purchase a serverless resource group, see Create and use a serverless resource group.
Optional. Add the RAM user that you want to use to the workspace as a member and grant the required permissions to the member.
Only workspace members can run EMR tasks in DataStudio. You can add the RAM user that you want to use to the workspace as a member and grant the required permissions to the member. For more information, see Manage permissions on workspace-level services.
NoteThe Alibaba Cloud account to which a workspace belongs and the RAM user that is used to create a workspace automatically become members of the workspaces and are assigned the Workspace Administrator role.
Register the EMR cluster to DataWorks and initialize the serverless resource group.
You can use the EMR cluster in DataWorks only if you register the cluster to DataWorks. For more information, see Register an EMR cluster to DataWorks.
ImportantYou must make sure that the initialization of the resource group is successful. Otherwise, tasks that use the resource group may fail. If the initialization of the resource group fails, you can view the failure cause and perform a network connectivity diagnosis as prompted.
Key parameters for registering an EMR cluster to DataWorks:
Alibaba Cloud Account To Which Cluster Belongs: Select Current Alibaba Cloud Account.
Cluster Type: Select Data Lake.
Default Access Identity: Select Cluster Account: Hadoop.