To make sure that you can complete the workshop, you must activate E-MapReduce (EMR), DataWorks, and Object Storage Service (OSS) for your Alibaba Cloud account.

Prerequisites

  • An Alibaba Cloud account is created.
  • Real-name verification for individuals or enterprises is completed.
  • An EMR compute engine instance is associated with the desired workspace. The EMR folder is displayed only after you associate an EMR compute engine instance with the workspace on the Workspace Management page. For more information, see Configure a workspace.
  • An Alibaba Cloud EMR cluster is created. The inbound rules of the security group to which the cluster belongs include the following rules:
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • If you integrate Hive with Ranger in EMR, you must modify whitelist configurations and restart Hive before you develop EMR nodes in DataWorks. Otherwise, the error message Cannot modify spark.yarn.queue at runtime or Cannot modify SKYNET_BIZDATE at runtime is returned when you run EMR nodes.
    1. You can modify the whitelist configurations by using custom parameters in EMR. You can append key-value pairs to the value of a custom parameter. In this example, the custom parameter for Hive components is used. The following code provides an example:
      hive.security.authorization.sqlstd.confwhitelist.append=tez.*|spark.*|mapred.*|mapreduce.*|ALISA.*|SKYNET.*
      Note In the code, ALISA.* and SKYNET.* are configurations in DataWorks.
    2. After the whitelist configurations are modified, restart the Hive service to make the configurations take effect. For more information, see Restart a service.
  • An exclusive resource group for scheduling is created, and the resource group is associated with the virtual private cloud (VPC) where the EMR cluster resides. For more information, see Create and use an exclusive resource group for scheduling.
    Note EMR Hive nodes can be run only on exclusive resource groups for scheduling.

Background information

The following Alibaba Cloud services are used in this workshop:

Procedure

  1. Create an EMR cluster.
    1. Log on to the EMR console.
    2. In the top navigation bar, select the China (Shanghai) region. On the Cluster Management page, click Cluster Wizard.
      Note
      • Source data used in this workshop is stored in the China (Shanghai) region. Therefore, we recommend that you create an EMR cluster in the same region as the source data.
      • You can select Quick Purchase or Cluster Wizard to create an EMR cluster. In this topic, Cluster Wizard is selected.
    3. On the Cluster Wizard page, set the Cluster Type parameter to Hadoop and use the default values for other parameters in the Software Settings step. Click Next: Hardware Settings.
    4. In the Hardware Settings step, set the Billing Method parameter to Pay-As-You-Go, set the parameters in the Network Settings and Instance sections, and then click Next: Basic Settings.
    5. In the Basic Settings step, set the Cluster Name parameter, select a key pair from the Key Pair drop-down list, and then click Next: Confirm.
      By default, Assign Public IP Address is turned off. If you do not turn on this switch, you cannot access the cluster over the Internet after the cluster is created. In this workshop, you are not required to assign a public IP address. Therefore, click Next in the Assign Public IP Address dialog box. To access the cluster over the Internet, log on to the Elastic Compute Service (ECS) console and assign an elastic IP address (EIP) to the ECS instance that corresponds to the cluster.
    6. In the Confirm step, verify your configuration, read the terms of service, select E-MapReduce Service Terms, and then click Create.
  2. Initialize the cluster.
    After the purchase is complete, view the created EMR cluster on the Cluster Management page. It takes a few minutes to initialize the cluster.
    1. After the cluster is initialized, click the Data Platform tab.
    2. On the Data Platform tab, click Create Project in the upper-right corner.
    3. In the Create Project dialog box, set the Project Name and Project Description parameters.
      Note Use your Alibaba Cloud account to create the project. The project must be associated with a DataWorks workspace in subsequent steps.
    4. Click Create.
  3. Create a DataWorks workspace.
    Note Data resources provided for this workshop are all stored in the China (Shanghai) region. Therefore, we recommend that you create a workspace in the China (Shanghai) region. Otherwise, the network connectivity test fails when you create a connection.
    1. Move the pointer over the Icon icon in the upper-left corner of the EMR console and choose Products and Services > DTplus > DataWorks.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select a region where you want to create a workspace.
    4. On the Workspaces page, click Create Workspace. In the Create Workspace panel, set the parameters in the Basic Settings step and click Next.
      Section Parameter Description
      Basic Information Workspace Name The name of the workspace. The name must be 3 to 27 characters in length and start with a letter. It can contain only letters, underscores (_), and digits.
      Display Name The display name of the workspace. The display name can be up to 27 characters in length. It must start with a letter and can contain only letters, underscores (_), and digits.
      Mode Valid values: Basic Mode (Production Environment Only) and Standard Mode (Development and Production Environments). In this topic, set the parameter to Basic Mode (Production Environment Only).
      Description The description of the workspace.
      Advanced Settings Download SELECT Query Result Specifies whether to allow workspace members to download the results queried in DataStudio.
    5. In the Select Engines and Services step, select E-MapReduce and click Next.
      DataWorks is now available as a commercial service. If you have not activated DataWorks in a region, activate it before you create a workspace in the region.
    6. In the Engine Details step, set the parameters based on your business requirements.
      Engine Details
      Parameter Description
      Instance Display Name The display name of the compute engine instance.
      Access ID The AccessKey ID of the account that is authorized to access the EMR cluster.
      Access Key The AccessKey secret of the account that is authorized to access the EMR cluster.
      EmrClusterID The ID of the EMR cluster. You can obtain the ID from the EMR console.
      Cluster ID The ID of the user who created the EMR cluster.
      Project ID The ID of the project in the EMR cluster.
      YARN resource queue The name of the resource queue in the EMR cluster. Unless otherwise specified, set the parameter to default.
      Endpoint The endpoint of the EMR cluster. You can obtain the endpoint from the EMR console.
    7. After the configuration is complete, click Create Workspace.
  4. Activate OSS and create a bucket.
    1. Activate OSS. For more information, see Activate OSS.
    2. Log on to the OSS console.
    3. In the left-side navigation pane, click Buckets.
    4. On the Buckets page, click Create Bucket.
    5. In the Create Bucket panel, set the parameters and click OK.
      Note Select China (Shanghai) from the Region drop-down list. For more information about the parameters, see Create buckets.
    6. Click the name of the created bucket in the Bucket Name column to go to the Files page.
    7. Click Create Folder on the Files page.
    8. In the Create Folder panel, set the Folder Name parameter and click OK.
      Note Create three folders to store external data sources of OSS, Relational Database Service (RDS), and JAR resources.