To make sure that you can complete the workshop, you must activate E-MapReduce, DataWorks, and Object Storage Service (OSS) for your Alibaba Cloud account.

Prerequisites

  • An Alibaba Cloud account is created.
  • Real-name verification for individuals or enterprises is completed.
  • An E-MapReduce cluster is bound to your workspace. The E-MapReduce service is available in a workspace only after you bind an E-MapReduce cluster to the workspace on the Workspace Management page.

Background information

The following Alibaba Cloud services are used in this workshop:

Procedure

  1. Create an E-MapReduce cluster.
    1. Log on to the E-MapReduce console.
    2. Select China (Shanghai) and click Cluster Wizard.
      Note
      • Source data used in this workshop is stored in the China (Shanghai) region. Therefore, we recommend that you create an E-MapReduce cluster in the same region as the source data.
      • You can select Quick Purchase or Cluster Wizard to create an E-MapReduce cluster. In this topic, Cluster Wizard is selected.
    3. On the Cluster Wizard page, set the Cluster Type parameter to Hadoop and use the default values for other parameters in the Software Settings step. Click Next: Hardware Settings.
    4. In the Hardware Settings step, set the Billing Method parameter to Pay-As-You-Go, set the parameters in the Network Settings and Instance sections, and then click Next: Basic Settings.
    5. In the Basic Settings step, set the Cluster Name parameter, select a key pair from the Key Pair drop-down list, and then click Next: Confirm.
      By default, no public IP address is assigned to an E-MapReduce cluster. After you create an E-MapReduce cluster, you can access the cluster only over the internal network. In this workshop, you are not required to assign a public IP address. Therefore, click Next in the Assign Public IP Address dialog box. To access the cluster over the Internet, log on to the Elastic Compute Service (ECS) console and assign an EIP to the corresponding ECS instance.
    6. In the Confirm step, verify your configuration, select the check box for E-MapReduce Service Terms, and then click Create.
  2. Initialize the cluster.
    After the purchase is completed, view the created E-MapReduce cluster on the Cluster Management tab. It takes a few minutes to initialize the cluster.
    1. After the cluster is initialized, click the Data Platform tab.
    2. On the Data Platform tab, click Create Project in the upper-right corner.
    3. In the Create Project dialog box, set the Project Name and Project Description parameters.
      Note Use your Alibaba Cloud account to create the project. The project is to be bound to a DataWorks workspace.
    4. Click Create.
  3. Create a DataWorks workspace.
    Note Data resources provided for this workshop are all stored in the China (Shanghai) region. Therefore, we recommend that you create a workspace in the China (Shanghai) region. Otherwise, the network connectivity test fails when you create a connection.
    1. Move the pointer over the Icon icon in the upper-left corner and choose Products and Services > DTplus > DataWorks.
    2. In the left-side navigation pane, click Workspaces.
    3. On the Workspaces page, move the pointer over the region in the upper-left corner and select a region where you want to create a workspace.
    4. Click Create Workspace. Set the parameters in the Basic Settings step and click Next.
      Section Parameter Description
      Basic Information Workspace Name The name of the workspace must be 3 to 27 characters in length and start with a letter. It can contain only letters, underscores (_), and digits.
      Display Name The display name can be up to 27 characters in length. It must start with a letter and can contain only letters, underscores (_), and digits.
      Mode Valid values: Basic Mode (Production Environment Only) and Standard Mode (Development and Production Environments). In this topic, set the parameter to Basic Mode (Production Environment Only).
      Description The description of the workspace.
      Advanced Settings Download SELECT Query Result Specifies whether to allow workspace members to download the results queried in DataStudio.
    5. In the Select Engines and Services step, select E-MapReduce and click Next.
      DataWorks is now available as a commercial service. If you have not activated DataWorks in a region, activate it before you create a workspace in the region.
    6. In the Engine Details step, set the parameters as required.
      Parameter Description
      Instance Display Name The display name of the compute engine instance.
      Access ID The AccessKey ID of the account that is authorized to access the E-MapReduce cluster.
      Access Key The AccessKey secret of the account that is authorized to access the E-MapReduce cluster.
      ClusterID The ID of the E-MapReduce cluster. You can obtain the ID from the E-MapReduce console.
      EmrUserID The ID of the user who created the E-MapReduce cluster.
      Workspace ID The ID of the project in the E-MapReduce cluster.
      YARN Resource Queue The name of the resource queue in the E-MapReduce cluster. Unless otherwise specified, set the parameter to default.
      Endpoint The endpoint of the E-MapReduce cluster. You can obtain the endpoint from the E-MapReduce console.
    7. After the configuration is completed, click Create Workspace.
  4. Activate OSS and create a bucket.
    1. Activate OSS. For more information, see Activate OSS.
    2. Log on to the OSS console.
    3. In the left-side navigation pane, click Buckets.
    4. On the Buckets page, click Create Bucket.
    5. In the Create Bucket panel, set the parameters as required and click OK.
      Note Select China (Shanghai) from the Region drop-down list. For more information about the parameters, see Create buckets.
    6. Click the name of the created bucket in the Bucket Name column to go to the Overview page.
    7. In the left-side navigation pane, click Files.
    8. In the Create Folder panel, set the Folder Name parameter and click OK.
      Note Create three folders to store data synchronized from OSS, Relational Database Service (RDS), and JAR resources.