To make sure that you can complete the workshop, you must activate E-MapReduce, DataWorks, and Object Storage Service (OSS) for your Alibaba Cloud account.

Prerequisites

  • An Alibaba cloud account is registered.
  • Real-name verification for individuals or enterprises is completed.
  • DataWorks of the Professional Edition or a more advanced edition is purchased so that you can bind an E-MapReduce compute engine instance to your DataWorks workspace.
  • An E-MapReduce compute engine instance is added on the Workspace Management page for the E-MapReduce module to be available on the DataStudio page. For more information, see Configure a workspace.

Background information

The following Alibaba Cloud services are used in this workshop:

Procedure

  1. Create an E-MapReduce cluster.
    1. Log on to the E-MapReduce console.
    2. Select China (Shanghai) from the region drop-down list at the top. On the Overview page, click Cluster Wizard in the Clusters section.
      Note
      • Source data used in this workshop is stored in the China (Shanghai) region. Therefore, we recommend that you create an E-MapReduce cluster in the same region as the source data.
      • You can select Quick Purchase or Cluster Wizard to create an E-MapReduce cluster. In this topic, Cluster Wizard is selected.
    3. On the Cluster Wizard page, set Cluster Type to Hadoop and use the default values for other parameters in the Software Settings step. Then click Next: Hardware Settings.
    4. In the Hardware Settings step, set Billing Method to Pay-As-You-Go, set parameters in the Network Settings and Instance sections, and then click Next: Basic Settings.
    5. In the Basic Settings step, set Cluster Name, click Key Pair, select a key pair from the Key Pair drop-down list, and then click Next: Confirm.
      By default, no Elastic IP Address (EIP) is assigned to an E-MapReduce cluster. After creating an E-MapReduce cluster, you can only access the cluster over the internal network. In this workshop, assigning an EIP is not required. Therefore, click Next in the Assign Public Network IP dialog box that appears. To access the cluster over the public network, log on to the Elastic Compute Service (ECS) console and assign an EIP to the corresponding ECS instance.
    6. In the Confirm step, verify your configuration, select the check box for E-MapReduce Service Terms, and then click Create.
  2. Wait for the cluster to be initialized.
    After the purchase is completed, click the Cluster Management tab and view the created E-MapReduce cluster. It takes a few minutes to initialize the cluster.
    1. After the cluster is initialized, click the Data Platform tab.
    2. On the Data Platform page, click Create Project in the upper-right corner.
    3. In the Create Project dialog box that appears, set Project Name and Project Description.
      Note Use your Alibaba Cloud account to create the project. The project is to be bound to a DataWorks workspace.
    4. Click Create.
  3. Create a DataWorks workspace.
    Note Data resources provided for this workshop are all stored in the China (Shanghai) region. Therefore, we recommend that you create a workspace in the China (Shanghai) region. Otherwise, the network connectivity test fails when you add a connection.
    1. Move the pointer over the icon in the upper-left corner and choose Products > DataWorks.
    2. On the Overview page that appears, click Create Workspace in the Shortcuts section.
      You can also click Workspaces in the left-side navigation pane and click Create Workspace on the Workspaces page.
      Note If you create a workspace on the Workspaces page, you must select a region in advance. This is because the Region parameter is unavailable in the Create Workspace dialog box that you open on the Workspaces page.
    3. In the Create Workspace dialog box that appears, set parameters in the Basic Settings step and click Next.
      Note In this tutorial, a workspace in basic mode is created as an example.
    4. In the Select Engines and Services step, select E-MapReduce and click Next.

      DataWorks is now available as a commercial service. If you have not activated DataWorks in a region, activate it first before creating a workspace in the region. By default, the following services are selected when you create a workspace: Data Integration, Data Analytics, Operation Center, and Data Quality.

    5. In the Engine Details step, set parameters for the E-MapReduce engine.
      Engine Parameter Description
      E-MapReduce Instance Display Name The name of the E-MapReduce cluster.
      Access ID and Access Key The AccessKey of the account authorized to access the E-MapReduce cluster.
      Cluster ID The ID of the E-MapReduce cluster, which is obtained from the E-MapReduce console.
      EmrUserID The ID of the user who created the E-MapReduce cluster. Log on to the E-MapReduce console. Click the Cluster Management tab at the top.

      On the Cluster Management page, find the cluster and click Details in the Actions column.

      On the page that appears, click Users in the left-side navigation pane. You can view the user ID on the Users page.

      Project ID The ID of the project in the E-MapReduce cluster. You can log on to the E-MapReduce console, click the Data Platform tab at the top, and then view the ID of the project on the Data Platform page.
      YARN resource queue The name of the resource queue in the E-MapReduce cluster. Set the value to default unless otherwise required.
      Endpoint The endpoint of the E-MapReduce cluster, which is obtained from the E-MapReduce console.
    6. Click Create Workspace.
  4. Activate OSS and create a bucket.
    1. Go to the OSS product landing page and click Buy Now.
    2. On the purchase page that appears, select the required configurations and click Enable Now.
    3. After the purchase is completed, click Console on the OSS product landing page to go to the OSS console.
    4. In the OSS console, click Buckets in the left-side navigation pane. On the Buckets page that appears, click Create Bucket. You can also click Create Bucket in the Bucket Management section on the Overview page.
    5. In the Create Bucket dialog box that appears, set parameters for the bucket and click OK.
      Note Set Region to China (Shanghai).
    6. Click the created bucket. On the page that appears, click Files in the left-side navigation pane. On the Files page, click Create Folder.
    7. In the Create Folder dialog box that appears, set Folder Name and click OK.
      Note Create three folders in total to store data synchronized from OSS, data synchronized from Relational Database Service (RDS), and JAR resources respectively.