This topic describes how to create an E-MapReduce (EMR) table.

Prerequisites

  • An Alibaba Cloud EMR cluster is created, and an inbound rule that contains the following content is added to the security group to which the cluster belongs.
    • Action: Allow
    • Protocol type: Custom TCP
    • Port range: 8898/8898
    • Authorization object: 100.104.0.0/16
  • An EMR compute engine instance is associated with your workspace. The EMR folder is displayed only after you associate an EMR compute engine instance with the workspace on the Workspace Management page. For more information, see Configure a workspace.
  • The metadata of an EMR data source is collected by using the Data Map service of DataWorks. For more information, see Collect metadata from an EMR data source.

Procedure

  1. Go to the DataStudio page.
    1. Log on to the DataWorks console.
    2. In the left-side navigation pane, click Workspaces.
    3. In the top navigation bar, select the region in which the workspace that you want to manage resides. Find the workspace and click DataStudio in the Actions column.
  2. Move the pointer over the create icon icon and choose Create Table > EMR > Table.
    You can also find the workflow in which you want to create an EMR table, right-click EMR, and then select Create Table.
  3. In the Create Table dialog box, configure the parameters. The following table describes the parameters.
    create table
    Parameter Description
    Engine Type The default value is EMR and cannot be changed.
    Engine Instance Select a compute engine instance from the drop-down list.
    Name The name of the EMR table that you want to create.
  4. Click Create. The table configuration tab appears.
    The upper part of the tab shows the configurations that you specified in the Create Table dialog box. You can change the database to which the EMR compute engine instance connects. To create a database, click Create a database. In the Create a database dialog box, configure the parameters and click OK.
  5. In the Basic attributes section, configure the parameters. The following table describes the parameters.
    Basic attributes
    Parameter Description
    Level 1 theme The name of the level-1 folder in which the table resides.
    Note The level-1 and level-2 folders show the table locations in DataWorks to help you easily manage tables.
    Level 2 theme The name of the level-2 folder in which the table resides.
    Create a theme Click Create a theme to go to the Folder Management tab. On the Folder Management tab, you can create level-1 and level-2 folders.
    Refresh After you create a folder, click Refresh.
    Description The description of the table.
  6. In the Physical model design section, configure the parameters. The following table describes the parameters.
    Physical model design
    Parameter Description
    Layer Select a level and a category from the drop-down lists based on your business requirements. To create levels and categories, click Create Level to go to the Level Management tab and create levels and categories. You can perform this operation only if you are the workspace administrator. After you create levels and categories, click Refresh.
    Physical classification
    Partition type Valid values: Partition table and Non-partitioned table.
    Table type Valid values: Internal tables and External tables.
  7. In the Table structure design section, configure the parameters. The following table describes the parameters.
    Table structure design
    Parameter Description
    Add fields To add a field, click Add fields, configure the field information, and then click Save.
    Move up You can click the buttons to adjust the field sequence of the table. If you want to adjust the sequence of fields in an existing table, you must delete the table and create another table that has the same name. You are not allowed to adjust the sequence of fields in an existing table in the production environment.
    Move down
    Field name The name of a field. The name can contain letters, digits, and underscores (_).
    Field type The data type of a field. EMR supports the following data types: TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE, DECIMAL, VARCHAR, CHAR, STRING, BINARY, DATETIME, DATE, TIMESTAMP, BOOLEAN, ARRAY, MAP, and STRUCT.
    Length/Settings The length limit of a field. If the data type that you specified for a field requires a length limit, you must configure this parameter.
    Description The description of a field.
    Primary key Specifies whether a field serves as the primary key. The primary key is a business concept that ensures the uniqueness of a record for your business. DataWorks has no limits on the primary key.
    Edit You can click this button for a field to edit the field and then click Save.
    Delete You can click this button for a field to delete the field.
    Note If you want to delete a field from an existing table and then commit the table, you must delete the table and create another table that has the same name. You are not allowed to perform this operation in the production environment.
    Add If you set the Partition type parameter to Partition table in the Physical model design section, you must configure a partition for the table.

    You can click this button to add a partition to the current table. If you want to add a partition to an existing table and then commit the table, you must delete the table and create another table that has the same name. You are not allowed to add a partition to an existing table in the production environment.

  8. Click the Submit icon icon in the top toolbar to commit the EMR table to the production environment.
    If you use a workspace in basic mode, commit the table to the development environment and the production environment in sequence.
    Notice

    You must select a resource group for scheduling when you commit the table. If you use an exclusive resource group for scheduling to commit the table, DataWorks issues a table creation node to a compute engine instance and prints the run logs. If an error occurs when you commit the table, you can use the run logs to troubleshoot the issue. If no exclusive resource groups for scheduling are available, you can purchase and configure an exclusive resource group for scheduling. For more information, see Create and use an exclusive resource group for scheduling.