When you use DataWorks to perform big data development operations, you can isolate environments such as development, testing, and production. If you use DataWorks together with other Alibaba Cloud services, you can also configure the required environment settings and isolate environments based on your business requirements. This topic describes how to implement environment isolation when you use DataWorks together with Data Lake Formation (DLF), Object Storage Service (OSS), and E-MapReduce (EMR).

Background information

Enterprise users may have requirements for creating and isolating different environments, such as development, testing, and production environments, during big data development. After environment isolation is implemented, the physical storage paths of data, the compute engines on which nodes are run, and big data development scripts in different environments are isolated from each other. In addition, strict permission management is imposed on personnel who perform operations in different environments. For example, O&M personnel can use the production environment, and developers can use only the development environment.

In this example, DataWorks is used together with DLF, OSS, and EMR for big data development, and the development environment and production environment are isolated.
  • DataWorks is used to manage the development, O&M, and scheduling of big data jobs.
  • Two EMR clusters are separately used for the development environment and production environment.
  • OSS is used to store actual data.
  • DLF is used to store and manage metadata.
The following sections describe how to implement environment isolation.

Environment isolation for DLF

  1. Log on to the DLF console. In the left-side navigation pane, choose Metadata > Metadata. On the Metadata page, click the Catalog List tab. On the Catalog List tab, click New Catalog to create two catalogs named dev and prod. The dev catalog is used to store the metadata in the development environment, and the prod catalog is used to store the metadata in the production environment. When you create the catalogs, specify different values for the Location parameter.
    DLF
  2. Find the dev catalog and prod catalog and separately create a database in the catalogs. We recommend that you specify the same name and different OSS paths for the databases. This can facilitate subsequent data migration.
    Create a database

Environment isolation for EMR

Log on to the EMR console and create two EMR clusters. Separately configure catalog information for engines in the two EMR clusters. Make sure that the engines in the EMR cluster for the development environment use the dev catalog and the engines in the EMR cluster for the production environment use the prod catalog.

In this example, the Hive engine is used. dlf.catalog.id in the Hive engine in the EMR cluster for the development environment is set to dev, as shown in the following figure. For more information, see Manage configuration items. EMR
Important
  • This section describes only how to configure catalog information for the Hive engine in the EMR cluster for the development environment. You must configure catalog information for all types of engines in the two EMR clusters.
  • After you configure catalog information for the two EMR clusters, you must issue the configurations and restart all engines to make the configurations take effect.

Environment isolation for DataWorks

  1. Log on to the DataWorks console and create two workspaces in basic mode. Use one workspace as the development environment and associate the EMR cluster for the development environment with the workspace. Use the other as the production environment and associate the EMR cluster for the production environment with the workspace. For information about how to create a workspace, see Create a workspace.
  2. In the workspace that is used as the development environment, create a node, configure scheduling properties for the node, and create a table by executing an SQL statement on the DataStudio page.
    The following code provides an example of the statement that you can use to create a table:
    create table if not exists db1.table1 ( id int, name String);
    Note The databases created in the DLF catalogs have declared the paths. The paths no longer need to be declared.
  3. Create a workflow in the workspace that is used as the development environment and use the cross-project cloning feature to deploy the workflow to the workspace that is used as the production environment.
    Cross-project cloningTo deploy the workflow, you must select the workflow and configure workflow settings, such as compute engine mappings and resource groups. For information about the cross-project cloning feature, see Overview. After the workflow is deployed, you can view the created node in the workspace that is used as the production environment. You can modify, test, and deploy the node based on your business requirements in the production environment.