All Products
Search
Document Center

DataWorks:Migrate EMR projects to DataWorks

Last Updated:Nov 06, 2023

DataWorks allows you to associate only DataLake clusters in the new data lake scenario in E-MapReduce (EMR) with a DataWorks workspace. To use the projects that you created in Hadoop clusters in the original data lake scenario, you must migrate the projects to a DataWorks workspace for data development. This topic describes how to migrate EMR projects to a DataWorks workspace.

Prerequisites

Background information

You can use one of the following methods to migrate workflows (nodes and scheduling settings), manually executed jobs, resources, and data sources from an EMR cluster to a DataWorks workspace.

After you trigger the migration, you can go to the Migration Assistant page in the DataWorks console to view the migration progress, migration results, and migration reports. For more information, see View the migration reports and result.

The following table lists the mappings between the original job types in EMR projects and the job types after the EMR projects are migrated to a DataWorks workspace.

Original job type

Job type after project migration to DataWorks

SQOOP

Data Integration (Batch synchronization)

SPARK_SQL

EMR_SPARK_SQL

SPARK

EMR_SPARK

SHELL

EMR_SHELL

PRESTO_SQL

EMR_PRESTO

MR

EMR_MR

IMPALA_SQL

EMR_IMPALA

HIVE_SQL

EMR_HIVE

HIVE

EMR_SHELL

Method 1: Use the one-click migration feature in the old EMR console to migrate EMR projects to DataWorks

You can use the one-click migration feature in the old EMR console to migrate the configuration information of an EMR cluster to a DataWorks workspace.

  1. Log on to the old EMR console.

  2. In the top navigation bar, select the region where your cluster resides. Then, click the Data Platform tab.

  3. Create a task for one-click migration.

    1. Click the ID of the project that you want to migrate in the Project ID/Name column to go to the details page of the project.

    2. Perform the steps shown in the following figure to go to the page that displays the procedure for migrating an EMR workflow to DataWorks.

      Project migration
    3. Select the desired workspace and click Migrate.

      One-click migration
      Note

      After you click Migrate, the system compresses the project that you want to migrate into a package, exports the package from EMR, and then imports the package to the desired DataWorks workspace.

    4. In the Note message, check the mappings of the types of nodes, scheduling settings, manually executed jobs, resources, and data sources before and after migration. You can use the mappings to check the integrity and validity of the migration. If the information is correct, click OK.

  4. The system starts to migrate the project.

    You can click Go to Import Tasks to view the migration progress. For more information, see View the migration reports and result.

Method 2: Use DataWorks Migration Assistant to export an EMR project as a package and then import the package to a DataWorks workspace

In the DataWorks console, you can export the nodes, scheduling settings, manually executed jobs, resources, and data sources that are stored in an EMR cluster as a package, and then import the package to a DataWorks workspace. The Migration Assistant service of DataWorks in different editions provides different migration policies. Different roles are granted different permissions to use the Migration Assistant service. For more information, see Limits.

Note

If you use the Migration Assistant service as a RAM user, make sure that the AliyunEMRFullAccess policy is attached to the RAM user. Otherwise, the system reports an error when you select a value from the Project Name drop-down list. For information about how to attach a policy to a RAM user, see Grant permissions to RAM users.

  1. Go to the Migration Assistant page in the DataWorks console.

    1. Log on to the DataWorks console. In the left-side navigation pane, choose Data Modeling and Development > DataStudio. On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.

    2. On the DataStudio page, click the Icon icon in the upper-left corner and choose All Products > More > Migration Assistant.

  2. Export a project from EMR as a package.

    1. In the left-side navigation pane of the Migration Assistant page, choose Cloud tasks > Scheduling Engine Export.

    2. On the Schemes of Scheduling Engine Export page, click the EMR tab. Then, click Create Export Task.

    3. In the Create Export Task dialog box, configure the parameters.

      Create Export Task dialog box
    4. After the project is exported, return to the Schemes of Scheduling Engine Export page to view the export result. Click Download Export Package in the Actions column that corresponds to the export task to download the exported package to your on-premises machine.

      Note

      The download link is valid for 30 days. We recommend that you download the package before the validity period ends. After the validity period ends, you need to re-export the project if you want to download the package.

      Schemes of Scheduling Engine Export page
  3. Import the downloaded package to a DataWorks workspace.

    1. Create an import task.

      In the left-side navigation pane of the Migration Assistant page, choose Cloud tasks > Scheduling Engine Import. On the Import Tasks page, click Create Import Task in the upper-right corner.

    2. In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.

      Create Import Task dialog box

      Parameter

      Description

      Name

      The name of the import task. You can specify a custom name for the import task.

      Scheduling Engine

      The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.

      Upload From

      The source of the package that you want to import. Valid values:

      • Local: Select this mode if the package is less than or equal to 30 MB in size.

      • OSS: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.

        Note

        For information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.

        Download link

      Select File

      The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.

      Note

      This parameter is required only if you select Local for the Upload From parameter.

      OSS Endpoint

      The OSS URL of the EMR project that you want to import.

      Note

      This parameter is required only if you select OSS for the Upload From parameter.

      File name

      The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.

      Note

      This parameter is required only if you select Local for the Upload From parameter.

      Remarks

      The description of the import task.

    3. On the Edit import task page, check the project that you want to import and click start import in the upper-right corner.

    4. The system starts to import the project.

      You can click Go to Import Tasks to view the migration progress. For more information, see View the migration reports and result.

Method 3: Package an EMR project by using a tool and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace

You can run commands to package an EMR project and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace.

Note

Before you use this method, you must install a Python environment on your on-premises machine.

  1. Package an EMR project to your on-premises machine.

    1. Download the package of the project packaging tool migrationx-reader to your on-premises machine.

    2. Run a command to package the EMR project that you want to migrate.

      Decompress the package of the project packaging tool and run the following command in the Python environment:

      python ./migrationx-reader/bin/reader.py -a aliyunemr -d . -i $accessId -k $accessKey -p $project -e emr.aliyuncs.com -r $regionId

      Take note of the following parameters:

      • $accessId $accessKey: the AccessKey pair of the user account that is used to perform the packaging operation.

      • $project: the name of the EMR project that you want to package.

      • $regionId: the ID of the region where the EMR project resides.

  2. Use DataWorks Migration Assistant to import the package of the EMR project.

    1. Create an import task.

      In the left-side navigation pane of the Migration Assistant page, choose Cloud tasks > Scheduling Engine Import. On the Import Tasks page, click Create Import Task in the upper-right corner.

    2. In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.

      Create Import Task dialog box

      Parameter

      Description

      Name

      The name of the import task. You can specify a custom name for the import task.

      Scheduling Engine

      The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.

      Upload From

      The source of the package that you want to import. Valid values:

      • Local: Select this mode if the package is less than or equal to 30 MB in size.

      • OSS: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.

        Note

        For information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.

        Download link

      Select File

      The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.

      Note

      This parameter is required only if you select Local for the Upload From parameter.

      OSS Endpoint

      The OSS URL of the EMR project that you want to import.

      Note

      This parameter is required only if you select OSS for the Upload From parameter.

      File name

      The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.

      Note

      This parameter is required only if you select Local for the Upload From parameter.

      Remarks

      The description of the import task.

    3. On the Edit import task page, check the project that you want to import and click start import in the upper-right corner.

    4. The system starts to import the project.

      You can click Go to Import Tasks to view the migration progress. For more information, see View the migration reports and result.

View the migration reports and result

After a project is migrated, you can go to the Migration Assistant page to view the migration progress, migration result, and migration reports.

  • Import

    On the Import Tasks page in Migration Assistant, find the desired import task and click View Import Report in the Actions column. View the report of an import task

  • Export

    On the Schemes of Scheduling Engine Export page in Migration Assistant, click the EMR tab, find the desired export task, and then click View Export Report in the Actions column. View the report of an export task