DataWorks allows you to associate only DataLake clusters in the new data lake scenario in E-MapReduce (EMR) with a DataWorks workspace. To use the projects that you created in Hadoop clusters in the original data lake scenario, you must migrate the projects to a DataWorks workspace for data development. This topic describes how to migrate EMR projects to a DataWorks workspace.
Prerequisites
DataWorks is activated and a DataWorks workspace is created. For information about how to create a DataWorks workspace, see Create and manage workspaces.
If you want to perform the migration as a RAM user, you must make sure that the RAM user is assigned the workspace administrator role and is attached with the
AliyunDataWorksFullAccess
andAliyunEMRFullAccess
policies. For more information, see Add a RAM user to a workspace as a member and assign roles to the member and Grant permissions to RAM users.The EMR cluster whose projects you want to migrate is associated with the DataWorks workspace that you created. For more information, see Associate an EMR compute engine with a workspace.
Background information
You can use one of the following methods to migrate workflows (nodes and scheduling settings), manually executed jobs, resources, and data sources from an EMR cluster to a DataWorks workspace.
After you trigger the migration, you can go to the Migration Assistant page in the DataWorks console to view the migration progress, migration results, and migration reports. For more information, see View the migration reports and result.
The following table lists the mappings between the original job types in EMR projects and the job types after the EMR projects are migrated to a DataWorks workspace.
Original job type | Job type after project migration to DataWorks |
SQOOP | Data Integration (Batch synchronization) |
SPARK_SQL | EMR_SPARK_SQL |
SPARK | EMR_SPARK |
SHELL | EMR_SHELL |
PRESTO_SQL | EMR_PRESTO |
MR | EMR_MR |
IMPALA_SQL | EMR_IMPALA |
HIVE_SQL | EMR_HIVE |
HIVE | EMR_SHELL |
Method 1: Use the one-click migration feature in the old EMR console to migrate EMR projects to DataWorks
You can use the one-click migration feature in the old EMR console to migrate the configuration information of an EMR cluster to a DataWorks workspace.
Log on to the old EMR console.
In the top navigation bar, select the region where your cluster resides. Then, click the Data Platform tab.
Create a task for one-click migration.
Click the ID of the project that you want to migrate in the Project ID/Name column to go to the details page of the project.
Perform the steps shown in the following figure to go to the page that displays the procedure for migrating an EMR workflow to DataWorks.
Select the desired workspace and click Migrate.
NoteAfter you click Migrate, the system compresses the project that you want to migrate into a package, exports the package from EMR, and then imports the package to the desired DataWorks workspace.
In the Note message, check the mappings of the types of nodes, scheduling settings, manually executed jobs, resources, and data sources before and after migration. You can use the mappings to check the integrity and validity of the migration. If the information is correct, click OK.
The system starts to migrate the project.
You can click Go to Import Tasks to view the migration progress. For more information, see View the migration reports and result.
Method 2: Use DataWorks Migration Assistant to export an EMR project as a package and then import the package to a DataWorks workspace
In the DataWorks console, you can export the nodes, scheduling settings, manually executed jobs, resources, and data sources that are stored in an EMR cluster as a package, and then import the package to a DataWorks workspace. The Migration Assistant service of DataWorks in different editions provides different migration policies. Different roles are granted different permissions to use the Migration Assistant service. For more information, see Limits.
If you use the Migration Assistant service as a RAM user, make sure that the AliyunEMRFullAccess
policy is attached to the RAM user. Otherwise, the system reports an error when you select a value from the Project Name drop-down list. For information about how to attach a policy to a RAM user, see Grant permissions to RAM users.
Go to the Migration Assistant page in the DataWorks console.
Log on to the DataWorks console. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to DataStudio.
On the DataStudio page, click the icon in the upper-left corner and choose .
Export a project from EMR as a package.
In the left-side navigation pane of the Migration Assistant page, choose .
On the Schemes of Scheduling Engine Export page, click the EMR tab. Then, click Create Export Task.
In the Create Export Task dialog box, configure the parameters.
After the project is exported, return to the Schemes of Scheduling Engine Export page to view the export result. Click Download Export Package in the Actions column that corresponds to the export task to download the exported package to your on-premises machine.
NoteThe download link is valid for 30 days. We recommend that you download the package before the validity period ends. After the validity period ends, you need to re-export the project if you want to download the package.
Import the downloaded package to a DataWorks workspace.
Create an import task.
In the left-side navigation pane of the Migration Assistant page, choose
. On the Import Tasks page, click Create Import Task in the upper-right corner.In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.
Parameter
Description
Name
The name of the import task. You can specify a custom name for the import task.
Scheduling Engine
The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.
Upload From
The source of the package that you want to import. Valid values:
Local: Select this mode if the package is less than or equal to 30 MB in size.
OSS: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.
NoteFor information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.
Select File
The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.
NoteThis parameter is required only if you select Local for the Upload From parameter.
OSS Endpoint
The OSS URL of the EMR project that you want to import.
NoteThis parameter is required only if you select OSS for the Upload From parameter.
File name
The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.
NoteThis parameter is required only if you select Local for the Upload From parameter.
Remarks
The description of the import task.
On the Edit import task page, check the project that you want to import and click start import in the upper-right corner.
The system starts to import the project.
You can click Go to Import Tasks to view the migration progress. For more information, see View the migration reports and result.
Method 3: Package an EMR project by using a tool and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace
You can run commands to package an EMR project and then use DataWorks Migration Assistant to import the packaged EMR project to a DataWorks workspace.
Before you use this method, you must install a Python environment on your on-premises machine.
Package an EMR project to your on-premises machine.
Download the package of the project packaging tool migrationx-reader to your on-premises machine.
Run a command to package the EMR project that you want to migrate.
Decompress the package of the project packaging tool and run the following command in the Python environment:
python ./migrationx-reader/bin/reader.py -a aliyunemr -d . -i $accessId -k $accessKey -p $project -e emr.aliyuncs.com -r $regionId
Take note of the following parameters:
$accessId $accessKey: the AccessKey pair of the user account that is used to perform the packaging operation.
$project: the name of the EMR project that you want to package.
$regionId: the ID of the region where the EMR project resides.
Use DataWorks Migration Assistant to import the package of the EMR project.
Create an import task.
In the left-side navigation pane of the Migration Assistant page, choose
. On the Import Tasks page, click Create Import Task in the upper-right corner.In the Create Import Task dialog box, configure the parameters and click OK. The following table describes the parameters.
Parameter
Description
Name
The name of the import task. You can specify a custom name for the import task.
Scheduling Engine
The type of the engine for the project that you want to import. In this example, E-MapReduce (EMR) is selected.
Upload From
The source of the package that you want to import. Valid values:
Local: Select this mode if the package is less than or equal to 30 MB in size.
OSS: Select this mode if the package exceeds 30 MB in size. If you select this mode, you must also enter the URL of the related Object Storage Service (OSS) object in the OSS Endpoint field. You can obtain the URL of a specified object in the View Details panel of the object in the OSS console.
NoteFor information about how to upload objects to OSS, see Upload objects. For information about how to obtain the URL of an object in the OSS console, see Use object URLs.
Select File
The exported package of the EMR project. After the package is uploaded, the system checks whether the package meets the requirements.
NoteThis parameter is required only if you select Local for the Upload From parameter.
OSS Endpoint
The OSS URL of the EMR project that you want to import.
NoteThis parameter is required only if you select OSS for the Upload From parameter.
File name
The name of the package to be uploaded. This parameter is automatically specified after you configure the preceding parameters.
NoteThis parameter is required only if you select Local for the Upload From parameter.
Remarks
The description of the import task.
On the Edit import task page, check the project that you want to import and click start import in the upper-right corner.
The system starts to import the project.
You can click Go to Import Tasks to view the migration progress. For more information, see View the migration reports and result.
View the migration reports and result
After a project is migrated, you can go to the Migration Assistant page to view the migration progress, migration result, and migration reports.
Import
On the Import Tasks page in Migration Assistant, find the desired import task and click View Import Report in the Actions column.
Export
On the Schemes of Scheduling Engine Export page in Migration Assistant, click the EMR tab, find the desired export task, and then click View Export Report in the Actions column.