Community Blog An Introduction and Best Practice of DataWorks Migration Assistant

An Introduction and Best Practice of DataWorks Migration Assistant

Part 7 of this 10-part series introduces DataWorks migration assistant and best practices.

By Peng Min, Product Manager of DataWorks

This article is a part of the One-stop Big Data Development and Governance DataWorks Use Collection.

1. Product Capabilities

The migration assistant is a good tool for rapid replication of development results on DataWorks. The migration assistant can quickly replicate development results on DataWorks in different environments. The migration assistant is a new module of DataWorks. Its core capabilities are mainly divided into two parts, migrating tasks to the cloud and DataWorks migration.

In the preceding figure, the interface on the left is the interface of choice for task migration to the cloud > open-source scheduling engine export scheme. It helps users export scheduling tasks on the open-source scheduling engine. On the right side is the migration page of DataWorks. This is the operation page of the exported object. The following introduces the two core capabilities of the migration assistant.

Task migration to the cloud means jobs on the self-built open-source scheduling engine are moved to the cloud quickly. The main scheduling engines supported are Oozie, Azkaban, and Airflow. The main types of nodes that can be migrated are Sqoop nodes, Shell nodes, and Hive nodes. You can convert these nodes into MaxCompute tasks of DataWorks or EMR tasks. For example, when a Hive job is imported to DataWorks, you can convert the Hive job into a MaxCompute SQL job or an EMR Hive job.

DataWorks migration is mainly used to migrate the development results on DataWorks. It is mainly used for migration in a variety of complex scenarios, such as cross-tenant, cross-region, cross-cloud, and cross-version. The objects that support migration include recurring tasks, manual tasks, resources, functions, SQL components, ad hoc queries, data sources, and table metadata (DDL).

The migration assistant supports multiple export methods. DataWorks has a backup and restoration function in the early stage, which supports recurring tasks and manual tasks for full backup and incremental backup. However, you may need to customize the backup content at times. At this time, the backup and restoration capability cannot meet the requirements, and the backup and restoration can support fewer objects, which cannot effectively support the migration requirements. Therefore, DataWorks has made a new product design for the migration requirement, giving birth to the migration assistant module.

It also supports some advanced settings, such as supporting users exporting blacklists to protect sensitive tasks during the migration process. It also allows users to set up the mapping of resource groups and job dependencies to reduce job modifications after importing to new workspaces. In addition, the DataWorks migration assistant supports detailed migration reports, allowing users to understand what happened during the entire migration process quickly. This includes which tasks were exported successfully, which tasks failed, and what causes the failure. Finally, the migration process will be compatible with the historical versions of DataWorks privatization deployment. It also supports the migration of jobs developed in the public cloud to DataWorks privatization deployment.

2. Usage Scenarios for Migration Assistant

The four main scenarios of the migration assistant are task migration to the cloud, cross-region migration, rapid build of a test environment, cross-cloud publish, disaster recovery, and rapid replication of development results. The following is an introduction to several core scenarios.

(1) Task Migration to the Cloud

As stated above, jobs on open-source scheduling engines, such as Oozie and Azkaban, can be converted into MaxCompute or EMR jobs.

(2) Cross-Region Migration

Cross-region migration allows users to migrate the development results from the Shanghai region to other regions. Users have raised a demand. When DataWorks was first used, it only used the Shanghai Region. Other regions have not been deployed yet, but the server is in the Beijing Region. What should I do to migrate the big data platform to the Beijing Region?

If you want to move DataWorks as a whole from one region to another, the migration assistant cannot fully meet the needs of users because the overall migration involves many risks and details, such as how to migrate data, how to migrate tasks, how to migrate job environments, and how to migrate members and permissions.

Therefore, the relocation across regions involves a lot of details. If you must do this kind of overall relocation across regions, you can use a ticket or contact the DataWorks Team in the DingTalk group to evaluate the risks of the overall relocation. If you only want to migrate some of your business to other regions, you can use the migration assistant.

(3) Cross-Cloud Publish

Cross-cloud publish is a common requirement of the financial industry. Industries, such as banking and insurance, must be supervised, so their development and production environments must be physically isolated. As a result, there will be two clusters and two sets of environments; one is the development environment, and another is the production environment. Daily data development is carried out in the development environment. Currently, the tasks are published to the production environment through the migration assistant.

Why must the publishing from development to production be completed by the migration assistant? There are three main problems.

First, the physical isolation between development and production makes it impossible to communicate between systems. Second, tasks cannot be published manually because the overall time window for task release is small. Third, since it is necessary to manage the version of the published objects, it is also impossible to manually perform creation tasks and migration. Therefore, the current solution tells developers to use the migration assistant to export the tasks to be published. Then, the O&M personnel import the export package to the production environment and keep the migration report in file for subsequent version management.

(4) Rapid Replication of Development Results

The core scenario of the migration assistant aims to copy the development results quickly. This feature is mainly for DataWorks partners. Partners only need to develop code once to copy these development results and deliver them to customers.

There are two advantages for partners using migration assistants to complete rapid replication of development results. First, the versions of DataWorks may be different, and data incompatibility issues between the versions may happen. The migration assistant can solve the data incompatibility issues and enable the task code to be copied between various versions and environments. Second, professional data developers will face a large number of customers at the same time, and the R&D center will develop multiple sets of task codes at the same time. They must be able to select the task objects to be migrated to deliver to customers flexibly. Also, the migration assistant can meet the requirements of customized export.


0 0 0
Share on

Alibaba Cloud Community

636 posts | 115 followers

You may also like