Migrate self-built Hadoop data to Alibaba Cloud EMR

Best Practices Overview

Application Scenario

Customers build their own Hadoop clusters in IDC or public cloud environments, and the data is centrally stored in the HDFS file system for data analysis tasks. However, due to the space limitation of the self-built HDFS, long-term data cannot be saved, or the customer needs to migrate the Hadoop cluster to the cloud. This practice plan provides best practices for the following scenarios:

Based on IPSec VPN tunnel + DistCp (Hadoop native tool), migrate data to Alibaba Cloud EMR cluster, target storage includes HDFS, Alibaba Cloud OSS and Jindo of Alibaba Cloud EMR

Technology Architecture

This practice plan is based on the technical architecture and main process as shown in the figure below:

Solution advantage

Secure data transmission based on IPSec VPN/private line.
•low cost
Creating Hadoop-type EMR clusters on Alibaba Cloud has certain cost advantages compared with self-built Hadoop clusters. At the same time, Alibaba Cloud EMR can use OSS as the underlying storage space to further reduce costs.

Before proceeding with this article, you need to complete the following preparations:

• Have an Alibaba Cloud real-name authentication account.
•Own a domain name that has passed the filing.
• Ensure that the account balance is greater than RMB 100.
•Activate services such as ECS, OSS, EMR and VPN gateway.

Resource Planning Description
The resource planning practiced in this solution is only for practical demonstration, and the actual business scenario resources are subject to the actual needs of users.
In this practical solution, the activation and purchase of Alibaba Cloud resources are involved. In the subsequent examples of this article, the relevant operations of service activation are not shown separately, and users are required to complete them by themselves.
During the use of this practice plan, the general process and the time required for actual operation are as follows (document reading time is not included), for users' reference:

1. Build a self-built Hadoop cluster environment
In this practice plan, the Shanghai VPC environment is used to simulate the customer's IDC network, and the following components are mainly installed:

(1) Install FlexGW VPN on ECS to simulate the VPN gateway in the customer's IDC network;
(2) Install Apache log simulator on ECS to generate log information in Apache format;
(3) Install Kafka on ECS to centrally store the logs sent by Flume;
(4) Install a 3-node Hadoop cluster on ECS, where HDFS is used to centrally save log data information.

1.1. Create a VPC network
Step 1 Log in to the VPC product console.
Step 2 Click Create VPC.
Step 3 On the Create VPC page, refer to the table below, configure the VPC and switch parameters, and click OK.

Step 4 After the VPC and switch are created successfully, click Finish.

1.2. Create ECS instances in batches
Create an ECS instance
Step 1 Log in to the ECS product console in the Shanghai region.
Step 2 Click Create Instance in the upper right corner.
Step 3 In the custom purchase mode, configure related parameters.
Refer to the table below to configure the basic configuration related content.
After the configuration is complete, click Next: Network and Security Group.
Click View Historical Price . In the historical price chart of preemptible instances, you can see that the current market price of instances in availability zone F is 0.034. Therefore, we set the upper limit price of a single instance to 0.04, which is slightly higher than the current market price.

Step 4 On the Network and Security Group page, refer to the table below to configure related parameters.

After the configuration is complete, click Next: System Configuration.

Step 5 On the system configuration page, refer to the table below to configure related parameters.

After the configuration is complete, click Confirm Order.

Step 6 On the order confirmation page, confirm the parameter information. Confirm that it is correct, read, agree and check the "Terms of Service for Cloud Server ECS" and "Terms of Use for Mirrored Products", and click Create Instance.

Step 7 After creating the task and submitting it successfully, click the management console to go to the ECS instance list page to view the details. In order to facilitate the identification of the purpose of ECS in the console, first change the instance name as shown in the following figure:

Step 8 Shut down the two instances of Kafka queue and FlexGW VPN, and replace the system disk with the image of the cloud market to save the deployment time of the basic environment.

Step 9 First replace the system disk for the FlexGW VPN gateway instance.

Click More > Disks and Images > Replace OS under the Actions column of the instance.

Click OK in the lower right corner, replace the system disk button.

Select an image market, and click Select from image market (including the operating system).

Enter flexgw in the search box, navigate to FlexGW IPsec VPN Server Enterprise Edition, and click Use.

Select Custom Password and set a login password.

Step 10 Refer to Step 9 to replace the system disk for the Kafka queue instance, and select the following image.

(Optional) Configure a Security Group

Confirm the opening of ports 22, 80, and 443 in the security group where the instance is located. If there are ports that have not been opened, follow the steps below to open them.
Step 1 On the ECS console, click Manage under the Action column of the FlexGW VPN gateway instance.
Step 2 Click the security group of this instance in the left navigation bar.

Step 3 Click Configure Rule under the Operation column of the corresponding security group.

Step 4 On the Inbound Direction tab, click Create Rule Quickly.

Step 5 In the Quick Create Rule dialog box, configure related parameters according to the figure below, and click OK.

Related Articles

Explore More Special Offers

  1. Short Message Service(SMS) & Mail Service

    50,000 email package starts as low as USD 1.99, 120 short messages start at only USD 1.00

phone Contact Us