Migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster - Elasticsearch

You can use an Alibaba Cloud Logstash pipeline to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster. This topic describes the procedure in detail.

Limits

The Elastic Compute Service (ECS) instances that host the self-managed Elasticsearch cluster must be deployed in a virtual private cloud (VPC). You cannot use ECS instances that are connected to VPCs over ClassicLink connections.
Alibaba Cloud Logstash clusters are deployed in VPCs. Before you configure a Logstash pipeline, you must check whether the ECS instances that host the self-managed Elasticsearch cluster reside in the same VPC as the Alibaba Cloud Logstash cluster that you want to use. If they reside in different VPCs, you must configure Network Address Translation (NAT) gateways and use the gateways to connect the ECS instances and Logstash cluster to the Internet. For more information, see Configure a NAT gateway for data transmission over the Internet.
You must configure security group rules to allow access from the IP addresses of the nodes in the Logstash cluster for the security groups of the ECS instances that host the self-managed Elasticsearch cluster. In addition, you must enable port 9200. You can obtain the IP addresses of the nodes in the Logstash cluster on the Basic Information page of the Logstash cluster.
In this example, an Alibaba Cloud Logstash V6.7.0 cluster is used to migrate data from a self-managed Elasticsearch 5.6.16 cluster to an Alibaba Cloud Elasticsearch V6.7.0 cluster. The scripts provided in this topic apply only to this type of data migration. If you want to perform other types of data migration, you must check whether your Elasticsearch clusters and Logstash cluster meet compatibility requirements based on the instructions in Compatibility matrixes. If they do not meet compatibility requirements, you can upgrade their versions or purchase new clusters.

Procedure

Step 1: Make preparations
Step 2: Configure and run a Logstash pipeline
Step 3: View the data migration result

Step 1: Make preparations

Create a self-managed Elasticsearch cluster.
We recommend that you use Alibaba Cloud ECS instances to create a self-managed Elasticsearch cluster. In this example, a self-managed Elasticsearch 5.6.16 cluster is created. For more information, see Install and Run Elasticsearch.
Create an Alibaba Cloud Logstash cluster.
We recommend that you create an Alibaba Cloud Logstash cluster in the same VPC as the ECS instances that host the self-managed Elasticsearch cluster. For more information, see Create an Alibaba Cloud Logstash cluster.
Create an Alibaba Cloud Elasticsearch cluster and enable the Auto Indexing feature for the cluster.
- We recommend that you create an Alibaba Cloud Elasticsearch cluster in the same VPC as the Alibaba Cloud Logstash cluster. Make sure that the Alibaba Cloud Elasticsearch cluster is of the same version as the Logstash cluster. In this example, V6.7.0 is used. For more information, see Create an Alibaba Cloud Elasticsearch cluster.
- For information about how to enable the Auto Indexing feature, see Configure the YML file.
  Note
  Logstash does not synchronize the structure features of data when Logstash migrates data. Therefore, if you enable the Auto Indexing feature, the structure of data may change after the data is migrated to the destination. If you want the structure of the data to remain unchanged, we recommend that you create an empty index in the destination and migrate data to the index. When you create the index, copy the mappings and settings configurations of the source and set the numbers of shards to appropriate values.

Step 2: Configure and run a Logstash pipeline

Go to the Logstash Clusters page of the Alibaba Cloud Elasticsearch console.
Navigate to the desired cluster.
1. In the top navigation bar, select the region where the cluster resides.
2. On the Logstash Clusters page, find the cluster and click its ID.
In the left-side navigation pane of the page that appears, click Pipelines.
On the Pipelines page, click Create Pipeline.

In the Create Task wizard, enter a pipeline ID and configure the pipeline.

In this example, the following configurations are used for the pipeline:

input {
  elasticsearch {
    hosts => ["http://<IP address of the master node in the self-managed Elasticsearch cluster>:9200"]
    user => "elastic"
    index => "*,-.monitoring*,-.security*,-.kibana*"
    password => "your_password"
    docinfo => true
  }
}
filter {
}
output {
  elasticsearch {
    hosts => ["http://es-cn-mp91cbxsm000c****.elasticsearch.aliyuncs.com:9200"]
    user => "elastic"
    password => "your_password"
    index => "%{[@metadata][_index]}"
    document_type => "%{[@metadata][_type]}"
    document_id => "%{[@metadata][_id]}"
  }    
  file_extend {
        path => "/ssd/1/ls-cn-v0h1kzca****/logstash/logs/debug/test"
    }
}

Table 1. Parameters
Parameter	Description
hosts	The endpoint of the self-managed Elasticsearch cluster or Alibaba Cloud Elasticsearch cluster. In the input part, specify a value for this parameter in the format of `http://<IP address of the master node in the self-managed Elasticsearch cluster>:<Port number>`. In the output part, specify a value for this parameter in the format of `http://<ID of the Alibaba Cloud Elasticsearch cluster>.elasticsearch.aliyuncs.com:9200`. Important When you configure this parameter, you must replace <IP address of the master node in the self-managed Elasticsearch cluster>, <Port number>, and <ID of the Alibaba Cloud Elasticsearch cluster> with actual values.
user	The username that is used to access the self-managed Elasticsearch cluster or Alibaba Cloud Elasticsearch cluster. Important The user and password parameters are required in most cases. If the X-Pack plug-in is not installed on the self-managed Elasticsearch cluster, you can leave the two parameters empty. The default username that is used to access the Alibaba Cloud Elasticsearch clusters is elastic. The default username is used in this example. You can use a custom username. Before you use a custom username, you must create a role for it and grant the required permissions to the role. For more information, see Use the RBAC mechanism provided by Elasticsearch X-Pack to implement access control.
password	The password that is used to access the self-managed Elasticsearch cluster or Alibaba Cloud Elasticsearch cluster.
index	The names of the indexes whose data you want to migrate or to which you want to migrate data. If you set this parameter to ,-.monitoring,-.security,-.kibana in the input part, the system migrates data in indexes other than system indexes whose names start with a period (`.`). If you set this parameter to %{[@metadata][_index]} in the output part, the system matches the index parameter in the metadata. This indicates that the names of the indexes generated on the Alibaba Cloud Elasticsearch cluster are the same as the names of the indexes on the self-managed Elasticsearch cluster.
docinfo	If you set this parameter to true, the system extracts the metadata of documents in the self-managed Elasticsearch cluster, such as the index, type, and id fields.
document_type	If you set this parameter to %{[@metadata][_type]}, the system matches the index type in the metadata. This indicates that the type of the indexes generated on the Alibaba Cloud Elasticsearch cluster is the same as the type of the indexes on the self-managed Elasticsearch cluster. Note If the version of the source Elasticsearch cluster is 6.X and that of the destination Elasticsearch cluster is 7.X, set document_type to `_doc`.
document_id	If you set this parameter to %{[@metadata][_id]}, the system matches the document IDs in the metadata. This indicates that the IDs of the documents generated on the Alibaba Cloud Elasticsearch cluster are the same as the IDs of the documents on the self-managed Elasticsearch cluster.
file_extend	This parameter is optional. It specifies whether the pipeline configuration debugging feature is enabled. You can use the path field to specify the path that stores debug logs. We recommend that you configure this parameter. After the parameter is configured, you can directly view the output data of the pipeline in the console. If the parameter is not configured, you need to check the output data of the pipeline in the destination. If the output data is incorrect, you need to modify the configuration of the pipeline in the console. This increases time and labor costs. For information about the feature, see Use the pipeline configuration debugging feature. Important Before you use the file_extend parameter, you must install the logstash-output-file_extend plug-in. For more information, see Install and remove a plug-in. By default, the path field is set to a system-specified path. We recommend that you do not change the path. You can click Start Configuration Debug to obtain the path.

An input plug-in for which an Elasticsearch cluster is specified can read data from the cluster based on the query statement that is configured for the plug-in. The plug-in is suitable for scenarios in which multiple test logs need to be imported at a time. By default, the synchronization operation is automatically disabled and the Logstash process is stopped after data is read. However, Logstash needs to ensure that the process continuously runs. Therefore, Logstash restarts the process. This may cause duplicate data writes if only one pipeline exists. To address this issue, you can specify a period to enable Logstash to run a pipeline on a regular basis. For example, you can enable Logstash to run a pipeline at 13:20 on March 5 every year. After the pipeline is run for the first time, Logstash stops it. You can use the schedule parameter together with the syntax of cron expressions to specify the period. For more information, see Scheduling in open source Logstash documentation.

The following code provides an example. In this example, a pipeline is scheduled to run at 13:20 on March 5 every year.

schedule => "20 13 5 3 *"

For more information about how to configure parameters in the Config Settings field, see Logstash configuration files.

Click Next to configure pipeline parameters.

管道参数配置

Parameter	Description
Pipeline Workers	The number of worker threads that run the filter and output plug-ins of the pipeline in parallel. If a backlog of events exists or some CPU resources are not used, we recommend that you increase the number of threads to maximize CPU utilization. The default value of this parameter is the number of vCPUs.
Pipeline Batch Size	The maximum number of events that a single worker thread can collect from input plug-ins before it attempts to run filter and output plug-ins. If you set this parameter to a large value, a single worker thread can collect more events but consumes larger memory. If you want to make sure that the worker thread has sufficient memory to collect more events, specify the LS_HEAP_SIZE variable to increase the Java virtual machine (JVM) heap size. Default value: 125.
Pipeline Batch Delay	The wait time for an event. This time occurs before you assign a small batch to a pipeline worker thread and after you create batch tasks for pipeline events. Default value: 50. Unit: milliseconds.
Queue Type	The internal queue model for buffering events. Valid values: MEMORY: traditional memory-based queue. This is the default value. PERSISTED: disk-based ACKed queue, which is a persistent queue.
Queue Max Bytes	The value must be less than the total capacity of your disk. Default value: 1024. Unit: MB.
Queue Checkpoint Writes	The maximum number of events that are written before a checkpoint is enforced when persistent queues are enabled. The value 0 indicates no limit. Default value: 1024.

Warning

After you configure the parameters, you must save the settings and deploy the pipeline. This triggers a restart of the Logstash cluster. Before you can proceed, make sure that the restart does not affect your business.

Click Save or Save and Deploy.
- Save: After you click this button, the system stores the pipeline settings and triggers a cluster change. However, the settings do not take effect. After you click Save, the Pipelines page appears. On the Pipelines page, find the created pipeline and click Deploy in the Actions column. Then, the system restarts the Logstash cluster to make the settings take effect.
- Save and Deploy: After you click this button, the system restarts the Logstash cluster to make the settings take effect.

Step 3: View the data migration result

Log on to the Kibana console of your Elasticsearch cluster and go to the homepage of the Kibana console as prompted.
For more information about how to log on to the Kibana console, see Log on to the Kibana console.
Note In this example, an Elasticsearch V6.7.0 cluster is used. Operations on clusters of other versions may differ. The actual operations in the console prevail.
In the left-side navigation pane of the page that appears, click Dev Tools.
On the Console tab of the page that appears, run the GET /_cat/indices?v command to view the indexes that store the migrated data.

FAQ

Q: How do I connect the ECS instances that host the self-managed Elasticsearch cluster to the Alibaba Cloud Logstash cluster when the ECS instances and the Logstash cluster belong to different accounts?
A: The ECS instances and the Logstash cluster belong to different accounts. Therefore, the ECS instances and the Logstash cluster reside in different VPCs. In this case, you can use Cloud Enterprise Network (CEN) to connect the ECS instances to the Logstash cluster. For more information, see Step 3: Attach network instances.
Q: An error occurs when Logstash writes data to the destination. What do I do?
A: Troubleshoot the error based on the instructions provided in FAQ about data transfer by using Logstash.