This topic describes how to use the reindex API to migrate data from a self-managed Elasticsearch cluster that runs on Elastic Compute Service (ECS) instances to an Alibaba Cloud Elasticsearch cluster. Related operations include index creation and data migration.
Background information
- If the self-managed Elasticsearch cluster stores large volumes of data, use snapshots stored in Object Storage Service (OSS). For more information, see Use OSS to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.
- If you want to filter source data, use Logstash. For more information, see Use Alibaba Cloud Logstash to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.
Prerequisites
- A single-zone Alibaba Cloud Elasticsearch cluster is created.
For more information, see Create an Alibaba Cloud Elasticsearch cluster.
- A self-managed Elasticsearch cluster and the data to be migrated are prepared.
We recommend that you use a self-managed Elasticsearch cluster deployed on Alibaba Cloud ECS instances. For more information about how to deploy a self-managed Elasticsearch cluster, see Installing and Running Elasticsearch. The self-managed Elasticsearch cluster must meet the following requirements:
- The ECS instances that host the self-managed Elasticsearch cluster are deployed in the same virtual private cloud (VPC) as the Alibaba Cloud Elasticsearch cluster. You cannot use an ECS instance that is connected to a VPC over a ClassicLink.
- The IP addresses of nodes in the Alibaba Cloud Elasticsearch cluster are added to the security groups of the ECS instances that host the self-managed Elasticsearch cluster. You can query the IP addresses of the nodes in the Kibana console of the Alibaba Cloud Elasticsearch cluster. In addition, port 9200 is enabled.
- The self-managed Elasticsearch cluster is connected to the Alibaba Cloud Elasticsearch
cluster. You can test the connectivity by running the
curl -XGET http://<host>:9200
command on the server where you run scripts.Note You can run all scripts provided in this topic on a server that can be connected to both the self-managed Elasticsearch cluster and Alibaba Cloud Elasticsearch cluster over port 9200.
Limits
Scenario | Network architecture | Support for the reindex API | Solution |
---|---|---|---|
Migrate data between Alibaba Cloud Elasticsearch clusters | Both clusters are deployed in the original network architecture. | Yes | For more information, see Use the reindex API to migrate data. |
Both clusters are deployed in the new network architecture. | No | Use OSS or Logstash to migrate data between the clusters. For more information, see Use OSS to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster and Use Alibaba Cloud Logstash to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster. | |
One is deployed in the original network architecture, and the other is deployed in the new network architecture. | No | ||
Migrate data from a self-managed Elasticsearch cluster that runs on ECS instances to an Alibaba Cloud Elasticsearch cluster | The Alibaba Cloud Elasticsearch cluster is deployed in the original network architecture. | Yes | For more information, see Use the reindex API to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster. |
The Alibaba Cloud Elasticsearch cluster is deployed in the new network architecture. | Yes | Use the PrivateLink service to establish a private connection between the VPC where
the Alibaba Cloud Elasticsearch cluster resides and the VPC where the self-managed
Elasticsearch cluster resides. Then, use the domain name of the endpoint you obtained
and the reindex API to migrate data between the clusters. For more information, see
Migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch
cluster deployed in the new network architecture.
Note Only some regions support PrivateLink. For more information, see Regions and zones that support PrivateLink. If the zone where your Alibaba Cloud Elasticsearch cluster resides does not support
PrivateLink, you cannot use the reindex API to migrate data between the clusters.
|
Precautions
- The network architecture of Alibaba Cloud Elasticsearch was adjusted in October 2020. Elasticsearch clusters created before October 2020 are deployed in the original network architecture. Elasticsearch clusters created in October 2020 or later are deployed in the new network architecture.You cannot perform cross-cluster operations, such as reindex, searches, or replication, between a cluster deployed in the original network architecture and a cluster deployed in the new network architecture. If you want to perform these operations between two clusters, you must make sure that the clusters are deployed in the same network architecture.The time when the network architecture in the China (Zhangjiakou) region and the regions outside China was adjusted is uncertain. If you want to perform the preceding operations between a cluster created before October 2020 and a cluster created in October 2020 or later in such a region, submit a ticket to contact Alibaba Cloud technical support to check whether the clusters are deployed in the same network architecture.
- Alibaba Cloud Elasticsearch clusters deployed in the new network architecture reside in the VPC within the service account of Alibaba Cloud Elasticsearch. These clusters cannot access resources in other network environments. Alibaba Cloud Elasticsearch clusters deployed in the original network architecture reside in VPCs that are created by users. These clusters can access resources in other network environments.
- To ensure data consistency and normal data read, we recommend that you do not write data to the self-managed Elasticsearch cluster during the migration. After the migration, you can read data from and write data to the Alibaba Cloud Elasticsearch cluster. If you want to write data to the self-managed Elasticsearch cluster during the migration, we recommend that you configure loop execution for the reindex operation to shorten the time during which write operations are suspended. For more information, see the method used to migrate a large volume of data (without deletions and with update time) in Step 4: Migrate data.
- If you connect to the self-managed Elasticsearch cluster or the Alibaba Cloud Elasticsearch
cluster by using its domain name, do not include path in the URL, such as
http://host:port/path
.
Procedure
Step 1: (Optional) Obtain the domain name of an endpoint
Alibaba Cloud Elasticsearch clusters created in October 2020 or later are deployed in the new network architecture. These clusters reside in the VPC within the service account of Alibaba Cloud Elasticsearch. If your Alibaba Cloud Elasticsearch cluster is deployed in the new network architecture, you need to use the PrivateLink service to establish a private connection between the VPC and your VPC. Then, obtain the domain name of the related endpoint for future use. To obtain the domain name, perform the following steps:
Step 2: Create destination indexes
Create destination indexes on the Alibaba Cloud Elasticsearch cluster based on the index settings of the self-managed Elasticsearch cluster. You can also enable the Auto Indexing feature for the Alibaba Cloud Elasticsearch cluster. However, we recommend that you do not use this feature.
#!/usr/bin/python
# -*- coding: UTF-8 -*-
# File name: indiceCreate.py
import sys
import base64
import time
import httplib
import json
## Specify the host of the self-managed Elasticsearch cluster.
oldClusterHost = "old-cluster.com"
## Specify the username of the self-managed Elasticsearch cluster. The field can be empty.
oldClusterUserName = "old-username"
## Specify the password of the self-managed Elasticsearch cluster. The field can be empty.
oldClusterPassword = "old-password"
## Specify the host of the Alibaba Cloud Elasticsearch cluster. You can obtain the information from the Basic Information page of the Alibaba Cloud Elasticsearch cluster in the Alibaba Cloud Elasticsearch console.
newClusterHost = "new-cluster.com"
## Specify the username of the Alibaba Cloud Elasticsearch cluster.
newClusterUser = "elastic"
## Specify the password of the Alibaba Cloud Elasticsearch cluster.
newClusterPassword = "new-password"
DEFAULT_REPLICAS = 0
def httpRequest(method, host, endpoint, params="", username="", password=""):
conn = httplib.HTTPConnection(host)
headers = {}
if (username != "") :
'Hello {name}, your age is {age} !'.format(name = 'Tom', age = '20')
base64string = base64.encodestring('{username}:{password}'.format(username = username, password = password)).replace('\n', '')
headers["Authorization"] = "Basic %s" % base64string;
if "GET" == method:
headers["Content-Type"] = "application/x-www-form-urlencoded"
conn.request(method=method, url=endpoint, headers=headers)
else :
headers["Content-Type"] = "application/json"
conn.request(method=method, url=endpoint, body=params, headers=headers)
response = conn.getresponse()
res = response.read()
return res
def httpGet(host, endpoint, username="", password=""):
return httpRequest("GET", host, endpoint, "", username, password)
def httpPost(host, endpoint, params, username="", password=""):
return httpRequest("POST", host, endpoint, params, username, password)
def httpPut(host, endpoint, params, username="", password=""):
return httpRequest("PUT", host, endpoint, params, username, password)
def getIndices(host, username="", password=""):
endpoint = "/_cat/indices"
indicesResult = httpGet(oldClusterHost, endpoint, oldClusterUserName, oldClusterPassword)
indicesList = indicesResult.split("\n")
indexList = []
for indices in indicesList:
if (indices.find("open") > 0):
indexList.append(indices.split()[2])
return indexList
def getSettings(index, host, username="", password=""):
endpoint = "/" + index + "/_settings"
indexSettings = httpGet(host, endpoint, username, password)
print index + " Original settings: \n" + indexSettings
settingsDict = json.loads(indexSettings)
## By default, the number of primary shards is the same as that for the indexes on the self-managed Elasticsearch cluster.
number_of_shards = settingsDict[index]["settings"]["index"]["number_of_shards"]
## The default number of replica shards is 0.
number_of_replicas = DEFAULT_REPLICAS
newSetting = "\"settings\": {\"number_of_shards\": %s, \"number_of_replicas\": %s}" % (number_of_shards, number_of_replicas)
return newSetting
def getMapping(index, host, username="", password=""):
endpoint = "/" + index + "/_mapping"
indexMapping = httpGet(host, endpoint, username, password)
print index + " Original mappings: \n" + indexMapping
mappingDict = json.loads(indexMapping)
mappings = json.dumps(mappingDict[index]["mappings"])
newMapping = "\"mappings\" : " + mappings
return newMapping
def createIndexStatement(oldIndexName):
settingStr = getSettings(oldIndexName, oldClusterHost, oldClusterUserName, oldClusterPassword)
mappingStr = getMapping(oldIndexName, oldClusterHost, oldClusterUserName, oldClusterPassword)
createstatement = "{\n" + str(settingStr) + ",\n" + str(mappingStr) + "\n}"
return createstatement
def createIndex(oldIndexName, newIndexName=""):
if (newIndexName == "") :
newIndexName = oldIndexName
createstatement = createIndexStatement(oldIndexName)
print "New index " + newIndexName + " Index settings and mappings: \n" + createstatement
endpoint = "/" + newIndexName
createResult = httpPut(newClusterHost, endpoint, createstatement, newClusterUser, newClusterPassword)
print "New index " + newIndexName + " Creation result: " + createResult
## main
indexList = getIndices(oldClusterHost, oldClusterUserName, oldClusterPassword)
systemIndex = []
for index in indexList:
if (index.startswith(".")):
systemIndex.append(index)
else :
createIndex(index, index)
if (len(systemIndex) > 0) :
for index in systemIndex:
print index + " It may be a system index and will not be recreated. You can manually recreate the index based on your business requirements."
Step 3: Configure a remote reindex whitelist for the Alibaba Cloud Elasticsearch cluster
Step 4: Migrate data
This section describes how to migrate data to an Alibaba Cloud Elasticsearch cluster deployed in the original network architecture. You can use one of the following methods to migrate data. Select a method based on the volume of data that you want to migrate and your business requirements.
Migrate a small volume of data
#!/bin/bash
# file:reindex.sh
indexName="The name of the index"
newClusterUser="The username of the Alibaba Cloud Elasticsearch cluster"
newClusterPass="The password of the Alibaba Cloud Elasticsearch cluster"
newClusterHost="The host of the Alibaba Cloud Elasticsearch cluster"
oldClusterUser="The username of the self-managed Elasticsearch cluster"
oldClusterPass="The password of the self-managed Elasticsearch cluster"
# You must configure the host of the self-managed Elasticsearch cluster in the format of [scheme]://[host]:[port]. Example: http://10.37.*.*:9200.
oldClusterHost="The host of the self-managed Elasticsearch cluster"
curl -u ${newClusterUser}:${newClusterPass} -XPOST "http://${newClusterHost}/_reindex?pretty" -H "Content-Type: application/json" -d'{
"source": {
"remote": {
"host": "'${oldClusterHost}'",
"username": "'${oldClusterUser}'",
"password": "'${oldClusterPass}'"
},
"index": "'${indexName}'",
"query": {
"match_all": {}
}
},
"dest": {
"index": "'${indexName}'"
}
}'
Migrate a large volume of data (without deletions and with update time)
#!/bin/bash
# file: circleReindex.sh
# CONTROLLING STARTUP:
# This is a script that uses the reindex API to remotely reindex data. Requirements:
# 1. Indexes are created on the Alibaba Cloud Elasticsearch cluster, or the Auto Indexing and dynamic mapping features are enabled for the cluster.
# 2. A remote reindex whitelist is configured for the Alibaba Cloud Elasticsearch cluster in the YML File Configuration panel of the Alibaba Cloud Elasticsearch console. For example, the following information is specified in the Other Configurations field: reindex.remote.whitelist: 172.16.**.**:9200.
# 3. The host is configured in the format of [scheme]://[host]:[port].
USAGE="Usage: sh circleReindex.sh <count>
count: the number of reindex operations that you can perform. A negative number indicates loop execution.
Example:
sh circleReindex.sh 1
sh circleReindex.sh 5
sh circleReindex.sh -1"
indexName="The name of the index"
newClusterUser="The username of the Alibaba Cloud Elasticsearch cluster"
newClusterPass="The password of the Alibaba Cloud Elasticsearch cluster"
oldClusterUser="The username of the self-managed Elasticsearch cluster"
oldClusterPass="The password of the self-managed Elasticsearch cluster"
## http://myescluster.com
newClusterHost="The host of the Alibaba Cloud Elasticsearch cluster"
# You must configure the host of the self-managed Elasticsearch cluster in the format of [scheme]://[host]:[port]. Example: http://10.37.*.*:9200.
oldClusterHost="The host of the self-managed Elasticsearch cluster"
timeField="The update time of data"
reindexTimes=0
lastTimestamp=0
curTimestamp=`date +%s`
hasError=false
function reIndexOP() {
reindexTimes=$[${reindexTimes} + 1]
curTimestamp=`date +%s`
ret=`curl -u ${newClusterUser}:${newClusterPass} -XPOST "${newClusterHost}/_reindex?pretty" -H "Content-Type: application/json" -d '{
"source": {
"remote": {
"host": "'${oldClusterHost}'",
"username": "'${oldClusterUser}'",
"password": "'${oldClusterPass}'"
},
"index": "'${indexName}'",
"query": {
"range" : {
"'${timeField}'" : {
"gte" : '${lastTimestamp}',
"lt" : '${curTimestamp}'
}
}
}
},
"dest": {
"index": "'${indexName}'"
}
}'`
lastTimestamp=${curTimestamp}
echo "${reindexTimes} reindex operations are performed. The last reindex operation is complete at ${lastTimestamp}. Result: ${ret}."
if [[ ${ret} == *error* ]]; then
hasError=true
echo "An unknown error occurred when you perform this operation. All subsequent operations are suspended."
fi
}
function start() {
## A negative number indicates loop execution.
if [[ $1 -lt 0 ]]; then
while :
do
reIndexOP
done
elif [[ $1 -gt 0 ]]; then
k=0
while [[ k -lt $1 ]] && [[ ${hasError} == false ]]; do
reIndexOP
let ++k
done
fi
}
## main
if [ $# -lt 1 ]; then
echo "$USAGE"
exit 1
fi
echo "Start the reindex operation for the ${indexName} index."
start $1
echo "${reindexTimes} reindex operations are performed."
Migrate a large volume of data (without deletions and update time)
#!/bin/bash
# file:miss.sh
indexName="The name of the index"
newClusterUser="The username of the Alibaba Cloud Elasticsearch cluster"
newClusterPass="The password of the Alibaba Cloud Elasticsearch cluster"
newClusterHost="The host of the Alibaba Cloud Elasticsearch cluster"
oldClusterUser="The username of the self-managed Elasticsearch cluster"
oldClusterPass="The password of the self-managed Elasticsearch cluster"
# You must configure the host of the self-managed Elasticsearch cluster in the format of [scheme]://[host]:[port]. Example: http://10.37.*.*:9200.
oldClusterHost="The host of the self-managed Elasticsearch cluster"
timeField="updatetime"
curl -u ${newClusterUser}:${newClusterPass} -XPOST "http://${newClusterHost}/_reindex?pretty" -H "Content-Type: application/json" -d '{
"source": {
"remote": {
"host": "'${oldClusterHost}'",
"username": "'${oldClusterUser}'",
"password": "'${oldClusterPass}'"
},
"index": "'${indexName}'",
"query": {
"bool": {
"must_not": {
"exists": {
"field": "'${timeField}'"
}
}
}
}
},
"dest": {
"index": "'${indexName}'"
}
}'
FAQ
- Problem: When I run the curl command, the system displays
{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}
. What do I do?Solution: Add-H "Content-Type: application/json"
to the curl command and try again.// Obtain all the indexes on the self-managed Elasticsearch cluster. If you do not have the required permissions, remove the "-u user:pass" parameter. Replace oldClusterHost with the information about the host of the self-managed Elasticsearch cluster. curl -u user:pass -XGET http://oldClusterHost/_cat/indices | awk '{print $3}' // Obtain the settings and mappings of the index that you want to migrate for the specified user based on the returned indexes. Replace indexName with the index name that you want to query. curl -u user:pass -XGET http://oldClusterHost/indexName/_settings,_mapping?pretty=true // Create an index on the Alibaba Cloud Elasticsearch cluster based on the settings and mappings that you obtained. You can set the number of replica shards to 0 to accelerate data migration, and change the number to 1 after data is migrated. // Replace newClusterHost with the host information of the Alibaba Cloud Elasticsearch cluster, testindex with the name of the index that you have created, and testtype with the type of the index. curl -u user:pass -XPUT http://<newClusterHost>/<testindex> -d '{ "testindex" : { "settings" : { "number_of_shards" : "5", // Specify the number of primary shards for the index on the self-managed Elasticsearch cluster, such as 5. "number_of_replicas" : "0" // Set the number of replica shards to 0. } }, "mappings" : { // Specify the mappings of the index on the self-managed Elasticsearch cluster. Example: "testtype" : { "properties" : { "uid" : { "type" : "long" }, "name" : { "type" : "text" }, "create_time" : { "type" : "long" } } } } } }'
- Problem: What do I do if the source index stores large volumes of data and the data
migration is slow?
Solution:
- If you use the reindex API to migrate data, data is migrated in scroll mode. To improve the efficiency of data migration, you can increase the scroll size or configure a sliced scroll. The sliced scroll can parallelize the reindex process. For more information, see the reindex API.
- If the self-managed Elasticsearch cluster stores large volumes of data, we recommend that you use snapshots stored in OSS to migrate data. For more information, see Use OSS to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.
- If the source index stores large volumes of data, you can set the number of replica
shards to 0 and the refresh interval to -1 for the destination index before you migrate
data to accelerate data migration. After data is migrated, restore the settings to
the original values.
// You can set the number of replica shards to 0 and disable the refresh feature to accelerate the data migration. curl -u user:password -XPUT 'http://<host:port>/indexName/_settings' -d' { "number_of_replicas" : 0, "refresh_interval" : "-1" }' // After data is migrated, set the number of replica shards to 1 and the refresh interval to 1s, which is the default value. curl -u user:password -XPUT 'http://<host:port>/indexName/_settings' -d' { "number_of_replicas" : 1, "refresh_interval" : "1s" }'