You can use the reindex API to migrate data between Elasticsearch clusters. This topic describes the migration procedure in detail.

Background information

You can use the reindex API to perform the following operations:
  • Migrate data between Elasticsearch clusters.
  • Reindex the data in an index whose shards are inappropriately configured. For example, the data volume is large, but only a few shards are configured for the index. This slows down data write operations.
  • Replicate the data in an index if the index stores large volumes of data and you want to modify the mapping configuration of the index. This operation requires only a short period of time. You can also insert the data into a new index. However, this operation is time-consuming.
    Note After you define the mapping configuration for an index in an Elasticsearch cluster and insert data into the index, you cannot modify the mapping configuration.

Prerequisites

  • Two Alibaba Cloud Elasticsearch clusters are created. One is used as a local cluster, and the other is used as a remote cluster.

    For more information, see Create an Alibaba Cloud Elasticsearch cluster. The two clusters must belong to the same virtual private cloud (VPC) and vSwitch. In this example, an Elasticsearch V6.7.0 cluster is used as the local cluster, and an Elasticsearch V6.3.2 cluster is used as the remote cluster.

  • Test data is prepared.
    • Local cluster
      Create a destination index in the local cluster.
      PUT dest
      {
        "settings": {
          "number_of_shards": 5,
          "number_of_replicas": 1
        }
      }
    • Remote cluster
      Prepare the data that you want to migrate. In this example, the data in the "Quick start" topic is used. For more information, see Getting started. Test data for the local cluster
      Important If you want to use a cluster that runs Elasticsearch V7.0 or later as a remote cluster, you must set the index type to _doc.

Limits

The network architecture of Alibaba Cloud Elasticsearch was adjusted in October 2020. Alibaba Cloud Elasticsearch clusters created before October 2020 are deployed in the original network architecture. Alibaba Cloud Elasticsearch cluster created in October 2020 or later are deployed in the new network architecture. Due to the adjustment of the network architecture, you cannot use the reindex API to migrate data between clusters in some scenarios. The following table describes the scenarios and provides data migration solutions in these scenarios.
ScenarioNetwork architectureSupport for the reindex APISolution
Migrate data between Alibaba Cloud Elasticsearch clustersBoth clusters are deployed in the original network architecture. YesFor more information, see Use the reindex API to migrate data.
Both clusters are deployed in the new network architecture. NoUse OSS or Logstash to migrate data between the clusters. For more information, see Use OSS to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster and Use Alibaba Cloud Logstash to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.
One is deployed in the original network architecture, and the other is deployed in the new network architecture. No
Migrate data from a self-managed Elasticsearch cluster that runs on ECS instances to an Alibaba Cloud Elasticsearch clusterThe Alibaba Cloud Elasticsearch cluster is deployed in the original network architecture. YesFor more information, see Use the reindex API to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.
The Alibaba Cloud Elasticsearch cluster is deployed in the new network architecture. YesUse the PrivateLink service to establish a private connection between the VPC where the Alibaba Cloud Elasticsearch cluster resides and the VPC where the self-managed Elasticsearch cluster resides. Then, use the domain name of the endpoint you obtained and the reindex API to migrate data between the clusters. For more information, see Migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster deployed in the new network architecture.
Note Only some regions support PrivateLink. For more information, see Regions and zones that support PrivateLink. If the zone where your Alibaba Cloud Elasticsearch cluster resides does not support PrivateLink, you cannot use the reindex API to migrate data between the clusters.

Procedure

  1. Log on to the Alibaba Cloud Elasticsearch console.
  2. In the left-side navigation pane, click Elasticsearch Clusters.
  3. Navigate to the desired cluster.
    1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
    2. On the Elasticsearch Clusters page, find the cluster and click its ID.
  4. Configure a reindex whitelist for the local cluster.
    1. In the left-side navigation pane of the page that appears, click Cluster Configuration.
    2. On the page that appears, click Modify Configuration on the right side of YML Configuration.
    3. In the Other Configurations field of the YML File Configuration panel, configure a reindex whitelist.
      For more information about how to configure a reindex whitelist, see Configure a remote reindex whitelist.
      • If the remote cluster is a single-zone cluster, specify the reindex whitelist in the format of <Domain name of the cluster>:9200. Configuration example of a remote single-zone cluster
        reindex.remote.whitelist: ["es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200"]
      • If the remote cluster is a multi-zone cluster, the reindex whitelist must contain the IP addresses of all the data nodes in the cluster and the port number of the cluster. Configuration example of a remote multi-zone cluster
        reindex.remote.whitelist: ["10.0.xx.xx:9200","10.0.xx.xx:9200","10.0.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200"]
        Note You can obtain the IP addresses of all the data nodes in a cluster from the Node Visualization tab on the Basic Information page of the cluster. For more information, see View the basic information of nodes.
    4. Select This operation will restart the cluster. Continue? and click OK.
  5. In the local cluster, call the reindex API to reindex data.
    Log on to the Kibana console of the local cluster and run the following command to reindex data:
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200",
          "username": "elastic",
          "password": "your_password"
        },
        "index": "product_info",
        "query": {
          "match": {
            "productName": "Wealth management"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
    PartParameterDescription
    sourcehostThe URL that is used to connect to the remote cluster. The URL must contain the protocol, domain name, and port number. Example: https://otherhost:9200.
    • If the remote cluster is a single-zone cluster, the value of the host parameter must be in the format of http://<Domain name of the cluster>:9200.
      Note You can obtain the domain name from the Basic Information page of the cluster. For more information, see View the basic information of a cluster.
    • If the remote cluster is a multi-zone cluster, the value of the host parameter must be in the format of http://<IP address of a data node in the cluster>:9200.
    usernameThe username that is used to connect to the remote cluster. This parameter is optional. It is required only if basic authentication needs to be performed on requests that are sent to the remote cluster. The default username that is used to connect to Alibaba Cloud Elasticsearch clusters is elastic.
    Important
    • For security purposes, we recommend that you use HTTPS to send requests if basic authentication needs to be performed. Otherwise, the required password is transmitted in plaintext.
    • For Alibaba Cloud Elasticsearch clusters, you can use HTTPS in host only after you enable the protocol for the clusters.
    passwordThe password that is used to connect to the remote cluster. The password is specified when you create the cluster. If you forget the password, you can reset it. For more information about the procedure and precautions for resetting the password, see Reset the access password for an Elasticsearch cluster.
    indexThe source index in the remote cluster.
    querySpecifies the data that you want to migrate. For more information, see Reindex API.
    destindexThe destination index in the local cluster.
    Note When you reindex data from a remote cluster, manual slicing and automatic slicing are not supported for the data. For more information, see Manual slicing and Automatic slicing.
    If the command is successfully run, the following result is returned:
    {
      "took" : 51,
      "timed_out" : false,
      "total" : 2,
      "updated" : 2,
      "created" : 0,
      "deleted" : 0,
      "batches" : 1,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0,
      "failures" : [ ]
    }
  6. Run the following command to view the migrated data:
    GET dest/_search
    The following figures show the command outputs.
    • Single-zone clusterView the migrated data
    • Multi-zone clusterView the migrated data

Summary

The configurations that are required to migrate data from a single-zone cluster are similar to the configurations that are required to migrate data from a multi-zone cluster. The following table lists differences.
Cluster typeConfiguration of the reindex whitelistConfiguration of the host parameter
Single-zone clusterDomain name of the cluster:9200https://Domain name of the cluster:9200
Multi-zone clusterCombination of the IP addresses of all the data nodes in the cluster and the port number of the clusterhttps://IP address of a data node in the cluster:9200

Additional information

When you use the reindex API to reindex data, you can specify a batch size and timeout periods.
  • Batch size

    A remote Elasticsearch cluster uses a heap to cache index data. The default batch size is 100 MB. If an index in the remote cluster contains large documents, you must change the batch size to a smaller value.

    In the following example, size is set to 10.
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200"
        },
        "index": "source",
        "size": 10,
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
  • Timeout periods

    Use socket_timeout to specify a timeout period for socket reads. The default value of socket_timeout is 30s. Use connect_timeout to specify a timeout period for connections. The default value of connect_timeout is 1s.

    In the following example, socket_timeout is set to 1m, and connect_timeout is set to 10s.
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200",
          "socket_timeout": "1m",
          "connect_timeout": "10s"
        },
        "index": "source",
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }