You can use the reindex API to migrate data between Elasticsearch clusters. This topic describes the migration procedure in detail.

Background information

You can use the reindex API to perform the following operations:
  • Migrate data between Elasticsearch clusters.
  • Reindex the data in an index whose shards are inappropriately configured. For example, the data volume is large, but only a few shards are configured for the index. This slows down data write operations.
  • Replicate the data in an index if the index stores large volumes of data and you want to modify the mapping configuration of the index. This operation requires only a short period of time. You can also insert the data into a new index. However, this operation is time-consuming.
    Note After you define the mapping configuration for an index in an Elasticsearch cluster and insert data into the index, you cannot modify the mapping configuration.

Prerequisites

  • Two Alibaba Cloud Elasticsearch clusters are created. One is used as a local cluster, and the other is used as a remote cluster.

    For more information, see Create an Alibaba Cloud Elasticsearch cluster. The two clusters must belong to the same virtual private cloud (VPC) and vSwitch. In this example, an Elasticsearch V6.7.0 cluster is used as the local cluster, and an Elasticsearch V6.3.2 cluster is used as the remote cluster.

  • Test data is prepared.
    • Local cluster
      Create a destination index in the local cluster.
      PUT dest
      {
        "settings": {
          "number_of_shards": 5,
          "number_of_replicas": 1
        }
      }
    • Remote cluster
      Prepare the data that you want to migrate. In this example, the data in the "Quick start" topic is used. For more information, see Getting started. Test data for the local cluster
      Notice If you want to use a cluster that runs Elasticsearch V7.0 or later as a remote cluster, you must set the index type to _doc.

Precautions

The network architecture of Alibaba Cloud Elasticsearch was adjusted in October 2020. Due to this adjustment, you cannot use the reindex API to migrate data between clusters in some scenarios. The following table describes such scenarios and the data migration solutions in these scenarios. Alibaba Cloud Elasticsearch clusters created before October 2020 are deployed in the original network architecture, and those created in October 2020 or later are deployed in the new network architecture.
Scenario Network architecture Support for the reindex API Solution
Use the reindex API to migrate data between Alibaba Cloud Elasticsearch clusters Both clusters are deployed in the original network architecture. Yes For more information, see Use the reindex API to migrate data.
Both clusters are deployed in the new network architecture. No Use OSS or Logstash to migrate data between the clusters. For more information, see Use OSS to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster and Use Alibaba Cloud Logstash to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.
One is deployed in the original network architecture, and the other is deployed in the new network architecture. No
Migrate data from a self-managed Elasticsearch cluster that runs on ECS instances to an Alibaba Cloud Elasticsearch cluster The Alibaba Cloud Elasticsearch cluster is deployed in the original network architecture. Yes For more information, see Use the reindex API to migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster.
The Alibaba Cloud Elasticsearch cluster is deployed in the new network architecture. Yes Use the PrivateLink service to establish a network connection between the Alibaba Cloud Elasticsearch cluster and the self-managed Elasticsearch cluster that runs on ECS instances. This way, the service account of Alibaba Cloud Elasticsearch can be used to access the self-managed Elasticsearch cluster. Then, use the domain name of the endpoint you obtained and the reindex API to migrate data between the clusters. For more information, see Migrate data from a self-managed Elasticsearch cluster to an Alibaba Cloud Elasticsearch cluster deployed in the new network architecture.
Note Only some regions support PrivateLink. For more information, see Regions and zones that support PrivateLink. If the zone where your Alibaba Cloud Elasticsearch cluster resides does not support PrivateLink, you cannot use the reindex API to migrate data between the two clusters.
Note
  • Alibaba Cloud Elasticsearch clusters deployed in the new network architecture reside in an exclusive VPC for Alibaba Cloud Elasticsearch. These clusters cannot access resources in other network environments. Alibaba Cloud Elasticsearch clusters deployed in the original network architecture reside in VPCs that are created by users. These clusters can still access resources in other network environments.
  • The network architecture in the China (Zhangjiakou) region and the regions outside China was adjusted before October 2020. If you want to perform operations between a cluster that is created before October 2020 and a cluster that is created in October 2020 or later in such a region, submit a ticket to contact Alibaba Cloud technical support to check whether the network architecture supports the operations.
  • Clusters created in other regions before October 2020 are deployed in the original network architecture, and those created in other regions in October 2020 or later are deployed in the new network architecture.
  • To ensure data consistency, we recommend that you stop writing data to the self-managed Elasticsearch cluster before the migration. This way, you can continue to read data from the cluster during the migration. After the migration, you can read data from and write data to the Alibaba Cloud Elasticsearch cluster. If you do not stop writing data to the self-managed Elasticsearch cluster, we recommend that you configure loop execution for reindex operations in the code to shorten the time during which write operations are suspended. For more information, see the method used to migrate a large volume of data (without deletions and with update time) in the Step 3: Migrate data section.
  • If you connect to the self-managed Elasticsearch cluster or the Alibaba Cloud Elasticsearch cluster by using its domain name, do not include path in the URL, such as http://host:port/path.

Procedure

  1. Log on to the Elasticsearch console.
  2. In the left-side navigation pane, click Elasticsearch Clusters.
  3. Navigate to the desired cluster.
    1. In the top navigation bar, select the resource group to which the cluster belongs and the region where the cluster resides.
    2. In the left-side navigation pane, click Elasticsearch Clusters. On the Elasticsearch Clusters page, find the cluster and click its ID.
  4. Configure a reindex whitelist for the local cluster.
    1. In the left-side navigation pane of the page that appears, click Cluster Configuration.
    2. On the page that appears, click Modify Configuration on the right side of YML Configuration.
    3. In the Other Configurations field of the YML File Configuration panel, configure a reindex whitelist.
      For more information about how to configure a reindex whitelist, see Configure a remote reindex whitelist.
      • If the remote cluster is a single-zone cluster, specify the reindex whitelist in the format of <Domain name of the cluster>:9200. Configuration example of a remote single-zone cluster
        reindex.remote.whitelist: ["es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200"]
      • If the remote cluster is a multi-zone cluster, the reindex whitelist must contain the IP addresses of all the data nodes in the cluster and the port number of the cluster. Configuration example of a remote multi-zone cluster
        reindex.remote.whitelist: ["10.0.xx.xx:9200","10.0.xx.xx:9200","10.0.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200"]
        Note You can obtain the IP addresses of all the data nodes in a cluster from the Node Visualization tab on the Basic Information page of the cluster. For more information, see View the basic information of nodes.
    4. Select This operation will restart the cluster. Continue? and click OK.
  5. In the local cluster, call the reindex API to reindex data.
    Log on to the Kibana console of the local cluster and run the following command to reindex data:
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200",
          "username": "elastic",
          "password": "your_password"
        },
        "index": "product_info",
        "query": {
          "match": {
            "productName": "Wealth management"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
    Part Parameter Description
    source host The URL that is used to connect to the remote cluster. The URL must contain the protocol, domain name, and port number. Example: https://otherhost:9200.
    • If the remote cluster is a single-zone cluster, the value of the host parameter must be in the format of http://<Domain name of the cluster>:9200.
      Note You can obtain the domain name from the Basic Information page of the cluster. For more information, see View the basic information of a cluster.
    • If the remote cluster is a multi-zone cluster, the value of the host parameter must be in the format of http://<IP address of a data node in the cluster>:9200.
    username The username that is used to connect to the remote cluster. This parameter is optional. It is required only if basic authentication needs to be performed on requests that are sent to the remote cluster. The default username that is used to connect to Alibaba Cloud Elasticsearch clusters is elastic.
    Notice
    • For security purposes, we recommend that you use HTTPS to send requests if basic authentication needs to be performed. Otherwise, the required password is transmitted in plaintext.
    • For Alibaba Cloud Elasticsearch clusters, you can use HTTPS in host only after you enable the protocol for the clusters.
    password The password that is used to connect to the remote cluster. The password is specified when you create the cluster. If you forget the password, you can reset it. For more information about the procedure and precautions for resetting the password, see Reset the access password for an Elasticsearch cluster.
    index The source index in the remote cluster.
    query Specifies the data that you want to migrate. For more information, see Reindex API.
    dest index The destination index in the local cluster.
    Note When you reindex data from a remote cluster, manual slicing and automatic slicing are not supported for the data. For more information, see Manual slicing and Automatic slicing.
    If the command is successfully run, the following result is returned:
    {
      "took" : 51,
      "timed_out" : false,
      "total" : 2,
      "updated" : 2,
      "created" : 0,
      "deleted" : 0,
      "batches" : 1,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0,
      "failures" : [ ]
    }
  6. Run the following command to view the migrated data:
    GET dest/_search
    The following figures show the command outputs.
    • Single-zone clusterView the migrated data
    • Multi-zone clusterView the migrated data

Summary

The configurations that are required to migrate data from a single-zone cluster are similar to the configurations that are required to migrate data from a multi-zone cluster. The following table lists differences.
Cluster type Configuration of the reindex whitelist Configuration of the host parameter
Single-zone cluster Domain name of the cluster:9200 https://Domain name of the cluster:9200
Multi-zone cluster Combination of the IP addresses of all the data nodes in the cluster and the port number of the cluster https://IP address of a data node in the cluster:9200

Additional information

When you use the reindex API to reindex data, you can specify a batch size and timeout periods.
  • Batch size

    A remote Elasticsearch cluster uses a heap to cache index data. The default batch size is 100 MB. If an index in the remote cluster contains large documents, you must change the batch size to a smaller value.

    In the following example, size is set to 10.
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200"
        },
        "index": "source",
        "size": 10,
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
  • Timeout periods

    Use socket_timeout to specify a timeout period for socket reads. The default value of socket_timeout is 30s. Use connect_timeout to specify a timeout period for connections. The default value of connect_timeout is 1s.

    In the following example, socket_timeout is set to 1m, and connect_timeout is set to 10s.
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200",
          "socket_timeout": "1m",
          "connect_timeout": "10s"
        },
        "index": "source",
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }