You can use the reindex API to migrate data between Elasticsearch clusters. This topic describes the migration procedure in detail.

Background information

You can use the reindex API to perform the following operations:
  • Migrate data between Elasticsearch clusters.
  • Reindex the data in an index whose shards are inappropriately configured. For example, the data volume is large, but only a few shards are configured for the index. This slows down data write operations.
  • Replicate the data in an index if the index stores large volumes of data and you want to modify the mapping configuration of the index. This operation requires only a short period of time. You can also insert the data into a new index. However, this operation is time-consuming.
    Note After you define the mapping configuration for an index in an Elasticsearch cluster and insert data into the index, you cannot modify the mapping configuration.

Prerequisites

  • Two Alibaba Cloud Elasticsearch clusters are created. One is used as a local cluster, and the other is used as a remote cluster.

    For more information, see Create an Alibaba Cloud Elasticsearch cluster. The two clusters must belong to the same virtual private cloud (VPC) and vSwitch. In this example, an Elasticsearch V6.7.0 cluster is used as the local cluster, and an Elasticsearch V6.3.2 cluster is used as the remote cluster.

  • Test data is prepared.
    • Local cluster
      Create a destination index in the local cluster.
      PUT dest
      {
        "settings": {
          "number_of_shards": 5,
          "number_of_replicas": 1
        }
      }
    • Remote cluster
      Prepare the data that you want to migrate. In this example, the data in the "Quick start" topic is used. For more information, see Quick start. Test data for the local cluster
      Notice If you want to use a cluster that runs Elasticsearch V7.0 or later as a remote cluster, you must set the index type to _doc.

Procedure

  1. Log on to the Elasticsearch console.
  2. In the left-side navigation pane, click Elasticsearch Clusters.
  3. Navigate to the desired cluster.
    1. In the top navigation bar, select a resource group and a region.
    2. In the left-side navigation pane, click Elasticsearch Clusters. On the Elasticsearch Clusters page, find the desired cluster and click its ID.
  4. Configure a reindex whitelist for the local cluster.
    1. In the left-side navigation pane of the page that appears, click Cluster Configuration.
    2. On the page that appears, click Modify Configuration on the right side of YML Configuration.
    3. In the Other Configurations field of the YML File Configuration panel, configure a reindex whitelist.
      For more information about how to configure a reindex whitelist, see Configure a remote reindex whitelist.
      • If the remote cluster is a single-zone cluster, specify the reindex whitelist in the format of <Domain name of the cluster>:9200. Configuration example of a remote single-zone cluster
        reindex.remote.whitelist: ["es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200"]
      • If the remote cluster is a multi-zone cluster, the reindex whitelist must contain the IP addresses of all the data nodes in the cluster and the port number of the cluster. Configuration example of a remote multi-zone cluster
        reindex.remote.whitelist: ["10.0.xx.xx:9200","10.0.xx.xx:9200","10.0.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200","10.15.xx.xx:9200"]
        Note You can obtain the IP addresses of all the data nodes in a cluster from the Node Visualization tab on the Basic Information page of the cluster. For more information, see View the basic information of nodes.
    4. Select This operation will restart the cluster. Continue? and click OK.
  5. In the local cluster, call the reindex API to reindex data.
    Log on to the Kibana console of the local cluster and run the following command to reindex data:
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://es-cn-09k1rgid9000g****.elasticsearch.aliyuncs.com:9200",
          "username": "elastic",
          "password": "your_password"
        },
        "index": "product_info",
        "query": {
          "match": {
            "productName": "Wealth management"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
    Part Parameter Description
    source host The URL that is used to connect to the remote cluster. The URL must contain the protocol, domain name, and port number. Example: https://otherhost:9200.
    • If the remote cluster is a single-zone cluster, the value of the host parameter must be in the format of http://<Domain name of the cluster>:9200.
      Note You can obtain the domain name from the Basic Information page of the cluster. For more information, see View the basic information of a cluster.
    • If the remote cluster is a multi-zone cluster, the value of the host parameter must be in the format of http://<IP address of a data node in the cluster>:9200.
    username The username that is used to connect to the remote cluster. This parameter is optional. It is required only if basic authentication needs to be performed on requests that are sent to the remote cluster. The default username that is used to connect to Alibaba Cloud Elasticsearch clusters is elastic.
    Notice
    • For security purposes, we recommend that you use HTTPS to send requests if basic authentication needs to be performed. Otherwise, the required password is transmitted in plaintext.
    • For Alibaba Cloud Elasticsearch clusters, you can use HTTPS in host only after you enable the protocol for the clusters.
    password The password that is used to connect to the remote cluster. The password is specified when you create the cluster. If you forget the password, you can reset it. For more information about the procedure and precautions for resetting the password, see Reset the access password for an Elasticsearch cluster.
    index The source index in the remote cluster.
    query Specifies the data that you want to migrate. For more information, see Reindex API.
    dest index The destination index in the local cluster.
    Note When you reindex data from a remote cluster, manual slicing and automatic slicing are not supported for the data. For more information, see Manual slicing and Automatic slicing.
    If the command is successfully run, the following result is returned:
    {
      "took" : 51,
      "timed_out" : false,
      "total" : 2,
      "updated" : 2,
      "created" : 0,
      "deleted" : 0,
      "batches" : 1,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0,
      "failures" : [ ]
    }
  6. Run the following command to view the migrated data:
    GET dest/_search
    The following figures show the command outputs.
    • Single-zone clusterView the migrated data
    • Multi-zone clusterView the migrated data

Summary

The configurations that are required to migrate data from a single-zone cluster are similar to the configurations that are required to migrate data from a multi-zone cluster. The following table lists differences.
Cluster type Configuration of the reindex whitelist Configuration of the host parameter
Single-zone cluster Domain name of the cluster:9200 https://Domain name of the cluster:9200
Multi-zone cluster Combination of the IP addresses of all the data nodes in the cluster and the port number of the cluster https://IP address of a data node in the cluster:9200

Additional information

When you use the reindex API to reindex data, you can specify a batch size and timeout periods.
  • Batch size

    A remote Elasticsearch cluster uses a heap to cache index data. The default batch size is 100 MB. If an index in the remote cluster contains large documents, you must change the batch size to a smaller value.

    In the following example, size is set to 10.
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200"
        },
        "index": "source",
        "size": 10,
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }
  • Timeout periods

    Use socket_timeout to specify a timeout period for socket reads. The default value of socket_timeout is 30s. Use connect_timeout to specify a timeout period for connections. The default value of connect_timeout is 1s.

    In the following example, socket_timeout is set to 1m, and connect_timeout is set to 10s.
    POST _reindex
    {
      "source": {
        "remote": {
          "host": "http://otherhost:9200",
          "socket_timeout": "1m",
          "connect_timeout": "10s"
        },
        "index": "source",
        "query": {
          "match": {
            "test": "data"
          }
        }
      },
      "dest": {
        "index": "dest"
      }
    }