Migrate ES data to the Lindorm Search Engine - Lindorm - Alibaba Cloud Documentation Center

Logstash is an open source data processing tool that collects, processes, and sends data to a destination database. This topic describes how to use Logstash to migrate data from a self-managed Elasticsearch cluster to the Lindorm Search Engine by configuring scripts and migration tasks.

Prerequisites

The self-managed Elasticsearch (ES) cluster is a version from 7.0.0 to 7.10.1.
You have activated the Lindorm Search Engine. For more information, see Activation guide.
You have added the client IP address to the Lindorm whitelist. For more information, see Configure a whitelist.
Note
The examples in this topic use a self-managed ES cluster and a Logstash service that are deployed on an Alibaba Cloud Elastic Compute Service (ECS) instance. To ensure connectivity between the ECS instance and the Lindorm instance, you must add the IP address of the ECS instance to the Lindorm whitelist. For more information about how to create an Alibaba Cloud ECS instance, see Create a custom instance.

Migration solution selection

Choose a migration solution as needed and the mappings of the source ES index.

Full data migration: Use this solution if documents in the source index of the self-managed ES cluster are not being added, deleted, or modified.
Incremental data migration: Use this solution if the self-managed ES index is updated with new or modified data, but no data is deleted. This solution requires that the documents contain a field that indicates the data update time.
Comprehensive migration solution: Use this solution if your data is updated but does not contain a field for the data update time. In this scenario, you must modify your business code to include this field before you can perform the migration.

Data preparation

The self-managed Elasticsearch cluster used in this topic is deployed on an Alibaba Cloud ECS instance. The data to be migrated is the geonames dataset from Rally. For more information about how to import the data, see Run a Benchmark: Races. In the example in this topic, the index in the Elasticsearch cluster is named geonames.

Data size: 11,396,503 documents. The decompressed data occupies 3.3 GB of space.

Note

You can also migrate your own existing data from a self-managed ES cluster. For information about how to write data to a self-managed ES cluster, see the Index API.

Step 1: Install the Logstash service

To install and deploy the Logstash service, see the Logstash Reference.

Step 2: Create a search index

Before you use Logstash to migrate data from an Elasticsearch cluster to the Lindorm Search Engine, you must create a destination index in Lindorm to store the migrated data.

Important

The Logstash service does not copy information, such as settings, from the source index to the destination index. If you want the destination index to have the same configuration and mapping rules as the source index, you must specify parameters such as settings, mappings, and shard counts when you create the destination index.

In this topic, the destination index is named geonames. The settings and mappings parameters are not specified when the index is created.

curl -XPUT "http://<url>/geonames" -u <username>:<password>

Parameter description

Parameter	Description
url	The Elasticsearch-compatible endpoint of the search engine. For more information about how to obtain the endpoint, see Elasticsearch-compatible endpoints. Important If the application is deployed on an ECS instance, connect to the Lindorm instance over a virtual private cloud (VPC) for better security and lower network latency. If the application is deployed locally, enable the public endpoint in the console before connecting to the Lindorm instance over the public network. To do this, go to the console. In the navigation pane on the left, choose Database Connections. Click the Search Engine tab, and then click Enable Public Endpoint in the upper-right corner of the tab. If you use a VPC to access the Lindorm instance, set the url to the LindormSearch VPC endpoint for Elasticsearch. If you use the Internet to access the Lindorm instance, set the url to the LindormSearch Internet endpoint for Elasticsearch.
username	The username and password to access the Search Engine. To obtain the default username and password: In the console, choose Database Connections in the navigation pane on the left. Click the Search Engine tab to view the credentials.
password

Note

You can specify different settings and mappings for the destination index from the source index. However, the mappings of the destination index must not conflict with the mappings of the source index. Conflicts may cause the data migration to fail. For more information, see the Create index API.

Step 3: Migrate data

Full data migration

If the source index in your self-managed ES cluster is no longer being modified, which means no documents are added, deleted, or updated, you can perform a full data migration. This method migrates all documents from the source index to the Lindorm Search Engine.

Create a Logstash configuration file named fulldata.conf to configure the full data migration task. The following is an example:

input{
  elasticsearch{
    # Source ES address
    hosts =>  ["http://<host>:<port>"]
    
    # Username & password for access
    user => "changeme"
    password => "changeme"
    
    # Source index name. Supports multiple comma-separated index names and fuzzy matching.
    index => "geonames"
  
    # Query all data and sync from the source
    query => '{"query":{"match_all":{}}}'

    # Default is false. Metadata such as _id is not read.
    docinfo=>true
  }
}

filter {
  # Remove extra Logstash fields
  mutate {
    remove_field => ["@timestamp", "@version"]
  }
}

output{
  elasticsearch{
    # Lindorm connection address
    hosts => ["http://<lindorm-address>"]
  
    # Username & password for access
    user => "changeme"
    password => "changeme"
  
    index => "geonames"
    # Keep the original ID when writing data. Remove this line if not needed.
    document_id => "%{[@metadata][_id]}"

    # Disable the built-in Logstash template
    manage_template => false
  }
}

Parameter description

Configuration item	Parameter	Description
input	hosts	The IP address and port of the self-managed Elasticsearch cluster.
	user	The username for the self-managed Elasticsearch cluster. This parameter is optional. Specify it as needed.
	password	The password for the self-managed Elasticsearch cluster. This parameter is optional. Specify it as needed.
	index	The name of the source index in the self-managed Elasticsearch cluster to be migrated.
	query	The query condition for data migration. In the example, match_all queries all documents in the index. This migrates all documents from the source index to the destination index.
output	hosts	The Elasticsearch-compatible search engine endpoint. For more information, see Elasticsearch-compatible endpoint. Important If your Logstash service is deployed on an ECS instance, connect to the Lindorm instance over a virtual private cloud (VPC) for better security and lower network latency. If your Logstash service is deployed locally, enable the public endpoint in the console before connecting to the Lindorm instance over the public network. To do this, go to the console. In the navigation pane on the left, choose Database Connections. Click the Search Engine tab, and then click Enable Public Endpoint in the upper-right corner of the tab. To access the Lindorm instance over a virtual private cloud (VPC), enter the VPC address for the Elasticsearch-compatible endpoint. To access the Lindorm instance over the public network, enter the Internet address for the Elasticsearch-compatible endpoint.
	user	The username and password to access the Search Engine. To obtain the default username and password: In the console, choose Database Connections in the navigation pane on the left. Click the Search Engine tab to view the credentials.
	password
	index	The name of the destination index created in the Lindorm Search Engine.

Specify fulldata.conf as the task configuration file and start the Logstash service to perform the data migration. Logstash automatically stops after the data migration is complete.
```
cd logstash-7.10.0
bin/logstash -f <path/to/fulldata.conf>
```

Incremental data migration

If your self-managed ES index is updated with new or modified data, but no data is deleted, you can use incremental data migration to complete the data migration task.

This method uses a rolling migration based on a data update time field. When the Logstash rolling migration task runs for the first time, it also migrates all existing historical data to the destination index.

The following sections provide the increment.conf configuration file and the rolling migration script for the incremental data migration task.

The increment.conf configuration file.

input{
  elasticsearch{
    hosts =>  ["<connection-address-of-self-managed-ES-cluster>:<port-of-self-managed-ES-cluster>"]
    user => "<username-for-self-managed-ES-cluster>"
    password => "<password-for-self-managed-ES-cluster>"
    index => "<source-index-of-self-managed-ES-cluster>"
    query => '{"query":{"range":{"updateTimestampField":{"gte":"${TMP_LAST}","lt":"${TMP_CURR}"}}}}'
    docinfo=>true
  }
}
filter {
  mutate {
    remove_field => ["@timestamp", "@version"]
  }
}
output{
  elasticsearch{
    hosts => ["http://<lindorm-address>"]
    user => "changeme"
    password => "changeme"
    index => "geonames"
    document_id => "%{[@metadata][_id]}"
    manage_template => false
  }
}

The task rolling migration script.

#!/bin/bash

unset TMP_LAST
unset TMP_CURR

# Logstash execution interval. Default is 30s.
sleepInterval=30
# Set a value slightly larger than the source's refresh interval. Unit is seconds. The default source interval is 15s.
refreshInterval=16

# Default conversion is to ms. Adjust based on the unit of the business field.
export TMP_LAST=0
i=1
while true
do
  echo "Starting Logstash data migration task ${i}..."

  # Default conversion is to ms. Adjust based on the unit of the business field.
  export TMP_CURR=$((($(date +%s%N)/1000000) - ($refreshInterval * 1000)))
  <path/to/logstash>/bin/logstash -f <path/to/increment.conf>

  echo "Logstash data migration task ${i} completed."
  echo "Data update time range for this migration: ${TMP_LAST} to ${TMP_CURR}"
  i=$(( $i + 1 ))
  export TMP_LAST=${TMP_CURR}
  sleep ${sleepInterval}
done

The preceding example uses environment variables in the increment.conf configuration file and a rolling migration script to continuously migrate data that is updated within a specific time window. For more information about how to use environment variables in a Logstash configuration, see Using environment variables.

Important

The Lindorm Search Engine does not guarantee the processing order of data write requests because their arrival order on the server-side is not guaranteed. If you run full and incremental data migration tasks at the same time, historical data may overwrite updated data. Therefore, you must first run the full data migration task for historical data. After this task is complete, you can start the incremental data rolling migration task.

Comprehensive migration solution

If your data does not include a data update time field, you must modify your business code to add this field. For historical data that lacks this field, use the historical data migration task configuration and start a Logstash task to migrate it. For incremental data that includes the data update time field, use the rolling migration solution.

The following history.conf configuration file is used for the Logstash historical data migration task in this topic.

input{
  elasticsearch{
    hosts =>  ["http://<host>:<port>"]
    user => "changeme"
    password => "changeme"
    index => "geonames"
    query => '{"query":{"bool":{"must_not":{"exists":{"field":"updateTimeField"}}}}}'
    docinfo=>true
  }
}
filter {
  mutate {
    remove_field => ["@timestamp", "@version"]
  }
}
output{
  elasticsearch{
    hosts => ["http://<lindorm-address>"]
    user => "changeme"
    password => "changeme"
    index => "geonames"
    document_id => "%{[@metadata][_id]}"
    manage_template => false
  }
}

For the rolling migration script example, see Task rolling migration script.

Step 4: Check the migration results

You can check whether the number of documents in the source ES index and the destination Lindorm index are the same. You can also check whether recently updated data is consistent. This helps you verify that the historical and incremental data from the self-managed ES cluster has been completely migrated to the Lindorm Search Engine. The following example code shows how to perform these checks:

# View index details
curl -XGET "<url>/_cat/indices?v" -u <username>:<password>

# View recently updated data in the index
curl -XGET "<url>/<index>/_search" -u <username>:<password> -H'Content-Type:application/json' -d'{
  "query": {
    "bool": {
      "must": {
        "exists": {
          "field": "updateTimestampField"
        }
      }
    }
  },
  "sort": [
    {
      "updateTimestampField": {
        "order": "desc"
      }
    }
  ],
  "size": 20
}'