Migrate Elasticsearch data to ApsaraDB for SelectDB - ApsaraDB for SelectDB

Prerequisites

Before you begin, ensure that you have:

Network connectivity between the Elasticsearch cluster and the SelectDB instance. All Elasticsearch cluster nodes must reside in the same virtual private cloud (VPC) as the SelectDB instance. If the nodes reside in different VPCs, configure cross-VPC connectivity first. For more information, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
The IP addresses of all Elasticsearch cluster nodes added to the IP address whitelist of the SelectDB instance. For more information, see Configure an IP address whitelist.
The VPC IP addresses of the SelectDB instance added to the Elasticsearch cluster's whitelist (if the cluster supports whitelisting). To get the VPC IP addresses, see How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?. To get the public IP address, run ping against the SelectDB instance's public endpoint.
Basic familiarity with catalogs in SelectDB. For background, see Data lakehouse.

How it works

Creating a catalog establishes a read-only connection from SelectDB to your Elasticsearch cluster. Once the catalog is created, SelectDB automatically creates a database named default_db inside it and maps each Elasticsearch index to a table in that database. You can then query the catalog and run INSERT INTO ... SELECT to copy data into a SelectDB table.

Note

SelectDB can only read data from an external catalog. Writing back to the Elasticsearch cluster is not supported.

Sample environment

The following steps use these example values. Substitute your own values as needed.

Item	Value
Elasticsearch index (source)	`product_info`
SelectDB database (destination)	`es_db`
SelectDB table (destination)	`test_es2SelectDB`

Prepare the sample source data

For more information, see Getting started.

Migrate offline data

Step 1: Connect to the SelectDB instance

For connection instructions, see Connect to an instance.

Step 2: Create an Elasticsearch catalog

Run the following statement to create a catalog that points to your Elasticsearch cluster:

CREATE CATALOG es_catalog PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200",
    "user"="test_user",
    "password"="test_passwd",
    "nodes_discovery"="false"
);

The following table describes the catalog properties:

Parameter	Required	Default	Description
`es_catalog`	Yes	—	The name of the catalog. Change the name based on your business requirements.
`type`	Yes	—	Set to `es`.
`hosts`	Yes	—	The access URL of the Elasticsearch cluster. Format: `http://<IP address>:<port>`. To use a Server Load Balancer (SLB) endpoint, set this to the SLB URL.
`user`	No	—	The username for accessing the Elasticsearch cluster.
`password`	No	—	The password for accessing the Elasticsearch cluster.
`doc_value_scan`	No	`true`	Enables column-oriented storage (doc_values) for querying field values. When enabled, SelectDB automatically queries fields with doc_values enabled, which significantly improves performance for queries on a small number of columns.
`keyword_sniff`	No	`true`	Detects TEXT fields in Elasticsearch and queries them using the corresponding KEYWORD fields. When `true`, SelectDB matches the full field value before tokenization. When `false`, SelectDB matches individual terms after tokenization.
`nodes_discovery`	No	`true`	Enables Elasticsearch node discovery. When `true`, SelectDB discovers and connects to all available data nodes. When `false`, SelectDB accesses nodes only through the URL specified in `hosts`. Set to `false` for Alibaba Cloud Elasticsearch clusters, which can only be accessed via URL.
`mapping_es_id`	No	`false`	Maps the Elasticsearch `_id` field to a SelectDB column. Set to `true` if you need to query the primary key value. When `false`, the `_id` field is not accessible in queries.
`like_push_down`	No	`true`	Converts LIKE queries to Elasticsearch wildcard queries and pushes them down to the cluster. This can increase CPU consumption on the Elasticsearch cluster in some scenarios. Set to `false` to disable LIKE pushdown and improve accuracy at the cost of recall rate.
`include_hidden_index`	No	`false`	Includes hidden indexes in query conditions.

Step 3: Verify the catalog

Run SHOW CATALOGS to confirm the catalog was created:

SHOW CATALOGS;

Expected output:

+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId    | CatalogName  | Type     | IsCurrent | CreateTime              | LastUpdateTime      | Comment                |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | es_catalog   | es       |           | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 |                        |
|            0 | internal     | internal | yes       | UNRECORDED              | NULL                | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+

Step 4: (Optional) Query the Elasticsearch catalog

Switch to the external catalog to browse the mapped tables:

SWITCH es_catalog;

Query and access data in es_catalog the same way you would in the internal catalog. For query examples, see the "Query data from an Elasticsearch data source" section of the Elasticsearch data source topic.

Switch back to the internal catalog when done:

SWITCH internal;

Step 5: Create the destination database

Skip this step if the database already exists.

CREATE DATABASE es_db;

Switch to the destination database:

USE es_db;

Step 6: Create the destination table

Before creating the table, review the column type mappings between Elasticsearch and SelectDB. For the full type mapping reference, see the "Column type mappings" section of the Elasticsearch data source topic.

Create the table using column types that match the Elasticsearch index schema:

CREATE TABLE test_Es2SelectDB
(
    `annual_rate` VARCHAR(200),
    `describe`    TEXT,
    `productName` VARCHAR(200)
)
DISTRIBUTED BY HASH(productName) BUCKETS 4
PROPERTIES("replication_num" = "1");

Step 7: Migrate data

Run INSERT INTO ... SELECT to copy all rows from the Elasticsearch index into the SelectDB table:

INSERT INTO test_Es2SelectDB SELECT * FROM es_catalog.default_db.product_info;

Step 8: Verify the migration

Query the destination table to confirm the data was imported:

SELECT * FROM test_Es2SelectDB;

Migrate incremental data

In production environments, Elasticsearch data typically includes both historical (offline) data and ongoing incremental data. Use the following strategies based on your data type.

Log data

Write new log entries simultaneously to both the Elasticsearch cluster and the SelectDB instance. Once enough data accumulates in SelectDB, run your analysis queries directly against SelectDB.

Transactional or event data

Use SelectDB's Unique key model to handle deduplication automatically. Write incremental data to both the Elasticsearch cluster and SelectDB in parallel. Migrate historical data using the offline migration steps above. If historical and incremental records overlap, the Unique key model deduplicates on the primary key, so no manual conflict resolution is needed.

ELK ecosystem compatibility

For log collection, SelectDB offers customized and enhanced Filebeat and Logstash plugins to efficiently collect and process log data. You can also collect logs by importing data. For more information, see the following topics:

What's next

Elasticsearch data source — Full parameter reference, column type mappings, and predicate pushdown details
Data lakehouse — Working with external catalogs in SelectDB
Configure an IP address whitelist — Network access control for your SelectDB instance