Use a catalog to migrate offline data from an Elasticsearch cluster to an ApsaraDB for SelectDB instance. The catalog maps Elasticsearch indexes to SelectDB tables, so you can run a single INSERT INTO ... SELECT statement to complete the migration.
Prerequisites
Before you begin, ensure that you have:
-
Network connectivity between the Elasticsearch cluster and the SelectDB instance. All Elasticsearch cluster nodes must reside in the same virtual private cloud (VPC) as the SelectDB instance. If the nodes reside in different VPCs, configure cross-VPC connectivity first. For more information, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
-
The IP addresses of all Elasticsearch cluster nodes added to the IP address whitelist of the SelectDB instance. For more information, see Configure an IP address whitelist.
-
The VPC IP addresses of the SelectDB instance added to the Elasticsearch cluster's whitelist (if the cluster supports whitelisting). To get the VPC IP addresses, see How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?. To get the public IP address, run
pingagainst the SelectDB instance's public endpoint. -
Basic familiarity with catalogs in SelectDB. For background, see Data lakehouse.
How it works
Creating a catalog establishes a read-only connection from SelectDB to your Elasticsearch cluster. Once the catalog is created, SelectDB automatically creates a database named default_db inside it and maps each Elasticsearch index to a table in that database. You can then query the catalog and run INSERT INTO ... SELECT to copy data into a SelectDB table.
SelectDB can only read data from an external catalog. Writing back to the Elasticsearch cluster is not supported.
Sample environment
The following steps use these example values. Substitute your own values as needed.
| Item | Value |
|---|---|
| Elasticsearch index (source) | product_info |
| SelectDB database (destination) | es_db |
| SelectDB table (destination) | test_es2SelectDB |
Prepare the sample source data
For more information, see Getting started.
Migrate offline data
Step 1: Connect to the SelectDB instance
For connection instructions, see Connect to an instance.
Step 2: Create an Elasticsearch catalog
Run the following statement to create a catalog that points to your Elasticsearch cluster:
CREATE CATALOG es_catalog PROPERTIES (
"type"="es",
"hosts"="http://127.0.0.1:9200",
"user"="test_user",
"password"="test_passwd",
"nodes_discovery"="false"
);
The following table describes the catalog properties:
| Parameter | Required | Default | Description |
|---|---|---|---|
es_catalog |
Yes | — | The name of the catalog. Change the name based on your business requirements. |
type |
Yes | — | Set to es. |
hosts |
Yes | — | The access URL of the Elasticsearch cluster. Format: http://<IP address>:<port>. To use a Server Load Balancer (SLB) endpoint, set this to the SLB URL. |
user |
No | — | The username for accessing the Elasticsearch cluster. |
password |
No | — | The password for accessing the Elasticsearch cluster. |
doc_value_scan |
No | true |
Enables column-oriented storage (doc_values) for querying field values. When enabled, SelectDB automatically queries fields with doc_values enabled, which significantly improves performance for queries on a small number of columns. |
keyword_sniff |
No | true |
Detects TEXT fields in Elasticsearch and queries them using the corresponding KEYWORD fields. When true, SelectDB matches the full field value before tokenization. When false, SelectDB matches individual terms after tokenization. |
nodes_discovery |
No | true |
Enables Elasticsearch node discovery. When true, SelectDB discovers and connects to all available data nodes. When false, SelectDB accesses nodes only through the URL specified in hosts. Set to false for Alibaba Cloud Elasticsearch clusters, which can only be accessed via URL. |
mapping_es_id |
No | false |
Maps the Elasticsearch _id field to a SelectDB column. Set to true if you need to query the primary key value. When false, the _id field is not accessible in queries. |
like_push_down |
No | true |
Converts LIKE queries to Elasticsearch wildcard queries and pushes them down to the cluster. This can increase CPU consumption on the Elasticsearch cluster in some scenarios. Set to false to disable LIKE pushdown and improve accuracy at the cost of recall rate. |
include_hidden_index |
No | false |
Includes hidden indexes in query conditions. |
Step 3: Verify the catalog
Run SHOW CATALOGS to confirm the catalog was created:
SHOW CATALOGS;
Expected output:
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| CatalogId | CatalogName | Type | IsCurrent | CreateTime | LastUpdateTime | Comment |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
| 436009309195 | es_catalog | es | | 2024-08-06 17:09:08.058 | 2024-07-19 18:04:37 | |
| 0 | internal | internal | yes | UNRECORDED | NULL | Doris internal catalog |
+--------------+--------------+----------+-----------+-------------------------+---------------------+------------------------+
Step 4: (Optional) Query the Elasticsearch catalog
Switch to the external catalog to browse the mapped tables:
SWITCH es_catalog;
Query and access data in es_catalog the same way you would in the internal catalog. For query examples, see the "Query data from an Elasticsearch data source" section of the Elasticsearch data source topic.
Switch back to the internal catalog when done:
SWITCH internal;
Step 5: Create the destination database
Skip this step if the database already exists.
CREATE DATABASE es_db;
Switch to the destination database:
USE es_db;
Step 6: Create the destination table
Before creating the table, review the column type mappings between Elasticsearch and SelectDB. For the full type mapping reference, see the "Column type mappings" section of the Elasticsearch data source topic.
Create the table using column types that match the Elasticsearch index schema:
CREATE TABLE test_Es2SelectDB
(
`annual_rate` VARCHAR(200),
`describe` TEXT,
`productName` VARCHAR(200)
)
DISTRIBUTED BY HASH(productName) BUCKETS 4
PROPERTIES("replication_num" = "1");
Step 7: Migrate data
Run INSERT INTO ... SELECT to copy all rows from the Elasticsearch index into the SelectDB table:
INSERT INTO test_Es2SelectDB SELECT * FROM es_catalog.default_db.product_info;
Step 8: Verify the migration
Query the destination table to confirm the data was imported:
SELECT * FROM test_Es2SelectDB;
Migrate incremental data
In production environments, Elasticsearch data typically includes both historical (offline) data and ongoing incremental data. Use the following strategies based on your data type.
Log data
Write new log entries simultaneously to both the Elasticsearch cluster and the SelectDB instance. Once enough data accumulates in SelectDB, run your analysis queries directly against SelectDB.
Transactional or event data
Use SelectDB's Unique key model to handle deduplication automatically. Write incremental data to both the Elasticsearch cluster and SelectDB in parallel. Migrate historical data using the offline migration steps above. If historical and incremental records overlap, the Unique key model deduplicates on the primary key, so no manual conflict resolution is needed.
ELK ecosystem compatibility
For log collection, SelectDB offers customized and enhanced Filebeat and Logstash plugins to efficiently collect and process log data. You can also collect logs by importing data. For more information, see the following topics:
What's next
-
Elasticsearch data source — Full parameter reference, column type mappings, and predicate pushdown details
-
Data lakehouse — Working with external catalogs in SelectDB
-
Configure an IP address whitelist — Network access control for your SelectDB instance