ApsaraDB for SelectDB supports querying Elasticsearch data directly through an Elasticsearch catalog, enabling federated OLAP analysis without moving data. The catalog automatically maps Elasticsearch index metadata, supporting multi-index joins within Elasticsearch and cross-system joins between SelectDB and Elasticsearch.
Supported versions: Elasticsearch 5.X and later.
Prerequisites
Before you begin, ensure that you have:
All nodes in your Elasticsearch cluster connected to the SelectDB instance — nodes must share the same virtual private cloud (VPC), or you must configure cross-VPC connectivity. For details, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
The IP addresses of all Elasticsearch cluster nodes added to the SelectDB instance IP address whitelist. See Configure an IP address whitelist
The VPC IP addresses of the SelectDB instance added to the Elasticsearch cluster's allowlist (if the cluster enforces one). To find the SelectDB VPC IP addresses, see How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
Basic familiarity with SelectDB catalogs. See Data lakehouse
Create an Elasticsearch catalog
CREATE CATALOG test_es PROPERTIES (
"type"="es",
"hosts"="http://127.0.0.1:9200",
"user"="test_user",
"password"="test_passwd",
"nodes_discovery"="false"
);Because Elasticsearch has no database concept, SelectDB automatically creates a single database named default_db under the catalog. After switching to the catalog with SWITCH, SelectDB enters default_db automatically — no USE default_db statement needed.
Parameters
| Parameter | Required | Default | Description |
|---|---|---|---|
hosts | Yes | — | URL of the Elasticsearch data source. Accepts one or more URLs, or the URL of a Server Load Balancer (SLB) instance fronting the cluster. |
user | No | — | Account for accessing the Elasticsearch data source. |
password | No | — | Password for the account. |
doc_value_scan | No | true | Enable column-oriented storage (doc_values) for querying field values. |
keyword_sniff | No | true | Detect TEXT fields and query via their corresponding KEYWORD sub-fields. If set to false, queries match against tokenized terms from the TEXT field. |
nodes_discovery | No | true | Enable automatic node discovery. Set to false when using Alibaba Cloud Elasticsearch, which routes traffic through an SLB instance and does not expose nodes directly. |
ssl | No | false | Enable HTTPS access. SelectDB trusts all HTTPS requests from frontend (FE) and backend (BE) nodes regardless of SSL certificate validity. |
mapping_es_id | No | false | Map the _id metadata field from the Elasticsearch index. |
like_push_down | No | true | Convert LIKE conditions to Elasticsearch wildcards and push them down to Elasticsearch. This increases CPU usage on the Elasticsearch side. |
include_hidden_index | No | false | Include hidden indexes. |
Only HTTP basic authentication is supported. The account must have read access to/_cluster/state/and_nodes/http, plus read permissions on the indexes. If HTTPS is not enabled, the account and password are optional. For Elasticsearch 5.x or 6.x indexes with multiple types, SelectDB reads data from the first type only.
Query data
After creating the catalog, query Elasticsearch tables the same way you query internal SelectDB tables. All three of the following approaches are equivalent:
-- Switch to the catalog, then query
SWITCH test_es;
SELECT * FROM es_table LIMIT 10;
-- Use the fully qualified database path
USE test_es.default_db;
SELECT * FROM es_table LIMIT 10;
-- Use the fully qualified table name directly
SELECT * FROM test_es.default_db.es_table LIMIT 10;Rollup, pre-aggregation, and materialized views are not available for external Elasticsearch tables.
Basic queries
SELECT * FROM es_table WHERE k1 > 1000 AND k3 = 'term' OR k4 LIKE 'fu*z_';Extended esquery
Use esquery(field, QueryDSL) to push down queries that cannot be expressed in SQL — such as match_phrase and geo_shape — directly to Elasticsearch. The field parameter associates the query with an index. The QueryDSL parameter is a JSON object with exactly one root key.
match_phrase query:
SELECT * FROM es_table WHERE esquery(k4, '{"match_phrase": {"k4": "selectdb on es"}}');geo_shape query:
SELECT * FROM es_table WHERE esquery(k4, '{"geo_shape": {"location": {"shape": {"type": "envelope", "coordinates": [[13, 53], [14, 52]]}, "relation": "within"}}}');bool query:
SELECT * FROM es_table WHERE esquery(k4, '{"bool": {"must": [{"terms": {"k1": [11, 12]}}, {"terms": {"k2": [100]}}]}}');Column type mappings
| Elasticsearch type | SelectDB type | Notes |
|---|---|---|
| NULL | NULL | |
| BOOLEAN | BOOLEAN | |
| BYTE | TINYINT | |
| SHORT | SMALLINT | |
| INTEGER | INT | |
| LONG | BIGINT | |
| UNSIGNED_LONG | LARGEINT | |
| FLOAT | FLOAT | |
| HALF_FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| SCALED_FLOAT | DOUBLE | |
| DATE | DATE | Supported formats: default, yyyy-MM-dd HH:mm:ss, yyyy-MM-dd, epoch_millis |
| KEYWORD | STRING | |
| TEXT | STRING | |
| IP | STRING | |
| NESTED | STRING | |
| OBJECT | STRING | |
| OTHER | UNSUPPORTED |
ARRAY type
Elasticsearch has no explicit ARRAY type, but a field can contain zero or more values. To declare a field as an array in SelectDB, add an array_fields entry under _meta.selectdb in the index mapping.
Example data structure for the doc index:
{
"array_int_field": [1, 2, 3, 4],
"array_string_field": ["selectdb", "is", "the", "best"],
"id_field": "id-xxx-xxx",
"timestamp_field": "2022-11-12T12:08:56Z",
"array_object_field": [{"name": "xxx", "age": 18}]
}Update the mapping to declare array fields:
# Elasticsearch 7.x and later
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
"_meta": {
"selectdb": {
"array_fields": [
"array_int_field",
"array_string_field",
"array_object_field"
]
}
}
}'
# Elasticsearch 6.x and earlier
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
"_doc": {
"_meta": {
"selectdb": {
"array_fields": [
"array_int_field",
"array_string_field",
"array_object_field"
]
}
}
}
}'Best practices
Filter condition pushdown
SelectDB pushes filter conditions down to Elasticsearch so only matching data is returned, reducing CPU, memory, and I/O load on both systems. The following table shows how SQL operators map to Elasticsearch Query DSL.
| SQL syntax | Elasticsearch query |
|---|---|
= | term query |
IN | terms query |
>, <, >=, <= | range query |
AND | bool.filter |
OR | bool.should |
NOT | bool.must_not |
NOT IN | bool.must_not + terms query |
IS NOT NULL | exists query |
IS NULL | bool.must_not + exists query |
esquery | Native Query DSL |
Enable columnar scan to accelerate queries
Set enable_docvalue_scan to true to read field values from column-oriented storage (doc_values) instead of the _source field. When only a few columns are queried, columnar scan can be more than ten times faster than reading from _source.
SelectDB applies two principles when columnar scan is enabled:
Best effort: If all queried fields have
doc_valueenabled, SelectDB reads entirely from column-oriented storage.Auto-downgrade: If any queried field lacks
doc_value, SelectDB falls back to reading from_sourcefor all fields.
TEXT fields cannot use column-oriented storage. If a TEXT field is in the query, SelectDB reads from
_source.When querying 25 or more fields, the performance difference between columnar scan and
_sourcebecomes negligible.
Detect KEYWORD fields
Set enable_keyword_sniff to true to make SelectDB automatically use KEYWORD sub-fields for equality queries on STRING fields.
When Elasticsearch auto-creates an index, STRING fields get both a TEXT field and a KEYWORD sub-field:
"k4": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}Without keyword sniffing, the SQL condition k4 = "SelectDB On ES" generates this Query DSL:
"term": {"k4": "SelectDB On ES"}Because k4 is a TEXT field, its value is tokenized into selectdb, on, and es — none of which match the full phrase, so no results are returned.
With enable_keyword_sniff set to true, SelectDB automatically rewrites the condition to target the KEYWORD sub-field:
"term": {"k4.keyword": "SelectDB On ES"}The KEYWORD field stores the original unmodified value, so the exact phrase matches correctly.
Enable node discovery
Set nodes_discovery to true to let SelectDB discover all Elasticsearch data nodes with allocated shards.
Alibaba Cloud Elasticsearch routes traffic through an SLB instance, which prevents direct access to individual cluster nodes. Always setnodes_discoverytofalsewhen connecting to Alibaba Cloud Elasticsearch.
Enable HTTPS
Set ssl to true to connect to the Elasticsearch cluster over HTTPS. SelectDB trusts all HTTPS requests from FE and BE nodes regardless of SSL certificate validity.
Handle time fields
The guidance in this section applies to external Elasticsearch tables only. For Elasticsearch catalogs, time fields are automatically mapped to the DATE or DATETIME type.
Configure the date field format in Elasticsearch to support a broad range of inputs:
"dt": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}In SelectDB, define the field as date, datetime, or varchar. All of the following filter conditions push down correctly:
SELECT * FROM doe WHERE k2 > '2020-06-21';
SELECT * FROM doe WHERE k2 < '2020-06-21 12:00:00';
SELECT * FROM doe WHERE k2 < 1593497011;
SELECT * FROM doe WHERE k2 < now();
SELECT * FROM doe WHERE k2 < date_format(now(), '%Y-%m-%d');If no
formatis set in Elasticsearch, the default isstrict_date_optional_time||epoch_millis.When importing timestamp values into a DATE field in Elasticsearch, the timestamp must be in milliseconds. Elasticsearch requires millisecond precision for internal processing; other units cause errors.
Query the _id field
Elasticsearch automatically assigns a globally unique _id to each document when none is specified at import time. To query _id from an external table, declare it as a VARCHAR column:
CREATE EXTERNAL TABLE `doe` (
`_id` varchar COMMENT "",
`city` varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://127.0.0.1:8200",
"user" = "root",
"password" = "root",
"index" = "doe"
);To query _id from an Elasticsearch catalog, set mapping_es_id to true.
Filter the
_idfield using only the=orINoperator.The
_idfield must be of typeVARCHAR.
Limitations
Rollup, pre-aggregation, materialized views: Not supported for external Elasticsearch tables.
How it works
+----------------------------------------------+
| |
| SelectDB +------------------+ |
| | FE +--------------+-------+
| | | Request Shard Location
| +--+-------------+-+ | |
| ^ ^ | |
| | | | |
| +-------------------+ +------------------+ | |
| | | | | | | | |
| | +----------+----+ | | +--+-----------+ | | |
| | | BE | | | | BE | | | |
| | +---------------+ | | +--------------+ | | |
+----------------------------------------------+ |
| | | | | | |
| | | | | | |
| HTTP SCROLL | | HTTP SCROLL | |
+-----------+---------------------+------------+ |
| | v | | v | | |
| | +------+--------+ | | +------+-------+ | | |
| | | | | | | | | | |
| | | DataNode | | | | DataNode +<-----------+
| | | | | | | | | | |
| | | +<--------------------------------+
| | +---------------+ | | |--------------| | | |
| +-------------------+ +------------------+ | |
| Same Physical Node | |
| | |
| +-----------------------+ | |
| | | | |
| | MasterNode +<-----------------+
| ES | | |
| +-----------------------+ |
+----------------------------------------------+The query flow works as follows:
The FE sends a request to the configured host to get the HTTP port information for all nodes and the shard distribution of the index. If the request fails, the FE traverses the host list sequentially until one succeeds.
The FE generates a query execution plan based on the node and index metadata, then sends it to the relevant BEs.
Each BE concurrently fetches data from its assigned Elasticsearch index shards in streaming mode using the HTTP Scroll API — reading from either
_source(row-oriented) or doc_values (column-oriented), depending on the query.SelectDB computes the final results and returns them.