Integrate an Elasticsearch data source for federated queries - ApsaraDB for SelectDB

ApsaraDB for SelectDB supports querying Elasticsearch data directly through an Elasticsearch catalog, enabling federated OLAP analysis without moving data. The catalog automatically maps Elasticsearch index metadata, supporting multi-index joins within Elasticsearch and cross-system joins between SelectDB and Elasticsearch.

Supported versions: Elasticsearch 5.X and later.

Prerequisites

Before you begin, ensure that you have:

All nodes in your Elasticsearch cluster connected to the SelectDB instance — nodes must share the same virtual private cloud (VPC), or you must configure cross-VPC connectivity. For details, see What do I do if a connection fails to be established between an ApsaraDB for SelectDB instance and a data source?
The IP addresses of all Elasticsearch cluster nodes added to the SelectDB instance IP address whitelist. See Configure an IP address whitelist
The VPC IP addresses of the SelectDB instance added to the Elasticsearch cluster's allowlist (if the cluster enforces one). To find the SelectDB VPC IP addresses, see How do I view the IP addresses in the VPC to which my ApsaraDB SelectDB instance belongs?
Basic familiarity with SelectDB catalogs. See Data lakehouse

Create an Elasticsearch catalog

CREATE CATALOG test_es PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200",
    "user"="test_user",
    "password"="test_passwd",
    "nodes_discovery"="false"
);

Because Elasticsearch has no database concept, SelectDB automatically creates a single database named default_db under the catalog. After switching to the catalog with SWITCH, SelectDB enters default_db automatically — no USE default_db statement needed.

Parameters

Parameter	Required	Default	Description
`hosts`	Yes	—	URL of the Elasticsearch data source. Accepts one or more URLs, or the URL of a Server Load Balancer (SLB) instance fronting the cluster.
`user`	No	—	Account for accessing the Elasticsearch data source.
`password`	No	—	Password for the account.
`doc_value_scan`	No	true	Enable column-oriented storage (doc_values) for querying field values.
`keyword_sniff`	No	true	Detect TEXT fields and query via their corresponding KEYWORD sub-fields. If set to `false`, queries match against tokenized terms from the TEXT field.
`nodes_discovery`	No	true	Enable automatic node discovery. Set to `false` when using Alibaba Cloud Elasticsearch, which routes traffic through an SLB instance and does not expose nodes directly.
`ssl`	No	false	Enable HTTPS access. SelectDB trusts all HTTPS requests from frontend (FE) and backend (BE) nodes regardless of SSL certificate validity.
`mapping_es_id`	No	false	Map the `_id` metadata field from the Elasticsearch index.
`like_push_down`	No	true	Convert `LIKE` conditions to Elasticsearch wildcards and push them down to Elasticsearch. This increases CPU usage on the Elasticsearch side.
`include_hidden_index`	No	false	Include hidden indexes.

Only HTTP basic authentication is supported. The account must have read access to /_cluster/state/ and _nodes/http, plus read permissions on the indexes. If HTTPS is not enabled, the account and password are optional. For Elasticsearch 5.x or 6.x indexes with multiple types, SelectDB reads data from the first type only.

Query data

After creating the catalog, query Elasticsearch tables the same way you query internal SelectDB tables. All three of the following approaches are equivalent:

-- Switch to the catalog, then query
SWITCH test_es;
SELECT * FROM es_table LIMIT 10;

-- Use the fully qualified database path
USE test_es.default_db;
SELECT * FROM es_table LIMIT 10;

-- Use the fully qualified table name directly
SELECT * FROM test_es.default_db.es_table LIMIT 10;

Rollup, pre-aggregation, and materialized views are not available for external Elasticsearch tables.

Basic queries

SELECT * FROM es_table WHERE k1 > 1000 AND k3 = 'term' OR k4 LIKE 'fu*z_';

Extended esquery

Use esquery(field, QueryDSL) to push down queries that cannot be expressed in SQL — such as match_phrase and geo_shape — directly to Elasticsearch. The field parameter associates the query with an index. The QueryDSL parameter is a JSON object with exactly one root key.

match_phrase query:

SELECT * FROM es_table WHERE esquery(k4, '{"match_phrase": {"k4": "selectdb on es"}}');

geo_shape query:

SELECT * FROM es_table WHERE esquery(k4, '{"geo_shape": {"location": {"shape": {"type": "envelope", "coordinates": [[13, 53], [14, 52]]}, "relation": "within"}}}');

bool query:

SELECT * FROM es_table WHERE esquery(k4, '{"bool": {"must": [{"terms": {"k1": [11, 12]}}, {"terms": {"k2": [100]}}]}}');

Column type mappings

Elasticsearch type	SelectDB type	Notes
NULL	NULL
BOOLEAN	BOOLEAN
BYTE	TINYINT
SHORT	SMALLINT
INTEGER	INT
LONG	BIGINT
UNSIGNED_LONG	LARGEINT
FLOAT	FLOAT
HALF_FLOAT	FLOAT
DOUBLE	DOUBLE
SCALED_FLOAT	DOUBLE
DATE	DATE	Supported formats: default, `yyyy-MM-dd HH:mm:ss`, `yyyy-MM-dd`, `epoch_millis`
KEYWORD	STRING
TEXT	STRING
IP	STRING
NESTED	STRING
OBJECT	STRING
OTHER	UNSUPPORTED

ARRAY type

Elasticsearch has no explicit ARRAY type, but a field can contain zero or more values. To declare a field as an array in SelectDB, add an array_fields entry under _meta.selectdb in the index mapping.

Example data structure for the doc index:

{
  "array_int_field": [1, 2, 3, 4],
  "array_string_field": ["selectdb", "is", "the", "best"],
  "id_field": "id-xxx-xxx",
  "timestamp_field": "2022-11-12T12:08:56Z",
  "array_object_field": [{"name": "xxx", "age": 18}]
}

Update the mapping to declare array fields:

# Elasticsearch 7.x and later
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
    "_meta": {
        "selectdb": {
            "array_fields": [
                "array_int_field",
                "array_string_field",
                "array_object_field"
            ]
        }
    }
}'

# Elasticsearch 6.x and earlier
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
    "_doc": {
        "_meta": {
            "selectdb": {
                "array_fields": [
                    "array_int_field",
                    "array_string_field",
                    "array_object_field"
                ]
            }
        }
    }
}'

Best practices

Filter condition pushdown

SelectDB pushes filter conditions down to Elasticsearch so only matching data is returned, reducing CPU, memory, and I/O load on both systems. The following table shows how SQL operators map to Elasticsearch Query DSL.

SQL syntax	Elasticsearch query
`=`	term query
`IN`	terms query
`>`, `<`, `>=`, `<=`	range query
`AND`	bool.filter
`OR`	bool.should
`NOT`	bool.must_not
`NOT IN`	bool.must_not + terms query
`IS NOT NULL`	exists query
`IS NULL`	bool.must_not + exists query
`esquery`	Native Query DSL

Enable columnar scan to accelerate queries

Set enable_docvalue_scan to true to read field values from column-oriented storage (doc_values) instead of the _source field. When only a few columns are queried, columnar scan can be more than ten times faster than reading from _source.

SelectDB applies two principles when columnar scan is enabled:

Best effort: If all queried fields have doc_value enabled, SelectDB reads entirely from column-oriented storage.
Auto-downgrade: If any queried field lacks doc_value, SelectDB falls back to reading from _source for all fields.

Important

TEXT fields cannot use column-oriented storage. If a TEXT field is in the query, SelectDB reads from _source.
When querying 25 or more fields, the performance difference between columnar scan and _source becomes negligible.

Detect KEYWORD fields

Set enable_keyword_sniff to true to make SelectDB automatically use KEYWORD sub-fields for equality queries on STRING fields.

When Elasticsearch auto-creates an index, STRING fields get both a TEXT field and a KEYWORD sub-field:

"k4": {
    "type": "text",
    "fields": {
        "keyword": {
            "type": "keyword",
            "ignore_above": 256
        }
    }
}

Without keyword sniffing, the SQL condition k4 = "SelectDB On ES" generates this Query DSL:

"term": {"k4": "SelectDB On ES"}

Because k4 is a TEXT field, its value is tokenized into selectdb, on, and es — none of which match the full phrase, so no results are returned.

With enable_keyword_sniff set to true, SelectDB automatically rewrites the condition to target the KEYWORD sub-field:

"term": {"k4.keyword": "SelectDB On ES"}

The KEYWORD field stores the original unmodified value, so the exact phrase matches correctly.

Enable node discovery

Set nodes_discovery to true to let SelectDB discover all Elasticsearch data nodes with allocated shards.

Alibaba Cloud Elasticsearch routes traffic through an SLB instance, which prevents direct access to individual cluster nodes. Always set nodes_discovery to false when connecting to Alibaba Cloud Elasticsearch.

Enable HTTPS

Set ssl to true to connect to the Elasticsearch cluster over HTTPS. SelectDB trusts all HTTPS requests from FE and BE nodes regardless of SSL certificate validity.

Handle time fields

The guidance in this section applies to external Elasticsearch tables only. For Elasticsearch catalogs, time fields are automatically mapped to the DATE or DATETIME type.

Configure the date field format in Elasticsearch to support a broad range of inputs:

"dt": {
    "type": "date",
    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}

In SelectDB, define the field as date, datetime, or varchar. All of the following filter conditions push down correctly:

SELECT * FROM doe WHERE k2 > '2020-06-21';
SELECT * FROM doe WHERE k2 < '2020-06-21 12:00:00';
SELECT * FROM doe WHERE k2 < 1593497011;
SELECT * FROM doe WHERE k2 < now();
SELECT * FROM doe WHERE k2 < date_format(now(), '%Y-%m-%d');

Important

If no format is set in Elasticsearch, the default is strict_date_optional_time||epoch_millis.
When importing timestamp values into a DATE field in Elasticsearch, the timestamp must be in milliseconds. Elasticsearch requires millisecond precision for internal processing; other units cause errors.

Query the `_id` field

Elasticsearch automatically assigns a globally unique _id to each document when none is specified at import time. To query _id from an external table, declare it as a VARCHAR column:

CREATE EXTERNAL TABLE `doe` (
    `_id`  varchar COMMENT "",
    `city` varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
    "hosts"    = "http://127.0.0.1:8200",
    "user"     = "root",
    "password" = "root",
    "index"    = "doe"
);

To query _id from an Elasticsearch catalog, set mapping_es_id to true.

Important

Filter the _id field using only the = or IN operator.
The _id field must be of type VARCHAR.

Limitations

Rollup, pre-aggregation, materialized views: Not supported for external Elasticsearch tables.

How it works

+----------------------------------------------+
|                                              |
| SelectDB   +------------------+              |
|            |       FE         +--------------+-------+
|            |                  |  Request Shard Location
|            +--+-------------+-+              |       |
|               ^             ^                |       |
|               |             |                |       |
|  +-------------------+ +------------------+  |       |
|  |            |      | |    |             |  |       |
|  | +----------+----+ | | +--+-----------+ |  |       |
|  | |      BE       | | | |      BE      | |  |       |
|  | +---------------+ | | +--------------+ |  |       |
+----------------------------------------------+       |
   |        |          | |        |         |          |
   |        |          | |        |         |          |
   |    HTTP SCROLL    | |    HTTP SCROLL   |          |
+-----------+---------------------+------------+       |
|  |        v          | |        v         |  |       |
|  | +------+--------+ | | +------+-------+ |  |       |
|  | |               | | | |              | |  |       |
|  | |   DataNode    | | | |   DataNode   +<-----------+
|  | |               | | | |              | |  |       |
|  | |               +<--------------------------------+
|  | +---------------+ | | |--------------| |  |       |
|  +-------------------+ +------------------+  |       |
|   Same Physical Node                         |       |
|                                              |       |
|           +-----------------------+          |       |
|           |                       |          |       |
|           |      MasterNode       +<-----------------+
| ES        |                       |          |
|           +-----------------------+          |
+----------------------------------------------+

The query flow works as follows:

The FE sends a request to the configured host to get the HTTP port information for all nodes and the shard distribution of the index. If the request fails, the FE traverses the host list sequentially until one succeeds.
The FE generates a query execution plan based on the node and index metadata, then sends it to the relevant BEs.
Each BE concurrently fetches data from its assigned Elasticsearch index shards in streaming mode using the HTTP Scroll API — reading from either _source (row-oriented) or doc_values (column-oriented), depending on the query.
SelectDB computes the final results and returns them.

Prerequisites

Create an Elasticsearch catalog

Query data

Basic queries

Extended esquery

Column type mappings

ARRAY type

Best practices

Filter condition pushdown

Enable columnar scan to accelerate queries

Detect KEYWORD fields

Enable node discovery

Enable HTTPS

Handle time fields

Query the _id field

Limitations

How it works

Query the `_id` field