All Products
Search
Document Center

ApsaraDB for SelectDB:Elasticsearch data source

Last Updated:Mar 28, 2026

ApsaraDB for SelectDB supports querying Elasticsearch data directly through an Elasticsearch catalog, enabling federated OLAP analysis without moving data. The catalog automatically maps Elasticsearch index metadata, supporting multi-index joins within Elasticsearch and cross-system joins between SelectDB and Elasticsearch.

Supported versions: Elasticsearch 5.X and later.

Prerequisites

Before you begin, ensure that you have:

Create an Elasticsearch catalog

CREATE CATALOG test_es PROPERTIES (
    "type"="es",
    "hosts"="http://127.0.0.1:9200",
    "user"="test_user",
    "password"="test_passwd",
    "nodes_discovery"="false"
);

Because Elasticsearch has no database concept, SelectDB automatically creates a single database named default_db under the catalog. After switching to the catalog with SWITCH, SelectDB enters default_db automatically — no USE default_db statement needed.

Parameters

ParameterRequiredDefaultDescription
hostsYesURL of the Elasticsearch data source. Accepts one or more URLs, or the URL of a Server Load Balancer (SLB) instance fronting the cluster.
userNoAccount for accessing the Elasticsearch data source.
passwordNoPassword for the account.
doc_value_scanNotrueEnable column-oriented storage (doc_values) for querying field values.
keyword_sniffNotrueDetect TEXT fields and query via their corresponding KEYWORD sub-fields. If set to false, queries match against tokenized terms from the TEXT field.
nodes_discoveryNotrueEnable automatic node discovery. Set to false when using Alibaba Cloud Elasticsearch, which routes traffic through an SLB instance and does not expose nodes directly.
sslNofalseEnable HTTPS access. SelectDB trusts all HTTPS requests from frontend (FE) and backend (BE) nodes regardless of SSL certificate validity.
mapping_es_idNofalseMap the _id metadata field from the Elasticsearch index.
like_push_downNotrueConvert LIKE conditions to Elasticsearch wildcards and push them down to Elasticsearch. This increases CPU usage on the Elasticsearch side.
include_hidden_indexNofalseInclude hidden indexes.
Only HTTP basic authentication is supported. The account must have read access to /_cluster/state/ and _nodes/http, plus read permissions on the indexes. If HTTPS is not enabled, the account and password are optional. For Elasticsearch 5.x or 6.x indexes with multiple types, SelectDB reads data from the first type only.

Query data

After creating the catalog, query Elasticsearch tables the same way you query internal SelectDB tables. All three of the following approaches are equivalent:

-- Switch to the catalog, then query
SWITCH test_es;
SELECT * FROM es_table LIMIT 10;

-- Use the fully qualified database path
USE test_es.default_db;
SELECT * FROM es_table LIMIT 10;

-- Use the fully qualified table name directly
SELECT * FROM test_es.default_db.es_table LIMIT 10;
Rollup, pre-aggregation, and materialized views are not available for external Elasticsearch tables.

Basic queries

SELECT * FROM es_table WHERE k1 > 1000 AND k3 = 'term' OR k4 LIKE 'fu*z_';

Extended esquery

Use esquery(field, QueryDSL) to push down queries that cannot be expressed in SQL — such as match_phrase and geo_shape — directly to Elasticsearch. The field parameter associates the query with an index. The QueryDSL parameter is a JSON object with exactly one root key.

match_phrase query:

SELECT * FROM es_table WHERE esquery(k4, '{"match_phrase": {"k4": "selectdb on es"}}');

geo_shape query:

SELECT * FROM es_table WHERE esquery(k4, '{"geo_shape": {"location": {"shape": {"type": "envelope", "coordinates": [[13, 53], [14, 52]]}, "relation": "within"}}}');

bool query:

SELECT * FROM es_table WHERE esquery(k4, '{"bool": {"must": [{"terms": {"k1": [11, 12]}}, {"terms": {"k2": [100]}}]}}');

Column type mappings

Elasticsearch typeSelectDB typeNotes
NULLNULL
BOOLEANBOOLEAN
BYTETINYINT
SHORTSMALLINT
INTEGERINT
LONGBIGINT
UNSIGNED_LONGLARGEINT
FLOATFLOAT
HALF_FLOATFLOAT
DOUBLEDOUBLE
SCALED_FLOATDOUBLE
DATEDATESupported formats: default, yyyy-MM-dd HH:mm:ss, yyyy-MM-dd, epoch_millis
KEYWORDSTRING
TEXTSTRING
IPSTRING
NESTEDSTRING
OBJECTSTRING
OTHERUNSUPPORTED

ARRAY type

Elasticsearch has no explicit ARRAY type, but a field can contain zero or more values. To declare a field as an array in SelectDB, add an array_fields entry under _meta.selectdb in the index mapping.

Example data structure for the doc index:

{
  "array_int_field": [1, 2, 3, 4],
  "array_string_field": ["selectdb", "is", "the", "best"],
  "id_field": "id-xxx-xxx",
  "timestamp_field": "2022-11-12T12:08:56Z",
  "array_object_field": [{"name": "xxx", "age": 18}]
}

Update the mapping to declare array fields:

# Elasticsearch 7.x and later
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
    "_meta": {
        "selectdb": {
            "array_fields": [
                "array_int_field",
                "array_string_field",
                "array_object_field"
            ]
        }
    }
}'

# Elasticsearch 6.x and earlier
curl -X PUT "localhost:9200/doc/_mapping?pretty" -H 'Content-Type: application/json' -d '
{
    "_doc": {
        "_meta": {
            "selectdb": {
                "array_fields": [
                    "array_int_field",
                    "array_string_field",
                    "array_object_field"
                ]
            }
        }
    }
}'

Best practices

Filter condition pushdown

SelectDB pushes filter conditions down to Elasticsearch so only matching data is returned, reducing CPU, memory, and I/O load on both systems. The following table shows how SQL operators map to Elasticsearch Query DSL.

SQL syntaxElasticsearch query
=term query
INterms query
>, <, >=, <=range query
ANDbool.filter
ORbool.should
NOTbool.must_not
NOT INbool.must_not + terms query
IS NOT NULLexists query
IS NULLbool.must_not + exists query
esqueryNative Query DSL

Enable columnar scan to accelerate queries

Set enable_docvalue_scan to true to read field values from column-oriented storage (doc_values) instead of the _source field. When only a few columns are queried, columnar scan can be more than ten times faster than reading from _source.

SelectDB applies two principles when columnar scan is enabled:

  • Best effort: If all queried fields have doc_value enabled, SelectDB reads entirely from column-oriented storage.

  • Auto-downgrade: If any queried field lacks doc_value, SelectDB falls back to reading from _source for all fields.

Important
  • TEXT fields cannot use column-oriented storage. If a TEXT field is in the query, SelectDB reads from _source.

  • When querying 25 or more fields, the performance difference between columnar scan and _source becomes negligible.

Detect KEYWORD fields

Set enable_keyword_sniff to true to make SelectDB automatically use KEYWORD sub-fields for equality queries on STRING fields.

When Elasticsearch auto-creates an index, STRING fields get both a TEXT field and a KEYWORD sub-field:

"k4": {
    "type": "text",
    "fields": {
        "keyword": {
            "type": "keyword",
            "ignore_above": 256
        }
    }
}

Without keyword sniffing, the SQL condition k4 = "SelectDB On ES" generates this Query DSL:

"term": {"k4": "SelectDB On ES"}

Because k4 is a TEXT field, its value is tokenized into selectdb, on, and es — none of which match the full phrase, so no results are returned.

With enable_keyword_sniff set to true, SelectDB automatically rewrites the condition to target the KEYWORD sub-field:

"term": {"k4.keyword": "SelectDB On ES"}

The KEYWORD field stores the original unmodified value, so the exact phrase matches correctly.

Enable node discovery

Set nodes_discovery to true to let SelectDB discover all Elasticsearch data nodes with allocated shards.

Alibaba Cloud Elasticsearch routes traffic through an SLB instance, which prevents direct access to individual cluster nodes. Always set nodes_discovery to false when connecting to Alibaba Cloud Elasticsearch.

Enable HTTPS

Set ssl to true to connect to the Elasticsearch cluster over HTTPS. SelectDB trusts all HTTPS requests from FE and BE nodes regardless of SSL certificate validity.

Handle time fields

The guidance in this section applies to external Elasticsearch tables only. For Elasticsearch catalogs, time fields are automatically mapped to the DATE or DATETIME type.

Configure the date field format in Elasticsearch to support a broad range of inputs:

"dt": {
    "type": "date",
    "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}

In SelectDB, define the field as date, datetime, or varchar. All of the following filter conditions push down correctly:

SELECT * FROM doe WHERE k2 > '2020-06-21';
SELECT * FROM doe WHERE k2 < '2020-06-21 12:00:00';
SELECT * FROM doe WHERE k2 < 1593497011;
SELECT * FROM doe WHERE k2 < now();
SELECT * FROM doe WHERE k2 < date_format(now(), '%Y-%m-%d');
Important
  • If no format is set in Elasticsearch, the default is strict_date_optional_time||epoch_millis.

  • When importing timestamp values into a DATE field in Elasticsearch, the timestamp must be in milliseconds. Elasticsearch requires millisecond precision for internal processing; other units cause errors.

Query the _id field

Elasticsearch automatically assigns a globally unique _id to each document when none is specified at import time. To query _id from an external table, declare it as a VARCHAR column:

CREATE EXTERNAL TABLE `doe` (
    `_id`  varchar COMMENT "",
    `city` varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
    "hosts"    = "http://127.0.0.1:8200",
    "user"     = "root",
    "password" = "root",
    "index"    = "doe"
);

To query _id from an Elasticsearch catalog, set mapping_es_id to true.

Important
  • Filter the _id field using only the = or IN operator.

  • The _id field must be of type VARCHAR.

Limitations

  • Rollup, pre-aggregation, materialized views: Not supported for external Elasticsearch tables.

How it works

+----------------------------------------------+
|                                              |
| SelectDB   +------------------+              |
|            |       FE         +--------------+-------+
|            |                  |  Request Shard Location
|            +--+-------------+-+              |       |
|               ^             ^                |       |
|               |             |                |       |
|  +-------------------+ +------------------+  |       |
|  |            |      | |    |             |  |       |
|  | +----------+----+ | | +--+-----------+ |  |       |
|  | |      BE       | | | |      BE      | |  |       |
|  | +---------------+ | | +--------------+ |  |       |
+----------------------------------------------+       |
   |        |          | |        |         |          |
   |        |          | |        |         |          |
   |    HTTP SCROLL    | |    HTTP SCROLL   |          |
+-----------+---------------------+------------+       |
|  |        v          | |        v         |  |       |
|  | +------+--------+ | | +------+-------+ |  |       |
|  | |               | | | |              | |  |       |
|  | |   DataNode    | | | |   DataNode   +<-----------+
|  | |               | | | |              | |  |       |
|  | |               +<--------------------------------+
|  | +---------------+ | | |--------------| |  |       |
|  +-------------------+ +------------------+  |       |
|   Same Physical Node                         |       |
|                                              |       |
|           +-----------------------+          |       |
|           |                       |          |       |
|           |      MasterNode       +<-----------------+
| ES        |                       |          |
|           +-----------------------+          |
+----------------------------------------------+

The query flow works as follows:

  1. The FE sends a request to the configured host to get the HTTP port information for all nodes and the shard distribution of the index. If the request fails, the FE traverses the host list sequentially until one succeeds.

  2. The FE generates a query execution plan based on the node and index metadata, then sends it to the relevant BEs.

  3. Each BE concurrently fetches data from its assigned Elasticsearch index shards in streaming mode using the HTTP Scroll API — reading from either _source (row-oriented) or doc_values (column-oriented), depending on the query.

  4. SelectDB computes the final results and returns them.