User guide for search indexes on the JSON data type in Lindorm

Lindorm supports creating search indexes on JSON columns in wide tables. The query functions available to you depend on the structure of your JSON data. This guide covers all three supported JSON patterns — array of basic elements, JSON object, and array of objects — and shows you which index type and query functions apply to each.

Prerequisites

Before you begin, make sure that:

The LindormTable engine is version 2.8.5 or later
The Lindorm LTS engine is version 3.8.13.3 or later

To check or upgrade your engine version, see Minor version update.

Usage notes

When relying on JSON automatic type inference, if a JSON field's data type is inconsistent across writes (for example, field a is first written as a string and later as a number), the type inferred from the first write is used. Subsequent values that do not conform are skipped.
If a JSON document contains many internal fields, automatic inference may hit the search index column limit and cause index sync tasks to fail. We recommend that you index only the JSON fields you actually query (see the example below for details). The default column limit is 1000. On LindormTable 2.8.6.1 and later, you can adjust the limit with the following SQL:
```
ALTER INDEX idx ON search_table SET SEARCH_INDEX_MAX_COLUMN_COUNT='2000';
```

Choose the right index type

Before creating a search index on a JSON column, identify which of the three patterns your data follows. Mixing patterns in a single column causes invalid data to be silently skipped during indexing.

JSON pattern	Example	Index type	Supported query functions
Array of basic elements	`[1, 2, 3]` or `["a", "b"]`	`mapping` with scalar type (`keyword`, `integer`, etc.)	`JSON_CONTAINS`, `JSON_CONTAINS_ANY`
JSON object	`{"name": "Alice", "age": 13}`	`type=jsonobject` or `mapping` with `"type": "object"`	`JSON_EXTRACT`, `JSON_EXTRACT_STRING`, `JSON_CONTAINS`, `JSON_CONTAINS_ANY` (on nested arrays), `MATCH`...`AGAINST`
Array of objects	`[{"name": "Alice"}, {"name": "Bob"}]`	`type=jsonarray` or `mapping` with `"type": "nested"`	`SEARCH_QUERY` with Elasticsearch DSL only

Important

Do not mix the three patterns in a single column. Invalid data may be skipped and not indexed.

Array of basic elements

A basic element array stores a flat list of scalar values at the top level, such as ["101", "102", "109"] (string array) or [1, 2, 3] (integer array).

Create a wide table

CREATE TABLE test_json_array(id VARCHAR, c1 JSON, c2 JSON, PRIMARY KEY (id));

Insert data

UPSERT INTO test_json_array(id, c1, c2) VALUES ('1001', '["101", "102", "109"]', '[1, 2, 3]');
UPSERT INTO test_json_array(id, c1, c2) VALUES ('1002', '["999", "888", "777"]', '[1, 2, 3, 4, 5]');

Create a search index

Specify the element type in the mapping parameter. Use keyword for string arrays and integer for integer arrays.

CREATE INDEX idx USING SEARCH ON test_json_array(
  c1(mapping='{
    "type": "keyword"
  }'),
  c2(mapping='{
    "type": "integer"
  }')
);

Query

Two functions are available:

JSON_CONTAINS: returns rows where all specified values are present in the array
JSON_CONTAINS_ANY: returns rows where at least one of the specified values is present

`JSON_CONTAINS` — rows that contain all specified values

-- No row contains both "101" and "999", so the result is empty
Lindorm> SELECT * FROM test_json_array WHERE JSON_CONTAINS(c1, '["101", "999"]');
Empty set (0.11 sec)

-- id=1001 contains both "101" and "102"
Lindorm> SELECT * FROM test_json_array WHERE JSON_CONTAINS(c1, '["101", "102"]');
+------+-----------------------+-----------+
| id   | c1                    | c2        |
+------+-----------------------+-----------+
| 1001 | ["101", "102", "109"] | [1, 2, 3] |
+------+-----------------------+-----------+
1 row in set (0.01 sec)

`JSON_CONTAINS_ANY` — rows that contain at least one of the specified values

-- id=1001 has "101" and id=1002 has "999", so both rows are returned
Lindorm> SELECT * FROM test_json_array WHERE JSON_CONTAINS_ANY(c1, '["101", "999"]');
+------+-----------------------+---------------------+
| id   | c1                    | c2                  |
+------+-----------------------+---------------------+
| 1002 | ["999", "888", "777"] | [1, 2, 3, 5, 9, 10] |
| 1001 | ["101", "102", "109"] | [1, 2, 3]           |
+------+-----------------------+---------------------+
2 rows in set (0.01 sec)

JSON object

A JSON object stores a structured document at the top level, such as {"name": "Alice", "age": 13, "hobbies": ["read", "badminton"]}. Fields can be scalar values, arrays, or nested objects.

Create a wide table

CREATE TABLE test_json_object(id VARCHAR, user_info JSON, PRIMARY KEY (id));

Insert data

UPSERT INTO test_json_object(id, user_info) VALUES ('1001', '{"name": "Alice", "age": 13, "address": "Hangzhou, Zhejiang Province", "hobbies": ["play games", "read", "badminton"]}');
UPSERT INTO test_json_object(id, user_info) VALUES ('1002', '{"name": "Bob", "age": 9, "address": "Ningbo, Zhejiang Province", "hobbies": ["play games"]}');
UPSERT INTO test_json_object(id, user_info) VALUES ('1003', '{"name": "John", "age": 21, "address": "Shenzhen, Guangdong Province", "hobbies": ["read", "badminton", "food"]}');

Create a search index

Two approaches are available. Use predefined structure whenever possible.

Automatic type inference

Specify type=jsonobject to let the system infer field types from the first value written. For example, user_info.name is inferred as string and user_info.age as numeric. Data written later that doesn't match the inferred types is skipped to preserve index consistency.

CREATE INDEX idx USING SEARCH ON test_json_object(user_info(type=jsonobject));

Predefine the field structure (recommended)

Automatic type inference has two limitations:

To run tokenized (full-text) queries on a field, the field type must be explicitly set to text. The default inferred type is keyword, which doesn't support tokenization.
If a JSON object contains a nested array of objects, the field type must be explicitly set to nested.

Use mapping to define the internal field structure. Field types follow Elasticsearch syntax.

-- Explicitly define field types to enable tokenized queries and control indexing behavior
CREATE INDEX idx USING SEARCH ON test_json_object(
  user_info(mapping='{
    "type": "object",
    "properties": {
      "name": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      },
      "address": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "hobbies": {
        "type": "keyword"
      }
    }
  }')
);

To index only a subset of fields and ignore the rest, add "dynamic": "false":

CREATE INDEX idx USING SEARCH ON test_json_object(
  user_info(mapping='{
    "type": "object",
    "dynamic": "false",
    "properties": {
      "name": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      },
      "address": {
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "hobbies": {
        "type": "keyword"
      }
    }
  }')
);

Query

Three query methods are supported for JSON objects:

JSON_EXTRACT / JSON_EXTRACT_STRING — match a single scalar field by path
JSON_CONTAINS / JSON_CONTAINS_ANY — match elements in an array field within the object
MATCH...AGAINST combined with JSON_EXTRACT — full-text (tokenized) search on a text field

Matching a complete sub-object is not supported. For example, WHERE JSON_EXTRACT(json_col, '$.user') = '{"name": "Alice", "age": 12}' is not valid. JSON_EXTRACT in a search index matches only single scalar elements.

`JSON_EXTRACT` — find rows by a scalar field value

-- Find users whose name is Alice
Lindorm> SELECT * FROM test_json_object WHERE JSON_EXTRACT_STRING(user_info, '$.name')='Alice';
+------+------------------------------------------------------------------------------------------------+
| id   | user_info                                                                                      |
+------+------------------------------------------------------------------------------------------------+
| 1001 | {"name": "Alice", "age": 13, "address": "Hangzhou, Zhejiang Province", "hobbies": ["play games", "read", "badminton"]} |
+------+------------------------------------------------------------------------------------------------+
1 row in set (0.02 sec)

-- Find users whose age is 21
Lindorm> SELECT * FROM test_json_object WHERE JSON_EXTRACT(user_info, '$.age')=21;
+------+--------------------------------------------------------------------------------------------------------+
| id   | user_info                                                                                              |
+------+--------------------------------------------------------------------------------------------------------+
| 1003 | {"name": "John", "age": 21, "address": "Shenzhen, Guangdong Province", "hobbies": ["read", "badminton", "food"]}       |
+------+--------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

`MATCH`...`AGAINST` with `JSON_EXTRACT` — full-text search on a text field

The address field must be predefined as type: text. The default inferred type (keyword) does not support tokenized queries.

-- Find users whose address contains "Zhejiang" (tokenized match across multiple words)
Lindorm> SELECT * FROM test_json_object WHERE MATCH(JSON_EXTRACT(user_info, '$.address')) AGAINST ('Zhejiang');
+------+---------------------------------------------------------------+
| id   | user_info                                                     |
+------+---------------------------------------------------------------+
| 1001 | {"name": "Alice", "age": 13, "address": "Hangzhou, Zhejiang", "hobbies": ["play games", "read", "badminton"]}    |
| 1002 | {"name": "Bob", "age": 9, "address": "Ningbo, Zhejiang", "hobbies": ["play games"]}                              |
+------+---------------------------------------------------------------+
2 rows in set (0.03 sec)

-- Find users whose address contains "Hangzhou"
Lindorm> SELECT * FROM test_json_object WHERE MATCH(JSON_EXTRACT(user_info, '$.address')) AGAINST ('Hangzhou');
+------+---------------------------------------------------------------+
| id   | user_info                                                     |
+------+---------------------------------------------------------------+
| 1001 | {"name": "Alice", "age": 13, "address": "Hangzhou, Zhejiang", "hobbies": ["play games", "read", "badminton"]}    |
+------+---------------------------------------------------------------+
1 row in set (0.01 sec)

`JSON_CONTAINS` — find rows where an array field contains specific values

-- Find users who have "read" in their hobbies list
Lindorm> SELECT * FROM test_json_object WHERE JSON_CONTAINS(user_info, '["read"]', '$.hobbies');
+------+---------------------------------------------------------------------------------------------------------------+
| id   | user_info                                                                                                     |
+------+---------------------------------------------------------------------------------------------------------------+
| 1003 | {"name": "John", "age": 21, "address": "Shenzhen, Guangdong Province", "hobbies": ["read", "badminton", "food"]}              |
| 1001 | {"name": "Alice", "age": 13, "address": "Hangzhou, Zhejiang Province", "hobbies": ["play games", "read", "badminton"]}       |
+------+---------------------------------------------------------------------------------------------------------------+
2 rows in set (0.04 sec)

Array of objects

An array of objects stores a top-level array where each element is a JSON object, such as [{"name": "Alice", "age": 12}, {"name": "Bob", "age": 20}].

Why array of objects requires a different index type

Without a specialized index type, a database flattens an array of objects into independent multi-value fields, losing the associations between fields that belong to the same object. For example, given the following two rows:

id	user
1002	`[{"name": "Alice", "age": 9}, {"name": "Bob", "age": 20}]`

A naive (non-nested) index would store the name and age values as flat lists: name = [Alice, Bob] and age = [9, 20]. A query for "Alice with age >= 10" would incorrectly match this row because Alice and age=20 happen to coexist in the flattened fields, even though Alice's actual age is 9.

Lindorm uses type: nested internally for array of objects indexes. Each object in the array is indexed as an isolated unit, so field associations within the same object are preserved. As a result, only SEARCH_QUERY with embedded Elasticsearch DSL is supported for querying this pattern — the DSL nested query expresses these intra-object constraints correctly.

Create a wide table

CREATE TABLE test_json_object_array(id VARCHAR, user JSON, primary key(id));

Insert data

UPSERT INTO test_json_object_array(id, user) VALUES ('1001', '[{"name": "Alice", "age": 12}]');
UPSERT INTO test_json_object_array(id, user) VALUES ('1002', '[{"name": "Alice", "age": 9},{"name": "Bob", "age": 20}]');
UPSERT INTO test_json_object_array(id, user) VALUES ('1003', '[{"name": "Alice", "age": 12},{"name": "Bob", "age": 20}]');

Create a search index

Two approaches are available. Use predefined structure whenever possible.

Automatic type inference

Specify type=jsonarray to let the system infer field types from the first value written for each object in the array. Data written later that doesn't match the inferred types is skipped.

CREATE INDEX idx USING SEARCH ON test_json_object_array(user(type=jsonarray));

Predefine the field structure (recommended)

Automatic type inference has the same limitations as for JSON objects: tokenized queries require explicit text type, and nested arrays of objects within a field require explicit nested type.

Use mapping to define the internal field structure. The top-level type must be nested. Field types follow Elasticsearch syntax.

CREATE INDEX idx USING SEARCH ON test_json_object_array(
  user(mapping='{
    "type": "nested",
    "properties": {
      "name": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      }
    }
  }')
);

To index only a subset of fields and ignore the rest, add "dynamic": "false":

CREATE INDEX idx USING SEARCH ON test_json_object_array(
  user(mapping='{
    "type": "nested",
    "dynamic": "false",
    "properties": {
      "name": {
        "type": "keyword"
      },
      "age": {
        "type": "integer"
      }
    }
  }')
);

Query

Query arrays of objects using the SEARCH_QUERY function with embedded Elasticsearch DSL. Use a nested query with the path parameter set to the column name.

-- Find rows where a single user object has name "Alice" AND age >= 10
-- (preserves field associations within each object)
Lindorm> SELECT * FROM test_json_object_array WHERE SEARCH_QUERY('
  {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.name": "Alice" } },
            { "range": { "user.age": {"gte": 10} } }
          ]
        }
      }
    }
  }
');
+------+-----------------------------------------------------------+
| id   | user                                                      |
+------+-----------------------------------------------------------+
| 1003 | [{"name": "Alice", "age": 12},{"name": "Bob", "age": 20}] |
| 1001 | [{"name": "Alice", "age": 12}]                            |
+------+-----------------------------------------------------------+

id=1002 is excluded because its "Alice" entry has age=9, which is less than 10. The nested index preserves the association between fields within the same object, so Alice and age=9 are correctly evaluated together rather than mixed with Bob's age of 20.