All Products
Search
Document Center

ApsaraDB for ClickHouse:New features of ApsaraDB for ClickHouse of version 22.8

Last Updated:Jan 26, 2024

This topic describes the new features of ApsaraDB for ClickHouse of version 22.8.

Extended value range of the date and time parameters

  • The value range of Date32 and DateTime64 is extended to support dates from the year 1900 to 2299. In earlier versions, the supported time range was only from the year 1925 to 2283.

  • The value of the date parameters can be precise to eight decimal places, which indicates time in microseconds. If more than eight digits is placed after the decimal point, the maximum supported value is 2262-04-11.

Lightweight delete

Version 22.8 supports the standard DELETE FROM SQL syntax on MergeTree tables and lightweight delete implementation for MergeTree families. The DELETE operation for ApsaraDB for ClickHouse of earlier versions is of the mutation type. All DELETE events are separately recorded in files, and are then executed asynchronously based on kernel scheduling on an irregular basis. Specific records must be located for physical deletion. In this sense, the DELETE operation is complex.

The DELETE syntax used in earlier versions:

ALTER TABLE [db.]table [ON CLUSTER cluster] DELETE WHERE filter_expr

The DELETE syntax used in version 22.8:

DELETE FROM [db.]table [WHERE expr]

Sample statement:

DELETE FROM hits WHERE Title LIKE '%hello%' ;

For more information about LIKE expressions, see like.

Compared with the conventional operations of the mutation type, the standard SQL syntax facilitates development. According to the official performance test result, if you want to perform delete operations on a single table that contains 100 million records and 110 data partitions, the performance of lightweight deletes from the table is about 40 times better than the delete performance in ApsaraDB for ClickHouse of earlier versions. However, lightweight deletes are still asynchronously implemented by using the DELETE syntax by default. The column update feature of ApsaraDB for ClickHouse allows you to mark deleted rows. You can add system virtual columns with the column name "_row_exists" to data partitions and run update_row_exists=0 where predicate to mark deleted rows.

Support for JSON types and dynamic subcolumns

Read and write of JSON data in earlier versions

In versions earlier than V22.8, ApsaraDB for ClickHouse stores JSON objects in the String type. JSON objects are in the text format and have complex structures. Special string parsing functions must be used to parse the complex JSON objects to extract internal fields. Multiple internal fields are contained in JSON objects stored in the String type. Query of a specific internal property involves the scanning of undesired fields. This results in low query efficiency. Moreover, nested JSON data must be parsed and converted level by level. Sample statements:

CREATE TABLE games (data String)
ENGINE = MergeTree ORDER BY tuple();

SELECT JSONExtractString(data, 'teams', 1, 'name')
FROM games;

Read and write of JSON data in version 22.8

Version 22.8 supports an independent JSON object type. You only need to specify the JSON type for JSON objects in DDL operations. The engine dynamically matches each written JSON object value with a suitable data type. The JSON data of each property is stored in a separate column. If a newly written JSON object value does not match the data type of its column, the engine automatically modifies the column type to be compatible with all data types. The engine also adds new columns to store the JSON data of new properties. Sample statements:

DROP TABLE IF EXISTS github_JSON;

SET allow_experimental_object_type=1;

CREATE table github_JSON(event JSON) ENGINE = MergeTree ORDER BY tuple()

Dynamic subcolumns greatly improve the analysis efficiency and extensibility of unstructured data.

In versions earlier than V22.8, before you write semi-structured JSON data from Object Storage Service (OSS) to an ApsaraDB for ClickHouse cluster, you must create a structured destination table on the ApsaraDB for ClickHouse cluster and define a data type for each field. This way, JSON objects can be separately stored in subcolumns and can be analyzed in an efficient manner. If the structures of the JSON objects change, you must modify the schema of the destination table.

In version 22.8, dynamic subcolumns are supported, and you only need to create JSON data type fields in a destination table, and import semi-structured data in batches into the destination table, regardless of the number of levels of nested JSON data or the internal data types. In addition, when the properties of JSON objects are added due to business changes, you do not need to modify the schema of the destination table. The kernel dynamically adds subcolumns to store data, which greatly improves the flexibility. Sample statement:

INSERT INTO github_JSON SELECT * FROM OSS('oss-endpoint',
JSONAsObject, 'event JSON');

You do not need to parse strings or convert data types to read data. You can directly read the properties of nested JSON objects. Dynamic subcolumns greatly improve the efficiency of JSON data storage and query. According to the official test result, the string query efficiency for JSON data in version 22.8 is about 40 times higher than that in earlier versions. Sample statement:

SELECT event.type, event.repo, event.actor FROM github_JSON LIMIT 1;

Local cache for remote file systems

ApsaraDB for ClickHouse can read data from the file systems of local disks at a high speed. For example, OSS first caches the data on the local disks or cloud disks of Elastic Compute Service (ECS) in page caches, and then ApsaraDB for ClickHouse can read the cached data quickly. However, if ApsaraDB for ClickHouse reads data from remote file systems such as OSS, the system cannot be aware of the read operation or cache the data in page caches. In version 22.8, the ApsaraDB for ClickHouse kernel implements caching for remote file systems by engine. The data in remote file systems is cached in both local disks and random access memories (RAMs), which greatly improves caching performance.

Other features

Version 22.8 also provides some other features, for example:

  • Projection: The data consistency with the source table is increased. To improve efficiency at the cost of storage, physical tables are created based on different dimensions of sort indexes. This improves the query efficiency of non-sort key data by tens of times.

  • User-defined functions: User-defined functions in SQL statements and scripts are supported. This improves the capabilities of data cleansing and processing for users.