All Products
Search
Document Center

ApsaraDB for ClickHouse:Compatibility and performance bottleneck analysis and solutions for self-managed ClickHouse migration to ApsaraDB for ClickHouse

Last Updated:Mar 30, 2026

After you migrate data from a self-managed ClickHouse cluster to the cloud, you may encounter compatibility and performance issues. To ensure a smooth migration, we recommend running this analysis in a test environment before switching production traffic.

Background information

Self-managed ClickHouse clusters migrated to ApsaraDB for ClickHouse may encounter the following issues:

  • Version compatibility issues:

    • MaterializedMySQL engine compatibility issues

    • SQL compatibility issues

  • CPU resources exhausted and memory insufficient after migrating to ApsaraDB for ClickHouse

To resolve these issues, focus on compatibility and performance during the migration.

This guide covers:

  1. Parameter compatibility — compare settings between clusters and align them

  2. MaterializedMySQL compatibility — handle engine differences after migration

  3. SQL compatibility verification — batch-test your existing queries against the new cluster

  4. Performance bottleneck analysis — locate the tables and queries causing CPU or memory issues

  5. SQL optimization — apply targeted fixes after identifying the root cause

Compatibility analysis

Parameter compatibility

Run this query on both your self-managed cluster and your ApsaraDB for ClickHouse cluster to export all configuration parameters:

SELECT
    name,
    groupArrayDistinct(value) AS value
FROM clusterAllReplicas(`default`, system.settings)
GROUP BY name
ORDER BY name ASC

Then use a diff tool (such as VS Code's built-in file comparison) to identify mismatches. For any parameter that differs, update the ApsaraDB for ClickHouse cluster to match the self-managed cluster's value.

Parameters that affect compatibility:

  • compatibility

  • prefer_global_in_and_join

  • distributed_product_mode

Parameters that affect performance:

  • max_threads

  • max_bytes_to_merge_at_max_space_in_pool

  • prefer_global_in_and_join

image.png

MaterializedMySQL compatibility

If your self-managed cluster synchronizes data from MySQL using the MaterializedMySQL engine, be aware of how ApsaraDB for ClickHouse handles this after migration.

The community version of the MaterializedMySQL engine is no longer maintained. ApsaraDB for ClickHouse uses Data Transmission Service (DTS) to sync MySQL data instead. DTS creates a ReplacingMergeTree local table on each cluster node and a distributed table that routes writes to those nodes — replacing the MaterializedMySQL table entirely.

This architectural change causes two common issues:

Issue 1: IN and JOIN queries break on the distributed table

In the self-managed cluster, MaterializedMySQL replicates data to each shard directly. After migration, data flows through a distributed table, which changes how IN and JOIN subqueries resolve. For details and fixes, see What do I do if an error occurs when I use a subquery on a distributed table?.

Issue 2: Duplicate data in query results

The ReplacingMergeTree table merges duplicate rows asynchronously in the background. If merges fall behind, queries return more duplicates than the self-managed cluster did. Two solutions are available:

Solution 1: Force deduplication at query time

Run the following command to enable the final modifier globally. This deduplicates results during each query but increases CPU and memory usage:

SET global final = 1;

Solution 2: Tune merge frequency on the target table

Increase how often ApsaraDB for ClickHouse merges parts in the ReplacingMergeTree table:

ALTER TABLE <AIM_TABLE>
    MODIFY SETTING
        min_age_to_force_merge_on_partition_only = 1,
        min_age_to_force_merge_seconds = 60;
Parameter Description Default
min_age_to_force_merge_on_partition_only When set to 1, forces data merging within partitions 0 (disabled)
min_age_to_force_merge_seconds Time interval between forced part merges (in seconds) 3600

SQL compatibility verification

Use a Python script to extract SELECT statements from your self-managed cluster's query log and replay them against ApsaraDB for ClickHouse. Failed queries indicate SQL compatibility issues to resolve before migration.

Prerequisites

Before you begin, ensure that you have:

Verify SQL compatibility

  1. Install the Python client library for ClickHouse:

    pip3 install clickhouse_driver
  2. Create a Python file with the following script. Replace the placeholder values with your actual connection details:

    from clickhouse_driver import connect
    import datetime
    import logging
    # pip3 install clickhouse_driver is required.
    
    # Connection settings for the self-managed cluster
    host_old = 'HOST_OLD'          # VPC endpoint
    port_old = TCP_PORT_OLD        # TCP port
    user_old = 'USER_OLD'          # Username
    password_old = 'PASSWORD_OLD'  # Password
    
    # Connection settings for ApsaraDB for ClickHouse
    host_new = 'HOST_NEW'          # VPC endpoint
    port_new = TCP_PORT_NEW        # TCP port
    user_new = 'USER_NEW'          # Username
    password_new = 'PASSWORD_NEW'  # Password
    
    logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
    
    def create_connection(host, port, user, password):
        """Establish a connection to ClickHouse."""
        return connect(host=host, port=port, user=user, password=password)
    
    def get_query_hashes(cursor):
        """Query the query_hash list of the last two days."""
        get_queryhash_sql = '''
        select distinct normalized_query_hash from system.query_log
        where type='QueryFinish'
          and `is_initial_query`=1
          and `user` not in ('default', 'aurora')
          and lower(`query`) not like 'select 1%'
          and lower(`query`) not like 'select timezone()%'
          and lower(`query`) not like '%dms-websql%'
          and lower(`query`) like 'select%'
          and `event_time` > now() - INTERVAL 2 DAY;
        '''
        cursor.execute(get_queryhash_sql)
        return cursor.fetchall()
    
    def get_sql_info(cursor, queryhash):
        """Query the SQL information for the specified query_hash in the last 2 days (search range is 3 days)."""
        get_sqlinfo_sql = f'''
        select `event_time`, `query_duration_ms`, `read_rows`, `read_bytes`, `memory_usage`, `databases`, `tables`, `current_database`, `query`
        from system.query_log
        where `event_time` > now() - INTERVAL 3 DAY and `type`='QueryFinish' and `normalized_query_hash`='{queryhash}' limit 1
        '''
        cursor.execute(get_sqlinfo_sql)
        sql_info = cursor.fetchone()
        if sql_info:
            return [info.strftime('%Y-%m-%d %H:%M:%S') if isinstance(info, datetime.datetime) else info for info in sql_info[:-1]], sql_info[-1]
        return None
    
    def execute_sql_on_new_db(cursor, current_database, query_sql, execute_failed_sql):
        """Execute the SQL query on the new cluster and record any failures."""
        try:
            cursor.execute(f"USE {current_database};")
            cursor.execute(query_sql)
        except Exception as error:
            logging.error(f'query_sql execute in new db failed: {query_sql}')
            execute_failed_sql[query_sql] = error
    
    def main():
        conn_old = create_connection(host=host_old, port=port_old, user=user_old, password=password_old)
        conn_new = create_connection(host=host_new, port=port_new, user=user_new, password=password_new)
    
        cursor_old = conn_old.cursor()
        cursor_new = conn_new.cursor()
    
        # Extract query hashes and SQL from the self-managed cluster
        old_query_hashes = get_query_hashes(cursor_old)
        old_db_execute_dir = {}
        for queryhash in old_query_hashes:
            sql_info, query = get_sql_info(cursor_old, queryhash[0])
            if sql_info:
                old_db_execute_dir[query] = sql_info
    
        cursor_old.close()
        conn_old.close()
    
        # Replay queries against ApsaraDB for ClickHouse
        execute_failed_sql = {}
        keys_list = list(old_db_execute_dir.keys())
        for query_sql in old_db_execute_dir:
            position = keys_list.index(query_sql)
            current_database = old_db_execute_dir[query_sql][-1]
            logging.info(f"new db test the {position + 1}th/{len(old_db_execute_dir)}, running sql: {query_sql}\n")
            execute_sql_on_new_db(cursor_new, current_database, query_sql, execute_failed_sql)
    
        # Collect execution results from ApsaraDB for ClickHouse
        new_query_hashes = get_query_hashes(cursor_new)
        new_db_execute_dir = {}
        for queryhash in new_query_hashes:
            sql_info, query = get_sql_info(cursor_new, queryhash[0])
            if sql_info:
                new_db_execute_dir[query] = sql_info
    
        cursor_new.close()
        conn_new.close()
    
        # Log results
        for query_sql in new_db_execute_dir:
            if query_sql in old_db_execute_dir:
                logging.info(f'succeed sql: {query_sql}')
                logging.info(f'old sql info: {old_db_execute_dir[query_sql]}')
                logging.info(f'new sql info: {new_db_execute_dir[query_sql]}\n')
    
        for query_sql in execute_failed_sql:
            logging.error('\033[31m{}\033[0m'.format(f'failed sql: {query_sql}'))
            logging.error('\033[31m{}\033[0m'.format(f'failed error: {execute_failed_sql[query_sql]}\n'))
    
    if __name__ == "__main__":
        main()
  3. Run the script and review the output. Each failed query is logged in red with its error message. Investigate and fix failures based on the specific error before proceeding with production migration.

Performance analysis and optimization

After switching traffic to ApsaraDB for ClickHouse, you may see CPU saturation or memory exhaustion. The steps below help you systematically locate the root cause — from identifying the problematic tables, to isolating specific queries, to analyzing and fixing them.

If you already know which table or query is causing the performance issue, skip directly to Step 3: Analyze SQL performance.

Step 1: Identify tables causing the performance bottleneck

Two approaches are available: flame graphs and query_log analysis. Flame graphs require more setup but give you a visual, intuitive breakdown of CPU usage. query_log analysis is simpler and needs no external tools, but requires you to interpret the data yourself. Use both together for the most complete picture.

Create a flame graph

  1. Connect to ApsaraDB for ClickHouse using clickhouse-client. For connection instructions, see Connect to a ClickHouse cluster using the command-line interface.

  2. Export trace_log for the time window when the performance issue occurred. Replace <IP>, <port>, and the event_time values with your actual values:

    Parameter Description
    <IP> Virtual Private Cloud (VPC) endpoint of ApsaraDB for ClickHouse
    <port> TCP port of ApsaraDB for ClickHouse
    # trace_type = 'CPU' collects CPU call stacks.
    # trace_type = 'Real' collects wall-clock call stacks.
    /clickhouse/bin/clickhouse-client -h <IP> --port <port> -q \
      "SELECT arrayStringConcat(arrayReverse(arrayMap(x -> concat( addressToLine(x), '#', demangle(addressToSymbol(x)) ), trace)), ';') AS stack, count() AS samples \
      FROM system.trace_log \
      WHERE trace_type = 'CPU' \
        AND event_time >= '2025-01-08 19:31:00' \
        AND event_time < '2025-01-08 19:33:00' \
      GROUP BY trace \
      ORDER BY samples DESC \
      FORMAT TabSeparated \
      SETTINGS allow_introspection_functions=1" > cpu_trace_log.txt
  3. Install clickhouse-flamegraph. For download and installation instructions, see clickhouse-flamegraph.

  4. Generate the flame graph:

    cat cpu_trace_log.txt | flamegraph.pl > cpu_trace_log.svg

    Open the SVG file to identify functions consuming the most CPU. For example, if you see ReplacingSortedMerge appearing prominently, focus your investigation on queries against ReplacingMergeTree tables.

    image

Analyze query_log

Run these queries simultaneously on both clusters to perform a top N comparison. Look for tables where CPU or memory usage is significantly higher on ApsaraDB for ClickHouse than on the self-managed cluster — those are your candidates.

High CPU or memory usage on a table doesn't always cause overall cluster degradation. It only suggests that table is more likely to be the source of the problem. Compare results between both clusters to confirm.

Locate tables with CPU issues

SELECT
    tables,
    first_value(query),
    count() AS cnt,
    groupArrayDistinct(normalizedQueryHash(query)) AS normalized_query_hash,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS sum_user_cpu,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS sum_system_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS avg_user_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS avg_system_cpu,
    sum(memory_usage) AS sum_memory_usage,
    avg(memory_usage) AS avg_memory_usage
FROM clusterAllReplicas(default, system.query_log)
WHERE (event_time > '2025-01-08 19:30:00') AND (event_time < '2025-01-08 20:30:00') AND (query_kind = 'Select')
GROUP BY tables
ORDER BY sum_user_cpu DESC
LIMIT 5

Locate tables with memory issues

SELECT
    tables,
    first_value(query),
    count() AS cnt,
    groupArrayDistinct(normalizedQueryHash(query)) AS normalized_query_hash,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS sum_user_cpu,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS sum_system_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS avg_user_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS avg_system_cpu,
    sum(memory_usage) AS sum_memory_usage,
    avg(memory_usage) AS avg_memory_usage
FROM clusterAllReplicas(default, system.query_log)
WHERE (event_time > '2025-01-08 19:30:00') AND (event_time < '2025-01-08 20:30:00') AND (query_kind = 'Select')
GROUP BY tables
ORDER BY sum_memory_usage
LIMIT 5

Replace event_time values with the time window when the performance issue occurred.

Step 2: Identify queries causing the performance bottleneck

After identifying the problematic table in Step 1, run the following queries to isolate the specific SQL statements causing the issue. Replace <AIM_TABLE> with the table name from Step 1, and update event_time to match your target time window.

Locate SQL queries that cause CPU issues

SELECT
    tables,
    first_value(query),
    count() AS cnt,
    normalizedQueryHash(query) AS normalized_query_hash,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS sum_user_cpu,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS sum_system_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS avg_user_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS avg_system_cpu,
    sum(memory_usage) AS sum_memory_usage,
    avg(memory_usage) AS avg_memory_usage
FROM clusterAllReplicas(default, system.query_log)
WHERE (event_time > '2025-01-08 19:30:00') AND (event_time < '2025-01-08 20:30:00') AND (query_kind = 'Select') AND has(tables, '<AIM_TABLE>')
GROUP BY
    tables,
    normalized_query_hash
ORDER BY sum_user_cpu DESC
LIMIT 5

Locate SQL queries that cause memory issues

SELECT
    tables,
    first_value(query),
    count() AS cnt,
    normalizedQueryHash(query) AS normalized_query_hash,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS sum_user_cpu,
    sum(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS sum_system_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'UserTimeMicroseconds')]) AS avg_user_cpu,
    avg(ProfileEvents.Values[indexOf(ProfileEvents.Names, 'SystemTimeMicroseconds')]) AS avg_system_cpu,
    sum(memory_usage) AS sum_memory_usage,
    avg(memory_usage) AS avg_memory_usage
FROM clusterAllReplicas(default, system.query_log)
WHERE (event_time > '2025-01-08 19:30:00') AND (event_time < '2025-01-08 20:30:00') AND (query_kind = 'Select') AND has(tables, 'AIM_TABLE')
GROUP BY
    tables,
    normalized_query_hash
ORDER BY sum_memory_usage
LIMIT 5

Step 3: Analyze SQL performance

Use EXPLAIN PIPELINE and system.query_log to understand why a specific query is slow.

Analyze the execution plan:

EXPLAIN PIPELINE <SQL with performance issues>

For details on reading EXPLAIN output, see the ClickHouse EXPLAIN documentation.

Analyze execution metrics from query_log:

SELECT
    hostname() AS host,
    *
FROM clusterAllReplicas(`default`, system.query_log)
WHERE
    event_time > '2025-01-18 00:00:00'
    AND event_time < '2025-01-18 03:00:00'
    AND initial_query_id = '<INITIAL_QUERY_ID>'
    AND type = 'QueryFinish'
ORDER BY query_start_time_microseconds

Replace <INITIAL_QUERY_ID> with the query ID of the statement you want to investigate, and update event_time to cover the relevant time window.

Focus on these fields in the output:

Field What to look for
ProfileEvents Detailed counters for CPU time, disk I/O, memory allocation, and network. Compares resource usage across nodes. See system events reference.
Settings Configuration parameters active during the query. Helps identify settings that may be causing unexpected behavior. See settings reference.
query_duration_ms Total query runtime. If a subquery is unusually slow, use its query_id to find its execution details in the log. To adjust log verbosity, see Configure config.xml parameters.

If the metrics alone don't explain the performance degradation, pull the same query's execution information from the self-managed cluster and compare it side by side with the ApsaraDB for ClickHouse results. Differences in ProfileEvents values often reveal the root cause.

Step 4: Optimize SQL

After identifying the slow query, apply the relevant optimizations below. Most performance regressions after migration have a specific root cause — focus on the technique that matches your findings rather than applying all of them at once.

Optimization Description
Index optimization Identify columns used frequently in WHERE clauses and create an index for them.
Data types Use the most appropriate data types to reduce storage overhead and speed up reads.
**Avoid SELECT *** Select only the columns you need to reduce I/O and network transfer.
Use primary key and index columns in WHERE Filter on indexed columns to enable pruning.
Use PREWHERE Move selective filter conditions to PREWHERE to reduce the data volume processed by subsequent stages.
Use GLOBAL JOIN for distributed tables Replace regular JOIN with GLOBAL JOIN to avoid per-shard broadcast inefficiency.
Tune `max_threads` Increasing max_threads uses more CPU cores, but setting it too high causes resource contention. See settings reference.
Create materialized views For frequently executed complex queries with stable results, pre-aggregate with a materialized view.
Tune data compression ClickHouse compresses data by default. Adjusting the compression algorithm and level can improve storage efficiency and query speed.
Expand the query cache For stable, frequently repeated queries, increase uncompressed_cache_size to reduce recomputation. For parameter modification instructions, see Configure config.xml parameters.