All Products
Search
Document Center

E-MapReduce:FAQ

Last Updated:Mar 26, 2026

This document answers frequently asked questions about ClickHouse in E-MapReduce (EMR).

How do I create a ClickHouse user?

Create a ClickHouse user using the EMR console or the clickhouse-client CLI.

EMR console

  1. In the EMR console, go to the ClickHouse service page and click the Configure tab.

  2. Click the server-users tab, then click Add Configuration Item.

  3. Add a configuration item with one of the following keys, and set its value to the password: Replace <YourUserName> with the username you want to create.

    Key Authentication method
    users.<YourUserName>.password Plaintext password
    users.<YourUserName>.password_sha256_hex SHA-256 hash
    users.<YourUserName>.password_double_sha1_hex Double SHA-1 hash
  4. Save the configuration and restart the ClickHouse service.

For more information, see Manage parameters for services and Restart a service.

clickhouse-client CLI

  1. Enable access management for the default user account. In the EMR console, go to Configure > server-users > Add Configuration Item and add users.default.access_management with the value 1. Save the configuration and restart the ClickHouse service.

  2. Log on to the EMR ClickHouse cluster via SSH. For more information, see Log on to a cluster.

  3. Start the ClickHouse client:

    clickhouse-client -h core-1-1 -m

    Replace core-1-1 with the name of the core node you logged on to.

  4. Create a user:

    CREATE USER IF NOT EXISTS user_test ON CLUSTER new_cluster_emr IDENTIFIED WITH plaintext_password BY '123456';

    Replace user_test with your username and 123456 with your password. The full CREATE USER syntax is:

    CREATE USER [IF NOT EXISTS | OR REPLACE] name1 [ON CLUSTER cluster_name1]
            [, name2 [ON CLUSTER cluster_name2] ...]
        [NOT IDENTIFIED | IDENTIFIED {[WITH {no_password | plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']}]
        [HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
        [DEFAULT ROLE role [,...]]
        [GRANTEES {user | role | ANY | NONE} [,...] [EXCEPT {user | role} [,...]]]
        [SETTINGS variable [= value] [MIN [=] min_value] [MAX [=] max_value] [READONLY | WRITABLE] | PROFILE 'profile_name'] [,...];
  5. Verify the user was created:

    SHOW USERS;

    The output lists all existing users, including the one you just created.

What do I do if data loss occurs?

Symptom: You write A rows to ClickHouse but read back fewer than A rows.

Cause: The data is most likely not lost — it is a read routing issue with distributed tables. By default, a distributed table uses one connection per shard. If a shard has multiple replicas, only one node per shard is queried, so data on the other nodes is not returned.

Data is distributed to each node regardless of whether data is written using a distributed table or directly to local tables. For example:

CREATE TABLE db.table_local (...) Engine = MergeTree()

CREATE TABLE db.table_distributed (...) Engine = Distributed(cluster_emr, db, table_local, rand());

Solutions:

Solution Operation Notes
(Recommended) Recreate the local table as a replicated table Delete db.table_local and recreate it using the ReplicatedMergeTree engine. Ensures all replicas stay synchronized; any node in a shard can serve a complete copy of the data.
(Not recommended) Increase parallel replicas In the EMR console, go to Configure > server-users > Add Configuration Item and add profiles.<your_profile_name>.max_parallel_replicas with a value greater than or equal to the number of replicas per shard. Make sure users.<your_clickhouse-client_name>.profile matches <your_profile_name>. Increases read parallelism but does not fix the underlying data distribution issue. For more information, see Configure user permissions.

What do I do if the error "Memory limit (for total) exceeded" appears?

Cause: The server's total memory usage has exceeded the limit set by max_server_memory_usage. The cap is calculated as:

max_server_memory_usage = Physical server memory x max_server_memory_usage_to_ram_ratio

The default value of max_server_memory_usage_to_ram_ratio is 0.9.

Solution: In the EMR console, go to Configure > server-config > Add Configuration Item and add the max_server_memory_usage configuration item with a higher value. To change the ratio itself, add the max_server_memory_usage_to_ram_ratio configuration item.

What do I do if the error "Memory limit (for query) exceeded" appears?

Cause: A single query exceeded the per-query memory limit (max_memory_usage).

Solutions: Configure max_memory_usage at the scope that matches your situation:

Scope How to configure
Global (all queries) In the EMR console, go to Configure > server-config > Add Configuration Item and add profiles.<your_profile_name>.max_memory_usage. Make sure users.<your_clickhouse-client_name>.profile matches <your_profile_name>.
ClickHouse client In the EMR console, go to Configure > client-config > Add Configuration Item and add max_memory_usage.
Current session Run SET max_memory_usage=<value>. Takes effect for the session's duration.
Single query Append the setting to your SQL: SELECT column FROM table SETTINGS max_memory_usage=<value>.

For more information about configuration items, see Configure user permissions.

What do I do if the error "Memory limit (for user) exceeded" appears?

Cause: The total memory used by a single user has exceeded the per-user limit (max_memory_usage_for_user).

Solutions: Configure max_memory_usage_for_user at the scope that matches your situation:

Scope How to configure
Global (all users) In the EMR console, go to Configure > server-users > Add Configuration Item and add profiles.<your_profile_name>.max_memory_usage_for_user. Make sure users.<your_clickhouse-client_name>.profile matches <your_profile_name>.
ClickHouse client In the EMR console, go to Configure > client-config > Add Configuration Item and add max_memory_usage_for_user.
Current session Run SET max_memory_usage_for_user=<value>. Takes effect for the session's duration.
Single query Append the setting to your SQL: SELECT column FROM table SETTINGS max_memory_usage_for_user=<value>.

For more information about configuration items, see Configure user permissions.

How are ClickHouse components distributed in an OLAP cluster?

The ClickHouse service includes three components: ClickHouseRuntime, ClickHouseServer, and ClickHouseKeeper.

Their distribution depends on the EMR cluster version:

Component EMR V3.X EMR V5.X
ClickHouseRuntime All node types All node types
ClickHouseServer Core nodes only Core nodes only
ClickHouseKeeper Not installed First core node, or first three core nodes (no scale-out support)

In EMR V5.X, the number of nodes running ClickHouseKeeper is fixed at cluster creation time. For example, if you create a cluster with two core nodes, ClickHouseKeeper runs on the first core node only. Adding more core nodes later does not change this.

Why does data import speed gradually decrease?

This is expected behavior for MergeTree-based tables, including ReplicatedMergeTree and ReplacingMergeTree.

When data is written, ClickHouse sorts it in memory and flushes it to disk as a data part. Background threads continuously merge smaller parts into larger ones. In the early phase of an import, there are few parts to merge, so all resources go toward receiving and flushing data — import speed is at its peak.

As more data arrives and more parts accumulate, background merge threads compete with the import process for CPU and I/O. Import speed drops until it stabilizes at a balance point between incoming data and background merging.

How do I upgrade the ClickHouse service?

The ClickHouse version in EMR is tied to the EMR cluster version. To upgrade ClickHouse, upgrade the EMR cluster.

Why do I need to define a logical cluster?

A logical cluster defines the relationships between nodes — such as sharding and replication topology — within a physical EMR ClickHouse cluster. Different logical clusters can serve different business use cases on the same physical cluster.

By default, EMR configures one logical cluster that includes all nodes in the physical cluster. To view existing logical clusters, run:

SELECT * FROM system.clusters;

What are shards and replicas in ClickHouse?

  • Shard: A shard consists of one or more nodes. All nodes in a shard are replicas of each other and hold the same data for tables using the Replicated\*MergeTree engine.

  • Replica: A node can be a replica of one or more other nodes. Replicated nodes store identical data for Replicated\*MergeTree tables.

How do I test ClickHouse performance?

Use Star Schema Benchmark (SSB) for OLAP query throughput testing. For setup and usage, see Star Schema Benchmark.

For single-query latency testing, use clickhouse-benchmark. For additional datasets and test scenarios, see Tutorials and example datasets.

Can I change the number of shards and replicas after creating a cluster?

Yes, but changing shards and replicas directly on a cluster with existing data can cause data errors. Instead, add a new logical cluster with the desired shard and replica configuration, and keep the original logical cluster intact.

To add a logical cluster, go to Configure > server-metrika in the EMR console, find the clickhouse_remote_servers parameter, and add the new cluster configuration.

What settings are required when using HDFS in EMR ClickHouse?

For configuration details, see the HDFS table engine documentation.

What are the differences between EMR ClickHouse and ApsaraDB for ClickHouse?

Feature EMR ClickHouse ApsaraDB for ClickHouse
Cluster ID prefix c- (for example, c-3c8697f91408****) cc- (for example, cc-bp16qwvp7hy8i****)
Management model Semi-managed — you can log on to Elastic Compute Service (ECS) instances for O&M Fully managed — ECS instance access is not available

Can ClickHouse query Hive tables that use DLF for metadata storage?

No. ClickHouse can only query Hive tables that use MySQL or ApsaraDB RDS for metadata storage. Hive tables backed by Data Lake Formation (DLF) are not supported.

If you need to work with Hive data, consider these alternatives:

  • To import Hive data into ClickHouse: Use Apache Spark or Apache SeaTunnel.

  • To analyze Hive data in place: Use an engine such as StarRocks, Trino, or Impala.