FAQ - E-MapReduce - Alibaba Cloud Documentation Center

This topic provides answers to some frequently asked questions about ClickHouse.

How do I create a ClickHouse user?
What do I do if data loss occurs?
What do I do if the error message "Memory limit (for total) exceeded" appears?
What do I do if the error message "Memory limit (for query) exceeded" appears?
What do I do if the error message "Memory limit (for user) exceeded" appears?
How are the components of the ClickHouse service distributed in an OLAP cluster?
Why does the data import speed gradually decrease?
How do I upgrade the ClickHouse service?
Why do I need to define a logical cluster in ClickHouse?
What are shards and replicas in ClickHouse?
How do I test the performance of ClickHouse?
Can I change the number of shards and replicas after I create a cluster?
What settings must I configure when I use HDFS in EMR ClickHouse?
What are the differences between EMR ClickHouse and ApsaraDB for ClickHouse?
Can ClickHouse be used to query Hive tables that use DLF for metadata storage?

How do I create a ClickHouse user?

You can create a ClickHouse user in the E-MapReduce (EMR) console or by using the ClickHouse client.

EMR console
On the Configure tab of the ClickHouse service page in the EMR console, click the server-users tab. On the server-users tab, click Add Configuration Item. In the dialog box that appears, add a configuration item whose key is users.<YourUserName>.password, users.<YourUserName>.password_sha256_hex, or users.<YourUserName>.password_double_sha1_hex and specify a value for the configuration item. Then, save the configuration and restart the ClickHouse service.
Replace <YourUserName> with the name of the ClickHouse user that you want to create.
Note
For more information about how to add a configuration item, see Manage parameters for services. For more information about how to restart a service, see Restart a service.
ClickHouse client
1. On the Configure tab of the ClickHouse service page in the EMR console, click the server-users tab. On the server-users tab, click Add Configuration Item. In the dialog box that appears, add a configuration item whose key is users.default.access_management and value is 1, save the configuration, and then restart the ClickHouse service. Use the default user account to connect to your ClickHouse cluster.
2. Log on to the EMR ClickHouse cluster in SSH mode. For more information, see Log on to a cluster.
3. Run the following command to start the ClickHouse client:
```
clickhouse-client -h core-1-1 -m
```
  Note
  In the sample command, core-1-1 indicates the name of the core node that you log on to. If you have multiple core nodes, you can log on to one of the nodes.
4. Execute the following statement to create a ClickHouse user:
```
CREATE USER IF NOT EXISTS user_test ON CLUSTER new_cluster_emr IDENTIFIED WITH plaintext_password BY '123456';
```
  Note
  In this example, the name of the user is user_test and the password is 123456. You can specify the name and the password based on your business requirements.
  Syntax that is used to create a ClickHouse user:
```
CREATE USER [IF NOT EXISTS | OR REPLACE] name1 [ON CLUSTER cluster_name1]
        [, name2 [ON CLUSTER cluster_name2] ...]
    [NOT IDENTIFIED | IDENTIFIED {[WITH {no_password | plaintext_password | sha256_password | sha256_hash | double_sha1_password | double_sha1_hash}] BY {'password' | 'hash'}} | {WITH ldap SERVER 'server_name'} | {WITH kerberos [REALM 'realm']}]
    [HOST {LOCAL | NAME 'name' | REGEXP 'name_regexp' | IP 'address' | LIKE 'pattern'} [,...] | ANY | NONE]
    [DEFAULT ROLE role [,...]]
    [GRANTEES {user | role | ANY | NONE} [,...] [EXCEPT {user | role} [,...]]]
    [SETTINGS variable [= value] [MIN [=] min_value] [MAX [=] max_value] [READONLY | WRITABLE] | PROFILE 'profile_name'] [,...];
```
5. Execute the following statement to query existing users:
```
SHOW USERS;
```
  The information about existing users is returned.
```
┌─name──────┐
│ default          │
│ user_test        │
│ user_test2       │
└─────────┘
```

What do I do if data loss occurs?

Problem description: A total number of A data entries are written to ClickHouse, but only B data entries are read. B is less than A.
Cause: Data is not missing in ClickHouse in most cases. However, at least two replicas are stored in each shard. The *MergeTree engine is used for local tables. If you use a distributed table to read data, the number of data entries that are read may be less than the number of data entries that are written.
Data is distributed to each node regardless of whether data is written by using a distributed table or is directly written to local tables. When you read data by using a distributed table, each shard uses one connection by default. In this case, the number of connections is less than the number of nodes. As a result, data on some nodes cannot be read. Sample statements:
```
CREATE TABLE db.table_local
(
  ...
)
Engine = MergeTree()

CREATE TABLE db.table_distributed
(
  ...
)
Engine = Distributed(cluster_emr, db, table_local, rand());
```

Solutions

Solution	Operation
(Recommended) Solution 1	Delete the db.table_local table and create a replicated table. Use the created replicated table as a local table.
(Not recommended) Solution 2	On the Configure tab of the ClickHouse service page in the EMR console, click the server-users tab. On the server-users tab, click Add Configuration Item. In the dialog box that appears, add a configuration item whose key is profiles.<your_profile_name>.max_parallel_replicas and value is greater than or equal to the number of replicas in each shard. Make sure that the value of users.<your_clickhouse-client_name>.profile is the same as the value of <your_profile_name>. Note Replace <your_profile_name> with your profile name and replace <your_clickhouse-client_name> with the name of your ClickHouse client. For more information about the configuration items, see Configure user permissions.

What do I do if the error message "Memory limit (for total) exceeded" appears?

Cause: The memory usage exceeds the total memory available for the server.
Solution: On the Configure tab of the ClickHouse service page in the EMR console, click the server-config tab. On the server-config tab, click Add Configuration Item. In the dialog box that appears, add the max_server_memory_usage configuration item. The maximum value that you can specify for this configuration item is calculated by using the following formula: Physical server memory × Value of max_server_memory_usage_to_ram_ratio.
Note
The default value of the max_server_memory_usage_to_ram_ratio parameter in ClickHouse is 0.9. To change the value of this parameter, add the max_server_memory_usage_to_ram_ratio configuration item and specify the value based on your business requirements.

What do I do if the error message "Memory limit (for query) exceeded" appears?

Cause: The memory usage exceeds the maximum memory available for a single query.

Solutions

Scenario	Operation
Global configuration	On the Configure tab of the ClickHouse service page in the EMR console, click the server-config tab. On the server-config tab, click Add Configuration Item. In the dialog box that appears, add the profiles.<your_profile_name>.max_memory_usage configuration item. Make sure that the value of the users.<your_clickhouse-client_name>.profile configuration item is the same as the value of <your_profile_name>. Note Replace <your_profile_name> with your profile name and replace <your_clickhouse-client_name> with the name of your ClickHouse client. For more information about the configuration items, see Configure user permissions.
For the ClickHouse client	On the Configure tab of the ClickHouse service page in the EMR console, click the client-config tab. On the client-config tab, click Add Configuration Item. In the dialog box that appears, add the max_memory_usage configuration item.
For a session	Run the `SET max_memory_usage=xxxx` command. The configuration takes effect during the lifecycle of a session.
For a query	Add the setting of max_memory_usage to an SQL statement. The configuration takes effect only for the current query. Sample statement: `SELECT column FROM table SETTINGS max_memory_usage=xxxx`

What do I do if the error message "Memory limit (for user) exceeded" appears?

Cause: The memory usage exceeds the maximum memory available for a single user.

Solutions

Scenario	Operation
Global configuration	On the Configure tab of the ClickHouse service page in the EMR console, click the server-users tab. On the server-users tab, click Add Configuration Item. In the dialog box that appears, add the profiles.<your_profile_name>.max_memory_usage_for_user configuration item. Make sure that the value of the users.<your_clickhouse-client_name>.profile configuration item is the same as the value of <your_profile_name>. Note Replace <your_profile_name> with your profile name and replace <your_clickhouse-client_name> with the name of your ClickHouse client. For more information about the configuration items, see Configure user permissions.
For the ClickHouse client	On the Configure tab of the ClickHouse service page in the EMR console, click the client-config tab. On the client-config tab, click Add Configuration Item. In the dialog box that appears, add the max_memory_usage_for_user configuration item.
For a session	Run the `SET max_memory_usage_for_user=xxxx` command. The configuration takes effect during the lifecycle of a session.
For a query	Add the setting of max_memory_usage_for_user to an SQL statement. The configuration takes effect only for the current query. Sample statement: `SELECT column FROM table SETTINGS max_memory_usage_for_user=xxxx`

How are the components of the ClickHouse service distributed in an OLAP cluster?

The ClickHouse service contains the following components: ClickHouseRuntime, ClickHouseServer, and ClickHouseKeeper.

In OLAP clusters of EMR V3.X, the ClickHouseKeeper component is not installed, the ClickHouseRuntime component is installed on all types of nodes, and the ClickHouseServer component is installed only on core nodes.
In OLAP clusters of EMR V5.X, the ClickHouseRuntime component is installed on all types of nodes, and the ClickHouseServer component is installed only on core nodes. The ClickHouseKeeper component is installed on the first core node or the first three core nodes and does not support scale-out. For example, if you configure two core nodes when you create an OLAP cluster, the ClickHouseKeeper component is installed on the first core node. If you add two more core nodes to the cluster, the number of core nodes on which the ClickHouseKeeper component is installed remains unchanged.

Why does the data import speed gradually decrease?

In ClickHouse, most tables are MergeTree-related tables, such as ReplicatedMergeTree and ReplacingMergeTree tables.

When data is written to a MergeTree table, the data is sorted in the memory and then flushed to the disk. In this case, a data part is generated on the disk. Data parts on the disk are continuously merged by using background threads. This way, small data parts are merged into a large data part. In the initial phase of importing data to a MergeTree table, the background threads may find that no data parts can be merged. All resources are used to receive data and then flush the data to the disk. Therefore, the data import speed is high. After a period of time, more data parts can be merged by using the background threads. As a result, some resources are used to merge data parts. More data that is flushed to the disk indicates that more computing resources are required to merge data parts. The amount of resources that are used to merge data parts increase until a balance is reached between data import and data merging. As a result, the data import speed decreases.

How do I upgrade the ClickHouse service?

The version of the ClickHouse service varies based on the version of an EMR cluster. If you want to upgrade the ClickHouse service, you must upgrade the EMR cluster.

Why do I need to define a logical cluster in ClickHouse?

You can specify related configurations to define a logical cluster in ClickHouse. You can execute the select * from system.clusters SQL statement in ClickHouse to view the existing logical clusters.

Each logical cluster contains multiple machines. You can configure a logical cluster to establish the relationships between different machines, such as sharding and replication relationships. You can also use different logical clusters in different business scenarios.

By default, a logical cluster that contains all nodes in a physical EMR ClickHouse cluster is configured.

What are shards and replicas in ClickHouse?

In ClickHouse, a node can be a replica of one or more nodes. The data of tables that use the Replicated*MergeTree engine on the nodes is the same.

In ClickHouse, a shard consists of one or more nodes. The nodes in the shard are replicas of each other.

How do I test the performance of ClickHouse?

In most cases, you can use Star Schema Benchmark (SSB) to test the performance of ClickHouse. For more information, see Star Schema Benchmark. For information about other test cases, see Tutorials and Example Datasets and clickhouse-benchmark.

Can I change the number of shards and replicas after I create a cluster?

For example, when you create a cluster, the number of shards is A and the number of replicas is B. After the cluster is created, you want to change the number of shards to C and the number of replicas to D. ClickHouse allows you to change the number of shards and replicas. To change the number of shards and replicas, perform the following operations: On the Configure tab of the ClickHouse service page in the EMR console, click the server-metrika tab. On the server-metrika tab, find the clickhouse_remote_servers parameter and change the number of shards and the number of replicas in the settings of the parameter.

However, to prevent errors, we recommend that you do not directly change the number of shards and replicas. For example, if data already exists in the cluster, the changes may lead to a data error. We recommend that you add a logical cluster and configure the number of shards and replicas based on your business requirements. This way, you can retain the original configurations of shards and replicas.

What settings must I configure when I use HDFS in EMR ClickHouse?

For more information, see HDFS.

What are the differences between EMR ClickHouse and ApsaraDB for ClickHouse?

The ID of an EMR ClickHouse cluster starts with c-. Example: c-3c8697f91408****. The ID of an ApsaraDB for ClickHouse cluster starts with cc-. Example: cc-bp16qwvp7hy8i****.
EMR ClickHouse is semi-managed. You can log on to an Elastic Compute Service (ECS) instance to perform O&M operations. ApsaraDB for ClickHouse is fully managed. You cannot log on to an ECS instance to perform O&M operations.

Can ClickHouse be used to query Hive tables that use DLF for metadata storage?

No, ClickHouse cannot be used to query Hive tables that use Data Lake Formation (DLF) for metadata storage. ClickHouse can be used to query only Hive tables that use MySQL or ApsaraDB RDS for metadata storage. Therefore, we recommend that you do not use ClickHouse to query Hive tables.

The following suggestions are provided based on your business requirements:

If you want to import data from Hive to ClickHouse, we recommend that you use Apache Spark or Apache Seatunnel.
If you want to analyze Hive data, we recommend that you use an engine such as StarRocks, Trino, or Impala.