The configuration parameters for the ClickHouse service of an E-MapReduce (EMR) ClickHouse cluster include client parameters, server parameters, user permission parameters, and extended parameters. This topic describes how to configure the ClickHouse client, ClickHouse server, and extended parameters for the ClickHouse service.

Background information

The following table describes the references of the configuration parameters for the ClickHouse service.
Item References
ClickHouse client client-config
ClickHouse server server-config
Extended parameters server-metrika
User permissions Configure user permissions

Prerequisites

A ClickHouse cluster is created. For more information, see Create a cluster.

Precautions

Extensible Markup Language (XML) files are used to configure the ClickHouse service. An XML file can contain nested parameters and nested parameter values. Take note of the following rules when you add custom parameters:
  • If you can add parameters to the yandex tag, directly add the parameters. This eliminates the need to add the parameters in the EMR console.
  • If a nested parameter is used, separate layers in the nested parameter with periods (.).

    For example, on the server-users tab, you can configure the nested parameter users.aliyun.password for a newly added user named aliyun. The value of this parameter is a password. You can specify a custom password.

  • When you add a custom parameter, do not specify the parameter name and parameter value in XML format.

client-config

The parameters on the client-config tab are used to generate the config.xml file that is used by a ClickHouse client. You can go to the ClickHouse service page of the EMR console, click client-config on the Configure tab, and then configure the following parameters.

Parameter Description
user The username that is used to log on to the ClickHouse client. Default value: default.
password The password that is used to log on to the ClickHouse client. This parameter is left empty by default.
prompt_by_server_display_name.production The prompt that is customized for the ClickHouse client. The prompt varies based on the value of the display_name parameter on the server-config tab. If you set the display_name parameter to default, the prompt is the value of the prompt_by_server_display_name.default parameter on the client-config tab. For more information about the color of prompts, see Color prompts with readline and tip_colors_and_formatting.
prompt_by_server_display_name.default
prompt_by_server_display_name.test

server-config

The parameters on the server-config tab are used to generate the config.xml file that is used by a ClickHouse server. You can go to the ClickHouse service page of the EMR console, click server-config on the Configure tab, and then configure the following parameters.

Parameter Description
tcp_port The TCP port that is used to communicate with the ClickHouse client. Default value: 9000.
logger.count The maximum number of archived ClickHouse log files. If the number of archived log files reaches the value of this parameter, ClickHouse deletes the earliest archived log files. Default value: 10.
logger.errorlog The path that is used by the ClickHouse server to store error logs. Default value: /var/log/clickhouse-server/clickhouse-server.err.log.
logger.level The level of the logs. The default level is information. Valid values (sorted based on the urgency degree): none, fatal, critical, error, warning, notice, information, debug, and trace. none indicates that logging is disabled.
logger.size The maximum size of a log file. If the size of a log file reaches the value of this parameter, ClickHouse archives and renames the log file and creates another log file. Default value: 1000M.
logger.path The path that is used by the ClickHouse server to store common logs. The default path is /var/log/clickhouse-server/clickhouse-server.log. The log file records logs of the level specified by the logger.level parameter.
access_control_path The path of the folder that is used by the ClickHouse server to store the configurations of the users and roles that are created by executing SQL statements. Default value: /var/lib/clickhouse/access/.
user_files_path The path that stores user files. This parameter is used in the file() function of a table. Default value: /var/lib/clickhouse/user_files/.
path_to_regions_hierarchy_file The path that stores regional hierarchy files. This parameter is used by the ClickHouse internal dictionary. This parameter is left empty by default.
path_to_regions_names_files The path that stores files that contain region names. This parameter is used by the ClickHouse internal dictionary. This parameter is left empty by default.
distributed_ddl.path The path that is used by ZooKeeper to store Data Definition Language (DDL) query queues. Default value: /clickhouse/task_queue/ddl. Unless otherwise specified, the CREATE, DROP, ALTER, and RENAME statements that are executed in the ClickHouse cluster affect only the machine that is used to process queries. You can configure the parameters that are prefixed with distributed_ddl to allow the queries to be run in the ClickHouse cluster. These parameters take effect only if ZooKeeper is enabled.
tmp_policy The policy that is used to store temporary data that is generated when large table queries are processed. This parameter is left empty by default.
You can set this parameter to a value based on disk policies that are specified by the storage_configuration parameter on the server-metrika tab.
Note If this parameter is left empty, the tmp_path parameter takes effect. If this parameter is specified, the tmp_path parameter is ignored.
path The path to the directory of data files. You must add a forward slash (/) to the end of the path. Default value: /var/lib/clickhouse/access/.
https_port The HTTPS port that is used to communicate with the ClickHouse server. Parameters related to OpenSSL are required only if you configure the https_port parameter. If you specify both the https_port parameter and the http_port parameter, the https_port parameter is ignored. This parameter is left empty by default.
query_log.flush_interval_milliseconds If log_queries=1 is configured for the profile that you use, the information about threads that are used for queries is stored in a table. The following parameters that are prefixed with query_log can be used to configure the information storage:
  • flush_interval_milliseconds: specifies the interval at which data in the memory is updated to a table. Default value: 7500.
  • engine: specifies the type of engine that is used by the table. This parameter is left empty by default.
    Notice Do not configure the query_log.partition_by parameter if you configure the query_log.engine parameter. Otherwise, an error may occur.
  • partition_by: specifies the partition keys of the table. Default value: toYYYYMM(event_date).
  • database: specifies the name of the database to which the table belongs. Default value: system.
  • table: specifies the name of the table. Default value: query_thread_log.
query_log.engine
query_log.partition_by
query_log.database
query_log.table
interserver_http_credentials.user The credentials. In most cases, if the name of the engine that is used by the table is prefixed with Replicated, table replication does not require authentication. You can configure the parameters to enable authentication. The credentials are used only for communication between replicas and are independent of the credentials of the ClickHouse client.
  • user: the username. This parameter is left empty by default.
  • password: the password. This parameter is left empty by default.
interserver_http_credentials.password
mlock_executable Specifies whether to call the mlockall function. If you call the mlockall function after the ClickHouse service is started, the latency of the first query can be reduced and the executable file of the ClickHouse service can be prevented from being called when the I/O load is high. Default value: false.
Note We recommend that you set this parameter to true. However, take note that the time that is required to start the ClickHouse service is increased by several seconds if you set this parameter to true.
trace_log.table If the value of either the query_profiler_real_time_period_ns or query_profiler_cpu_time_period_ns parameter for the profile that you use is not 0, the stack trace that is recorded by the query profiler is stored in a table. You can use the following parameters that are prefixed with trace_log to configure the information storage.
  • database: specifies the name of the database to which the table belongs. Default value: system.
  • table: specifies the name of the table. Default value: trace_log.
  • partition_by: specifies the partition keys of the table. Default value: toYYYYMM(event_date).
  • engine: specifies the type of engine that is used by the table. This parameter is left empty by default.
    Notice Do not configure the trace_log.partition_by parameter if you configure the trace_log.engine parameter. Otherwise, an error may occur.
  • flush_interval_milliseconds: specifies the interval at which data in the memory is updated to a table. Default value: 7500.
trace_log.database
trace_log.partition_by
trace_log.engine
trace_log.flush_interval_milliseconds
disable_internal_dns_cache Specifies whether to disable the internal DNS cache. The internal DNS cache is disabled if you set this parameter to a value that is not 0. Default value: 0.
Note We recommend that you configure this parameter in a system in which an environment frequently changes, such as Kubernetes.
listen_reuse_port Specifies whether to allow a port to be reused among sockets. Valid values:
  • 0: A port cannot be reused among sockets. It is the default value.
  • 1: A port can be reused among sockets.
query_thread_log.table If log_query_threads=1 is configured for the profile that you use, the information about threads that are used for queries is stored in a table.
  • database: specifies the name of the database to which the table belongs. Default value: system.
  • table: specifies the name of the table. Default value: query_thread_log.
  • partition_by: specifies the partition keys of the table. Default value: toYYYYMM(event_date).
  • engine: specifies the type of engine that is used by the table. This parameter is left empty by default.
    Notice Do not configure the query_thread_log.partition_by parameter if you configure the query_thread_log.engine parameter. Otherwise, an error may occur.
  • flush_interval_milliseconds: specifies the interval at which data in the memory is updated to a table. Default value: 7500.
query_thread_log.database
query_thread_log.partition_by
query_thread_log.engine
query_thread_log.flush_interval_milliseconds
default_database The name of the default database. Default value: default.
http_server_default_response The page that is automatically returned when you access the HTTP server of the ClickHouse service.
display_name The default prompt that is configured for the ClickHouse client. This parameter is left empty by default.
builtin_dictionaries_reload_interval The interval at which the built-in dictionary is reloaded. Unit: seconds. Default value: 3600.
umask The mask of file permissions. The default value of this parameter is 027, which specifies that operating system users cannot read files such as log and data files. Users in the same group can only read the files.
uncompressed_cache_size The cache size of the decompressed block if the MergeTree table engine is used. Default value: 0.

If you use the default value, caching is disabled.

timezone The time zone of the ClickHouse server. Default value: Asia/Shanghai.
max_session_timeout The maximum session timeout. Unit: seconds. Default value: 3600.
default_session_timeout The default session timeout. Unit: seconds. Default value: 60.
max_open_files The maximum number of files that you can open. Default value: 262144.
Note The valid values of this parameter vary based on the operating system that you use. If you leave this parameter empty, ClickHouse uses the value of the max_open_files parameter that is configured for the operating system.
tmp_path The path that stores temporary data that is generated when large table queries are processed. You must add a forward slash (/) to the end of the path. Default value: /var/lib/clickhouse/tmp/.
max_concurrent_queries The maximum number of queries that can be processed in parallel. Default value: 100.
tcp_port_secure The TCP port that is used to communicate with the ClickHouse client. This parameter is left empty by default.
Note Parameters related to OpenSSL are required only if you configure the tcp_port_secure parameter.
listen_try Specifies whether to immediately exit if the protocol such as IPv4 or IPv6 that is specified by listen_host cannot be used.
  • 0: Does not immediately exit. It is the default value.
  • 1: Immediately exits.
mysql_port The MySQL port that is used to communicate with the ClickHouse client.
keep_alive_timeout The time that is required for a request to be sent to the ClickHouse service before the existing connection is closed. Unit: seconds. Default value: 3.
max_connections The maximum number of connections allowed. Default value: 4096.
dns_cache_update_period The interval at which the IP addresses that are stored in the internal DNS cache of the ClickHouse service are updated. Unit: seconds. Default value: 15.

The update is performed asynchronously in a separate system thread.

path_to_regions_names_files The path that stores the files that contain region names. This parameter is used by the ClickHouse internal dictionary. This parameter is left empty by default.
include_from The configuration file of the ClickHouse server is compiled based on XML. Some XML tags contain the include attribute. The content of these XML tags can be replaced by the configurations in the file referenced by include_from. Default value: /etc/ecm/clickhouse-conf/clickhouse-server/metrika.xml.
interserver_http_port The port that is used for data exchange between ClickHouse servers. Default value: 9009.
dictionaries_config The path that stores the configuration file of the external dictionary. The path can contain wildcards such as periods (.), asterisks (*), and question marks (?). Default value: *_dictionary.xml.
http_port The HTTP port that is used to communicate with the ClickHouse server. Default value: 8123.

The Java Database Connectivity (JDBC) of open source ClickHouse also uses this port to access a ClickHouse cluster. For more information, see clickhouse-jdbc.

users_config The path that stores the user configuration, access control configuration, resource limit configuration, and setting configuration files. Default value: users.xml.
dictionaries_lazy_load Specifies whether to delay the creation of a dictionary. Valid values:
  • true: A dictionary is created when the function is used for the first time. If the dictionary fails to be created, an exception occurs in the function that uses the dictionary. true is the default value.
  • false: All dictionaries are created when the ClickHouse server is started. If an error occurs, the ClickHouse server directly exits.
listen_host The IP address on which the ClickHouse server listens. You can set this parameter to an IPv4 or IPv6 address. If you set this parameter to ::, all IP addresses are allowed. You can configure multiple IP addresses. Separate multiple IP addresses with commas (,), such as 127.0.0.1,localhost. Default value: 0.0.0.0.
default_profile The default name of the profile. Default value: default.
mark_cache_size The approximate size of the cache that is used by the mark index if the MergeTree table engine is used. Default value: 5368709120. Unit: bytes.
listen_backlog The number of backlogs. Default value: 64.
format_schema_path The path that stores the schema of input data. Default value: /var/lib/clickhouse/format_schemas/.

server-metrika

The parameters on the server-metrika tab are used to generate the metrika.xml file. By default, the metrika.xml file is referenced by the config.xml file of the ClickHouse server. You can go to the ClickHouse service page of the EMR console, click server-metrika on the Configure tab, and then configure the following parameters.

Parameter Description
clickhouse_compression The data compression settings for tables that use the MergeTree engine. For more information, see Server Settings. This parameter is left empty by default.

You can configure this parameter if you want to enable data compression.

storage_configuration The custom disk information. Alibaba Cloud EMR automatically creates a ClickHouse data directory for each disk and creates the HDD in order disk policy for the disks.
zookeeper_servers The information about ZooKeeper servers that are used to configure a ClickHouse cluster. The default value is the information of a ZooKeeper server that is created when you create a ClickHouse cluster. You can specify multiple ZooKeeper servers. Separate the information of the ZooKeeper servers with commas (,), such as emr-header-1.cluster-12345:2181,emr-worker-1.cluster-12345:2181,emr-worker-2.cluster-12345:2181.
quotas_default You can configure multiple quotas to flexibly adjust resource limits. This parameter specifies the value of the quota that is named default. You can add custom quota settings.
clickhouse_remote_servers The information about shards and replicas that you configure for a ClickHouse cluster. The default value is the topology that is generated based on the numbers of shards and replicas that are configured when you create the ClickHouse cluster.
The following sample code shows the value of the clickhouse_remote_servers parameter if you configure two shards and two replicas during the ClickHouse cluster creation:
<cluster_emr>
  <shard>
    <weight>1</weight>
    <internal_replication>true</internal_replication>
    <replica>
      <host>emr-header-1.cluster-12345</host>
      <port>9000</port>
    </replica>
    <replica>
      <host>emr-worker-1.cluster-12345</host>
      <port>9000</port>
    </replica>
  </shard>
  <shard>
    <weight>1</weight>
    <internal_replication>true</internal_replication>
    <replica>
      <host>emr-worker-2.cluster-12345</host>
      <port>9000</port>
    </replica>
    <replica>
      <host>emr-worker-3.cluster-12345</host>
      <port>9000</port>
    </replica>
  </shard>
</cluster_emr>
The following sample code shows the value of the storage_configuration parameter if a server contains four disks:
<disks>
  <disk1>
    <path>/mnt/disk1/clickhouse/</path>
    <keep_free_space_bytes>10485760</keep_free_space_bytes>
  </disk1>
  <disk2>
    <path>/mnt/disk2/clickhouse/</path>
    <keep_free_space_bytes>10485760</keep_free_space_bytes>
  </disk2>
  <disk3>
    <path>/mnt/disk3/clickhouse/</path>
    <keep_free_space_bytes>10485760</keep_free_space_bytes>
  </disk3>
  <disk4>
    <path>/mnt/disk4/clickhouse/</path>
    <keep_free_space_bytes>10485760</keep_free_space_bytes>
  </disk4>
</disks>
<policies>
  <hdd_in_order>
    <volumes>
      <single>
        <disk>disk1</disk>
        <disk>disk2</disk>
        <disk>disk3</disk>
        <disk>disk4</disk>
      </single>
    </volumes>
  </hdd_in_order>
</policies>

References

For more information about the ClickHouse parameters, see the following official documentation:

What to do next

For more information about how to modify or add parameters, see Manage parameters for services.