This topic describes how to quickly write data to local tables of each node in a ClickHouse cluster in random mode.

Prerequisites

A ClickHouse cluster is created. For more information, see Create a cluster.

Procedure

  1. Log on to the ClickHouse cluster in SSH mode. For more information, see Log on to a cluster.
  2. Run the following command to download the official sample dataset:
    curl  https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz  | unxz --threads=`nproc` > hits_v1.tsv
  3. Run the following command to start the ClickHouse client:
    clickhouse-client -m
  4. Run the following command to create a database.
    You can run the following command that contains the on CLUSTER parameter to create a database for all nodes in the cluster. By default, the cluster is named cluster_emr.
    CREATE DATABASE IF NOT EXISTS demo on CLUSTER cluster_emr;
    The following output is returned. Create database
  5. Create a replicated table for all nodes in the cluster.

    The replicated table generates the same number of data replicas as the number of table replicas for each node in the cluster, and the data is consistent across all table replicas.

    CREATE TABLE demo.hits_local ON CLUSTER cluster_emr
    (
        `WatchID` UInt64,
        `JavaEnable` UInt8,
        `Title` String,
        `GoodEvent` Int16,
        `EventTime` DateTime,
        `EventDate` Date,
        `CounterID` UInt32,
        `ClientIP` UInt32,
        `ClientIP6` FixedString(16),
        `RegionID` UInt32,
        `UserID` UInt64,
        `CounterClass` Int8,
        `OS` UInt8,
        `UserAgent` UInt8,
        `URL` String,
        `Referer` String,
        `URLDomain` String,
        `RefererDomain` String,
        `Refresh` UInt8,
        `IsRobot` UInt8,
        `RefererCategories` Array(UInt16),
        `URLCategories` Array(UInt16),
        `URLRegions` Array(UInt32),
        `RefererRegions` Array(UInt32),
        `ResolutionWidth` UInt16,
        `ResolutionHeight` UInt16,
        `ResolutionDepth` UInt8,
        `FlashMajor` UInt8,
        `FlashMinor` UInt8,
        `FlashMinor2` String,
        `NetMajor` UInt8,
        `NetMinor` UInt8,
        `UserAgentMajor` UInt16,
        `UserAgentMinor` FixedString(2),
        `CookieEnable` UInt8,
        `JavascriptEnable` UInt8,
        `IsMobile` UInt8,
        `MobilePhone` UInt8,
        `MobilePhoneModel` String,
        `Params` String,
        `IPNetworkID` UInt32,
        `TraficSourceID` Int8,
        `SearchEngineID` UInt16,
        `SearchPhrase` String,
        `AdvEngineID` UInt8,
        `IsArtifical` UInt8,
        `WindowClientWidth` UInt16,
        `WindowClientHeight` UInt16,
        `ClientTimeZone` Int16,
        `ClientEventTime` DateTime,
        `SilverlightVersion1` UInt8,
        `SilverlightVersion2` UInt8,
        `SilverlightVersion3` UInt32,
        `SilverlightVersion4` UInt16,
        `PageCharset` String,
        `CodeVersion` UInt32,
        `IsLink` UInt8,
        `IsDownload` UInt8,
        `IsNotBounce` UInt8,
        `FUniqID` UInt64,
        `HID` UInt32,
        `IsOldCounter` UInt8,
        `IsEvent` UInt8,
        `IsParameter` UInt8,
        `DontCountHits` UInt8,
        `WithHash` UInt8,
        `HitColor` FixedString(1),
        `UTCEventTime` DateTime,
        `Age` UInt8,
        `Sex` UInt8,
        `Income` UInt8,
        `Interests` UInt16,
        `Robotness` UInt8,
        `GeneralInterests` Array(UInt16),
        `RemoteIP` UInt32,
        `RemoteIP6` FixedString(16),
        `WindowName` Int32,
        `OpenerName` Int32,
        `HistoryLength` Int16,
        `BrowserLanguage` FixedString(2),
        `BrowserCountry` FixedString(2),
        `SocialNetwork` String,
        `SocialAction` String,
        `HTTPError` UInt16,
        `SendTiming` Int32,
        `DNSTiming` Int32,
        `ConnectTiming` Int32,
        `ResponseStartTiming` Int32,
        `ResponseEndTiming` Int32,
        `FetchTiming` Int32,
        `RedirectTiming` Int32,
        `DOMInteractiveTiming` Int32,
        `DOMContentLoadedTiming` Int32,
        `DOMCompleteTiming` Int32,
        `LoadEventStartTiming` Int32,
        `LoadEventEndTiming` Int32,
        `NSToDOMContentLoadedTiming` Int32,
        `FirstPaintTiming` Int32,
        `RedirectCount` Int8,
        `SocialSourceNetworkID` UInt8,
        `SocialSourcePage` String,
        `ParamPrice` Int64,
        `ParamOrderID` String,
        `ParamCurrency` FixedString(3),
        `ParamCurrencyID` UInt16,
        `GoalsReached` Array(UInt32),
        `OpenstatServiceName` String,
        `OpenstatCampaignID` String,
        `OpenstatAdID` String,
        `OpenstatSourceID` String,
        `UTMSource` String,
        `UTMMedium` String,
        `UTMCampaign` String,
        `UTMContent` String,
        `UTMTerm` String,
        `FromTag` String,
        `HasGCLID` UInt8,
        `RefererHash` UInt64,
        `URLHash` UInt64,
        `CLID` UInt32,
        `YCLID` UInt64,
        `ShareService` String,
        `ShareURL` String,
        `ShareTitle` String,
        `ParsedParams` Nested(Key1 String,Key2 String,Key3 String,Key4 String,Key5 String,ValueDouble Float64),
        `IslandID` FixedString(16),
        `RequestNum` UInt32,
        `RequestTry` UInt8
    )
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/hits_local', '{replica}')
    PARTITION BY toYYYYMM(EventDate)
    ORDER BY (CounterID, EventDate, intHash32(UserID))
    SAMPLE BY intHash32(UserID);
    Note The {shard} and {replica} parameters are macros that are automatically generated by Alibaba Cloud EMR for ClickHouse clusters and can be used directly.
    The following output is returned. Create table
  6. Run the following command to create a distributed table.

    A distributed table does not store data. It is only a view of an underlying table. You can use a distributed table to perform distributed queries on multiple servers. In this example, the random function rand() is used. The function ensures that data can be randomly written to local tables of each node.

    CREATE TABLE demo.hits_all on CLUSTER cluster_emr AS demo.hits_local 
    ENGINE = Distributed(cluster_emr, demo, hits_local, rand());
  7. Exit the ClickHouse client and run the following command in the directory in which the sample data is stored to import data:
    clickhouse-client --query "INSERT INTO demo.hits_all FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv;
  8. Restart the ClickHouse client and view data.

    Data is written at random. Therefore, the amount of data may vary on each node.

    • Run the following command to view the amount of data in the demo.hits_all table on the emr-header-1 node:
      select count(*) from demo.hits_all;
      The following output is returned:
      ┌─count() ─┐
      │  8873898   │
      └──────┘
    • Run the following command to view the amount of data in the demo.hits_local table on the emr-header-1 node:
      select count(*) from demo.hits_local;
      The following output is returned:
      ┌─count() ─┐
      │  4434571   │
      └──────┘
    • View the amount of data in the demo.hits_local table on the emr-worker-1 node.
      Note You can perform the following operations to view the amount of data in the demo.hits_local table on the other nodes.
      1. Run the following command to switch to the hadoop user on the master node:
        su hadoop
      2. Run the following command to switch to the emr-worker-1 node:
        ssh emr-worker-1;
      3. Run the following command on the ClickHouse client to view the amount of data in the demo.hits_local table:
        select count(*) from demo.hits_local;
        The following output is returned:
        ┌─count() ─┐
        │  4437327   │
        └──────┘