All Products
Search
Document Center

E-MapReduce:Use ClickHouse

Last Updated:Jul 05, 2023

This topic describes how to quickly write data to local tables of each node in a ClickHouse cluster in random mode.

Prerequisites

A ClickHouse cluster is created. For more information, see Create a ClickHouse cluster.

Procedure

  1. Log on to the ClickHouse cluster in SSH mode. For more information, see Log on to a cluster.

  2. Run the following command to download the official sample dataset:

    curl https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz  | unxz --threads=`nproc` > hits_v1.tsv
  3. Run the following command to start the ClickHouse client:

    clickhouse-client -h core-1-1 -m
    Note

    In the sample command, core-1-1 indicates the name of the core node that you log on to. If you have multiple core nodes, you can log on to one of the nodes.

  4. Run the following command to create a database.

    You can run the following command that contains the on CLUSTER parameter to create a database for all nodes in the cluster. By default, the cluster is named cluster_emr.

    CREATE DATABASE IF NOT EXISTS demo on CLUSTER cluster_emr;

    The following output is returned. Create Database

  5. Create a replicated table for all nodes in the cluster.

    The replicated table generates the same number of data replicas as the number of table replicas for each node in the cluster, and the data is consistent across all table replicas.

    CREATE TABLE demo.hits_local ON CLUSTER cluster_emr
    (
        `WatchID` UInt64,
        `JavaEnable` UInt8,
        `Title` String,
        `GoodEvent` Int16,
        `EventTime` DateTime,
        `EventDate` Date,
        `CounterID` UInt32,
        `ClientIP` UInt32,
        `ClientIP6` FixedString(16),
        `RegionID` UInt32,
        `UserID` UInt64,
        `CounterClass` Int8,
        `OS` UInt8,
        `UserAgent` UInt8,
        `URL` String,
        `Referer` String,
        `URLDomain` String,
        `RefererDomain` String,
        `Refresh` UInt8,
        `IsRobot` UInt8,
        `RefererCategories` Array(UInt16),
        `URLCategories` Array(UInt16),
        `URLRegions` Array(UInt32),
        `RefererRegions` Array(UInt32),
        `ResolutionWidth` UInt16,
        `ResolutionHeight` UInt16,
        `ResolutionDepth` UInt8,
        `FlashMajor` UInt8,
        `FlashMinor` UInt8,
        `FlashMinor2` String,
        `NetMajor` UInt8,
        `NetMinor` UInt8,
        `UserAgentMajor` UInt16,
        `UserAgentMinor` FixedString(2),
        `CookieEnable` UInt8,
        `JavascriptEnable` UInt8,
        `IsMobile` UInt8,
        `MobilePhone` UInt8,
        `MobilePhoneModel` String,
        `Params` String,
        `IPNetworkID` UInt32,
        `TraficSourceID` Int8,
        `SearchEngineID` UInt16,
        `SearchPhrase` String,
        `AdvEngineID` UInt8,
        `IsArtifical` UInt8,
        `WindowClientWidth` UInt16,
        `WindowClientHeight` UInt16,
        `ClientTimeZone` Int16,
        `ClientEventTime` DateTime,
        `SilverlightVersion1` UInt8,
        `SilverlightVersion2` UInt8,
        `SilverlightVersion3` UInt32,
        `SilverlightVersion4` UInt16,
        `PageCharset` String,
        `CodeVersion` UInt32,
        `IsLink` UInt8,
        `IsDownload` UInt8,
        `IsNotBounce` UInt8,
        `FUniqID` UInt64,
        `HID` UInt32,
        `IsOldCounter` UInt8,
        `IsEvent` UInt8,
        `IsParameter` UInt8,
        `DontCountHits` UInt8,
        `WithHash` UInt8,
        `HitColor` FixedString(1),
        `UTCEventTime` DateTime,
        `Age` UInt8,
        `Sex` UInt8,
        `Income` UInt8,
        `Interests` UInt16,
        `Robotness` UInt8,
        `GeneralInterests` Array(UInt16),
        `RemoteIP` UInt32,
        `RemoteIP6` FixedString(16),
        `WindowName` Int32,
        `OpenerName` Int32,
        `HistoryLength` Int16,
        `BrowserLanguage` FixedString(2),
        `BrowserCountry` FixedString(2),
        `SocialNetwork` String,
        `SocialAction` String,
        `HTTPError` UInt16,
        `SendTiming` Int32,
        `DNSTiming` Int32,
        `ConnectTiming` Int32,
        `ResponseStartTiming` Int32,
        `ResponseEndTiming` Int32,
        `FetchTiming` Int32,
        `RedirectTiming` Int32,
        `DOMInteractiveTiming` Int32,
        `DOMContentLoadedTiming` Int32,
        `DOMCompleteTiming` Int32,
        `LoadEventStartTiming` Int32,
        `LoadEventEndTiming` Int32,
        `NSToDOMContentLoadedTiming` Int32,
        `FirstPaintTiming` Int32,
        `RedirectCount` Int8,
        `SocialSourceNetworkID` UInt8,
        `SocialSourcePage` String,
        `ParamPrice` Int64,
        `ParamOrderID` String,
        `ParamCurrency` FixedString(3),
        `ParamCurrencyID` UInt16,
        `GoalsReached` Array(UInt32),
        `OpenstatServiceName` String,
        `OpenstatCampaignID` String,
        `OpenstatAdID` String,
        `OpenstatSourceID` String,
        `UTMSource` String,
        `UTMMedium` String,
        `UTMCampaign` String,
        `UTMContent` String,
        `UTMTerm` String,
        `FromTag` String,
        `HasGCLID` UInt8,
        `RefererHash` UInt64,
        `URLHash` UInt64,
        `CLID` UInt32,
        `YCLID` UInt64,
        `ShareService` String,
        `ShareURL` String,
        `ShareTitle` String,
        `ParsedParams` Nested(Key1 String,Key2 String,Key3 String,Key4 String,Key5 String,ValueDouble Float64),
        `IslandID` FixedString(16),
        `RequestNum` UInt32,
        `RequestTry` UInt8
    )
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/hits_local', '{replica}')
    PARTITION BY toYYYYMM(EventDate)
    ORDER BY (CounterID, EventDate, intHash32(UserID))
    SAMPLE BY intHash32(UserID);
    Note

    The {shard} and {replica} parameters are macros that are automatically generated by Alibaba Cloud E-MapReduce (EMR) for ClickHouse clusters and can be used directly.

  6. Run the following command to create a distributed table.

    A distributed table does not store data. It is only a view of an underlying table. You can use a distributed table to perform distributed queries on multiple servers. In this example, the random function rand() is used. The function ensures that data can be randomly written to local tables of each node.

    CREATE TABLE demo.hits_all on CLUSTER cluster_emr AS demo.hits_local 
    ENGINE = Distributed(cluster_emr, demo, hits_local, rand());
  7. Exit the ClickHouse client and run the following command in the directory in which the sample data is stored to import data:

    clickhouse-client -h core-1-1 --query "INSERT INTO demo.hits_all FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv;
  8. Restart the ClickHouse client and view data.

    Data is written at random. Therefore, the amount of data may vary on each node.

    • Run the following command to view the amount of data in the demo.hits_all table on the core-1-1 node:

      select count(*) from demo.hits_all;
    • Run the following command to view the amount of data in the demo.hits_local table on the core-1-1 node:

      select count(*) from demo.hits_local;
    • View the amount of data in the demo.hits_local table on the core-1-2 node.

      Note

      You can perform the following operations to view the amount of data in the demo.hits_local table on the other nodes. You can view the node name on the Nodes tab of a cluster in the E-MapReduce console.

      1. Run the following command to log on to the ClickHouse client:

        clickhouse-client -h core-1-2 -m
      2. Run the following command on the ClickHouse client to view the amount of data in the demo.hits_local table:

        select count(*) from demo.hits_local;