全部產品
Search
文件中心

E-MapReduce:快速使用ClickHouse

更新時間:Jul 01, 2024

本文通過樣本為您介紹,如何快速將資料隨機寫入ClickHouse叢集各個節點的本地表。

前提條件

已建立ClickHouse叢集,詳情請參見建立ClickHouse叢集

操作步驟

  1. 使用SSH方式登入ClickHouse叢集,詳情請參見登入叢集

  2. 執行以下命令,下載官方範例資料集。

    curl https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz  | unxz --threads=`nproc` > hits_v1.tsv
  3. 執行如下命令,啟動ClickHouse用戶端。

    clickhouse-client -h core-1-1 -m
    說明

    本樣本登入core-1-1節點,如果您有多個Core節點,可以登入任意一個節點。

  4. 執行如下命令,建立資料庫。

    可以使用on CLUSTER參數在叢集的所有節點建立資料庫,預設叢集標識為cluster_emr。

    CREATE DATABASE IF NOT EXISTS demo on CLUSTER cluster_emr;

    返回資訊如下所示。Create Database

  5. 在叢集上的所有節點建立一張複製表(Replicated表)。

    複製表(Replicated表)會根據副本的個數,實現資料的多副本,並實現資料的最終一致性。

    CREATE TABLE demo.hits_local ON CLUSTER cluster_emr
    (
        `WatchID` UInt64,
        `JavaEnable` UInt8,
        `Title` String,
        `GoodEvent` Int16,
        `EventTime` DateTime,
        `EventDate` Date,
        `CounterID` UInt32,
        `ClientIP` UInt32,
        `ClientIP6` FixedString(16),
        `RegionID` UInt32,
        `UserID` UInt64,
        `CounterClass` Int8,
        `OS` UInt8,
        `UserAgent` UInt8,
        `URL` String,
        `Referer` String,
        `URLDomain` String,
        `RefererDomain` String,
        `Refresh` UInt8,
        `IsRobot` UInt8,
        `RefererCategories` Array(UInt16),
        `URLCategories` Array(UInt16),
        `URLRegions` Array(UInt32),
        `RefererRegions` Array(UInt32),
        `ResolutionWidth` UInt16,
        `ResolutionHeight` UInt16,
        `ResolutionDepth` UInt8,
        `FlashMajor` UInt8,
        `FlashMinor` UInt8,
        `FlashMinor2` String,
        `NetMajor` UInt8,
        `NetMinor` UInt8,
        `UserAgentMajor` UInt16,
        `UserAgentMinor` FixedString(2),
        `CookieEnable` UInt8,
        `JavascriptEnable` UInt8,
        `IsMobile` UInt8,
        `MobilePhone` UInt8,
        `MobilePhoneModel` String,
        `Params` String,
        `IPNetworkID` UInt32,
        `TraficSourceID` Int8,
        `SearchEngineID` UInt16,
        `SearchPhrase` String,
        `AdvEngineID` UInt8,
        `IsArtifical` UInt8,
        `WindowClientWidth` UInt16,
        `WindowClientHeight` UInt16,
        `ClientTimeZone` Int16,
        `ClientEventTime` DateTime,
        `SilverlightVersion1` UInt8,
        `SilverlightVersion2` UInt8,
        `SilverlightVersion3` UInt32,
        `SilverlightVersion4` UInt16,
        `PageCharset` String,
        `CodeVersion` UInt32,
        `IsLink` UInt8,
        `IsDownload` UInt8,
        `IsNotBounce` UInt8,
        `FUniqID` UInt64,
        `HID` UInt32,
        `IsOldCounter` UInt8,
        `IsEvent` UInt8,
        `IsParameter` UInt8,
        `DontCountHits` UInt8,
        `WithHash` UInt8,
        `HitColor` FixedString(1),
        `UTCEventTime` DateTime,
        `Age` UInt8,
        `Sex` UInt8,
        `Income` UInt8,
        `Interests` UInt16,
        `Robotness` UInt8,
        `GeneralInterests` Array(UInt16),
        `RemoteIP` UInt32,
        `RemoteIP6` FixedString(16),
        `WindowName` Int32,
        `OpenerName` Int32,
        `HistoryLength` Int16,
        `BrowserLanguage` FixedString(2),
        `BrowserCountry` FixedString(2),
        `SocialNetwork` String,
        `SocialAction` String,
        `HTTPError` UInt16,
        `SendTiming` Int32,
        `DNSTiming` Int32,
        `ConnectTiming` Int32,
        `ResponseStartTiming` Int32,
        `ResponseEndTiming` Int32,
        `FetchTiming` Int32,
        `RedirectTiming` Int32,
        `DOMInteractiveTiming` Int32,
        `DOMContentLoadedTiming` Int32,
        `DOMCompleteTiming` Int32,
        `LoadEventStartTiming` Int32,
        `LoadEventEndTiming` Int32,
        `NSToDOMContentLoadedTiming` Int32,
        `FirstPaintTiming` Int32,
        `RedirectCount` Int8,
        `SocialSourceNetworkID` UInt8,
        `SocialSourcePage` String,
        `ParamPrice` Int64,
        `ParamOrderID` String,
        `ParamCurrency` FixedString(3),
        `ParamCurrencyID` UInt16,
        `GoalsReached` Array(UInt32),
        `OpenstatServiceName` String,
        `OpenstatCampaignID` String,
        `OpenstatAdID` String,
        `OpenstatSourceID` String,
        `UTMSource` String,
        `UTMMedium` String,
        `UTMCampaign` String,
        `UTMContent` String,
        `UTMTerm` String,
        `FromTag` String,
        `HasGCLID` UInt8,
        `RefererHash` UInt64,
        `URLHash` UInt64,
        `CLID` UInt32,
        `YCLID` UInt64,
        `ShareService` String,
        `ShareURL` String,
        `ShareTitle` String,
        `ParsedParams` Nested(Key1 String,Key2 String,Key3 String,Key4 String,Key5 String,ValueDouble Float64),
        `IslandID` FixedString(16),
        `RequestNum` UInt32,
        `RequestTry` UInt8
    )
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/hits_local', '{replica}')
    PARTITION BY toYYYYMM(EventDate)
    ORDER BY (CounterID, EventDate, intHash32(UserID))
    SAMPLE BY intHash32(UserID);
    說明

    {shard}{replica}是阿里雲EMR為ClickHouse叢集自動產生的宏定義,可以直接使用。

  6. 執行以下命令,建立分布式(Distributed)表。

    分布式表不儲存資料,僅僅是底層表的一個View,但可以在多個伺服器上進行分散式查詢。本例中使用隨機函數rand(),表示資料會隨機寫入各個節點的本地表。

    CREATE TABLE demo.hits_all on CLUSTER cluster_emr AS demo.hits_local 
    ENGINE = Distributed(cluster_emr, demo, hits_local, rand());
  7. 退出ClickHouse用戶端,在範例資料的目錄下執行以下命令,匯入資料。

    clickhouse-client -h core-1-1 --query "INSERT INTO demo.hits_all FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv;
  8. 重新啟動ClickHouse用戶端,查看資料。

    因為資料是隨機寫入的,各節點資料量可能不同。

    • 查看core-1-1節點demo.hits_all的資料量。

      select count(*) from demo.hits_all;
    • 查看core-1-1節點demo.hits_local的資料量。

      select count(*) from demo.hits_local;
    • 查看core-1-2節點demo.hits_local的資料量。

      說明

      其餘節點,您也可以按照以下步驟來查看demo.hits_local的資料量。節點名稱您可以在EMR控制台的節點管理頁面查看。

      1. 執行以下命令,登入ClickHouse用戶端。

        clickhouse-client -h core-1-2 -m
      2. 在ClickHouse用戶端,執行以下命令,查看demo.hits_local的資料量。

        select count(*) from demo.hits_local;