This topic describes how to use Kudu in E-MapReduce, and its scenarios and architecture.

Overview

Kudu is a columnar storage manager developed for the Apache Hadoop platform. It completes the storage layer in the Hadoop ecosystem to enable fast analytics on rapidly changing data. You can use Kudu when you want to:

  • Quickly read and write data by using HBase. HBase also allows you to modify data.
  • Query and analyze tremendous Parquet data or large amounts of data stored in Hadoop Distributed File System (HDFS) by using Hive, Spark, or Impala. Parquet, the column-oriented data storage format, is well fit to this scenario because it is compatible with the preceding three SQL engines.

By using Kudu, you do not need to deploy HDFS and HBase clusters at the same time. This reduces the cost for operations and maintenance.

Typical scenarios

  • Real-time computing

    Real-time computing business needs to continuously import large amounts of data and read, write, and update the imported data in a quasi-real-time manner. Kudu can meet such requirements of real-time computing.

  • Time series data

    Time series data is a series of data points generated in the time order. You can use time series data to check metrics and predict the performance trend. For example, by storing data of customers' purchase and browse history in the time order, you can predict customers' purchase trend and analyze their purchase behavior. When you analyze the stored data, some data may be modified and new data may be generated based on customers' operations. The modified data and new data must be analyzed in a timely manner. Kudu offers a scalable and efficient solution to access and handle such data.

  • Prediction model

    A prediction model is built based on large amounts of data. The model is always changing because data on which the model depends is always changing. If you store the prediction model and relevant data in HDFS, HDFS needs to spend a few hours or even a whole day in rebuilding the model each time data changes. Kudu offers a more efficient solution. By using Kudu, you only need to update data and query data again when data changes. Then, Kudu updates the prediction model in seconds or minutes.

  • Tremendous baseline data

    A large amount of baseline data exists in the production environment. The baseline data may be stored in HDFS, Kudu, or a relational database management system (RDBMS). Impala allows you to use multiple underlying storage engines at the same time, such as HDFS and Kudu. When you use Kudu to process data from another storage engine, you do not need to migrate data to Kudu.

Basic architecture

Kudu has two types of components: master server and tablet server. Master servers manage metadata, including information about tablet servers and tablets. Master servers use the Raft protocol to implement high availability. Tablet servers store tablets and their replicas. Replicas of each tablet are stored on different tablet servers, which implement high availability of these tablet replicas by using the Raft protocol. The following figure shows the basic architecture of a Kudu cluster.

Basic architecture

Create a Kudu cluster

You can create a Kudu cluster as a service component of Hadoop clusters in E-MapReduce V3.22.0 and later. If you select the Kudu service when you create a Hadoop cluster in E-MapReduce, a Kudu cluster is created. By default, the Kudu cluster contains three master servers and supports high availability.

Create a Kudu cluster

Integrate Kudu with Impala

To integrate Kudu with Impala, add the following configuration to the Impala profile: kudu_master_hosts= [master1][:port],[master2][:port],[master3][:port], or set the kudu.master_addresses parameter in the TBLPROPERTIES statement to specify the Kudu cluster. For example, you can set the kudu.master_addresses parameter in the following SQL statements:

CREATE TABLE my_first_table
    (
      id BIGINT,
      name STRING,
      PRIMARY KEY(id)
    )
    PARTITION BY HASH PARTITIONS 16
    STORED AS KUDU
    TBLPROPERTIES(
      'kudu.master_addresses' = 'master1:7051,master2:7051,master3:7051');

Common commands

  • Check the health of a Kudu cluster
    kudu cluster ksck <master_addresses>
  • Dump available metrics of Kudu servers
    kudu-tserver --dump_metrics_json
    kudu-master --dump_metrics_json
  • Query available tables
    kudu table list <master_addresses>
  • Dump the content of a column file (CFile)
    kudu fs dump cfile <block_id>
  • Dump the tree of a Kudu file system
    kudu fs dump tree [-fs_wal_dir=<dir>] [-fs_data_dirs=<dirs>]