Impala is a real-time SQL query engine that features high performance and low latency. You can use Impala to query Apache Hadoop data in real time.

Background information

Impala uses the same metadata, SQL syntax (Hive SQL), and ODBC driver as Apache Hive to provide a familiar and unified platform for real-time or batch-oriented queries.


To avoid latency, Impala uses a distributed query engine, instead of MapReduce, to access data. This engine is similar to those in a relational database management system (RDBMS). The performance of this engine depends on the type of query and configuration and is order-of-magnitude faster than Hive. The following figure shows the Impala architecture.Impala

For more information about Impala, see Apache Impala.

Impala components

Impala components:
  • Impala Daemon

    The core component of Impala is Impala Daemon that runs on each node. Impala Daemon is represented by a process named Impalad. Impala Daemon is used to read and write data files, receive query statements sent by using the impala-shell command or from Hue, JDBC, or ODBC, and parallelize queries and distribute tasks to Impala nodes of a cluster. In addition, Impala Daemon can also be used to return locally calculated query results back to the coordinator node.

  • StateStore Daemon

    StateStore Daemon is represented by a process named Statestored. StateStore Daemon is used to manage the health status of all Impalad processes in a cluster and forwards the status results to all Impalad processes. If an Impalad process is unavailable due to node exceptions, network exceptions, or software problems, StateStore Daemon notifies other Impalad processes of the abnormality. When a new query request is initiated, it is not delivered to the unavailable Impalad process.

  • Catalog Daemon

    Catalog Daemon is represented by a process named Catalogd. Catalog Daemon is used to synchronize metadata changes on each Impalad process to other Impalad processes in the same cluster. We recommend that you run StateStore Daemon and Catalog Daemon on the same node because all requests are delivered by using the Statestored process.

Quick start