Impala is a real-time SQL query engine that features high performance and low latency. You can use SELECT clauses, JOIN clauses, and aggregate functions in Impala to query data stored in Hadoop Distributed File System (HDFS) or HBase in real time.

Background information

Impala uses the same metadata, SQL syntax (Hive SQL), and Open Database Connectivity (ODBC) driver as Apache Hive to provide a familiar and unified platform for batch processing or real-time queries.

Precautions

If you want to delete a partition in a Hive table when Impala is used, run a command in the Impala or Hive CLI to delete the partition. If you directly delete the partition directory from the system file path, the Hive table becomes unavailable.

Benefits

To minimize latency, Impala uses a distributed query engine, instead of MapReduce, to access data. This engine is similar to those in a relational database management system (RDBMS). The performance of this engine varies based on the type of query and configuration and is an order-of-magnitude faster than Hive. The following figure shows the architecture of Impala. Impala
Compared with Hadoop, Impala brings the following benefits for SQL queries:
  • Local processing is performed on data nodes, which helps avoid network bottlenecks.
  • You do not need to convert data formats. Therefore, you are not charged for data format conversion.
  • You can use a single, open, and unified metadata storage.
  • All data can be immediately queried without delays for extract, transform, load (ETL) operations.
  • All hardware is used for Impala queries and MapReduce.
  • Only a single machine tool is required to implement scalability.

For more information about Impala, see Apache Impala.

Impala components

The following figure shows the components of Impala in an E-MapReduce (EMR) cluster. Impala
Impala consists of the following components:
  • Impalad

    Impalad processes are deployed on core and task nodes and can be scaled.

    The core component of Impala is Impala Daemon that runs on each node. Impala Daemon is represented by a process named Impalad. Impala Daemon is used to read data from and write data to files, receive query statements sent by using the impala-shell command or from Hue, Java Database Connectivity (JDBC), or ODBC, parallelize queries, and distribute tasks to Impala nodes of a cluster. In addition, Impala Daemon can also be used to return locally calculated query results back to the coordinator node.

  • Statestored

    The Statestored process is deployed on the emr-header-1 node.

    StateStore Daemon is represented by a process named Statestored. StateStore Daemon is used to manage the health status of all Impalad processes in a cluster and forward the status results to all Impalad processes. If an Impalad process is unavailable due to node exceptions, network exceptions, or software issues, StateStore Daemon notifies other Impalad processes of the abnormality. When a new query request is initiated, the query request is not delivered to the unavailable Impalad process.

  • Catalogd

    The Catalogd process is deployed on the emr-header-1 node.

    Catalog Daemon is represented by a process named Catalogd. Catalog Daemon is used to synchronize metadata changes on each Impalad process to other Impalad processes in the same cluster. We recommend that you run StateStore Daemon and Catalog Daemon on the same node because all requests are delivered by using the Statestored process.

  • HAProxy

    HAProxy is deployed on the master node. HAProxy connects to the Impala nodes of a cluster and forwards query requests to the Impala nodes.

Quick start