Apache Flume is a distributed, reliable, and highly available system. You can use Apache Flume to collect, aggregate, and move large amounts of log data and store the data in a centralized manner. Various data sources are supported.

In E-MapReduce (EMR) V3.19.0 and later, you can configure and manage Flume agents in the EMR console.

Scenarios

In most cases, Flume is used to collect log data. You can also customize Flume sources to collect events from various external data sources.

Flume stores data on a real-time computing platform, an offline computing platform, or a storage system for subsequent data analysis and cleansing. A real-time computing platform can be Flink, Spark Streaming, or Storm. An offline computing platform can be MapReduce, Hive, or Presto. A storage system can be HDFS, Object Storage Service (OSS), Kafka, or Elasticsearch.

Architecture

A Flume agent is an instance of Flume. It is essentially a Java Virtual Machine (JVM) process that controls the transmission of events from producers to consumers. A Flume agent contains one or more sources, channels, and sinks. One source can connect to multiple channels. One channel can connect to multiple sinks. Flume

Terms

Term Description
event The basic unit of data that flows through a Flume agent. An event consists of a byte array of data and an optional set of string attributes that are added as headers.
Example:
--------------------------------
| Header (Map) | Body (byte[]) |
--------------------------------
               Flume Event
source A data collector. It collects events from an external data source and sends multiple events to one or more channels at the same time.
Common sources:
  • Avro Source: listens on an Avro port to receive events from an Avro client. Avro is a data serialization framework developed in an Apache Hadoop project.
  • Exec Source: runs a specified command, such as tail -f /var/log/messages, to obtain standard command output.
  • NetCat TCP Source: listens on a specified TCP port to obtain data. It provides similar functionality as Netcat UDP Source.
  • Taildir Source: monitors multiple files stored in a directory and records the offset. If you use Taildir Source, no data is lost. Taildir Source is a commonly used source.
channel A channel is located between a source and a sink to cache events.
Common channels:
  • Memory Channel: functions as an in-memory queue to cache events. This component features high performance and is commonly used.
  • File Channel: caches events to data files and writes out checkpoints. This component features high reliability but has poor performance.
  • JDBC Channel: caches events to relational databases.
  • Kafka Channel: caches events in Kafka.
sink Obtains events from a channel and commits the events as transactions to an external storage. After an event is committed to an external storage, the event is removed from the channel.
Common sinks:
  • Logger Sink: is used for testing.
  • Avro Sink: converts received events into Avro events and sends the Avro events to the source of the next Flume agent in a data flow.
  • HDFS Sink: writes events to Hadoop Distributed File System (HDFS). This component is commonly used.
  • Hive Sink: writes events as transactions to Hive tables or partitions.
  • Kafka Sink: writes events to Kafka.