Apache Flume is a distributed, reliable, and highly available system. You can use Apache Flume to collect, aggregate, and move large amounts of log data and store the data in a centralized manner. Various data sources are supported.
Scenarios
In most cases, Flume is used to collect log data. You can also customize Flume sources to collect events from various external data sources.
Flume stores data on a real-time computing platform, an offline computing platform, or a storage system for subsequent data analysis and cleansing. A real-time computing platform can be Flink, Spark Streaming, or Storm. An offline computing platform can be MapReduce, Hive, or Presto. A storage system can be Hadoop Distributed File System (HDFS), Object Storage Service (OSS), Kafka, or Elasticsearch.
Architecture
A Flume agent is an instance of Flume. It is essentially a Java VM (JVM) process that controls the transmission of events from producers to consumers. A Flume agent contains one or more sources, channels, and sinks. One source can connect to multiple channels. One channel can connect to multiple sinks.
Terms
Term | Description |
---|---|
event | The basic unit of data that flows through a Flume agent. An event consists of a byte array of data and an optional set of string attributes that are added as headers. Example:
|
source | A data collector. It collects events from an external data source and sends multiple events to one or more channels at the same time. Common sources:
|
channel | A channel is located between a source and a sink to cache events. Common channels:
|
sink | Obtains events from a channel and commits the events as transactions to an external storage. After an event is committed to an external storage, the event is removed from the channel. Common sinks:
|