Apache Flume is a distributed, reliable, and highly available system. You can use Apache Flume to collect, aggregate, and move large amounts of log data and store the data in a centralized manner. Various data sources are supported.
In E-MapReduce (EMR) V3.19.0 and later, you can configure and manage Flume agents in the EMR console.
In most cases, Flume is used to collect log data. You can also customize Flume sources to collect events from various external data sources.
Flume stores data on a real-time computing platform, an offline computing platform, or a storage system for subsequent data analysis and cleansing. A real-time computing platform can be Flink, Spark Streaming, or Storm. An offline computing platform can be MapReduce, Hive, or Presto. A storage system can be HDFS, Object Storage Service (OSS), Kafka, or Elasticsearch.
|event||The basic unit of data that flows through a Flume agent. An event consists of a byte
array of data and an optional set of string attributes that are added as headers.
|source||A data collector. It collects events from an external data source and sends multiple
events to one or more channels at the same time.
|channel||A channel is located between a source and a sink to cache events.
|sink||Obtains events from a channel and commits the events as transactions to an external
storage. After an event is committed to an external storage, the event is removed
from the channel.