How to sync HDFS audit logs to HDFS - E-MapReduce - Alibaba Cloud Documentation Center

E-MapReduce supports multiple ways to start Flume. This topic shows you how to modify Flume configurations in the E-MapReduce console to start a Flume agent that synchronizes HDFS audit logs to HDFS in real time.

Prerequisites

A data lake cluster must be created with the Flume service selected. For more information, see Create a cluster.

Procedure

Go to the Services page.
1. Log on to the E-MapReduce console.
2. In the top navigation bar, select a region and resource group.
3. On the EMR on ECS page, click Services in the Actions column of the target cluster.
On the Services page, click Configure in the Flume service section.

Configure the Flume agent on the core-1-1 node and save the configuration.

These configurations are based on open-source Flume. For more information, see the official Flume documentation.

On the Configure page, click the flume-conf.properties tab.
From the drop-down lists at the top, select Independent Node Configuration and core-1-1.

Modify the parameters in flume-conf.properties as needed.

default-agent.sinks = default-sink
default-agent.sources = default-source
default-agent.channels = default-channel
default-agent.sinks.default-sink.type = hdfs
default-agent.sinks.default-sink.channel =  default-channel
default-agent.channels.default-channel.type = file
default-agent.sources.default-source.type = avro
default-agent.sinks.default-sink.hdfs.path = hdfs://master-1-1:9000/path
default-agent.sinks.default-sink.hdfs.fileType = DataStream
default-agent.sinks.default-sink.hdfs.rollSize = 0
default-agent.sinks.default-sink.hdfs.rollCount = 0
default-agent.sinks.default-sink.hdfs.rollInterval = 86400
default-agent.sinks.default-sink.hdfs.batchSize = 51200
default-agent.sources.default-source.bind = 0.0.0.0
default-agent.sources.default-source.port = ****
default-agent.sources.default-source.channels =  default-channel
default-agent.channels.default-channel.transactionCapacity = 10000
default-agent.channels.default-channel.dataDirs = ****
default-agent.channels.default-channel.checkpointDir = ****
default-agent.channels.default-channel.capacity = 1000000

Parameter	Description
default-agent.sinks	The names of the agent's sinks. For example, default-sink.
default-agent.sources	The names of the agent's sources. For example, default-source.
default-agent.channels	The names of the agent's channels. For example, default-channel.
default-agent.sinks.default-sink.hdfs.path	The destination path in HDFS. For a high-availability (HA) cluster: For example, hdfs://emr-cluster/path. For a non-HA cluster: For example, hdfs://master-1-1:9000/path.
default-agent.sinks.default-sink.hdfs.fileType	The file type. This value must be set to DataStream.
default-agent.sinks.default-sink.hdfs.rollSize	The file size in bytes that triggers a roll. When a temporary file reaches this size, it is renamed and moved to its final destination. Set to 0 to disable rolling based on size.
default-agent.sinks.default-sink.hdfs.rollCount	The number of events that triggers a file roll. When a file contains this many events, it is rolled. Set to 0 to disable rolling based on the event count.
default-agent.sinks.default-sink.hdfs.rollInterval	The interval in seconds at which to roll the file, regardless of its size or event count. For example, 86400.
default-agent.sinks.default-sink.hdfs.batchSize	The number of events to write to HDFS in a single batch. For example, 51200.
default-agent.sinks.default-sink.channel	The channel name for default-sink.
default-agent.sources.default-source.bind	The IP address or hostname to which the source binds. Set to 0.0.0.0 to bind to all network interfaces.
default-agent.sources.default-source.port	The port on which the source listens for events. Configure this port as required.
default-agent.sources.default-source.channels	The channel name for default-source.
default-agent.channels.default-channel.transactionCapacity	The maximum number of events that the channel can handle in a single transaction. The default value is 10000.
default-agent.channels.default-channel.dataDirs	The path where the channel stores event data. This parameter is optional. The default path is ~/.flume/file-channel/data.
default-agent.channels.default-channel.checkpointDir	The path where checkpoint files are stored. This parameter is optional. The default path is ~/.flume/file-channel/checkpoint.
default-agent.channels.default-channel.capacity	The maximum number of events that the channel can store. Set this value based on your HDFS roll settings. This parameter is optional. The default value is 1000000.

Save the configuration.
1. At the bottom of the page, click Save.
2. In the dialog box, enter a reason for the change and click OK.

Repeat Step 3 to configure the Flume agent on the core-1-2 node.

Configure the Flume agent for the master node group and save the configuration.

These configurations are based on open-source Flume. For more information, see the official Flume documentation.

From the drop-down lists at the top, select Independent Node Configuration and master-1-1.

Modify the parameters in flume-conf.properties as needed.

default-agent.sinks = default-sink k1
default-agent.sources = default-source
default-agent.channels = default-channel
default-agent.sources.default-source.type = taildir
default-agent.sinks.default-sink.type = avro
default-agent.sinks.default-sink.channel =  default-channel
default-agent.channels.default-channel.type = file
default-agent.sources.default-source.filegroups = f1
default-agent.sources.default-source.filegroups.f1 = /mnt/disk1/log/hadoop-hdfs/hdfs-audit.log.*
default-agent.sources.default-source.positionFile = ~/.flume/taildir_position.json
default-agent.sources.default-source.channels =  default-channel
default-agent.sources.default-source.batchSize = 2000
default-agent.sources.default-source.ignoreRenameWhenMultiMatching = true
default-agent.channels.default-channel.checkpointDir = ****
default-agent.channels.default-channel.dataDirs = ****
default-agent.channels.default-channel.capacity = ****
default-agent.channels.default-channel.transactionCapacity = 2000
default-agent.sinkgroups = g1
default-agent.sinkgroups.g1.sinks = default-sink k1
default-agent.sinkgroups.g1.processor.type = failover
default-agent.sinkgroups.g1.processor.priority.default-sink = 10
default-agent.sinkgroups.g1.processor.priority.k1 = 5
default-agent.sinks.default-sink.hostname = ****
default-agent.sinks.default-sink.port = ****
default-agent.sinks.k1.hostname = ****
default-agent.sinks.k1.port = ****
default-agent.sinks.default-sink.batch-size = 2000
default-agent.sinks.k1.batch-size = 2000
default-agent.sinks.k1.type = avro
default-agent.sinks.k1.channel = default-channel

Parameter	Description
default-agent.sinks	The names of the agent's sinks. For example, default-sink k1.
default-agent.sources	The names of the agent's sources. For example, default-source.
default-agent.channels	The names of the agent's channels. For example, default-channel.
default-agent.sources.default-source.filegroups.f1	The path to the log files to be collected. The default path is /mnt/disk1/log/hadoop-hdfs/hdfs-audit.log.*.
default-agent.sources.default-source.positionFile	The path to the position file that tracks reading progress. This parameter is optional. The default path is ~/.flume/taildir_position.json.
default-agent.channels.default-channel.checkpointDir	The path where checkpoint files are stored.
default-agent.channels.default-channel.dataDirs	The path where the channel stores event data.
default-agent.channels.default-channel.capacity	Set this value based on your HDFS roll settings.
default-agent.sources.default-source.batchSize	The maximum number of messages to write to the channel in a single batch. Example: 2000.
default-agent.channels.default-channel.transactionCapacity	The maximum number of events that a channel takes from a source or pushes to a sink in each transaction. Example: 2000.
default-agent.sources.default-source.ignoreRenameWhenMultiMatching	Data duplication can occur when a Flume taildir source uses wildcards in filegroups to match rotated Log4j files. Set this parameter to true to prevent this issue.
default-agent.sinkgroups	The names of the agent's sink groups. For example, g1.
default-agent.sinkgroups.g1.sinks	The names of the sinks in the g1 sink group. For example, default-sink k1.
default-agent.sinkgroups.g1.processor.type	The processing logic for the g1 sink group. Valid values are: default: The default sink processor. failover: Implements failover for the sink group. load_balance: Implements load balancing for the sink group.
default-agent.sinkgroups.g1.processor.priority.default-sink	The priority of the default-sink in the sink group g1. A higher value gives the sink higher priority. Example: 10.
default-agent.sinkgroups.g1.processor.priority.k1	The priority of the k1 sink in the sink group g1. A higher value gives the sink higher priority. Example: 5.
default-agent.sinks.default-sink.hostname	The IP address of the core-1-1 node.
default-agent.sinks.default-sink.port	The port of the Flume agent on the core-1-1 node.
default-agent.sinks.k1.hostname	The IP address of the core-1-2 node.
default-agent.sinks.k1.port	The port of the Flume agent on the core-1-2 node.
default-agent.sinks.default-sink.batch-size	The number of events sent in each batch by default-sink. Example: 2000.
default-agent.sinks.k1.batch-size	The number of events sent in each batch by the k1 sink. Example: 2000.
default-agent.sinks.k1.type	The sink type. For example, avro.
default-agent.sinks.k1.channel	The channel that the sink uses. For example, default-channel.

Save the configuration.
1. At the bottom of the page, click Save.
2. In the dialog box, enter a reason for the change and click OK.

Start the Flume agents.
1. In the upper-right corner, choose More > Restart.
2. In the dialog box that appears, enter an Execution Reason and click OK.
3. In the Confirm dialog box, click OK.
  After the restart, the Flume agents begin synchronizing HDFS audit logs to HDFS.
  The Flume agent logs are stored at /var/log/emr/flume/default-agent/flume.log.