This topic describes how to use Flume to synchronize audit logs to HDFS in real time.

Prerequisites

An EMR Hadoop cluster is created, and Flume is selected from the optional services during the cluster creation. For more information, see Create a cluster.

Background information

In EMR V3.19.0 and later, you can configure and manage Flume agents in the EMR console.

The following figure shows the topology of Flume agents.flume
Notice You can adjust the topology as needed.

For other use scenarios of Flume, see Configure Flume.

Start a Flume agent

  1. Go to the Flume page.
    1. Log on to the EMR console.
    2. In the top navigation bar, select the region where your cluster resides. Select the resource group as required. By default, all resources of the account appear.
    3. Click the Cluster Management tab.
    4. On the Cluster Management page that appears, find the target cluster and click Details in the Actions column.
    5. In the left-side navigation pane, click Cluster Service and then FLUME.
  2. Configure core node parameters.
    1. Configure a Flume agent on the emr-worker-1 node.
      1. Click the Configure tab. Configure the parameters listed in the following table.
        Parameter Value
        default-agent.sinks.default-sink.type hdfs
        default-agent.channels.default-channel.type file
        default-agent.sources.default-source.type avro
        deploy_node_hostname emr-worker-1
      2. In the Service Configuration section, click the flume-conf tab.
      3. In the upper-right corner of the Service Configuration section, click Custom Configuration. Add the parameters listed in the following table.
        Parameter Description
        default-agent.sinks.default-sink.hdfs.path Configure this parameter for a high-availability cluster.

        Example: hdfs://emr-cluster/path.

        default-agent.sinks.default-sink.hdfs.fileType Set this parameter to DataStream.
        default-agent.sinks.default-sink.hdfs.rollSize Set this parameter to 0.
        default-agent.sinks.default-sink.hdfs.rollCount Set this parameter to 0.
        default-agent.sinks.default-sink.hdfs.rollInterval Set this parameter to 86400.
        default-agent.sinks.default-sink.hdfs.batchSize Set this parameter to 51200.
        default-agent.sources.default-source.bind Set this parameter to 0.0.0.0.
        default-agent.sources.default-source.port Set this parameter as required.
        default-agent.channels.default-channel.transactionCapacity Set this parameter to 51200.
        default-agent.channels.default-channel.dataDirs The path where events are stored.
        default-agent.channels.default-channel.checkpointDir The path where the checkpoint file is stored.
        default-agent.channels.default-channel.capacity Set this parameter based on the settings of the preceding HDFS rolling parameters.
      4. Click Save.
    2. Optional:Click the Component Deployment tab.
      The status of the Flume agent on the emr-worker-1 node is STARTED.
    3. Repeat Step a to configure a Flume agent on the emr-worker-2 node.
  3. Configure master node parameters.
    1. Configure a Flume agent.
      1. Click the Configure tab. Configure the parameters listed in the following table.
        Parameter Value
        additional_sinks k1
        deploy_node_hostname emr-header-1
        default-agent.sources.default-source.type taildir
        default-agent.sinks.default-sink.type avro
        default-agent.channels.default-channel.type file
      2. In the Service Configuration section, click the flume-conf tab.
      3. In the upper-right corner of the Service Configuration section, click Custom Configuration. Add the parameters listed in the following table.
        Parameter Description
        default-agent.sources.default-source.filegroups Set this parameter to f1.
        default-agent.sources.default-source.filegroups.f1 Set this parameter to /mnt/disk1/log/hadoop-hdfs/hdfs-audit.log. *.
        default-agent.sources.default-source.positionFile The path where the position file is stored.
        default-agent.channels.default-channel.checkpointDir The path where the checkpoint file is stored.
        default-agent.channels.default-channel.dataDirs The path where events are stored.
        default-agent.channels.default-channel.capacity Set this parameter based on the settings of the preceding HDFS rolling parameters.
        default-agent.sources.default-source.batchSize Set this parameter to 2000.
        default-agent.channels.default-channel.transactionCapacity Set this parameter to 2000.
        default-agent.sources.default-source.ignoreRenameWhenMultiMatching Set this parameter to true.
        default-agent.sinkgroups Set this parameter to g1.
        default-agent.sinkgroups.g1.sinks Set this parameter to default-sink k1.
        default-agent.sinkgroups.g1.processor.type Set this parameter to failover.
        default-agent.sinkgroups.g1.processor.priority.default-sink Set this parameter to 10.
        default-agent.sinkgroups.g1.processor.priority.k1 Set this parameter to 5.
        default-agent.sinks.default-sink.hostname The IP address of the emr-worker-1 node.
        default-agent.sinks.default-sink.port The port of the Flume agent on the emr-worker-1 node.
        default-agent.sinks.k1.hostname The IP address of the emr-worker-2 node.
        default-agent.sinks.k1.port The port of the Flume agent on the emr-worker-2 node.
        default-agent.sinks.default-sink.batch-size Set this parameter to 2000.
        default-agent.sinks.k1.batch-size Set this parameter to 2000.
        default-agent.sinks.k1.type Set this parameter to avro.
        default-agent.sinks.k1.channel Set this parameter to default-channel.
  4. Start the Flume agents.
    1. In the upper-right corner of the Service Configuration section, click Save.
    2. In the Confirm Changes dialog box, specify the parameters and click OK.
    3. Select Restart All Components from the Actions drop-down list in the upper-right corner.
    4. In the Cluster Activities dialog box, specify the parameters and click OK.
    5. In the Confirm message, click OK.

View logs

The logs of Flume agents are stored in the flume.log file in the /mnt/disk1/log/flume/default-agent/ directory.

View monitoring information

You can view the monitoring information of Flume agents on the Status tab for the Flume service.