This topic describes some common parameters for the Taildir Source, File Channel, and HDFS Sink components. You can adjust the parameters to optimize the performance of Flume.

Taildir Source

Parameter Description
filegroups Splits a directory into multiple directories to increase the read parallelism of Taildir Source.
batchSize Default value: 100. The number of data rows that are read at the same time. To improve the throughput, you can increase the value of this parameter.

File Channel

Parameter Description
checkpointInterval Default value: 30. Unit: seconds. To shorten the checkpoint interval, you can decrease the value of this parameter.
useDualCheckpoints Default value: false. To enable File Channel to back up checkpoints, you can set this parameter to true. This way, when the channel restarts, the channel does not need to read events from the beginning again.
maxFileSize Default value: 1.6. Unit: GB. The maximum size of a data file.

To accelerate the rolling of files, you can decrease the value of this parameter. This way, more disk space is freed up.

capacity Default value: 1000000. The maximum number of events that File Channel can hold.

To improve the throughput, you can increase the value of this parameter. You can also multiply the value of this parameter by the size of a single event to estimate the disk usage.

transactionCapacity Default value: 10000. The maximum number of events in a single transaction for File Channel.

HDFS Sink

Parameter Description
hdfs.batchSize Default value: 100. The number of events that are written to a file before the file is rolled to HDFS.
To improve the throughput, you can increase the value of this parameter.
Note We recommend that you set this parameter to the same value as the batchSize parameter for Taildir Source. Make sure that the values of the two parameters do not exceed the value of the transactionCapacity parameter for File Channel.
hdfs.threadsPoolSize Default value: 10. The number of HDFS I/O threads. You can adjust this parameter based on node configurations.
hdfs.useLocalTimeStamp Default value: false. Specifies whether a local timestamp is used.

To add a timestamp to the header of an event, set this parameter to true.

hdfs.rollInterval Default value: 30. Unit: seconds. The interval at which a temporary file is rolled into a final file.

If you set this parameter to 0, HDFS Sink does not roll files based on an interval.

hdfs.rollSize Default value: 1024. Unit: bytes. When the size of a file reaches the value of this parameter, HDFS Sink rolls the file into a final file.

If you set this parameter to 0, HDFS Sink does not roll files based on file sizes.

hdfs.rollCount Default value: 10. When the number of events that are written to a file reaches the value of this parameter, HDFS Sink rolls the file into a final file.

If you set this parameter to 0, HDFS Sink does not roll files based on the number of events.

hdfs.minBlockReplicas The minimum number of replicas per HDFS file block. The default value is the HDFS replication factor.

In most cases, HDFS Sink can properly roll files only if this parameter is set to 1.