This topic describes some common parameters for the Taildir Source, File Channel, and HDFS Sink components. You can adjust the parameters to optimize the performance of Flume.
Taildir Source
Parameter | Description |
---|---|
filegroups | Splits a directory into multiple directories to increase the read parallelism of Taildir Source. |
batchSize | Default value: 100. The number of data rows that are read at the same time. To improve the throughput, you can increase the value of this parameter. |
File Channel
Parameter | Description |
---|---|
checkpointInterval | Default value: 30. Unit: seconds. To shorten the checkpoint interval, you can decrease the value of this parameter. |
useDualCheckpoints | Default value: false. To enable File Channel to back up checkpoints, you can set this parameter to true. This way, when the channel restarts, the channel does not need to read events from the beginning again. |
maxFileSize | Default value: 1.6. Unit: GB. The maximum size of a data file.
To accelerate the rolling of files, you can decrease the value of this parameter. This way, more disk space is freed up. |
capacity | Default value: 1000000. The maximum number of events that File Channel can hold.
To improve the throughput, you can increase the value of this parameter. You can also multiply the value of this parameter by the size of a single event to estimate the disk usage. |
transactionCapacity | Default value: 10000. The maximum number of events in a single transaction for File Channel. |
HDFS Sink
Parameter | Description |
---|---|
hdfs.batchSize | Default value: 100. The number of events that are written to a file before the file
is rolled to HDFS.
To improve the throughput, you can increase the value of this parameter.
Note We recommend that you set this parameter to the same value as the batchSize parameter for Taildir Source. Make sure that the values of the two parameters do
not exceed the value of the transactionCapacity parameter for File Channel.
|
hdfs.threadsPoolSize | Default value: 10. The number of HDFS I/O threads. You can adjust this parameter based on node configurations. |
hdfs.useLocalTimeStamp | Default value: false. Specifies whether a local timestamp is used.
To add a timestamp to the header of an event, set this parameter to true. |
hdfs.rollInterval | Default value: 30. Unit: seconds. The interval at which a temporary file is rolled
into a final file.
If you set this parameter to 0, HDFS Sink does not roll files based on an interval. |
hdfs.rollSize | Default value: 1024. Unit: bytes. When the size of a file reaches the value of this
parameter, HDFS Sink rolls the file into a final file.
If you set this parameter to 0, HDFS Sink does not roll files based on file sizes. |
hdfs.rollCount | Default value: 10. When the number of events that are written to a file reaches the
value of this parameter, HDFS Sink rolls the file into a final file.
If you set this parameter to 0, HDFS Sink does not roll files based on the number of events. |
hdfs.minBlockReplicas | The minimum number of replicas per HDFS file block. The default value is the HDFS
replication factor.
In most cases, HDFS Sink can properly roll files only if this parameter is set to 1. |