This topic describes how to use Flume to consume log data. You can use the aliyun-log-flume plug-in to connect Log Service to Flume and write log data to Log Service or consume log data from Log Service.

Background information

The aliyun-log-flume plug-in connects Log Service to Flume. When Log Service is connected to Flume, Log Service can connect to other systems such as Hadoop distributed file system (HDFS) and Kafka by using Flume. The aliyun-log-flume plug-in provides sinks and sources to connect Log Service to Flume.
  • Sink: reads data from other data sources and writes the data to Log Service.
  • Source: consumes log data from Log Service and writes the log data to other systems.
For more information about the aliyun-log-flume plug-in, visit GitHub.

Procedure

  1. Download and install Flume. For more information, see Flume.
  2. Download the aliyun-log-flume plug-in and save the plug-in in the cd/***/flume/lib directory. To download the plug-in, click aliyun-log-flume-1.3.jar.
  3. In the cd/***/flume/conf directory, create the flumejob.conf configuration file.
    • For more information about a sink example and how to configure a sink, see Sink.
    • For more information about a source example and how to configure a source, see Source.
  4. Start Flume.

Sink

You can configure a sink to write data from other data sources to Log Service by using Flume. Data can be parsed into the following formats:
  • SIMPLE: A Flume event is written to Log Service as a field.
  • DELIMITED: A Flume event is parsed into fields based on the configured column names and written to Log Service.
The following table describes the configuration parameters of a sink.
Parameter Required Description
type Yes Default value: com.aliyun.Loghub.flume.sink.LoghubSink.
endpoint Yes The endpoint of the region where the Log Service project resides. For more information, see Endpoints.
project Yes The name of the project.
logstore Yes The name of the Logstore.
accessKeyId Yes The AccessKey ID that is used to access Log Service.
accessKey Yes The AccessKey secret that is used to access Log Service.
batchSize No The number of data entries that are written to Log Service at a time. Default value: 1000.
maxBufferSize No The maximum number of data entries in cache queues. Default value: 1000.
serializer No The serialization format of the Flume event. Valid values:
  • DELIMITED: delimiter mode.
  • SIMPLE: single-line mode. This is the default value.
  • Custom serializer: custom serialization mode. In this mode, you must specify the full names of columns.
columns No The column name. If you set the serializer parameter to DELIMITED, you must configure this parameter. Separate multiple columns with commas (,). The columns are sorted in the same order as they are in the data entries.
separatorChar No The delimiter, which must be a single character. If you set the serializer parameter to DELIMITED, you must configure this parameter. By default, commas (,) are used.
quoteChar No The quote character. If you set the serializer parameter to DELIMITED, you must configure this parameter. By default, double quotation marks (") are used.
escapeChar No The escape character. If you set the serializer parameter to DELIMITED, you must configure this parameter. By default, double quotation marks (") are used.
useRecordTime No Specifies whether to use the value of the timestamp field in the data entries as the log time when data is written to Log Service. Default value: false. This value indicates that the current time is used as the log time.
For more information about how to configure a sink, visit GitHub.

Source

You can configure a source to ship data from Log Service to other data sources by using Flume. Data can be parsed into the following formats:
  • DELIMITED: Log data is written to Flume in delimiter mode.
  • JSON: Log data is written to Flume in the JSON format.
The following table describes the parameters of a source.
Parameter Required Description
type Yes Default value: com.aliyun.loghub.flume.source.LoghubSource.
endpoint Yes The endpoint of the region where the Log Service project resides. For more information, see Endpoints.
project Yes The name of the project.
logstore Yes The name of the Logstore.
accessKeyId Yes The AccessKey ID that is used to access Log Service.
accessKey Yes The AccessKey secret that is used to access Log Service.
heartbeatIntervalMs No The heartbeat interval between the client and Log Service. Default value: 30000. Unit: milliseconds.
fetchIntervalMs No The interval at which data is read from Log Service. Default value: 100. Unit: milliseconds.
fetchInOrder No Specifies whether to consume log data in the order that the log data is written to Log Service. Default value: false.
batchSize No The number of log entries that are read at a time. Default value: 100.
consumerGroup No The name of the consumer group that reads log data.
initialPosition No The starting point from which data is read. Valid values: begin, end, and timestamp. Default value: begin.
Note If a checkpoint exists on the server, the checkpoint is preferentially used.
timestamp No The UNIX timestamp. If you set the initialPosition parameter to timestamp, you must configure this parameter.
deserializer Yes The deserialization format of the event. Valid values:
  • DELIMITED: delimiter mode. This is the default value.
  • JSON: JSON format.
  • Custom deserializer: custom deserialization mode. In this mode, you must specify the full names of the columns.
columns No The column name. If you set the deserializer parameter to DELIMITED, you must configure this parameter. Separate multiple columns with commas (,). The columns are sorted in the same order as they are in the log entries.
separatorChar No The delimiter, which must be a single character. If you set the deserializer parameter to DELIMITED, you must configure this parameter. By default, commas (,) are used.
quoteChar No The quote character. If you set the deserializer parameter to DELIMITED, you must configure this parameter. By default, double quotation marks (") are used.
escapeChar No The escape character. If you set the deserializer parameter to DELIMITED, you must configure this parameter. By default, double quotation marks (") are used.
appendTimestamp No Specifies whether to append the timestamp as a field to each log entry. If you set the deserializer parameter to DELIMITED, you must configure this parameter. Default value: false.
sourceAsField No Specifies whether to add the log source as a field named __source__. If you set the deserializer parameter to JSON, you must configure this parameter. Default value: false.
tagAsField No Specifies whether to add the log tag as a field. The field is named in the format of __tag__:{Name of the tag}. If you set the deserializer parameter to JSON, you must configure this parameter. Default value: false.
timeAsField No Specifies whether to add the log time as a field named __time__. If you set the deserializer parameter to JSON, you must configure this parameter. Default value: false.
useRecordTime No Specifies whether to use the value of the timestamp field in the log entries as the log time when log data is read from Log Service. Default value: false. This value indicates that the current time is used as the log time. Default value: false.
For more information about how to configure a source, visit GitHub.