edit-icon download-icon

Configure LogHub Reader

Last Updated: Feb 23, 2018

Honed originally by the Big Data demands of the Alibaba Group, Log Service (or Log for short, formerly “SLS”) is an all-in-one service for real-time data. With its capabilities to collect, consume, deliver, query, and analyze log-type data, Log Service can process and analyze massive amounts of data more efficiently. LogHub Reader uses the Java SDK of the Log Service to consume real-time log data in LogHub, and converts the log data to the Data Integration transfer protocol and sends the converted data to Writer.

How it works

LogHub Reader uses the Java SDK of the Log Service to consume real-time log data in LogHub. The actually used Java SDK version is shown as follows:

  1. <dependency>
  2. <groupId>com.aliyun.openservices</groupId>
  3. <artifactId>aliyun-log</artifactId>
  4. <version>0.6.7</version>
  5. </dependency>

Logstore is a component of the Log Service for collecting, storing, and querying log data. Logstore read and write logs are stored on a shard. Each Logstore consists of several shards and each shard is composed of an MD5 left-closed and right-open interval. These intervals do not overlap and the ranges of all intervals add up to the entire MD5 value range. Each shard provides a certain service capacity:

  • Writing: 5 MB/s, 2,000 times/s

  • Reading: 10 MB/s, 100 times/s

LogHub Reader consumes logs in shards, and the detailed consumption process (GetCursor- and BatchGetLog-related APIs) is as follows:

  • Obtains a cursor based on interval range.

  • Reads logs based on a cursor and step parameters and returns the next cursor.

  • Moves the cursor continuously to consume logs.

  • Splits tasks by shard for concurrent execution.

The following table shows the conversion between LogHub Reader and LogHub data types:

Internal DataX type LogHub data types
String String

Parameter description

  • endpoint

    • Description: The Log Service endpoint is a URL for accessing a project and its internal log data. It is associated with the Alibaba Cloud region and name of the project. For more information about the endpoints of different regions, see Regions and Endpoints.

    • Required: Yes

    • Default value: None
  • accessID

    • Description: It refers to an AccessKey for accessing the Log Service, which is used to identify the accessing user.

    • Required: Yes

    • Default value: None
  • accessKey

    • Description: It refers to another AccessKey to access the Log Service, which is used to verify the user’s key.

    • Required: Yes

    • Default value: None
  • project

    • Description: It refers to the project name of the target Log Service, which is the resource management component in the Log Service for isolating and controlling resources.

    • Required: Yes

    • Default value: None

  • logstore

    • Description: It refers to the name of the target Logstore. Logstore is a component of the Log Service for collecting, storing, and querying log data.

    • Required: Yes

    • Default value: None
  • topic

    • Description: Logs in one Logstore can be classified by log topic, and this configuration item refers to the log topic.

    • Required: No

    • Default value: Null

  • batchSize

    • Description: It refers to the number of data entries queried from the Log Service at a time.

    • Required: No

    • Default value: 128

  • column

    • Description: It refers to the Column names in each data entry. Here, you can set a metadata item in the Log Service as the synchronization column. Supported metadata items include “C_Topic”, “C_MachineUUID”, “C_HostName”, “C_Path”, and “C_LogTime”, which represent the log topic, unique identifier of the collection machine, host name, path, and log time, respectively.

    • Required: Yes

    • Default value: None

      Note: The name of column is case-sensitive.

  • beginDateTime

    • Description: It refers to the start time of data consumption and is the left boundary of the time range (left-closed and right-open). It is a time string in the yyyyMMddHHmmss format (such as 20180111013000) and can be used with the scheduling time parameters of DataWorks.

    • Required: Select either this parameter or beginTimestampMillis.

    • Default value: None

      Note: This parameter must be used with endDateTime.

  • endDateTime

    • Description: It refers to the end time of data consumption and is the right boundary of the time range (left-closed and right-open). It is a time string in the yyyyMMddHHmmss format (such as 20180111013010) and can be used with the scheduling time parameters of DataWorks.

    • Required: Select either this parameter or endTimestampMillis.

    • Default value: None

      Note: This parameter must be used with beginDateTime.

  • beginTimestampMillis

    • Description: It refers to the start time of data consumption in milliseconds and is the left boundary of the time range (left-closed and right-open).

      Required: Select either this parameter or beginDateTime.

    • Default value: None

      Note:

      This parameter must be used with endTimestampMillis. If its value is -1, it represents the initial Log Service cursor, namely, CursorMode.BEGIN. The beginDateTime mode is recommended.

  • endTimestampMillis

    • Description: It refers to the end time of data consumption in milliseconds and is the right boundary of the time range (left-closed and right-open).

    • Required: Select either this parameter or endDateTime.

    • Default value: None

      Note:

      This parameter must be used with beginTimestampMillis. If its value is -1, it represents the last Log Service cursor, namely, CursorMode.END. The endDateTime mode is recommended.

Development in script mode

The following is a script configuration sample. For more information about parameters, see the preceding Parameter description.

  1. {
  2. "type": "job",
  3. "version": "1.0",
  4. "configuration": {
  5. "setting": {
  6. "errorLimit": {
  7. "record": "0"
  8. },
  9. "speed": {
  10. "mbps": "1",
  11. "concurrent": "1"
  12. }
  13. },
  14. "reader": {
  15. "plugin": "loghubreader",
  16. "parameter": {
  17. "accessKey": "*****",
  18. "accessId": "*****",
  19. "batchSize": 1000,
  20. "beginDateTime": "",
  21. "endDateTime": "",
  22. "endpoint": "http://cn-hangzhou.sls.aliyuncs.com",
  23. "logstore": "xxx",
  24. "project": "xxx",
  25. "topic": "xxx",
  26. "column": [
  27. "col0",
  28. "col1",
  29. "col2",
  30. "col3",
  31. "col4",
  32. "C_Category",
  33. "C_Source",
  34. "C_Topic",
  35. "C_MachineUUID",
  36. "C_HostName",
  37. "C_Path",
  38. "C_LogTime"
  39. ]
  40. }
  41. },
  42. "writer": {
  43. "name": "streamwriter",
  44. "parameter": {
  45. "print": false
  46. }
  47. }
  48. }
  49. }
Thank you! We've received your feedback.