LogHub Shipper for Tablestore (LogHub Shipper) scrubs and converts data in Log Service, and then writes the data to data tables in Tablestore. LogHub Shipper is published to the Alibaba Cloud Container Hub by Tablestore by using the Docker image method and runs on Elastic Compute Service (ECS) instances based on Alibaba Cloud Container Service for Kubernetes (ACK).

Introduction

Log Service stores data in the JSON format and uses log groups as the basic unit for data read and write operations. For more information about log groups, see Terms. Therefore, you cannot query and analyze logs in Log Service based on specific conditions. For example, you cannot query and analyze log data of an app in the last 12 hours.

LogHub Shipper converts log data in Log Service into structured data, and then writes the data to data tables in Tablestore in real time. This provides accurate and high-performance online services in real time.

Data examples

For example, Log Service contains log data that is in the following format:

{"__time__":1453809242,"__topic__":"","__source__":"47.100.XX.XX","ip":"47.100.XX.XX","time":"26/Jan/2016:19:54:02 +0800","url":"POST /PutData?Category=YunOsAccountOpLog&AccessKeyId=U0U***45A&Date=Fri%2C%2028%20Jun%202013%2006%3A53%3A30%20GMT&Topic=raw&Signature=pD12XYLmGxKQ%2Bmkd6x7hAgQ7b1c%3D HTTP/1.1","status":"200","user-agent":"aliyun-sdk-java"}

LogHub Shipper writes the data to a Tablestore data table whose primary key columns are ip and time in the following format.

iptimesourcestatususer-agenturl
47.100.XX.XX26/Jan/2016:19:54:02 +080047.100.XX.XX200aliyun-sdk-javaPOST /PutData…

This way, you can retrieve historical data of a specific IP address based on a specific time point in the Tablestore data table in an efficient and accurate manner.

LogHub Shipper provides flexible data mapping rules. You can configure the mappings between the fields of log data and the attribute columns of data tables and convert the data in an efficient manner.

Terms

Terms of related services

Before you use LogHub Shipper, you need to understand the terms of related services. The following table describes the terms.

ServiceTerm
Log Service
Tablestore
ECSPay-as-you-go and Subscription
ACKNode, Application, Service, and Container

When you specify the number of containers in a single LogHub Shipper, we recommend that you specify a number that is less than or equal to the number of shards in the corresponding Logstore.

Resource Access Management (RAM)RAM user

We recommend that you authorize the RAM user of LogHub Shipper to only read data from Logstores and write data to Tablestore.

Data tables

Data tables store log data that is scrubbed and converted.

When you use data tables, take note of the following items:

  • You need to manually create a data table to store log data that is scrubbed and converted. LogHub Shipper does not automatically create data tables.
  • If Log Service and Tablestore are available, the latency between the time when a log enters Log Service and the time when the log is written to Tablestore is hundreds of milliseconds.
  • If Tablestore is unavailable, LogHub Shipper waits for up to 500 milliseconds and tries again.
  • LogHub Shipper regularly records persistent checkpoints.
  • If LogHub Shipper is unavailable due to an issue such as an upgrade, LogHub Shipper continues to consume logs from the last checkpoint upon recovery.
  • We recommend that you make sure that different logs in the same Logstore are written to different rows in the data table. This ensures the eventual consistency between the data table and the Logstore even when LogHub Shipper retries data consumption.
  • LogHub Shipper writes data to data tables by using the UpdateRow operation of Tablestore. Therefore, multiple LogHub Shippers can write data to the same data table. In this case, we recommend that you make sure that the LogHub Shippers write data to different attribute columns.

Status tables

LogHub Shipper uses the status table that you create in Tablestore to provide you with related information.

When you use the status table, take note of the following items:

  • Multiple LogHub Shippers can share the same status table.
  • When no errors occur, each LogHub Shipper container adds a record to the status table every 5 minutes.
  • When an error occurs but Tablestore is still available, the LogHub Shipper container immediately adds a record to the status table.
  • We recommend that you specify day-level time to live (TTL) for the status table. This way, the status table retains only recent data.

The status table contains the following four primary key columns:

  • project_logstore: String type. The project and Logstore of Log Service that are separated by vertical bars (|).
  • shard: Integer type. The shard number in Log Service.
  • target_table: String type. The name of the data table that stores the log data that is scrubbed and converted.
  • timestamp: Integer type. The time when a LogHub Shipper container adds a record to the status table. The value is a UNIX timestamp. Unit: milliseconds.

The following attribute columns record the status of data import. All attribute columns of a row in the status table are optional and may not exist.

  • shipper_id: String type. The ID of a LogHub Shipper container. This is the name of the container host.
  • error_code: String type. The error code defined in Tablestore. If no error occurs, this attribute column does not exist. For more information, see Error codes.
  • error_message: String type. The specific error message that is returned in Tablestore. If no error occurs, this attribute column does not exist.
  • failed_sample: String type. The log for which an error is reported. The value is a JSON string.
  • __time__: Integer type. The maximum value of the __time__ field of log data that the LogHub Shipper container writes to Tablestore after the most recent update of the status table by the container. For more information, see Terms.
  • row_count: Integer type. The number of logs that the LogHub Shipper container writes to Tablestore after the most recent update of the status table by the container.
  • cu_count: Integer type. The number of capacity units (CUs) that the LogHub Shipper container consumes after the most recent update of the status table by the container. For more information, see Read/write throughput.
  • skip_count: Integer type. The number of logs that the LogHub Shipper container scrubs after the most recent update of the status table by the container.
  • skip_sample: String type. One of the logs that the LogHub Shipper container discards after the most recent update of the status table by the container. The value is a JSON string. The log of the container records each discarded log and the reason for discarding the log.

Configurations

When you create LogHub Shipper, you need to specify the following environment variables for the container:

  • access_key_id and access_key_secret: the AccessKey ID and AccessKey secret of the Alibaba Cloud account that is used by LogHub Shipper.
  • loghub: the configurations of Log Service that are required by LogHub Shipper. The value is a JSON object that consists of the following parameters:
    • endpoint
    • logstore
    • consumer_group
  • tablestore: the configurations of Tablestore that are required by LogHub Shipper. The value is a JSON object that consists of the following parameters:
    • endpoint: the endpoint of the region in which the Tablestore instance is located.
    • instance: the name of the Tablestore instance.
    • target_table: the name of the data table. The data table must be in the specified instance.
    • status_table: the name of the status table. The status table must be in the specified instance.
  • exclusive_columns: the blacklist of attribute columns. The value is a JSON array that consists of JSON strings.

    If you specify the environment variable, LogHub Shipper does not write the specified fields to the data table as attribute columns. For example, the data table contains primary key A, the exclusive_columns environment variable is set to ["B", "C"], and a log contains three fields: A, B, and D. Then, one row is written to the data table. The row contains primary key A and attribute column D. Field C does not exist in the log. Therefore, LogHub Shipper does not write Column C to the data table. Field B exists in the log, but Column B is specified as an exclusive column. Therefore, LogHub Shipper does not write Column B to the data table.

  • transform: a simple conversion. The value is a JSON object. The key in the variable is the name of the column in the data table. The column can be a primary key column. The value is the simple conversion expression that LogHub Shipper defines based on the following rules:
    • A log field is an expression.
    • An unsigned integer is an expression.
    • A string in double quotes is an expression. Strings can contain the escape characters \" and \\\\.
    • ( func arg... ) is also an expression. Zero or multiple spaces or tabs can exist before and after the parentheses. At least one space exists between func and the parameter that follows func, and between different parameters. Each parameter must be an expression. The system supports the following functions:
      • ->int: converts a string to an integer. This function requires two parameters. The first parameter is the base, which can be 2 to 36. The second parameter is the string that you want LogHub Shipper to convert. The letter in the second parameter is not case sensitive and indicates a number that is greater than a decimal number from 10 to 35.
      • ->bool: converts a string to a Boolean value. This function requires one parameter whose value is a string you want LogHub Shipper to convert. "true" corresponds to the value of true and "false" corresponds to the value of false. Other strings are regarded invalid.
      • crc32: calculates CRC32 for a string and returns the result as an Integer value. This function requires one parameter whose value is a string you want LogHub Shipper to calculate.

If a log is missing or an error occurs during conversion, the column that corresponds to the key is regarded as a missing column. If an error occurs, the log of the container records the details of the error.

Data scrubbing follows only the rule: If a primary key column is missing, LogHub Shipper scrubs the corresponding log.