Alibaba Cloud Log Service Add-on for Splunk collects the logs from Alibaba Cloud Log Service (SLS), and then sends them to Splunk.

Architecture

The main features are as below.
  • Splunk data inputs create SLS consumer group to pull logs and support real-time log consumption from SLS.
  • There are two alternative protocols, which are private protocol and Splunk HEC, to forward logs from data inputs to Splunk indexers.
Splunk-001

Mechanism

Splunk-002
  • Each data input creates one consumer which belongs to one consumer group.
  • A consumer group consists of multiple consumers. The consumers in a consumer group consume data from a logstore.
  • A Logstore has multiple shards.
    • Each shard can be allocated to one consumer.
    • One consumer can consume data in multiple shards.
  • The names of all consumers are unique within a consumer group, because these names are the combination of consumer group, hostname, process id and event protocol.

For more information about SLS consumer group, see Use consumer groups to consume logs.

Preparations

  • Get an AccessKey for SLS.

    You need get a AccessKey for SLS project access in Alibaba cloud RAM, for more information, see AccessKey and Configure an AccessKey pair.

    You could also use Log Service permission assistant in Log service console for this purpose. A typical RAM policy is shown as below:
    Note Replace <Project name> with the name of the Log Service project. Replace <Logstore name> with the name of the source logstore. They could also contain wide char for fuzzy match like *.
    {
      "Version": "1",
      "Statement": [
        {
          "Action": [
            "log:ListShards",
            "log:GetCursorOrData",
            "log:GetConsumerGroupCheckPoint",
            "log:UpdateConsumerGroup",
            "log:ConsumerGroupHeartBeat",
            "log:ConsumerGroupUpdateCheckPoint",
            "log:ListConsumerGroup",
            "log:CreateConsumerGroup"
          ],
          "Resource": [
            "acs:log:*:*:project/<Project name>/logstore/<Logstore name>",
            "acs:log:*:*:project/<Project name>/logstore/<Logstore name>/*"
          ],
          "Effect": "Allow"
        }
      ]
    }
  • Splunk version and operating system check
    • Make sure to use the latest add-on version.
    • Operating system: Linux, Mac OS, Windows.
    • Splunk version: Splunk heavy forwarder 8.0+, Splunk indexer 7.0+.
  • Configure HTTP Event Collector on Splunk Enterprise.
    If using HEC to ingest event into Splunk, please make sure HEC is enabled. Ignore it if you directly use the Splunk heavy forwarder private protocol.
    Note It's not support to enable indexer acknowledgment when creating an Event Collector token now.

Installation of Add-on

There are two ways to install the add-on from Splunk web UI.
  • Method 1
    1. Click Splunk-004.
    2. Click browse more apps.
    3. Search keyword Alibaba Cloud Log Service Add-on for Splunk and find the add-on, click Install.
    4. After the installation is complete, restart Splunk service as prompted.
  • Method 2
    1. Click Splunk-004.
    2. Click Install app from file.
    3. Select the upgrade app box, clink Upload.

      Upload the .tgz file which is downloaded from App Search Results.

    4. After the installation is complete, restart Splunk service as prompted.

Configuration for Add-on

  1. Through Splunk Web UI, select the app Alibaba Cloud Log Service Add-on for Splunk.
  2. Global account setting.
    Select Configuration > Account, in Account sheet, you can set SLS AccessKey.
    Note the username and password configured in global account setting correspond to the AccessKey ID and AccessKey Secret.
  3. Log level setting.
    Select Configuration > Logging, in Logging sheet, you can set the log level for the add-on .
  4. Add data input.
    1. Click Inputs.
    2. Click Create New Input to create a new data input.
      Table 1. Data input parameters
      Parameter Mandatory&format Description Example of value
      name Yes, String The unique name for the data input. None
      Interval Yes, Integer Time in seconds to recover the Splunk data input process when it exits unexpectedly. Default value: 10 seconds.
      index Yes, String Splunk index. None
      SLS AccessKey Yes, String This AccessKey is used by pairing an AccessKey ID and an AccessKey Secret.
      Note The username and password configured in the global account setting correspond to the AccessKey ID and AccessKey Secret.
      The Account name configured in global account setting.
      SLS endpoint Yes, String SLS service endpoint. For more information, see Service endpoint.
      • cn-huhehaote.log.aliyuncs.com
      • https://cn-huhehaote.log.aliyuncs.com
      SLS project Yes, String The project in Log Service. For more information, see Manage a project. None
      SLS logstore Yes, String The logstore in log service. For more information, see Manage a Logstore. None
      SLS consumer group Yes, String A consumer group name that's used to consume the logstore. To scale, multiple inputs could be configured with the same consumer group name. For more information, see Use consumer groups to consume logs. None
      SLS cursor start time Yes, String The start time from which data is consumed. This parameter is valid only when the consumer group is created for the first time. logs will be consumed from the saving point for other times.
      Note The time is pointing to log arriving time on log service.
      Could be "begin", "end", "specific time format in ISO (e.g., 2018-12-26 0:0:0+8:00)".
      SLS heartbeat interval Yes, Integer The heartbeat interval in seconds between consumer and SLS server. Unit: second. Default value: 60 seconds
      SLS data fetch interval Yes, Integer If the coming data is not so frequent, please don't configure it too small. Unit: second. Default value: 1 second
      Topic filter No, String The topic filter string with the ";" separator.The logs with a topic that is in this topic filter string, will be ignored to send to Splunk. The logs with a topic that is in this topic filter string, will be ignored to send to Splunk. "TopicA;TopicB" means the logs with "TopicA" or "TopicB" will be ignored to send to Splunk.
      Unfolded fields No, Json A Json format string which map a topic to a field list. {" topicA": ["field_nameA1", "field_nameA2", ...], "topicB": ["field_nameB1", "field_nameB2", ...], ...} {"actiontrail_audit_event": ["event"] } means than if the topic of a log is "actiontrail_audit_event", the value of the field "event" will be unfolded from string to json.
      Event source No, String The source of an event in Splunk. None
      Event source type No, String The source type of an event in Splunk. None
      Event retry times No, Integer 0 means infinite retransmission. Default value: 0
      Event protocol Yes Event sending protocol For Splunk. If the private protocol is selected, the below parameters can be ignored.
      • HTTP for HEC
      • HTTPS for HEC
      • Private protocol
      HEC host Yes, only when Event protocol is HEC. String The host of HEC. For more information, see et up and use HTTP Event Collector in Splunk Web. None
      HEC port Yes, only when Event protocol is HEC. Integer The port of HEC. None
      HEC token Yes, only when Event protocol is HEC. String The token of HEC. For more information, see HEC token. None
      HEC timeout Yes, only when Event protocol is HEC. Integer The timeout of HEC in seconds. Default value: 120 seconds

Begin to work

  • Search data
    Firstly, enable the data inputs where are configured in the above step. Through Splunk Web UI, click Search & Reporting, enter App: Search & Reporting page, you will see the logs which are collected from Alibaba Cloud Log Service.Splunk-003
  • Internal log
    • Use the command index="_internal" | search "SLS info" to acquire operation status about SLS.
    • Use the command index="_internal" | search "error" to acquire run-time errors.

Sizing and security guide

  • Performance Metrics

    The performance and throughout of data transformation are highly impacted by several factor including:

    • SLS endpoint: using service endpoints for public network, classic network, VPC or Internet-based global acceleration. Normally, if network permits, classic network and VPC endpoint are best for performance. For more information, see Service endpoint.
    • Bandwidth: It points to the bandwidth of network between SLS and the Splunk heavy forwarder hosting the add-on and the bandwidth between Splunk heavy forwarder and indexers.
    • Capability of Splunk indexer: the receiving speed on Splunk indexers.
    • Shards count: The more shards of SLS logstore, the larger capability of data transform. You could specify the number of shards. For more information, see Manage a shard.
    • Configured count of data input: The more inputs configured with same consumer group name for one logstore, the throughput are bigger.
      Note The count of concurrent running consuemers are limited by shards of SLS logstore.
    • The memory and count of CPU core of Splunk heavy forwarders: Normally a single Splunk data input uses up 1~2G memory and up one CPU core.

    When above condition meets, a single Splunk data input creates one consumer which provides 1~2M/s consumption speed of raw logs. You must determine the number of shards based on the generation rate of raw logs.

    For example, if a logstore has 10M/s production speed of raw logs, you should split 10 shards for the logstore at least, and configure 10 data inputs with same settings for the Add-on. The machine should have 10 free CPU cores and 12G memory in single machine deployment mode.

  • High availability

    A consumer group stores checkpoints on the server. When the data consumption process of one consumer stops, another consumer automatically takes over the process and continues the process from the checkpoint of the last consumption. You can create Splunk data inputs on different servers. If a server stops or is damaged, a Splunk data input on another server can take over the consumption process and continue the process from the checkpoint. To have sufficient consumers, you can create more Splunk data inputs than the number of shards on different servers.

  • HTTPS
    • Log Service

      To use HTTPS to encrypt the data transmitted between the add-on and Log Service, you must set the prefix of the service endpoint to https://, for example, https://cn-beijing.log.aliyuncs.com.

      The server certificate *.aliyuncs.com issued by GlobalSign. Most Linux and Windows servers are preconfigured to trust this certificate by default. If a server does not trust this certificate, you can visit the following website to download and install a valid certificate: Install a trusted root CA or self-signed certificate.

    • Splunk

      To have HEC listen and communicate over HTTPS, click the Enable SSL checkbox in Splunk HEC Global Settings. For more information, see Configure HTTP Event Collector on Splunk Enterprise.

  • AccessKey storage protection

    Key information, such as the AccessKey of SLS and the HEC token of Splunk, is stored in Splunk confidential storage to prevent unexpected leakage.

FAQ

  • Configuration error
    • There are some basic configuration validations when you add or modify the data inputs on web console. For details, see Table 1.
    • SLS configuration error. For example, fail to create consumer group.
      • Command: index="_internal" | search "error"
      • Exception logs:
        aliyun.log.consumer.exceptions.ClientWorkerException: 
        error occour when create consumer group, 
        errorCode: LogStoreNotExist, 
        errorMessage: logstore xxxx does not exist
      • ConsumerGroupQuotaExceed

        You can configure up to 20 consumer groups for each logstore in SLS. We recommend that you log on to the Log Service console to view the status of the consumer group in advance. If the free num is insufficient, you should delete the consumer groups that you no longer need. The ConsumerGroupQuotaExceed error is reported when the number of consumer groups exceeds 20.

  • Permission errors
    • Have no permission to access Alibaba Cloud Log Service
      • Command: index="_internal" | search "error"
      • Exception logs:
        aliyun.log.consumer.exceptions.ClientWorkerException: 
        error occour when create consumer group, 
        errorCode: SignatureNotMatch, 
        errorMessage: signature J70VwxYH0+W/AciA4BdkuWxK6W8= not match
    • Have no permission to access HEC
      • Command: index="_internal" | search "error"
      • Exception logs:
        ERROR HttpInputDataHandler - Failed processing http input, token name=n/a, channel=n/a, source_IP=127.0.0.1, reply=4, events_processed=0, http_input_body_size=369
        
        WARNING pid=48412 tid=ThreadPoolExecutor-0_1 file=base_modinput.py:log_warning:302 | 
        SLS info: Failed to write [{"event": "{\"__topic__\": \"topic_test0\", \"__source__\": \"127.0.0.1\", \"__tag__:__client_ip__\": \"10.10.10.10\", \"__tag__:__receive_time__\": \"1584945639\", \"content\": \"goroutine id [0, 1584945637]\", \"content2\": \"num[9], time[2020-03-23 14:40:37|1584945637]\"}", "index": "main", "source": "sls log", "sourcetype": "http of hec", "time": "1584945637"}] remote Splunk server (http://127.0.0.1:8088/services/collector) using hec. 
        Exception: 403 Client Error: Forbidden for url: http://127.0.0.1:8088/services/collector, times: 3
      • Possible causes:
        • HEC is not configured or disabled.
        • The basic configuration and global settings for HEC are not correct. For example, if using HTTPS, you should enable SSL.
        • Check whether the indexer acknowledgment of the Event Collector token is disabled.
  • Consumption delay
    • If you turn on service logs function, you can do the following.
      • You can log on to the Log Service console to view the status of a consumer group. For more information, see Service log dashboards.
      • You can use CloudMonitor to view latency associated with consumer groups and to configure alerts. For more information, see Configure an alert.
    • If the service log feature is not enabled, you can view the status of the consumer group in the log Service Console. For more information, see View the status of a consumer group.

    Refer to Performance Metric to solve the problem like split more shards or creating more data inputs under the same consumer group.

  • Network shock
    • Command: index="_internal" | search "SLS info: Failed to write"
    • Exception logs:
      WARNING pid=58837 tid=ThreadPoolExecutor-0_0 file=base_modinput.py:log_warning:302 |
      SLS info: Failed to write [{"event": "{\"__topic__\": \"topic_test0\", \"__source__\": \"127.0.0.1\", \"__tag__:__client_ip__\": \"10.10.10.10\", \"__tag__:__receive_time__\": \"1584951417\", \"content2\": \"num[999], time[2020-03-23 16:16:57|1584951417]\", \"content\": \"goroutine id [0, 1584951315]\"}", "index": "main", "source": "sls log", "sourcetype": "http of hec", "time": "1584951417"}] remote Splunk server (http://127.0.0.1:8088/services/collector) using hec. 
      Exception: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer')), times: 3

    Normally it will retry automatically. But if the problem persists, please contact your network administrator.