Diagnose and monitor runtime status - Simple Log Service

When you use the Simple Log Service data collector to collect logs, you might encounter issues such as failed regular expression parsing, incorrect file paths, or traffic that exceeds the processing capacity of shards. Simple Log Service provides a diagnostic feature to help you identify collection errors. For real-time monitoring of the data collector, you can use built-in alert rules to receive notifications through channels such as DingTalk.

Prerequisites

You have collected logs using the Simple Log Service data collector. For more information, see Collect logs from a host.
Enable important logs for the destination project
This topic describes the procedure for enabling service logs. For more information, see Enable service logs.
1. Log on to the Simple Log Service console. In the Project list, click the target Project. On the Project details page, click the Service Log tab, and then click Enable Service Logs.
2. In the Enable Service Log panel, select Important Logs and Task Operational Logs, and click OK.
  - This operation automatically creates a Project named log-service-{user-id}-{region} in the destination region.
  - The ingestion, storage, query, and analysis of important logs and task operational logs are free of charge. You are charged on a pay-as-you-go basis for operations such as data transformation and data shipping.

Diagnose runtime issues

Runtime diagnostics are available in Premium and Basic editions:

Premium diagnostics (recommended): Provides an exception diagnosis dashboard. The dashboard clearly displays data collector-related exceptions and supports queries over a longer time range.
Basic diagnostics: Provides collection exception information from the last hour.

Scenarios

Abnormal data collector status: heartbeat failures, inactive processes, or SSL Certificate exceptions.
Log collection exceptions: Logs are not collected, latency is too high, or parsing fails, such as Regex Match errors.
Configuration errors: Incorrect file paths, mismatched machine group IP addresses, or cross-account permission issues.
performance bottleneck: The collection rate is close to or exceeds the default limit, such as 20 MB/s, which causes logs to be dropped.
Container log collection issues: Frequent pod restarts or rapid log rotation that leads to incomplete collection.
Plugin and custom collection issues: Failures in custom plugins, such as Grok parsing, or issues with HTTP data source collection.
Log loss is a potential data reliability issue that can occur if LoongCollector is not running or if the log rotation speed is too fast.

Procedure

Log on to the Simple Log Service console. In the Project list, click the destination Project.
Click Log Storage. In the Logstores list, hover over the destination Logstore, and then click the icon.
Click Premium Diagnostics or Basic Diagnostics to view the diagnostic information.
View the diagnostic information.
Basic diagnostics
The Log Collection Errors panel displays a list of all Logtail collection errors for the Logstore. You can click an error code to view its details. For more information, see Common data collection errors in Simple Log Service.
Premium diagnostics
On the Logtail Exception Monitoring page, view information such as Active Clients and All Error Information. For more information about the Collection Exception Monitoring dashboard, see View data reports. For more information about error codes, see Common data collection errors in Simple Log Service.
After you resolve the issues, check for new errors. You can ignore historical errors, which are displayed until they expire. Verify that no new errors appear after you resolve the issues. Logtail reports error messages at 10-minute intervals.
To view the complete logs that failed to parse and were discarded, see the LoongCollector operational log at the following path:
Host scenario: The /usr/local/ilogtail/ilogtail.LOG file on the server.
Container scenario: The /usr/local/ilogtail/loongcollector.LOG file in the container.

Health monitoring

Simple Log Service provides built-in alert policies for real-time monitoring of the data collector. You can configure these policies to meet the following monitoring needs:

Monitor for abnormal data collector heartbeats
Query the __topic__:logtail_status logs in internal-diagnostic_log to count the number of machines with normal Logtail heartbeats. Then, you can configure an alert rule to trigger an alert if the heartbeat count falls below the expected value. This helps you troubleshoot machines that are down or have network issues.
Data collector anomaly alerts
Execute the __topic__: logtail_alarm query statement to analyze the number of different types of errors that occurred in the last 15 minutes, such as unreadable files, insufficient permissions, and parsing failures. This helps you promptly identify and resolve configuration issues to prevent log loss.
Monitor for performance bottlenecks
Use the Logtail Exception Monitoring dashboard to monitor the runtime status and resource usage of Logtail, such as CPU and memory. The dashboard displays the number of active Logtail clients, a list of restarts, and all error messages. This helps you identify performance bottlenecks or abnormal restarts.
Monitor centralized log collection
Use the Logtail File Collection Monitoring dashboard to monitor the log collection status in multi-account or multi-region scenarios. The dashboard displays the number of collected files, the average latency, and the parsing failure rate. This helps ensure the continuity of log collection.

Procedure

Configure an action policy. An action policy defines how to send notifications when the status of a monitoring alert changes.
1. Log on to the Simple Log Service console.
2. In the Project list, find the Project for which you enabled important logs and click the Project name.
3. In the navigation pane on the left, click Alerts. On the Alert Center page, select the Notification Policies > Action Policy tab.
4. In the action policy list, find the sls.app.logtail.builtin action policy and click Modify in the Actions column.
5. In the Edit Action Policy dialog box, select a channel and configure it as described in Notification channels. Then, click Confirm.
Create Alerting Rule: Creates a monitoring rule that triggers an alert when the health status of LoongCollector reaches a specified threshold.
1. On the Alert Center page, click Alert Rules, and then click the icon to the right of Create Alert.
2. Click Create from Template. In the Create from Template panel, click Logtail Error Monitoring under All Templates. In the panel on the right, click the target card.
3. In the Create Alert panel, review the configuration. The built-in alert monitoring rule includes preset parameters. Click OK. For more information about the configuration parameters, see Create an alert rule.

Prerequisites

Diagnose runtime issues

Basic diagnostics

Premium diagnostics

Health monitoring