When you use LoongCollector to collect logs, you might encounter issues such as regular expression parsing failures, incorrect file paths, or traffic that exceeds the processing capacity of a shard. Simple Log Service (SLS) provides a diagnostic feature to help you locate collection errors. To monitor LoongCollector in real time, you can use built-in alert monitoring rules to receive alert notifications through channels such as DingTalk.
Prerequisites
-
You have collected logs by using LoongCollector. For more information, see Continuously collect text logs from a host.
-
Runtime diagnostics
Diagnostics are available in two editions: Advanced Diagnostics and Basic Diagnostics.
-
Advanced Diagnostics (Recommended): Provides a diagnostic dashboard that clearly displays LoongCollector-related exceptions and lets you query for exception information over an extended period.
-
Basic Diagnostics: Provides information about collection exceptions that occurred within the last hour.
Scenarios
-
Abnormal LoongCollector status: heartbeat failures, inactive processes, or SSL certificate exceptions.
-
Log collection exceptions: Logs are not collected, high collection latency, or parsing failures such as regular expression matching errors.
-
Configuration errors: Incorrect file paths, mismatched machine group IP addresses, or cross-account permission issues.
-
Performance bottlenecks: The collection rate approaches or exceeds the default limit, such as 20 MB/s, which causes logs to be dropped.
-
Container log collection issues: Frequent pod restarts or rapid log rotation that leads to incomplete collection.
-
Plugin and custom collection issues: Failures in custom plugins, such as a Grok parsing plugin, or failures in HTTP data source collection.
-
Data reliability issues: Log loss that occurs when LoongCollector is not running or log rotation is too fast.
Procedure
-
Log on to the Simple Log Service console. In the Project list, click the destination Project.
-
Click
Log Storage. In the list of Logstores, hover the pointer over the destination Logstore, and then click the
icon. -
Click Advanced Diagnostics or Basic Diagnostics to view the diagnostic information.
-
View the diagnostic information.
Basic diagnostics
The Log Collection Errors panel displays a list of all LoongCollector collection errors for the Logstore. You can click an error code to view its details. For more information, see Common data collection errors in Simple Log Service.
Advanced diagnostics
On the LoongCollector/Logtail Exception Monitoring page, view information such as Active Clients and All Error Information. For more information about the Collection Exception Monitoring dashboard, see View data reports. For more information about error codes, see Common data collection errors in Simple Log Service.
-
After you resolve the issues, check for new errors. Historical errors continue to appear until they expire. You can ignore them. Verify that no new errors occur after you fix the issues. LoongCollector reports errors at 10-minute intervals.
To view the full logs that were dropped due to parsing failures, you can check the LoongCollector operational logs. The paths are as follows:
Host scenario: In the
/usr/local/ilogtail/loongcollector.LOGfile on the server.Container scenario: In the container's
/usr/local/ilogtail/loongcollector.LOGfile.
Runtime monitoring
Simple Log Service provides built-in alert policies to monitor LoongCollector in real time. You can configure these policies for the following monitoring purposes:
-
Monitor LoongCollector for heartbeat anomalies
Query logs in the
internal-diagnostic_logLogstore with the search condition__topic__:logtail_statusto count the number of machines that have normal LoongCollector heartbeats. Then, configure an Alert Rule to trigger an alert if the heartbeat count falls below the expected value. This helps you troubleshoot machines that are down or have network issues. -
Create alerts for LoongCollector collection exceptions
Run the
__topic__: logtail_alarmquery to analyze the number of exceptions of different types that occurred in the last 15 minutes. These exceptions can include unreadable files, insufficient permissions, and parsing failures. This helps you promptly identify and resolve configuration issues to prevent log loss. -
Receive early warnings for performance bottlenecks
Use the Logtail Exception Monitoring dashboard to monitor the runtime status and resource usage of Logtail, such as CPU and memory. The dashboard displays the number of active LoongCollectors, a list of restarts, and complete error information. This helps you identify performance bottlenecks or abnormal restarts.
-
Monitor centralized log collection
Use the LoongCollector File Collection Monitoring dashboard to centrally manage the log collection status across multiple accounts or regions. The dashboard displays metrics such as the number of collected files, average latency, and parsing failure rate. This helps ensure collection continuity.
Procedure
-
Configure an Action Policy to define how notifications are sent when an alert changes status.
-
Log on to the Simple Log Service console.
-
In the Project list, find the Project for which you enabled important logs and click the Project name.
-
In the left-side navigation pane, click
Alerts. On the Alert Center page, choose . -
In the list of action policies, find the
sls.app.logtail.builtinAction Policy, and click Modify in the Actions column. -
In the Edit Action Policy dialog box, select and configure a notification channel based on your business requirements. For more information, see Notification methods. Then, click OK.
-
-
Create an Alert Rule to specify the conditions for triggering an alert when the LoongCollector runtime status meets a threshold.
-
On the Alert Center page, click Alert Rules, and then click the
icon next to Create Alert Rule. -
Click Create from Template. In the Create from Template panel, under All Templates, click Logtail Error Monitoring. Then, in the panel that appears on the right, click the card for the rule that you want to create.
-
In the Create Alert Rule panel, review the configuration. The built-in alert monitoring rule has preset parameters. Click OK. For more information about the configuration parameters, see Create an alert rule.
-