Network Disconnection and Power Outage Without Data Interruption: LoongCollector Reliable Collection Solution for Extreme Edge Scenarios

This article describes in detail how LoongCollector provides a complete and reliable collection solution for edge scenarios such as weak networks and power outages.

Background

With the rapid development of cloud computing and the Internet of Things (IoT) today, more and more business scenarios are pushing computing and data collection capabilities to the edge. From production line devices in smart manufacturing and onboard systems of new energy vehicles to retail terminals and smart home devices distributed across various locations, the observability data (logs, metrics, and traces) generated by these terminal devices is crucial for business operations, fault diagnosis, and user experience optimization.

However, the environments of terminal devices are extremely complex:

● Unstable network environment: Terminal devices often run in environments with weak networks or intermittent network disconnections. Problems such as mobile network signal fluctuations, unstable Wi-Fi connections, and high cross-region network latency are common.

● Unguaranteed power supply: Many terminal devices rely on batteries for power supply or face the threat of unexpected power outages.

● Extremely limited resources: The CPU, memory, storage, and network bandwidth of edge devices are extremely limited.

Collecting observability data under such extreme conditions faces significant challenges. For example, when a vehicle travels in a remote area, the vehicle may be in a state of weak network or network disconnection for a long time, and the network signal is intermittent. When the vehicle is turned off and the power is cut, all monitoring data cached in the memory is lost. In scenarios such as tunnels and underground parking lots, data collection is interrupted, and key fault diagnosis data cannot be transmitted back.

This article describes in detail how LoongCollector provides a complete and reliable collection solution for edge scenarios such as weak networks and power outages.

Three Major Challenges in Observability Data Collection for Terminal Devices

Challenge 1: Complex Network Environments

The network conditions of the operating environments of terminal devices are far more complex than those of data centers:

● Weak network scenarios: Unstable mobile network signals, weak Wi-Fi signals, and cross-region long links cause low network bandwidth, high latency, and high packet loss rates.

● Intermittent network disconnection: Device movement, network transitions, and temporary network faults cause periodic network interruptions.

● Long-term offline: In some scenarios, devices need to work offline for a long time, accumulating a large amount of data to be uploaded.

For example, when an in-vehicle terminal device is in transit in a remote area, the device may be in a state of weak network or network disconnection for a long time, and the normal network state is rare. When the vehicle is turned off or under maintenance, the in-vehicle terminal device is also powered off.

Challenge 2: Reliable Delivery of Observability Data

In unstable environments such as weak networks and power outages, ensuring reliable data delivery and consistency is the biggest challenge:

● Data loss threat: Network interruptions, device power outages, and process abnormalities can all lead to data loss.

● Sequential guarantee: Time-series data (such as metrics and traces) must maintain the chronological order of collection.

Challenge 3: Network Bandwidth Throttling

The network bandwidth of terminal devices is usually subject to strict limits:

● High traffic costs: The traffic fees of 4G/5G mobile networks are much higher than those of data center leased lines.

● Bandwidth contention: The upload of collected data needs to compete for limited bandwidth resources with business data transmission.

● Upload rate limits: Some carriers or network environments restrict upload bandwidth.

In such an environment, how to efficiently compress data, intelligently control the sending rate, and avoid bandwidth being fully occupied by collection traffic has become a problem that must be solved.

LoongCollector: Reliable Collection Solution Optimized for Edge Scenarios

LoongCollector is a high-performance and high-reliability observability data collector open sourced by Alibaba Cloud. While supporting the internal deployment of Alibaba Cloud at the scale of tens of millions, it has been deeply optimized for edge scenarios.

Overview of Core Capabilities

Unified Observability Data Collection

LoongCollector provides complete observability data collection capabilities:

● Host monitoring: Real-time collection of system metrics such as CPU, memory, disk, and network. It supports 100+ system metric items.

● Prometheus protocol: Fully compatible with the Prometheus ecosystem, allowing the collection of all application metrics that support Prometheus collection.

● Log collection: High-efficiency text log collection capabilities, supporting multiple log formats and parsing methods.

Ultra-Low Resource Consumption

For resource-constrained terminal devices, LoongCollector has undergone extreme performance optimization:

This means that under the same hardware conditions, LoongCollector can support more collection tasks or run stably on devices with more limited resources.

Enterprise-Level Stability Assurance

● Production-grade verification: Supports the observability data collection of more than 10 million instances within Alibaba Cloud.

● High availability: Single-instance high availability, supporting self-recovery from faults.

● Time-tested: Verified through years of Double 11 sales promotions, burst traffic, and other extreme scenarios.

Solution architecture: data persistence + asynchronous sending + intelligent retry

For edge scenarios such as weak networks, power outages, and network disconnections, LoongCollector adopts a core architecture design of "data persistence + asynchronous sending + intelligent retry."

Separation of collection and sending: Data collection and network sending are completely decoupled, and the collection procedure is not affected by network status.

Local persistence: Log data naturally possesses the capability of local persistence. This mainly refers to data without persistence capabilities, such as metrics. This solution writes all collected metrics to local files first to ensure no data is lost even during power outages or restarts.

Asynchronous consumption: An independent sending thread reads data from persistent files and sends the data. It automatically retries when the sending failed.

Intelligent backpressure: When the network is abnormal, the data reading speed is automatically controlled to avoid excessive memory usage.

Metric Data Flushing Persistence

Traditional metric collection solutions (such as Telegraf and Prometheus Pushgateway) usually send collected metric data directly to the server-side. This architecture works well in stable network environments but has fatal flaws in edge scenarios:

● Data loss due to network disconnection: When the network breaks, newly collected metric data cannot be sent and can only be discarded or cached in the memory.

● Data loss due to power outage: When the device unexpectedly loses power, all data cached in the memory is lost.

● High memory pressure: When the network is disconnected for a long time, the memory cache expands rapidly, eventually leading to out-of-memory (OOM).

LoongCollector innovatively performs local file persistence for host monitoring metrics and Prometheus metrics, realizing reliable storage of metric data:

● Periodically scrapes host and application metric data.

● Flushes data in text format to the local file system.

● Automatic rotation mechanism. It supports the configuration of single file size and file count, retains files in the latest fixed format, and automatically deletes expired files to prevent disk space from being filled up by historical data.

File Collection Asynchronous Consumption Mechanism

After metric data is persisted, how to efficiently and reliably send the data to the server-side is the next key issue. Challenges faced by traditional solutions include:

● Sending blocks collection: If the sending thread is coupled with the collection thread, a slow network slows down the collection speed.

● Sequence assurance: Metric data usually has time sequence requirements, and it is necessary to ensure that data is sent in the order of collection time.

● Resumable transmission: After the network recovers, sending needs to continue from the point of disconnection without duplication or omission.

LoongCollector adopts the method of file collection to asynchronously consume persisted metric data. The key technical points are as follows:

● Checkpoint mechanism: LoongCollector maintains fine-granularity checkpoints to record the reading position of each file. This ensures that even if the process crashes or power is lost during file reading, reading can continue from the disconnected position after a restart without data loss.

● File sequence assurance: Ensure that data is sent in the order of collection time through the file rotation order:

Prioritize earlier documents
Files in the same time segment are processed in increasing order of ordinal numbers
Support using the time in raw data to avoid data visualization issues caused by out-of-order UNIX timestamps

Intelligent Backpressure and Throttling

In a weak network environment, if data is read and sent without control, it will lead to:

● Memory usage surge: The read speed is much higher than the send speed, and data is stacked in the memory.

● Send queue overflow: After the queue is full, data is discarded or the process crashes.

● Bandwidth exhaustion: Collection traffic occupies the full bandwidth, Impacting normal business communication.

LoongCollector implements a multilayer intelligent backpressure mechanism:

Send concurrency adaptation: Drawing on the TCP congestion control algorithm, LoongCollector dynamically adjusts the send concurrency based on the network status. This adaptive mechanism ensures:

● Fast response: When the network is normal, bandwidth is fully utilized to send data quickly.

● Fast convergence: When the network is abnormal, the send frequency is quickly reduced to avoid invalid retries.

● Automatic recovery: After the network recovers, concurrency is automatically increased without manual intervention.

● Queue backpressure: When the send queue backlog reaches the threshold, LoongCollector pauses file reading. This prevents unlimited memory growth and ensures that the system runs stably even in a weak network environment for a long time.

● Traffic throttling: LoongCollector supports configuring the maximum send rate to prevent collection traffic from Impacting services. ilogtail_config.json:

{
  "max_bytes_per_sec": 1048576 # Limit the maximum send rate to 10 MB/s
}

Best Practice for LoongCollector Terminal Deployment

This example uses host monitoring and Prometheus collection for an application.

LoongCollector Start Parameter Suggestions

Modify ilogtail_config.json in the /usr/local/ilogtail directory.

Disable discard_old_data.
Increase the interval for the restart after disconnecting from the server-side, config_server_lost_connection_timeout. It is recommended to set it to 604800 seconds, or 7 days.
Increase the interval for the restart triggered by a read block, force_quit_read_timeout. It is recommended to set it to 604800 seconds, or 7 days.
Limit the maximum send rate max_bytes_per_sec. The traffic for host monitoring and one Java application is 0.88 KB/s, so it is recommended to set it to 1 MB/s to avoid abnormal traffic usage.
"working_ip". In mobile terminal scenarios, the IP address changes constantly. It is recommended to specify a fixed IP address on the machine.

ilogtail_config.json

{
  "discard_old_data": false,
  "config_server_lost_connection_timeout": 604800,
  "force_quit_read_timeout": 604800,
  "max_bytes_per_sec": 1048576,
  "cpu_usage_limit": 0.4,
  "mem_usage_limit": 384,
  "working_ip": 192.168.0.1
}

Collection Configuration

Local Configuration - Host Monitoring Collection Configuration

Create a input_host_monitor.yaml file in the /etc/ilogtail/config/local directory, and collect the host metrics to the local file path, such as /usr/local/ilogtail/metrics/host.log.

enable: true
inputs:
  - Type: input_host_monitor
    Interval: 15
flushers:
  - Type: flusher_file
    MaxFileSize: 104857600
    MaxFiles: 10
    FilePath: /usr/local/ilogtail/metrics/host.log

Local Configuration - Custom Metric Collection Configuration

Create a input_prometheus.yaml file in the /etc/ilogtail/config/local directory, and first collect the host metrics to the local file path, such as /usr/local/ilogtail/metrics/metric.log.

input_prometheus.yaml

enable: true
inputs:
  - Type: input_prometheus
    ScrapeConfig:
      job_name: node
      host_only_mode: true
      scrape_interval: 15s
      scrape_timeout: 10s
      static_configs:
        - targets: ["localhost:12345"]
flushers:
  - Type: flusher_file
    MaxFileSize: 524288000
    MaxFiles: 10
    FilePath: /usr/local/ilogtail/metrics/metric.log

Server-side Management Configuration - File Collection Configuration

{
    "aggregators": [],
    "global": {},
    "logSample": "",
    "inputs": [
        {
            "Type": "input_file",
            "FilePaths": [
                "/usr/local/ilogtail/metrics/*.log"
            ],
            "MaxDirSearchDepth": 0,
            "FileEncoding": "utf8",
            "EnableContainerDiscovery": false
        }
    ],
    "processors": [
        {
            "Type": "processor_parse_json_native",
            "SourceKey": "content",
            "KeepingSourceWhenParseFail": true
        }
    ]
}

Notices

Do not use extension plugins for processing plugins because extension plugins launch Go modules, which causes memory usage to increase.
In mobile terminal scenarios, the IP changes constantly. We recommend that you use identity machine groups.

LoongCollector Resource Monitoring Test Report

CPU: Average 0.02 cores, peak 0.028 cores

Memory: Average 31.5 MB, peak 35 MB

Network: Average 1.07 KB/s, peak 1.10 KB/s

Before compression: Average 12.99 KB/s, peak 13.13 KB/s
Actual sending: Average 1.07 KB/s, peak 1.10 KB/s

Disk: Average 6.07 KB/s, peak 13.03 KB/s

Summary and Outlook

Observability data collection in edge scenarios is a long-underestimated technical challenge. The instability of the network, the unreliability of power supplies, and the complexity of data consistency cause traditional collection solutions to frequently fail in edge environments. LoongCollector systematically solves these problems through an innovative architecture of "data persistence + asynchronous sending + Intelligent retry":

● Guaranteed reliable delivery of observability data

Local persistence ensures no data loss during network disconnection
Asynchronous sending mechanism achieves decoupling between collection and sending
Intelligent retry and backpressure ensure complete data upload after the network recovers

● Effectively implemented throttling

Efficient compression reduces the data transfer volume
Intelligent throttling avoids bandwidth saturation that has an Impact on services

However, the collection solution of LoongCollector still has more room for optimization:

The current persistence collection solution requires configuring two pipelines (collection pipeline + file read pipeline). Although flexible, it increases the user's understanding and configuration costs. LoongCollector is undergoing pipeline optimization to support internal persistence capabilities within a single pipeline, facilitating user configuration.
Terminal devices have a strong demand for STS authentication. LoongCollector is adapting to Alibaba Cloud STS dynamic authentication to support auto-refresh of temporary credentials, avoiding the threat of terminal AccessKey leakage.
In traffic cost-sensitive scenarios, every percentage point increase in compression rate means significant cost savings. LoongCollector is also exploring more extreme compression policies to further reduce network traffic.

Community