With the rapid development of cloud computing and the Internet of Things (IoT) today, more and more business scenarios are pushing computing and data collection capabilities to the edge. From production line devices in smart manufacturing and onboard systems of new energy vehicles to retail terminals and smart home devices distributed across various locations, the observability data (logs, metrics, and traces) generated by these terminal devices is crucial for business operations, fault diagnosis, and user experience optimization.
However, the environments of terminal devices are extremely complex:
● Unstable network environment: Terminal devices often run in environments with weak networks or intermittent network disconnections. Problems such as mobile network signal fluctuations, unstable Wi-Fi connections, and high cross-region network latency are common.
● Unguaranteed power supply: Many terminal devices rely on batteries for power supply or face the threat of unexpected power outages.
● Extremely limited resources: The CPU, memory, storage, and network bandwidth of edge devices are extremely limited.
Collecting observability data under such extreme conditions faces significant challenges. For example, when a vehicle travels in a remote area, the vehicle may be in a state of weak network or network disconnection for a long time, and the network signal is intermittent. When the vehicle is turned off and the power is cut, all monitoring data cached in the memory is lost. In scenarios such as tunnels and underground parking lots, data collection is interrupted, and key fault diagnosis data cannot be transmitted back.
This article describes in detail how LoongCollector provides a complete and reliable collection solution for edge scenarios such as weak networks and power outages.

The network conditions of the operating environments of terminal devices are far more complex than those of data centers:
● Weak network scenarios: Unstable mobile network signals, weak Wi-Fi signals, and cross-region long links cause low network bandwidth, high latency, and high packet loss rates.
● Intermittent network disconnection: Device movement, network transitions, and temporary network faults cause periodic network interruptions.
● Long-term offline: In some scenarios, devices need to work offline for a long time, accumulating a large amount of data to be uploaded.
For example, when an in-vehicle terminal device is in transit in a remote area, the device may be in a state of weak network or network disconnection for a long time, and the normal network state is rare. When the vehicle is turned off or under maintenance, the in-vehicle terminal device is also powered off.
In unstable environments such as weak networks and power outages, ensuring reliable data delivery and consistency is the biggest challenge:
● Data loss threat: Network interruptions, device power outages, and process abnormalities can all lead to data loss.
● Sequential guarantee: Time-series data (such as metrics and traces) must maintain the chronological order of collection.
The network bandwidth of terminal devices is usually subject to strict limits:
● High traffic costs: The traffic fees of 4G/5G mobile networks are much higher than those of data center leased lines.
● Bandwidth contention: The upload of collected data needs to compete for limited bandwidth resources with business data transmission.
● Upload rate limits: Some carriers or network environments restrict upload bandwidth.
In such an environment, how to efficiently compress data, intelligently control the sending rate, and avoid bandwidth being fully occupied by collection traffic has become a problem that must be solved.
LoongCollector is a high-performance and high-reliability observability data collector open sourced by Alibaba Cloud. While supporting the internal deployment of Alibaba Cloud at the scale of tens of millions, it has been deeply optimized for edge scenarios.
LoongCollector provides complete observability data collection capabilities:
● Host monitoring: Real-time collection of system metrics such as CPU, memory, disk, and network. It supports 100+ system metric items.
● Prometheus protocol: Fully compatible with the Prometheus ecosystem, allowing the collection of all application metrics that support Prometheus collection.
● Log collection: High-efficiency text log collection capabilities, supporting multiple log formats and parsing methods.
For resource-constrained terminal devices, LoongCollector has undergone extreme performance optimization:


This means that under the same hardware conditions, LoongCollector can support more collection tasks or run stably on devices with more limited resources.
● Production-grade verification: Supports the observability data collection of more than 10 million instances within Alibaba Cloud.
● High availability: Single-instance high availability, supporting self-recovery from faults.
● Time-tested: Verified through years of Double 11 sales promotions, burst traffic, and other extreme scenarios.
For edge scenarios such as weak networks, power outages, and network disconnections, LoongCollector adopts a core architecture design of "data persistence + asynchronous sending + intelligent retry."

Separation of collection and sending: Data collection and network sending are completely decoupled, and the collection procedure is not affected by network status.
Local persistence: Log data naturally possesses the capability of local persistence. This mainly refers to data without persistence capabilities, such as metrics. This solution writes all collected metrics to local files first to ensure no data is lost even during power outages or restarts.
Asynchronous consumption: An independent sending thread reads data from persistent files and sends the data. It automatically retries when the sending failed.
Intelligent backpressure: When the network is abnormal, the data reading speed is automatically controlled to avoid excessive memory usage.
Traditional metric collection solutions (such as Telegraf and Prometheus Pushgateway) usually send collected metric data directly to the server-side. This architecture works well in stable network environments but has fatal flaws in edge scenarios:
● Data loss due to network disconnection: When the network breaks, newly collected metric data cannot be sent and can only be discarded or cached in the memory.
● Data loss due to power outage: When the device unexpectedly loses power, all data cached in the memory is lost.
● High memory pressure: When the network is disconnected for a long time, the memory cache expands rapidly, eventually leading to out-of-memory (OOM).
LoongCollector innovatively performs local file persistence for host monitoring metrics and Prometheus metrics, realizing reliable storage of metric data:

● Periodically scrapes host and application metric data.
● Flushes data in text format to the local file system.
● Automatic rotation mechanism. It supports the configuration of single file size and file count, retains files in the latest fixed format, and automatically deletes expired files to prevent disk space from being filled up by historical data.
After metric data is persisted, how to efficiently and reliably send the data to the server-side is the next key issue. Challenges faced by traditional solutions include:
● Sending blocks collection: If the sending thread is coupled with the collection thread, a slow network slows down the collection speed.
● Sequence assurance: Metric data usually has time sequence requirements, and it is necessary to ensure that data is sent in the order of collection time.
● Resumable transmission: After the network recovers, sending needs to continue from the point of disconnection without duplication or omission.
LoongCollector adopts the method of file collection to asynchronously consume persisted metric data. The key technical points are as follows:
● Checkpoint mechanism: LoongCollector maintains fine-granularity checkpoints to record the reading position of each file. This ensures that even if the process crashes or power is lost during file reading, reading can continue from the disconnected position after a restart without data loss.
● File sequence assurance: Ensure that data is sent in the order of collection time through the file rotation order:
In a weak network environment, if data is read and sent without control, it will lead to:
● Memory usage surge: The read speed is much higher than the send speed, and data is stacked in the memory.
● Send queue overflow: After the queue is full, data is discarded or the process crashes.
● Bandwidth exhaustion: Collection traffic occupies the full bandwidth, Impacting normal business communication.
LoongCollector implements a multilayer intelligent backpressure mechanism:
Send concurrency adaptation: Drawing on the TCP congestion control algorithm, LoongCollector dynamically adjusts the send concurrency based on the network status. This adaptive mechanism ensures:
● Fast response: When the network is normal, bandwidth is fully utilized to send data quickly.
● Fast convergence: When the network is abnormal, the send frequency is quickly reduced to avoid invalid retries.
● Automatic recovery: After the network recovers, concurrency is automatically increased without manual intervention.

● Queue backpressure: When the send queue backlog reaches the threshold, LoongCollector pauses file reading. This prevents unlimited memory growth and ensures that the system runs stably even in a weak network environment for a long time.
● Traffic throttling: LoongCollector supports configuring the maximum send rate to prevent collection traffic from Impacting services. ilogtail_config.json:
{
"max_bytes_per_sec": 1048576 # Limit the maximum send rate to 10 MB/s
}
This example uses host monitoring and Prometheus collection for an application.
Modify ilogtail_config.json in the /usr/local/ilogtail directory.
{
"discard_old_data": false,
"config_server_lost_connection_timeout": 604800,
"force_quit_read_timeout": 604800,
"max_bytes_per_sec": 1048576,
"cpu_usage_limit": 0.4,
"mem_usage_limit": 384,
"working_ip": 192.168.0.1
}
Create a input_host_monitor.yaml file in the /etc/ilogtail/config/local directory, and collect the host metrics to the local file path, such as /usr/local/ilogtail/metrics/host.log.
enable: true
inputs:
- Type: input_host_monitor
Interval: 15
flushers:
- Type: flusher_file
MaxFileSize: 104857600
MaxFiles: 10
FilePath: /usr/local/ilogtail/metrics/host.log
Create a input_prometheus.yaml file in the /etc/ilogtail/config/local directory, and first collect the host metrics to the local file path, such as /usr/local/ilogtail/metrics/metric.log.
input_prometheus.yaml
enable: true
inputs:
- Type: input_prometheus
ScrapeConfig:
job_name: node
host_only_mode: true
scrape_interval: 15s
scrape_timeout: 10s
static_configs:
- targets: ["localhost:12345"]
flushers:
- Type: flusher_file
MaxFileSize: 524288000
MaxFiles: 10
FilePath: /usr/local/ilogtail/metrics/metric.log
{
"aggregators": [],
"global": {},
"logSample": "",
"inputs": [
{
"Type": "input_file",
"FilePaths": [
"/usr/local/ilogtail/metrics/*.log"
],
"MaxDirSearchDepth": 0,
"FileEncoding": "utf8",
"EnableContainerDiscovery": false
}
],
"processors": [
{
"Type": "processor_parse_json_native",
"SourceKey": "content",
"KeepingSourceWhenParseFail": true
}
]
}
Notices




Observability data collection in edge scenarios is a long-underestimated technical challenge. The instability of the network, the unreliability of power supplies, and the complexity of data consistency cause traditional collection solutions to frequently fail in edge environments. LoongCollector systematically solves these problems through an innovative architecture of "data persistence + asynchronous sending + Intelligent retry":
● Guaranteed reliable delivery of observability data
● Effectively implemented throttling
However, the collection solution of LoongCollector still has more room for optimization:
When AI Agents Take Over Phones: How to Monitor on Mobile Devices
686 posts | 56 followers
FollowAlibaba Cloud Native Community - August 7, 2025
Alibaba Cloud Native Community - March 30, 2026
Alibaba Cloud Native Community - August 8, 2025
Alibaba Cloud Native Community - August 13, 2025
Alibaba Cloud Native Community - January 19, 2026
Alibaba Cloud Native Community - September 8, 2025
686 posts | 56 followers
Follow
Application Real-Time Monitoring Service
Build business monitoring capabilities with real time response based on frontend monitoring, application monitoring, and custom business monitoring capabilities
Learn More
Real-Time Livestreaming Solutions
Stream sports and events on the Internet smoothly to worldwide audiences concurrently
Learn More
Managed Service for Prometheus
Multi-source metrics are aggregated to monitor the status of your business and services in real time.
Learn More
Global Application Acceleration Solution
This solution helps you improve and secure network and application access performance.
Learn MoreMore Posts by Alibaba Cloud Native Community