×
Community Blog Announcing One-time File Collection for LoongCollector

Announcing One-time File Collection for LoongCollector

This article introduces LoongCollector’s new one time file collection feature for fast, reliable, and automated batch ingestion of historical or static files.

Have you ever encountered such a scenario: You need to quickly migrate history logs, backfill data, or process a batch of static files, but are troubled by the inconvenience of traditional collection tools that "monitor constantly and only collect incremental data"? The one-time file collection launched by LoongCollector is a solution tailored for this type of requirement.

LoongCollector is a next-generation data collector launched by Alibaba Cloud Simple Log Service (SLS) that combines performance, stability, and programmability. It is designed to build the next-generation observability pipeline. LoongCollector extends and integrates the observability technology stack, changing the single-scenario limit of traditional log collectors, and supports the collection, processing, routing, and sending of Logs, Metrics, Traces, Events, and Profiles.

Commercial version: https://www.alibabacloud.com/help/en/sls/what-is-sls-loongcollector/

Open source version: https://github.com/alibaba/loongcollector

Different from regular continuous collection, the one-time file collection configuration will scan matching files once, complete reading, and automatically end after it starts, without the need for manual monitoring. It applies to scenarios such as history file migration, data backfilling, and temporary batch processing. It not only saves resources but also ensures complete data upload.

1. Stable, controllable, and traceable cloud-based batch automated data collection

Before the one-time file collection capability was released, LoongCollector (and its predecessor iLogtail) also provided a "history file collection" solution (Reference: Import history logs). Compared with the old solution, the new one-time file collection configuration is simpler and faster, possesses stronger batch processing capabilities and a clearer lifecycle, and improves stability and observability through finer-granularity checkpoints.

1

The new version of one-time collection upgrades static data collection from "standalone manual operation" to "cloud-based batch automation," making it more stable, controllable, and traceable. How are these advantages specifically realized? Let us introduce them one by one below.

1.1 Understanding the execution logic

1.1.1 One-time collection configuration

What is "one-time" collection configuration?

The collection pipelines of LoongCollector can be divided into two categories:

Continuous: Runs constantly, continuously discovers and collects new content (typically such as input_file).

One-time: Executes only once after starting, and ends after the collection is completed (typically such as input_static_file_onetime).

2

The scenarios for the two types of pipelines can be summarized as follows:

Continuous pipeline One-time pipeline
Data collection Continuous collection of logs/Metrics/Traces/events, etc. Historical log import, batch supplemental recording, folder structure preview, etc.
O&M Job Probe installation, background upgrade One-time diagnosis, temporary Job

How to distinguish one-time collection configuration

On the client side, the "Toggle" for the one-time pipeline is global.ExecutionTimeout.

● When global.ExecutionTimeout exists in the configuration, LoongCollector will detect the pipeline as one-time and compute its time-to-live (TTL).

● In addition to global.ExecutionTimeout, the inputs plugin also needs to be a one-time input plugin (usually ending with _onetime). Otherwise, the configuration will not take effect. In this topic, we use the input_static_file_onetime plugin to execute "one-time file collection."

The comparison sample is as follows:

# Normal file collection
enable: true
inputs:
  - Type: input_file
    FilePaths:
      - /var/log/*.log
flushers:
  - Type: flusher_stdout
    OnlyStdout: true
    Tags: true

# One-time file collection
enable: true
global:
  ExcutionTimeout: 3600
inputs:
  - Type: input_static_file_onetime
    FilePaths:
      - /var/log/history/*.log
flushers:
  - Type: flusher_stdout
    OnlyStdout: true
    Tags: true

Execution window and expiration mechanism of one-time

To provide a comprehensive overview of the one-time collection pipeline, you need to align the configuration lifecycle on the server-side/console side with the execution and reliability mechanisms on the client side. You can understand it as follows:

Server-side/Console side: Decides when the configuration is distributed and how long the configuration is retained (impacting "which machines can obtain the configuration and how long the machines can obtain the configuration").

Client side: Decides how the configuration runs after the configuration is obtained, how long the configuration runs, and how to resume collection from breakpoints (impacting "whether the collection can be completed, and whether data is missed or duplicated after a restart").

3

Server side: Distribution window, execution window, and retention period

One-time collection configurations usually contain three key time points on the console side:

  1. Configuration distribution window: Distributes configurations only to machines that have reported heartbeats within a period after the configuration creation (5 minutes; updating the configuration refreshes the window).
  2. Configuration execution window: After the configuration takes effect, the maximum time allowed for the configuration to run is the execution timeout of the configuration (that is, global.ExecutionTimeout; default: 10 minutes; range: 10 minutes to 1 week).
  3. Configuration retention period: The server-side retains the configuration for a period for tracing or reuse (7 days).

If machines are added to a group after configuration creation, they may miss the initial distribution window. When the data volume is large, you must increase the ExecutionTimeout in advance to prevent the collection from being interrupted because the execution window time limit is reached before the configuration collection is completed.

Client side: Execution and expiration

  1. Timeout range and Default Value: The unit of global.ExecutionTimeout is seconds, and the range is limited to 600 to 604,800 (10 minutes to 1 week).
  2. Expiration behavior: For one-time configurations, the client computes and records the expiration time (start + ExecutionTimeout). When the configuration expires, the client cleans up the expired configuration file and removes the status record of the configuration.
  3. Whether configuration updates trigger a "Rerun" (to avoid erroneous collection or duplicate collection): When a one-time configuration is updated, the client combines the following factors to determine whether "re-execution" is required: If global.ForceRerunWhenUpdate is true, the client forces a rerun whenever any change occurs in the configuration. If global.ForceRerunWhenUpdate is false (default), the client determines whether to rerun based on whether the hash of inputs and ExecutionTimeout have changed. If neither has changed, the client does not rerun the configuration and continues to use the original expiration time. Otherwise, the client treats the configuration as a "new one-time configuration."

One of the design goals of one-time is "avoiding duplicate execution of the same configuration." Therefore, the update policy aims to achieve controllable reruns.

1.1.2 One-time file collection

"Snapshot Semantics" of one-time file collection

The core semantics of input_static_file_onetime can be summarized in three points:

  1. Search for files once at startup: The client scans the matching paths at startup and solidifies the "list of matching files existing at that time" into the checkpoint. Subsequently added files will not be included in the current collection target.
  2. Read only the file size at the startup moment: Each file records an initial size. During the collection process, even if the file is appended with data, the client only reads up to the initial size (to avoid uncontrollable duplication or missed collection caused by reading while writing).
  3. Support rotation positioning: The file fingerprint contains information such as dev, inode, sig_hash, and sig_size. sig_hash and sig_size come from the signature of up to 1024 bytes at the beginning of the file. When file rotation causes the path to change, the client attempts to search by dev+inode in the folder and continues to read, avoiding missed collection as much as possible.

Reliability of one-time file collection (checkpoint mechanism)

One-time file collection records "configuration-level status + file-level progress" through checkpoints to support restart, upgrade, and abnormal recovery, and to avoid duplicate collection as much as possible.

Configuration-level checkpoint

This file records the core information of the one-time configuration (such as config_hash, expire_time, inputs_hash, and excution_timeout). This file is used to recover the time-to-live (TTL) and update policy judgment of the one-time configuration after a restart. The path is usually located at /etc/ilogtail/checkpoint/onetime_config_info.json.

File-level checkpoint

This file records the execution progress of one-time file collection and the status of each file. The path is usually located at: /etc/ilogtail/checkpoint/input_static_file/{config_name}@0.json.

Field description (aligned with the actual stored JSON):

Parameter Name

Parameter Explanation

config_name

collection configuration name

expire_time

time-to-live (TTL) of the collection configuration (Unix seconds)

file_count

Quantity of files to collect (Start time snapshot)

start_time

Time when Collection starts (Unix seconds)

finish_time

Time when Collection finishes (Unix seconds)

status

collection configuration Status: running / finished / abort

current_file_index

File index currently processed when running

files

filepath

File Path (absolute path at Start time)

sig_hash/sig_size

Take up to 1024 bytes from the beginning of the file to calculate the file signature hash, and record the signature length

dev/inode

Device number and inode, used for rotation positioning

status

File Status: waiting / reading / finished / abort

size

File Size at Start time (only collected to this position this time)

offset

Collection offset (exists during reading/abort)

start_time

File start Collection Time (exists during reading/finished/abort)

last_read_time

File last read time (exists during reading)

finish_time

File Collection completion time (exists when finished)

{
  "config_name" : "xxxx",
  "expire_time" : 1768550944,
  "file_count" : 1,
  "files" : 
    [
      {
        "dev" : 2051,
        "filepath" : "/var/log/tmpfs.log",
        "finish_time" : 1768550345,
        "inode" : 2888304,
        "size" : 1282,
        "start_time" : 1768550345,
        "status" : "finished"
      }
    ],
  "finish_time" : 1768550345,
  "input_index" : 0,
  "start_time" : 1768550344,
  "status" : "finished"
}

Resource usage and throughput control

One-time file collection is a native input plugin (implemented in C++). This feature shares the reader system with regular file collection and possesses good throughput capacity. The theoretical limit performance of single-threaded collection for single-line Text logs can reach 300 MB/s. At the same time, "controllable" constraints are imposed on resource usage:

Single-threaded sequential execution: All input_static_file_onetime collection configurations are uniformly scheduled by the StaticFileServer module inside LoongCollector. The overall process is single-threaded loop processing (different inputs are assigned time slices in the loop) to avoid uncontrolled resource usage caused by excessive concurrency.

Sending rate limiting (flusher_sls.MaxSendRate): Use the advanced parameter MaxSendRate of the SLS Outputs to perform Rate Limit on sending. The unit is B/s. When MaxSendRate > 0, the sending queue enables the rate limiter, thereby reducing the impact on network bandwidth and SLS write quotas.

4

2. Quick Start

SLS has published the one-time file Collection capability. You can experience the new feature in just three steps:

1.  Log on to the SLS console. On the Logtail configuration Page, select "One-time Logtail Configuration" and Click "Add Logtail Configuration".

5

2.  Select "One-time File Collection - Host".

6

3.  Fill in the file collection configuration (consistent with the configuration of regular file collection). Configure processing plugins as needed and save. For more detailed descriptions and parameter explanations, refer to the official documentation.

7

After saving, you can see that the data is collected:

8

You can also view the complete collection configuration in the configuration details:

9

3. Best Practices

3.1 Scenario 1: Large-scale machine group backfilling large amounts of files

Hypothetical scenario:

● Because of an accidental network disconnection for too long, exceeding the local fault tolerance limit of LoongCollector, 1,000 nodes need to backfill data. Each node needs to backfill about 10 GB.

● The target Logstore has 256 shards. The write limit for each shard is about 5 MB/s.

● The daily traffic of each machine is about 1 MB/s.

If you directly use default parameters to apply the one-time file collection configuration, the following may occur:

  1. The write rate surges instantly, triggering shard write quota errors.
  2. Backfill traffic occupies daily collection traffic.
  3. Backlog at the sender causes the one-time Job to fail to complete within the ExecutionTimeout.

It is recommended to perform two-step control:

Step 1: Rate limiting (MaxSendRate)

Estimate roughly based on available quota: The remaining available write capacity is about (256 × 5 - 1,000 × 1 = 280) MB/s. Averaged to each machine, it is about 0.28 MB/s (≈ 286 KB/s ≈ 286,720 B/s), rounded to about 290,000 B/s. You can set MaxSendRate to about 290000 (B/s) for rate limiting.

10

Step 2: Increase execution timeout (ExecutionTimeout)

At a sending rate of 286 KB/s, backfilling 10 GB requires at least about 10 GB / 286 KB/s ≈ 36,663 s ≈ 10.2 h. It is recommended to set ExecutionTimeout to 86400 (about 1 day) to leave enough margin for collection.

11

Summary: ExecutionTimeout: 86400 + MaxSendRate: 290000. This allows large-scale backfilling to be completed while minimizing the impact on daily online collection.

3.2 Scenario 2: Only backfill data from a certain time period in the file

Hypothetical scenario (disregarding quota, only discussing "avoiding duplication"):

● The edge zone encountered a network abnormality for an extended period, exceeding the LoongCollector local fault tolerance limit, resulting in the loss of approximately 12 hours of Data.

● There are multiple rotated files on the edge zone, and many files are only partially missing.

● The log is a single-line JSON, containing

  • {"timestamp":1768556120,"message":"hello world","level":"INFO"}

One-time file collection is executed in units of "file snapshots." If you recollect directly, it is likely that the time segments that have already been reported will be recollected as well.

Solution: Add the UNIX timestamp filter processing plugin processor_timestamp_filter_native (combined with processor_parse_json_native/processor_parse_timestamp_native if necessary) to the one-time Collection pipeline to retain only events within the Target time range, thereby achieving "precise recollection."

The console configuration diagram is as follows:

12
13

3.3 Scenario 3: The one-time collection configuration needs to be modified (to avoid polluting the target dataset)

One-time Collection is "executed immediately upon dispatch." If a logic error exists in the initial configuration, even if the configuration is Updated immediately, some unexpected data may have already been generated, causing new and old data to mix and impact analysis.

Suggested practice:

  1. Create a one-time configuration for the first time, and find that the output does not meet expectations.
  2. Update the one-time configuration (you can set ForceRerunWhenUpdate: true to trigger a Forced Rerun and interrupt the previous Collection Task), and verify whether the newly collected data format is correct. If the requirements are not met, retry repeatedly.
  3. Use a query statement to Filter out unexpected Data, and clean it up through SLS soft delete (Sample document: Simple Log Service soft delete).

14
15
16

In this way, you can retain only the collection result corresponding to the "final correct configuration" to avoid affecting subsequent analysis.

4. Summary

One-time file collection is suitable for scenarios such as historical data migration, network disconnection recollection, and temporary batch processing. After the configuration is dispatched, it is executed based on the "Start time file snapshot." With checkpoints to ensure recoverability and observability, and combined with ExecutionTimeout and MaxSendRate to provide a double safety net of "duration + traffic," you can steadily backfill the static data without disturbing the continuous online collection. You are welcome to try it out and provide feedback!

0 0 0
Share on

You may also like

Comments

Related Products