Tune Logstash Pipelines to Fix Performance Bottlenecks - Elasticsearch

Alibaba Cloud Logstash follows the same architecture and tuning model as open-source Logstash. A pipeline processes events in three stages—input, filter, and output—each running on independent worker threads. Events enter a central queue (in-memory by default), worker threads pull batches from the queue, apply filters, and write to the output destination.

This guide walks through performance issues in a structured order. Don't jump straight to pipeline parameters. Changing multiple variables at once makes it harder to isolate the root cause. Work through the checklist below in sequence, change one thing at a time, and measure after each change.

Troubleshooting order

Follow this sequence:

Check your input sources and output destinations
Check system resources (CPU, heap memory)
Tune pipeline parameters (Pipeline Batch Size, Pipeline Workers)

Check input and output performance

Logstash throughput is bounded by the slowest component in the pipeline—if Kafka or Elasticsearch is the bottleneck, tuning Logstash parameters won't help.

Before adjusting anything in Logstash:

Verify that Kafka consumer lag is not caused by slow downstream writes.
Check Elasticsearch indexing rates and watch for 429 responses. A 429 means Elasticsearch's indexing queue is full. When this happens, Logstash retries automatically, but the underlying issue is on the Elasticsearch side—check cluster health and shard allocation before changing Logstash settings.
Monitor write latency on your output destination.

Set up monitoring

To get visibility into what's happening inside Logstash, configure at least one of the following:

CloudMonitor alert policy: Tracks system-level metrics (CPU, memory, disk I/O) for the Logstash cluster. See Configure a custom alert policy.
X-Pack Monitoring: Tracks Logstash-specific metrics—event receive rate, event transfer rate, CPU utilization, and memory usage. The Alibaba Cloud Elasticsearch cluster associated with your Logstash cluster must be in the same virtual private cloud (VPC). See Enable the X-Pack Monitoring feature.

To analyze per-pipeline processing details, install the logstash-output-file_extend plugin. After a pipeline starts, the plugin writes debug logs that show how business data moves through each stage. See Use the pipeline configuration debugging feature.

Check CPU and heap memory

CPU

High CPU utilization alone is not always a bottleneck—it depends on whether resources are actually fully utilized.

If CPU is near 100%: resources are being used efficiently, and throughput can be improved by scaling out the cluster. Check heap memory as well, since frequent garbage collection often drives CPU spikes.
If CPU is consistently low: increasing cluster specs won't improve throughput. The bottleneck is likely upstream (slow input) or downstream (slow output).

Upgrading node specs only improves throughput when resources are already near full utilization.

Heap memory

Heap memory that is too large or too small both cause problems: the Java garbage collector (GC) runs more frequently, which spikes CPU.

Configure heap memory to match your workload:

The typical range is 4 GB to 8 GB. For most workloads, staying within this range is sufficient.
If you see signs of memory pressure (high CPU with spiky GC patterns), double the current heap size and test whether performance improves.
Set heap memory and off-heap memory to the same size, following open-source Logstash best practices.
If you need more than 8 GB, scale out the Logstash cluster rather than continuing to increase the heap size on a single node.

Before moving to production, run load tests and tune heap size based on your actual traffic.

Tune pipeline parameters

Two parameters control how much work Logstash does concurrently. The total number of events in flight at any time is:

inflight count = Pipeline Workers × Pipeline Batch Size

Keep this formula in mind as you adjust either parameter—a higher inflight count means higher memory consumption.

Pipeline Batch Size

Controls the number of events each worker pulls from the queue per cycle. A larger batch improves throughput but increases memory usage.

When writing to Elasticsearch, aim for a bulk request size of around 5 MB. Tune Pipeline Batch Size to reach that target rather than setting it arbitrarily high.

Pipeline Batch Size maps directly to Elasticsearch's bulk setting. Larger batches mean fewer, larger bulk requests.

Pipeline Workers

Controls the number of worker threads running the filter and output stages. Defaults to the number of vCPUs on each node.

When to increase Pipeline Workers:

CPU-bound pipelines (heavy filter computation, no network I/O): increase Pipeline Workers incrementally as long as CPU has headroom. Once CPU is saturated, adding more workers increases context-switching overhead and can *lower* throughput.
I/O-bound pipelines (network calls in filters or outputs, such as writing to Elasticsearch): these pipelines spend time waiting for I/O, so more workers can improve throughput even when CPU isn't fully used.

Increase the value, measure, and repeat. Change one value at a time.

Fix Kafka message accumulation

If messages are accumulating in Kafka topics, use the following approaches. Apply one at a time and measure before combining them. For more information, see Tips and Best Practices in the documentation for open source Logstash.

Increase partition count

For high-volume topics, calculate the minimum partition count as:

partitions ≥ Logstash nodes × consumer threads per node

More partitions allow more parallelism, but also increase overhead. Configure partitions based on your business requirements.

Distribute load with multiple pipelines

Configure multiple pipelines in the same Logstash cluster to use the same group ID. Kafka delivers each message to one consumer in the group, so this distributes load across pipelines on different nodes.

Increase Pipeline Workers and Pipeline Batch Size

For Kafka-heavy workloads, increasing both parameters together often helps. Start with Pipeline Batch Size (increase until you hit the 5 MB bulk request target for Elasticsearch), then increase Pipeline Workers. Monitor both Kafka consumer lag and Elasticsearch indexing latency as you tune.

What's next

Documentation for performance tuning of open-source Logstash