Centralised Log Management at Scale with Alibaba Cloud Log Service

This article examines how Alibaba Cloud Log Service consolidates log collection, storage, indexing, and downstream delivery into a single managed plat...

Distributed applications running in production generate log data from dozens or hundreds of sources, including application containers, load balancers, database instances, network appliances, and managed cloud services. The engineering challenge is not log collection alone. It is the design of a centralised system that ingests at a sustained throughput, indexes for sub-second query response, retains data at a predictable cost, and integrates with downstream analytics and alerting without requiring bespoke pipeline code. Alibaba Cloud Log Service is structured around this principle, exposing collection, storage, query, and integration capabilities through a unified API and console.

Collection Agents and Ingestion Paths

Three ingestion paths cover most needs.

Logtail is the managed collection agent. It runs on virtual machines, container hosts, and Kubernetes nodes, parsing log files locally and forwarding records over HTTPS to the regional ingestion endpoint on port 443. The parsing side is where it earns its keep. Logtail recognises single-line text, regex with named capture groups, JSON, delimited formats (CSV and TSV), syslog, and the common access log shapes out of the box, so there's no pre-processing layer to maintain on the host. Collection rules and parser definitions live in the console and are pushed down to agents centrally, which sidesteps the usual headache of drifting config files across a fleet. JSON records get an extra benefit: every top-level key becomes a queryable field on the resulting log entry, indexable and addressable from SQL with no further transformation.

For application-internal events that never hit disk, the producer SDKs Java, Python, Go, .NET, and Node.js call the PutLogs HTTP API directly. They batch locally, retry with exponential backoff, and dispatch asynchronously, keeping the application thread off the network path. The REST API stays open for lightweight cases where pulling in an SDK isn't warranted; requests are signed with HMAC-SHA256 using AccessKey credentials.

Storage Organisation: Projects, Logstores, and Shards

Data is organised through three nested constructs. A Project is a region-bound logical container that defines a RAM permission boundary; cross-region replication is not supported, and region choice should be governed by data residency requirements and ingestion-source proximity. A Logstore is a schema-flexible container within a Project, housing log records of a related kind, typically one Logstore per service or per log family. A Shard is the unit of read and write throughput within a Logstore, supporting 5 MiB/s ingestion and 10 MiB/s read capacity per shard.

Partition keys determine which shard a record routes to. An MD5 hash of the partition key value is computed at ingestion, and the record is written to the shard whose hash range contains the resulting value. Selecting a partition key with high cardinality and even distribution of service instance identifier, trace identifier, or tenant identifier prevents hot-shard conditions that cap effective throughput well below the aggregate Logstore limit. Where no natural partition key exists, omitting it causes records to be distributed in a round-robin manner across all available shards.

Shards can be split to increase capacity or merged to reduce idle cost. Splits divide a hash range into two contiguous sub-ranges; existing data remains in the parent shard, which transitions to read-only, while new writes are accepted by the two child shards. The operation is non-blocking and typically completes within seconds.

Indexing and Query Semantics

Log Service exposes two index types, applied per field within a Logstore. Full-text indexing extracts tokens from log bodies for keyword search using configurable delimiters; field indexing tags individual structured fields with their data type text, long, double, or JSON to support typed predicates and aggregations. Index configuration is mutable post hoc, but rebuilding for historical data requires a re-indexing operation scoped to the desired time range and incurs proportional cost.

Indexed query latency on a single Logstore is generally under one second for time ranges up to 24 hours and scales with the time range and result set size beyond that. Query syntax supports boolean operators, range predicates, wildcard matching with leading-character restrictions, and field-scoped clauses. For analytical workloads, a SQL-92 subset operates over indexed data, including standard aggregation functions, GROUP BY, ORDER BY, and inner JOIN with a single Logstore on the right side. Queries take a two-stage form: a search clause filters the indexed dataset, and a pipe operator hands the result to a SQL stage for aggregation. Placing high-selectivity predicates in the search clause before the pipe reduces the dataset scanned by the SQL stage and directly improves query latency on large Logstores.

Downstream Delivery and Integration

Log Service integrates with several downstream targets without an intermediary pipeline. LogShipper delivers records on a scheduled cadence to Object Storage Service in Parquet, JSON, or CSV format, partitioned by time, for long-term cold storage and ad-hoc query through external table mechanisms. A parallel shipping configuration targets MaxCompute for unified batch analytics across logs and business data.

For real-time downstream consumption, the Consumer Library implements a checkpoint-aware consumer group model. Each consumer in a group is assigned a disjoint subset of shards; checkpoint position has persisted server-side per consumer group and per shard, allowing horizontal scaling of consumers up to the shard count and automatic rebalancing on consumer addition or removal. Function Compute can be triggered directly on log arrival for low-throughput event-driven processing, alert routing, lightweight transformation, or webhook dispatch without operating a continuously running consumer.

Operational Considerations

Shard count and partition key selection: Sustained write throttling on a Logstore at well below the aggregate shard capacity indicates skewed partition key distribution. The Logstore monitoring view exposes per-shard ingestion rate; uneven distribution warrants reconsideration of the partition key or migration to round-robin distribution. Shard count should be sized to peak ingestion rather than average.
Index coverage and cost: Indexing increases storage cost proportional to the volume of indexed content. Fields queried for filtering or aggregation should be field-indexed; large free-text bodies queried only occasionally can remain unindexed and accessed through scan queries at higher latency. Index keys not actively queried should be removed periodically.
Retention and access control: Retention is configurable per Logstore between 1 and 3650 days, and should be aligned with the governing compliance regime rather than left at the default. RAM policies should be scoped to Project and Logstore granularity; the AssumeRole pattern is preferred over long-lived AccessKey credentials for application access, with rotation on a 90-day cycle and access activity audited through ActionTrail.

Conclusion

Logtail and the producer SDKs deliver records from heterogeneous sources without per-source pipeline code.
Projects, Logstores, and Shards provide scalable, throughput-bounded storage with partition-aware routing.
Full-text and field indexes, combined with SQL analytics, support both diagnostic search and aggregate analysis from the same data.
LogShipper, the Consumer Library, and Function Compute integration extend logs into long-term storage, custom processing, and event-driven workflows.

Disclaimer: The views expressed herein are for reference only and don’t necessarily represent the official views of Alibaba Cloud.

Community

Centralised Log Management at Scale with Alibaba Cloud Log Service

Collection Agents and Ingestion Paths

Storage Organisation: Projects, Logstores, and Shards

Indexing and Query Semantics

Downstream Delivery and Integration

Operational Considerations

Conclusion

Read previous post:

Read next post:

PM - C2C_Yuan

You may also like

Comments

PM - C2C_Yuan

Related Products

Alibaba Cloud Flow

Simple Log Service

DevOps Solution

Log Management for AIOps Solution