Building Cross-Cloud Observability: One Architecture, Unified Analytics

This article introduces a unified observability architecture for cross-cloud log analysis and AIOps, designed to streamline multicloud O&M and reduce costs for global enterprises.

1. Customer Requirements

1.1 Unified Analysis of Multicloud Logs

A common form in multicloud scenarios is that edge security and access capabilities outside China are handled by Cloudflare (Web Application Firewall (WAF), Content Delivery Network (CDN), and Access), and Verbose Logs are uniformly stored in Amazon Simple Storage Service (S3) through Logpush for low-cost archiving and compliance retention. Meanwhile, the core business and observability systems of the headquarters often Run on the Alibaba Cloud side. For example, application, gateway, and business logs enter Simple Log Service (SLS), and the alerting, on-call, and ticket systems are also built around the Alibaba Cloud side. The Result is that the "chain of evidence" of the same User Request, the same Attack, or the same publish Change is Distributed across both the Third-party cloud vendor and Alibaba Cloud side. This makes it difficult to complete unified retrieval, association analysis, or closed-loop handling in a single platform.

For the platform engineering team, the core challenge is not the location of log storage, but rather the lack of a unified platform to perform analysis and complete operational tasks.

● Logs are in S3, but troubleshooting, security analytics, and operation Analysis are often scattered across multiple Systems (Cloudflare console, Athena, Glue, Amazon Elastic MapReduce (EMR), CloudWatch, Business Intelligence (BI), and self-built alerting).

● Metrics cannot be standardized: the same Metric (such as 5xx, P99 latency, and WAF block ratio) is calculated separately in different Systems. It is difficult to audit Changes, reuse them, or perform migration.

● The management event response chain is long: it requires "first querying logs -> then manually summarizing -> then sending Notifications -> then dispatching tickets or performing rollback", and the Mean Time To Detect (MTTD) and Mean Time To Resolve (MTTR) are artificially lengthened.

1.2 Reduce Costs and Simplify O&M

If S3 is used as Log Storage, to "use" the Data (query and analysis, visualization, and alerting filter interaction), a combination of additional components is usually required for querying, ETL, metrics, and alerting. The chain becomes longer, configuration and troubleshooting span multiple Systems, and O&M complexity will significantly increase.

If Data is directly connected to CloudWatch: CloudWatch Logs is used for Collection and storage, Logs Insights is used for query and analysis, and Dashboards and Alarms are used for gauge and alerting closed-loops. The overall cost is usually very high.

2. SLS Solutions

Next, the data import, processing, query and analysis, gauge display, and alerting features in this set of SLS Solutions will be broken down and introduced step by step.

2.1 Import Data from S3 to SLS

In the eyes of many people, data import is just the three-step procedure of "read-transmit-write". But when you face:

● Logs that generate thousands of files per minute

● Attack and defense traffic that instantly surges from 1 GB to 10 GB

● Various mixed data formats such as gzip, snappy, JavaScript Object Notation (JSON), and Comma-Separated Values (CSV)

You will find that this is by no means a simple "copy and paste" operation.

Next, the difficulties encountered in the actual import procedure will be clarified first, and then the corresponding implementation methods will be explained:

Challenge 1: The "real-time Search" of massive small files is not simple (full traverse vs. real-time, incremental traverse vs. Integrity)

The ListObjects operation of S3 only Supports traverse in lexicographic order, and does not Support "filtering by Time". When the volume of History files in a bucket or folder is huge, a full scan may take a long Time. However, if only an incremental scan is performed, files may be missed because file names are out of order.

Consequences: New files are not Searched in Time (latency increases), or they are missed in extreme cases (Integrity threat).

Challenge 2: The throughput must be able to keep up with the peak, but cannot rely on "manual parameter tuning" (traffic burst + the "long tail" problem, where processing is slowed down by a few oversized files)

In real business, traffic will burst: usually 1 GB/minute, but it may surge to 10 GB/minute during Activities or faults. If the scale-out is slow, the end-to-end latency immediately becomes out of control after the queue accumulates.
Even if the concurrent capacity is fully utilized, long tails will still be encountered: "average assign by the number of files" will cause a Job to be dragged down by an oversized file, and the overall latency is determined by the slowest one.

Challenge 3: The data formats are often mixed and unpredictable

The same bucket may often mix JSON, CSV, and text. Even for JSON, it may be "line-by-line JSON, JSON array, or specific service formats (such as CloudTrail)". The compress may be .gz, .snappy, .lz4, or .zstd.
If you attempt to automatically detect the data format, sampling misjudgments and additional read overhead will be Imported, which will slow down the transmission chain instead.

Challenge 4: Data integrity and traceability must be guaranteed (ensuring no data is lost, supporting reprocessing, and enabling problem-file identification)

The import chain naturally has retry and replay: network jitter, Consumption timeout, Job restart, and management events and scans hitting the same object at the same Time may all cause repeated pulls.
More importantly, data loss is often more hidden. Missed events, permission Changes, scan point drifts, and parse abnormalities can cause data gaps during a certain period without being noticed.

Our design solutions for these difficulties are as follows:

● Design point 1: A "dual-mechanism" for file discovery ensures both timeliness and completeness.

SQS Event-driven: S3 events → SQS → data import Job Consumption (suitable for scenarios with irregular file names or low-latency requirements).
Dual-pattern traverse: Incremental catch-up to the latest point + periodic full fallback (to prevent missed discovery).

Comparison dimension	Dual-pattern traverse	Simple Queue Service (SQS) Event-driven
Real-time Search of new files	Minute-level	Second-level
Configuration complexity	Simple, no additional configuration required	Requires the configuration of S3 event Notifications and SQS
Reliability	High (full fallback)	Dependency on SQS reliability
Cost	Only S3 API calls	Additional SQS fees
Scenarios	Standard log import	High real-time requirements, irregular file names

● Design point 2: Auto Scaling + balanced allocation by data volume to handle traffic peaks and manage long-tail data.

Concurrent Jobs automatically scale out or in based on queues and data volumes. This avoids manual parameter tuning.
Job assignment is upgraded from "by the number of files" to "balanced allocation by data volume". This ensures that a round of concurrent Jobs can be completed at the same time as much as possible.

● Design point 3: Auto compression detection and explicit configuration of data formats (no guessing).

Compression Formats are automatically detected and decompressed based on file suffixes, such as .gz, .snappy, .lz4, and .zstd.
Data formats are explicitly specified by data import Jobs (such as JSON, CSV, single-line, multi-line, CloudTrail, and JSON array). Encoding Settings are also provided (default to UTF-8, and can be specified when necessary).

● Design point 4: Point and Status Management + retry and fencing + file-level tracking to make data backfilling feasible.

On the discovery side, "events + scan fallback" form a compensation closed loop to reduce the probability of missed discovery.
On the pull side, points and processing Status are maintained. Failed files enter the retry or fencing queue. Data backfilling by replaying object keys is Supported.
Deduplication and idempotence control the Impact of duplicates based on object identities (such as key + etag/version + offset) to make duplicates controllable and gaps visible.

2.2 One-stop Data Analytics

Data import is only the first step. A complete observability closed loop also requires data governance, interactive search, visualization, and intelligent alerting. SLS integrates these capabilities into a unified platform. The core principles of each step are described below.

Data transformation: fully managed streaming extract, transform, and load (ETL)

SLS data transformation is based on managed real-time Consumption Jobs and uses Structured Process Language (SPL) syntax to process logs in streams. It is fully managed, supports Auto Scaling, and makes Data visible in seconds. It also Supports line-by-line debugging and code hinting.

SLS uses the SPL engine as the kernel on the log pipeline, which includes advantages such as column-oriented calculation, single instruction multiple data (SIMD) acceleration, and C++ implementation. Based on the distributed architecture of the SPL engine, we have redesigned the Elasticity mechanism. It is not just scaling at the granularity of an instance (such as a Kubernetes pod or service compute unit) in the usual sense, but can quickly scale at the granularity of a DataBlock (MB level).

Scenario capabilities:

● Pre-compliance: IP-to-Geo transform and desensitization are completed outside China. Only compliance fields are retained after cross-border data transfer to meet General Data Protection Regulation (GDPR) and data export requirements.

● Data filtering: Invalid Data is removed to reduce downstream index and storage overheads.

● Structured extraction: Original fields are transformed into analyzable Metrics, and nested JSON is parsed to avoid repeated calculations during queries.

● Field projection: Only Gold fields are delivered, which can reduce cross-border traffic and index costs by 50% to 80%.

● Field enrichment: Field connection (JOIN) is performed on logs (such as order logs) and dimension tables (such as User information Tables) to Add more dimension information to logs for data analytics.

● Data forwarding: Logstore Data can be forwarded and aggregated to destination databases. Data can also be flexibly forwarded based on field Content.

Query and analysis: High-Performance engine and responses in seconds

SLS provides a high-Performance query engine that Supports the index pattern (responses in seconds for tens of billions of Data records) and the scan pattern (lightweight Analysis). Queries are directly applied to indexes without the need to pre-build datasets or wait for purge delays. For ultra-large-scale data analytics scenarios, SLS provides the Dedicated SQL, which includes the enhancement mode and complete accuracy mode.

Query engine and capabilities:

● Nearly a hundred Window Functions: Built-in statistical, aggregation, string, Time, and geospatial functions are provided out-of-the-box.

● Cross-database federated queries: StoreView supports cross-Project and cross-Logstore Data associated queries.

● SQL Exclusive: Provides high-precision Analysis capabilities in large data volume scenarios to avoid sampling errors.

● Scheduled SQL: Supports scheduled execution of SQL queries for Report Generation and Metric pre-computation.

Dashboards: Rich visualization, out-of-the-box

SLS dashboards are Data Visualization Tools provided by Simple Log Service to display query and analysis Results in a graphical interface. A dashboard usually contains multiple statistical charts to summarize and render key performance metrics, important Data, and Analysis Results.

Visualization capabilities:

● Rich chart Types: Multiple statistical charts such as Tables, line charts, column charts, pie charts, and maps are supported. The Pro Version supports the overlaid display of multiple query Results.

● Interaction and drill down: Supports global Time filtering, variable filter interaction, and chart drill down to track from the overall situation to details layer by layer.

● Subscribe and Share: Supports periodically rendering dashboards into Images and sending them by Email or to DingTalk groups. Supports embedding the console into third-party Systems.

● Third-party Integration: Can be integrated with visualization tools such as DataV, Grafana, and Tableau, and supports bidirectional import and export of Grafana dashboards.

Alerting: A one-stop artificial intelligence for IT operations platform

SLS alerting is a one-stop artificial intelligence for IT operations platform for alerting and monitoring systems, denoising, transaction management, and Notification dispatch. It consists of subsystems such as the alerting and monitoring system, alert management system, and notification management system. After logs or metrics are ingested, you can create monitoring jobs, notification channels, and alert policies within minutes.

Feature advantages:

● Low cost and fully managed: Provided as Software as a Service (SaaS). Except for text messages and voice calls, no additional fees are charged for alerting and monitoring systems, transaction management, or other features.

● Denoising and dispatch: Supports grouping, removing duplicates, suppression, and upgrading to avoid alert storms. Supports automatic dispatch to different teams based on rules.

● Rich notification channels: Natively integrates DingTalk, WeCom, Lark, Slack, text messages, voice calls, and Webhooks.

2.3 O&M Simplification (Using Integration to Replace Multiple Product Portfolios)

2.3.1 THIRD-PARTY CLOUD VENDOR multiple product portfolios: Which components are usually required to achieve the same closed loop

Having multiple components is not necessarily bad, but when your requirement is "unified standards, minute-level closed loop, and controllable low cost," multiple components mean:

● Longer pipeline: Data needs to be moved more times (extract, transform, and load (ETL), saving to intermediate tables, and refreshing datasets).

● Larger failure surface: Jitter in any step will Impact the end-to-end timeliness.

● More fragmented billing: Costs for storage, scans, ETL, alerting, visualization, and Networks are all increasing.

2.3.2 SLS integration vs THIRD-PARTY CLOUD VENDOR multiple product portfolios

In SLS, you can create a reusable engineering template that combines "import + processing + index + query + dashboard + alerting/transaction," use the template to deliver the first version, and use policies to iterate on costs and results.

3. Case Study of Log Analysis Architecture Upgrades for Globalized Enterprises

Background and Solutions

A large globalized enterprise whose business covers multiple areas such as Europe, Asia-Pacific, and North America achieves global access acceleration and Web application protection through mainstream Alibaba Cloud CDN and security services. To meet Data compliance and audit requirements outside China, the enterprise continuously archives its security and access logs to public cloud Object Storage Service for long-term retention and subsequent Analysis through the native log push capability (Logpush) of the platform.

Currently, the enterprise uses a combination of multiple components on THIRD-PARTY CLOUD VENDOR to achieve the Analysis and monitoring of logs outside China, and encounters the following problems:

● Scattered Data: S3 is distributed across multiple Regions such as Frankfurt and Tokyo, and data silos are difficult to uniformly manage and analyze.

● High query and analysis costs: Athena bills based on scan volume. CloudWatch Logs Insights has limited query capabilities and requires separate queries across regions. The costs of daily retrievals and alerting queries increase linearly with frequency.

In addition, extract, transform, and load (ETL) dependencies on Glue or Lambda require self-maintenance. QuickSight visualization requires additional authorization and has synchronization latency. CloudWatch Alarms configurations are scattered and lack unified denoising capabilities. The multiple product portfolio causes issues such as high O&M complexity and uncontrollable costs.

You can build a unified observability analysis platform based on SLS to achieve the following goals:

● Unified data transformation: You can use Structured Process Language (SPL) to complete data governance outside China (such as field clipping, IP address desensitization, and Geo enrichment). This reduces the costs of cross-border transfer.

● Unified query and analysis: You can aggregate gold data in the central Logstore in China to provide second-level interactive search for hundreds of millions of data records.

● Unified visualization: A one-stop dashboard is provided, and no additional business intelligence (BI) tools are required.

● Unified alerting closed loop: Intelligent alerting based on SLS query and analysis is provided. It supports denoising, dispatching, and multi-channel notifications.

3.1 Data Flow

Data is pushed from Cloudflare Logpush to various Amazon Web Services (THIRD-PARTY CLOUD VENDOR) S3 regions outside China for archiving. SLS imports the data into Logstores in the same region through event-driven mechanisms or scheduled scans. After the data is transformed by SPL, it is aggregated into the central Logstore in China to support unified query and analysis, dashboards, and alerting.

3.1.1 Sample SPL data transformation

Sample raw log (Cloudflare Web Application Firewall (WAF) log)

The sample Cloudflare WAF raw log contains sensitive and security fields such as ClientIP, SecurityAction, and SecuritySources, and covers three security action scenarios: block, allow, and challenge. You can directly use these logs to test SPL data transformation statements.

{  
  "EdgeStartTimestamp": "2024-12-25T10:30:00Z",  
  "RayID": "abc123def456",  
  "ClientIP": "203.0.113.50",  
  "OriginIP": "10.0.0.100",  
  "ClientRequestURI": "/api/v1/users?id=123",  
  "ClientRequestMethod": "POST",  
  "ClientRequestReferer": null,  
  "SecurityAction": "block",  
  "SecurityRuleID": "rule_001",  
  "SecuritySources": "[{\"source\":\"waf\",\"action\":\"block\"}]",  
  "OriginResponseStatus": 200,  
  "OriginResponseTime": 150,  
  "ResponseHeaders": "{\"x-cache\":\"MISS\"}"  
}

The following SPL script completes data governance outside China: time standardization, IP address to Geo geographic information conversion, IP address desensitization to anonymous fingerprints, security metadata parsing, and threat labeling. Finally, sensitive fields such as ClientIP and OriginIP are removed by using project-away, and only gold fields are retained for cross-border transfer.

-- Core tracking and time standardization  
*   
| extend __time__ = cast(to_unixtime(date_parse(EdgeStartTimestamp, '%Y-%m-%dT%H:%i:%SZ')) as bigint)  
| extend RequestId = RayID  
| extend RequestPath = url_extract_path(ClientRequestURI)  
  
-- IP -> Geo (completed outside China)  
| extend  
    GeoCountry = ip_to_country(ClientIP),  
    GeoRegion  = ip_to_province(ClientIP),  
    GeoCity    = ip_to_city(ClientIP)  
  
-- IP address desensitization: Retain anonymous fingerprints (optional) and do not carry the raw IP address for cross-border transfer  
| extend ClientFingerprint = to_base64(sha256(to_utf8(ClientIP)))  
  
-- Security metadata parsing and labeling  
| expand-values -keep SecuritySources  
| parse-json -prefix='Security' SecuritySources  
| extend IsHighRisk = if(ClientRequestMethod = 'POST' and (ClientRequestReferer is null or SecurityAction = 'block'), 1, 0)  
  
-- Final denoising and field projection  
| project-away ClientIP, OriginIP, ResponseHeaders, RayID

Sample Data after data transformation

The data after data transformation has completed Geo enrichment, IP masking, and threat labeling. Sensitive fields have been removed, and the data can be directly used for downstream query and analysis and alerting:

{  
    "RequestPath": "/api/v1/users",  
    "__time__": "1735122600",  
    "RequestId": "abc123def456",  
    "ClientFingerprint": "O1zTaFfLyH1ZqEHS03UiLSNMzwMX+4ZW7OsIVsDGgEg=",  
    "OriginResponseTime": "150",  
    "GeoCity": "Richardson",  
    "ClientRequestURI": "/api/v1/users?id=123",  
    "IsHighRisk": "1",  
    "EdgeStartTimestamp": "2024-12-25T10:30:00Z",  
    "SecurityAction": "block",  
    "SecurityRuleID": "rule_001",  
    "Securityaction": "block",  
    "GeoCountry": "United State",  
    "GeoRegion": "Texas",  
    "OriginResponseStatus": "200",  
    "Securitysource": "waf",  
    "ClientRequestMethod": "POST"  
}

3.1.2 query and analysis samples

Sample 1: Web Application Firewall (WAF) rule hit Statistics - This sample aggregates the hit Count, high-threat proportion, and unique attacker count by rule.

* | SELECT   
  SecurityRuleID,  
  count(*) AS TotalHits,  
  count_if(IsHighRisk = 1) AS HighRiskHits,  
  approx_distinct(ClientFingerprint) AS UniqueClients  
FROM log  
WHERE SecurityRuleID IS NOT NULL AND SecurityRuleID <> ''  
GROUP BY SecurityRuleID   
ORDER BY TotalHits DESC

Sample 2: Top 10 Attack source regions - This sample aggregates the block Count and unique attacker count by country or city.

* | SELECT   
  GeoCountry,  
  GeoCity,  
  count(*) AS AttackCount,  
  approx_distinct(ClientFingerprint) AS UniqueAttackers  
FROM log  
WHERE SecurityAction = 'block'  
GROUP BY GeoCountry, GeoCity  
ORDER BY AttackCount DESC  
LIMIT 10

Sample 3: Origin 5xx fault Trend - This sample aggregates the fault Count, Error Rate, and total Request count by minute.

* | SELECT   
  time_series(__time__, '1m', '%Y-%m-%d %H:%i:%s', '0') AS TimeBucket,  
  count_if(OriginResponseStatus >= 500) AS Origin5xxCount,  
  count_if(OriginResponseStatus >= 500) * 100.0 / count(*) AS Origin5xxRate,  
  count(*) AS TotalRequests  
FROM log  
GROUP BY TimeBucket  
ORDER BY TimeBucket

Sample 4: Request latency quantile Analysis - This sample aggregates P50/P90/P99 latency by path to locate slow APIs.

* | SELECT   
  RequestPath,  
  count(*) AS RequestCount,  
  approx_percentile(OriginResponseTime, 0.50) AS LatencyP50,  
  approx_percentile(OriginResponseTime, 0.90) AS LatencyP90,  
  approx_percentile(OriginResponseTime, 0.99) AS LatencyP99  
FROM log  
WHERE OriginResponseTime IS NOT NULL  
GROUP BY RequestPath  
HAVING count(*) > 100  
ORDER BY LatencyP99 DESC  
LIMIT 20

3.1.3 Alert rule samples

Alert 1: Sudden increase in origin 5xx faults - This alert is triggered when the Error Rate exceeds 5% to rapidly discover origin abnormalities.

* | SELECT  
    count_if(OriginResponseStatus >= 500) * 100.0 / count(*) AS Origin5xxRate  
  FROM log  
  HAVING Origin5xxRate > 5

Alert 2: Sudden increase in high-threat Requests - This alert is triggered when the Count exceeds 100 or the proportion exceeds 10% to detect potential Attacks.

* | SELECT  
    count_if(IsHighRisk = 1) AS HighRiskCount,  
    count_if(IsHighRisk = 1) * 100.0 / count(*) AS HighRiskRate  
  FROM log  
  HAVING HighRiskCount > 100 OR HighRiskRate > 10

Alert 3: Sudden increase in WAF blocks - This alert is triggered when the block Count exceeds 1000 or the unique attacker count exceeds 50 to assess the attack posture.

* | SELECT  
    count_if(SecurityAction = 'block') AS BlockCount,  
    approx_distinct(ClientFingerprint) AS UniqueAttackers  
  FROM log  
  HAVING BlockCount > 1000 OR UniqueAttackers > 50

4. Summary and Outlook

During the data migration procedure, the network quality and fees of cross-cloud and Cross-border Transfer cannot be ignored. Therefore, we have implemented the capability to reduce the overhead of cross-cloud and Cross-border Transfer by using CloudFront for users to choose.

References

● Import data from Amazon S3 to Simple Log Service

● THIRD-PARTY CLOUD VENDOR Glue Pricing

● Simple Log Service Pricing

Community