×
Community Blog Intelligent Traffic Management with Alibaba Cloud DNS and GTM

Intelligent Traffic Management with Alibaba Cloud DNS and GTM

This article examines the failure modes that intelligent traffic management is designed to address, including regional outages, geographic latency var...

Every request to an internet-facing application begins with a DNS query, which makes the authoritative DNS layer the earliest and most consequential point at which traffic can be routed, redirected, or denied. A correctly designed DNS and traffic management layer prevents regional failures from becoming customer-visible outages, places users on endpoints with acceptable latency regardless of where they connect from, and remains available under deliberate attack. A poorly designed one becomes the single point of failure for every dependent service.

Alibaba Cloud separates these responsibilities across two services. Alibaba Cloud DNS handles authoritative resolution, answering recursive resolver queries quickly, accurately, and at scale. Global Traffic Manager (GTM) sits above the DNS layer and decides which backend endpoint a given resolution should return based on health signals and policy. The integration is straightforward: a user-facing record in Alibaba Cloud DNS is configured as a CNAME pointing to a GTM access domain, and GTM resolves that CNAME to the selected endpoint at query time. The interesting engineering decisions are not in the wiring but in how each service is configured to address a specific class of failure. This article works through four such classes.

ChatGPT_Image_May_25_2026_06_33_38_PM
Figure 1: End-to-end resolution path across Alibaba Cloud DNS and GTM.

Surviving Regional Failure

The first failure class is the loss of a region, whether through an infrastructure incident, network partition, or planned maintenance that exceeds the application's tolerance. The mechanism that addresses this is GTM's failover routing combined with continuous endpoint health verification.

In GTM, endpoints are organised into address pools, with each pool typically representing one region's set of backends. An access strategy designates a primary pool and one or more secondary pools, along with the order in which secondaries are activated. Health checks probe each address in a pool at a configured interval over HTTP, HTTPS, TCP, or ICMP. HTTP and HTTPS checks support custom request paths, expected response codes, and response body matching, which is the difference between confirming that a load balancer is reachable and confirming that the application behind it is actually serving traffic correctly. The latter is the only signal that matters for failover decisions.

The detection latency for a failed pool is the product of three configured values: probe interval, probe timeout, and failure threshold. A 15-second interval with a 5-second timeout and a 3-failure threshold detects sustained failure in approximately 45 to 60 seconds. Tightening these values reduces detection latency but increases probe traffic and raises the rate of false positives from transient network conditions. A single dropped packet on a 5-second timeout should not be allowed to trigger a regional failover.

Detection latency is only half of the failover budget. The other half is the TTL on the GTM-managed CNAME, which governs how long downstream resolvers continue to serve the pre-failover answer from cache. If the access domain TTL is 300 seconds, resolvers may continue directing users to the failed pool for up to five minutes after GTM has internally switched. For workloads requiring sub-minute failover, the access domain TTL must be lowered to the 30 to 60 second range, and the operator should be aware that some downstream resolvers enforce minimum TTL floors regardless of the authoritative value. Measuring actual resolver behaviour for the target user population is more reliable than assuming compliance with the published TTL.

Routing Users to the Right Endpoint

The second failure class is more subtle: users are technically being served, but from endpoints that are geographically or topologically distant, producing latency that degrades the application experience. Two mechanisms address this, operating at different layers and suited to different scenarios.

Line-based resolution in Alibaba Cloud DNS allows a single record name to return different values depending on the network origin of the recursive resolver. The supported lines include default, the major mainland China carriers (China Telecom, China Unicom, China Mobile), overseas, and finer-grained geographic partitions. When a query arrives, the authoritative server inspects the source of the recursive resolver using EDNS Client Subnet, where supported, falling back to the resolver IP otherwise and returns the value associated with the matching line. A resolver in mainland China receives an A record pointing to a China-region load balancer; a resolver in Europe receives one pointing to a European load balancer. Line-based resolution is sufficient when the routing decision is purely geographic, the endpoints are stable, and no health-driven failover is required between them.

Geographic routing in GTM operates on the same principle but adds health awareness and policy composition. A strategy can map source regions to specific primary pools, while specifying fallback pools that are activated only when the regional primary fails its health checks. This composition is what distinguishes GTM from static line-based resolution: when the China primary pool fails, traffic from China does not need to remain stranded on a failed endpoint; it can be diverted to a designated secondary in Hong Kong or Singapore until the primary recovers, all without manual intervention or DNS record changes.

For workloads where user-to-endpoint latency varies substantially within a single declared geography, for example, an APAC deployment served from multiple availability zones across Southeast Asia, latency-based routing is a more accurate basis for endpoint selection than geographic labels. GTM continuously measures resolver-to-endpoint latency and directs each query to the pool with the lowest observed latency from that resolver's vantage point. The trade-off is that latency-based routing depends on measurement coverage and is less predictable than geographic routing, which makes it better suited to user-facing read traffic than to workloads with strict regional data residency requirements.

Defending the Resolution Layer Itself

The third failure class is an attack against the DNS layer. Authoritative DNS is a frequent target of volumetric denial-of-service campaigns because resolution failure cascades into the unavailability of every dependent service. The attack surface includes NXDOMAIN floods designed to exhaust authoritative server capacity, random subdomain attacks that bypass resolver caches, and direct UDP query floods against the authoritative IPs.

Alibaba Cloud DNS Enterprise Edition handles these conditions by distributing query handling across geographically distinct clusters and applying built-in mitigation against amplification and reflection patterns. The Enterprise plan also provides isolated query-handling capacity that is not shared with standard-tier zones, which prevents noisy-neighbour effects from degrading authoritative response performance for high-value domains. Operators serving revenue-generating or regulated workloads should evaluate Enterprise hosting at the design stage rather than after an incident, since planned migration during an active attack is operationally more difficult than initial provisioning.

DNSSEC addresses a different threat: cache poisoning and on-path tampering of responses between the authoritative server and the recursive resolver. Enabling DNSSEC at the zone level applies cryptographic validation to every response, and requires coordination with the domain registrar to publish the DS record in the parent zone. Key rotation should follow a documented schedule, and the service supports automated KSK and ZSK management, which removes the most common operational failure mode of manual DNSSEC deployment, where key rollover misalignment causes validation failure across the entire zone.

Where workloads are mainland-China-facing, ICP filing is a separate administrative dependency that must be satisfied before traffic can be served from China-region endpoints. ICP status does not affect DNS resolution itself, but it determines whether the endpoint returned by resolution is reachable from Chinese networks, which makes it a coupled constraint that needs to be tracked alongside the DNS configuration rather than treated as separate.

Knowing What the System Actually Did

The fourth failure class is operational opacity, the inability to confirm, after the fact, whether traffic was routed as policy intended, when failover events occurred, and which queries the system was actually answering. Without this visibility, incident reconstruction relies on inference and downstream symptoms rather than primary evidence.

GTM exposes per-strategy and per-pool metrics through CloudMonitor, including query counts segmented by access region, health check pass and fail rates, and the currently active pool for each strategy. These metrics are the primary signal for confirming that traffic distribution matches policy and that failover events have completed as expected. CloudMonitor alarms should be configured on at least three conditions: any pool transitioning to unhealthy status, sustained elevation of query volume against a region beyond its established baseline, and any access strategy entering a state where all configured pools are failing simultaneously. The third condition is the operational equivalent of a total outage signal and warrants the highest alert priority.

Alibaba Cloud DNS publishes query logs through Log Service when log delivery is enabled at the zone level. The logs capture per-query source resolver, queried record, returned value, and response latency. The immediate value is post-incident reconstruction, but the more durable value is in patterns that surface during routine review: high-volume queries on records that could safely be raised to longer TTLs to reduce authoritative load, query origins suggesting misconfigured downstream consumers, or distributions that reveal opportunities to consolidate line-based resolution rules. Treating query logs as a continuous source of optimisation signal rather than purely an incident-response artefact compounds operational improvement over time.

Conclusion

DNS-layer intelligence is best understood not as a feature of the resolution service but as the result of two services cooperating: authoritative DNS optimised for query throughput and resilience, and a traffic manager optimised for fast reaction to health signals and policy intent. The separation allows each layer to evolve independently. Routing strategy can be refined in GTM without revisiting authoritative DNS configuration, and DNS-layer protections can be hardened without disturbing the application-level routing policy.

Three patterns are worth evaluating as extensions to this design. Combining GTM with Anti-DDoS Pro or Anti-DDoS Premium pushes protection beyond the resolution layer into the application layer, which matters when the endpoint itself is the primary attack target rather than the DNS infrastructure. Alibaba Cloud DNS PrivateZone applies the same authoritative record-management surface to traffic inside a VPC, which allows internal service-to-service routing to follow the same conventions as external traffic without exposing internal resolution publicly. For workloads with frequent deployment cycles, weighted round-robin in GTM supports gradual traffic shifting between pool versions, which enables canary rollouts at the routing layer rather than relying solely on application-level feature flags.


Disclaimer: The views expressed herein are for reference only and don't necessarily represent the official views of Alibaba Cloud.

0 1 0
Share on

PM - C2C_Yuan

105 posts | 2 followers

You may also like

Comments