Fix DNS Resolution Failures in ACK Clusters - ACK - Alibaba Cloud - Container Service for Kubernetes

Diagnose and fix DNS failures in ACK clusters caused by CoreDNS, network, policy, or kernel issues.

How DNS resolution works

When an application pod sends a DNS query, it follows this path:

The pod sends a DNS query to the address in /etc/resolv.conf, which is typically the kube-dns Service IP.
kube-dns forwards the query to a CoreDNS pod in the kube-system namespace.
For internal domain names ending with .cluster.local, CoreDNS resolves from its cache without contacting upstream servers.
For external domain names, CoreDNS forwards the query to the upstream DNS servers specified in its configuration. The default upstream servers are 100.100.2.136 and 100.100.2.138, both deployed in the virtual private cloud (VPC).

With NodeLocal DNSCache, queries go to the local cache (169.254.20.10) first and fall back to kube-dns only if unresolved.

Key concepts

Term	Description
Internal domain name	Domain ending with `.cluster.local`. CoreDNS resolves from cache, not upstream servers.
External domain name	Domain not ending with `.cluster.local`. CoreDNS forwards to upstream DNS servers.
Application pod	Any non-system-component pod.
kube-dns Service	Kubernetes Service that routes DNS traffic to CoreDNS pods. Its IP is the default nameserver for application pods.
NodeLocal DNSCache	DaemonSet running a local DNS cache on each node. When enabled, pods query the local cache (`169.254.20.10`) instead of kube-dns.
Upstream DNS server	DNS server CoreDNS contacts for external domains. Defaults to `100.100.2.136` and `100.100.2.138`.

Step 1: Identify your error message

Match your error message to determine the likely failure category.

Client	Error message	Likely cause
ping	`ping: xxx.yyy.zzz: Name or service not known`	Domain does not exist, or DNS server unreachable. Latency >5s indicates server unreachable.
curl	`curl: (6) Could not resolve host: xxx.yyy.zzz`	Same as above.
PHP HTTP client	`php_network_getaddresses: getaddrinfo failed: Name or service not known in xxx.php on line yyy`	Same as above.
Golang HTTP client	`dial tcp: lookup xxx.yyy.zzz on 100.100.2.136:53: no such host`	Domain does not exist.
dig	`;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: xxxxx`	Domain does not exist.
Golang HTTP client	`dial tcp: lookup xxx.yyy.zzz on 100.100.2.139:53: read udp 192.168.0.100:42922->100.100.2.139:53: i/o timeout`	DNS server unreachable.
dig	`;; connection timed out; no servers could be reached`	DNS server unreachable.

Domain does not exist → Check whether the domain name itself is correct. If only external domains fail, see External domain name cannot be resolved.
DNS server unreachable → Proceed to Step 2.

Step 2: Check DNS policy and server address

Verify that the pod is using CoreDNS as its DNS server.

Retrieve the DNS policy and configuration:

# View the pod's DNS policy
kubectl get pod <pod-name> -o yaml

# Log in to the pod and inspect the DNS configuration
kubectl exec -it <pod-name> -- cat /etc/resolv.conf

Check the dnsPolicy field and the nameserver entries in /etc/resolv.conf.

`dnsPolicy` value	Behavior
`ClusterFirst`	Default. Pod uses the kube-dns Service IP as DNS server.
`ClusterFirstWithHostNet`	Same as `ClusterFirst` for host-network pods.
`Default`	Pod inherits DNS settings from the Elastic Compute Service (ECS) node. Use only when the pod does not resolve cluster-internal names.
`None`	DNS configured entirely through `dnsConfig`. NodeLocal DNSCache uses this to inject `169.254.20.10` and the kube-dns IP as nameservers.

If the pod is not using CoreDNS (nameserver is not the kube-dns Service IP), the pod may be overloaded or the conntrack table may be full. See Client is overloaded and Conntrack table is full.

If the pod is using NodeLocal DNSCache (nameserver is 169.254.20.10), see NodeLocal DNSCache does not work and Alibaba Cloud DNS PrivateZone names cannot be resolved.

If the pod is using CoreDNS, continue to Step 3.

Step 3: Check CoreDNS pod health

Inspect the CoreDNS pods:

# View CoreDNS pod status and placement
kubectl -n kube-system get pod -o wide -l k8s-app=kube-dns

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE
coredns-xxxxxxxxx-xxxxx   1/1     Running   0          25h   172.20.6.53   cn-hangzhou.192.168.0.198

# View real-time CPU and memory usage
kubectl -n kube-system top pod -l k8s-app=kube-dns

Expected output:

NAME                      CPU(cores)   MEMORY(bytes)
coredns-xxxxxxxxx-xxxxx   3m           18Mi

Pods not in Running state → Run kubectl -n kube-system describe pod <CoreDNS-pod-name> to identify the cause. See CoreDNS pods do not run as normal.
CPU or memory near the limit → See CoreDNS pods are overloaded.
CPU usage uneven across pods → See DNS queries are not evenly distributed.

Step 4: Check CoreDNS operational logs

kubectl -n kube-system logs -f --tail=500 --timestamps <coredns-pod-name>

Flag	Description
`-f`	Stream the log output.
`--tail=500`	Show the last 500 lines.
`--timestamps`	Include timestamps in each log line.

Look for error patterns that match known issues. For DNS query-level logs, first enable the CoreDNS log plugin. See Configure DNS resolution.

With the log plugin enabled, each resolved query produces an entry like:

[INFO] 172.20.2.25:44525 - 36259 "A IN redis-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000116946s

Common response codes:

Response code	Meaning	What to do
`NOERROR`	Resolved successfully.	No action needed.
`NXDOMAIN`	Domain does not exist on the upstream server.	Check whether the domain name includes a search suffix that doesn't resolve.
`SERVFAIL`	Upstream DNS server returned an error.	Check CoreDNS connectivity to upstream servers.
`REFUSED`	Upstream server rejected the query.	Check the CoreDNS Corefile configuration and the node's `/etc/resolv.conf`.

DNS response codes are defined in RFC 1035.

Step 5: Reproduce the error and isolate the cause

If the error occurs consistently:

Check the DNS query log for error response codes. See External domain name cannot be resolved.
Test network connectivity between application pods and CoreDNS. See Test network connectivity between application pods and CoreDNS.
Diagnose the container network. See Diagnose the container network.

If the error occurs intermittently:

Capture packets to collect evidence. See Capture packets.

Diagnostic methods

Test network connectivity between application pods and CoreDNS

Enter the application pod's network namespace using one of these methods:

Method 1 (recommended): Run kubectl exec -it <pod-name> -- bash to enter the pod.
Method 2: Log in to the node, find the process ID with ps aux | grep <application-process-name>, then enter the network namespace with nsenter -t <pid> -n bash.
Method 3 (for frequently restarting pods):
1. Log in to the node.
2. Run docker ps -a | grep <application-container-name> to find sandboxed container IDs (names start with k8s_POD_).
3. Run docker inspect <sandboxed-container-ID> | grep netns to find the network namespace path in /var/run/docker/netns/xxxx.
4. Run nsenter -n bash to enter the namespace. > Note: Do not add a space between -n and <netns-path>.

From the pod's network namespace, test connectivity:

# Test connectivity to the kube-dns Service
dig <domain> @<kube-dns-svc-ip>

# Test Internet Control Message Protocol (ICMP) connectivity to the CoreDNS pod
ping <coredns-pod-ip>

# Test DNS query directly to the CoreDNS pod
dig <domain> @<coredns-pod-ip>

Replace <kube-dns-svc-ip> with the kube-dns Service IP in the kube-system namespace, and <coredns-pod-ip> with the IP of a CoreDNS pod.

Symptom	Likely cause	Next step
Cannot reach the kube-dns Service	Node overloaded, kube-proxy down, or security group blocking User Datagram Protocol (UDP) port 53	Verify that security group rules allow UDP port 53.
Cannot reach the CoreDNS pod (ICMP)	Container network error or security group blocking ICMP	Diagnose the container network.
Cannot reach the CoreDNS pod (DNS)	Node overloaded or security group blocking UDP port 53	Verify that security group rules allow UDP port 53.

Test network connectivity of CoreDNS

Log in to the node where the CoreDNS pod runs.
Run ps aux | grep coredns to get the CoreDNS process ID.
Run nsenter -t <pid> -n bash to enter the CoreDNS network namespace.

Test connectivity:

# Test connectivity to the Kubernetes API server
telnet <apiserver_clusterip> 6443  # apiserver_clusterip is the ClusterIP of the kubernetes Service in the default namespace.

# Test connectivity to upstream DNS servers
dig <domain> @100.100.2.136
dig <domain> @100.100.2.138

Symptom	Likely cause	Next step
Cannot reach the Kubernetes API server	API server error, node overloaded, or kube-proxy down	Check the availability of the API server.
Cannot reach upstream DNS servers	Node overloaded, CoreDNS misconfigured, or Express Connect routing error	Submit a ticket.

Diagnose the container network

Log in to the ACK consoleACK console.
On the Clusters page, click the name of your cluster or click Details in the Actions column.
In the left-side navigation pane, choose Operations > Cluster Check.
On the Container Intelligence Service page, choose Cluster Check > Diagnosis.
On the Diagnosis page, click the Network Diagnosis tab.
Set Source address to the application pod IP, Destination address to the kube-dns Service IP, and Destination port to 53. Select Enable packet tracing and I know and agree, then click Create diagnosis.
In the diagnosis list, click Diagnosis details for your record.

The results show Diagnosis result, Packet paths, and All possible paths, along with identified error causes. See Use the cluster diagnostics feature to troubleshoot cluster issues.

Capture packets

Use packet capture when errors are intermittent and hard to reproduce.

Log in to the nodes where the application pods and the CoreDNS pod run.
Capture DNS traffic on each ECS instance:
```
tcpdump -i any port 53 -C 20 -W 200 -w /tmp/client_dns.pcap
```
This captures all traffic on port 53, rotating across up to 200 files of 20 MB each.
Reproduce the error and analyze packets from the failure time window. Check application logs for exact timestamps.

Packet capture has negligible service impact—only a slight increase in CPU utilization and disk I/O.

Known issues

Check for these environment-specific issues before deeper investigation.

Issue	Affected environments	Quick identification
Concurrent A and AAAA record queries	All (especially Alpine-based images and PHP apps)	Intermittent failures; packet capture shows simultaneous A/AAAA queries on the same port
IPVS UDP source port conflicts	kube-proxy in IP Virtual Server (IPVS) mode; CentOS or Alibaba Cloud Linux 2 with kernel earlier than `4.19.91-25.1.al7.x86_64`	Failures last approximately 5 minutes during node scaling or CoreDNS scaling
Conntrack table full	High-traffic nodes	`dmesg -H` shows `conntrack full`; failures during peak hours
Alibaba Cloud DNS PrivateZone with NodeLocal DNSCache	Clusters using both NodeLocal DNSCache and DNS PrivateZone	PrivateZone or `vpc-proxy` domain names fail to resolve or resolve to wrong addresses
autopath plugin bug	Clusters creating containers at high frequency	External names intermittently fail or resolve to wrong IPs; internal names usually resolve normally, except in clusters that create containers at high frequency, where internal service names may also resolve to wrong IPs
DNS PrivateZone and vpc-proxy names	Clusters where both internal and external domain names fail	Resolution errors only on domain names added to Alibaba Cloud DNS PrivateZone and domain names that contain vpc-proxy

FAQ

External domain name cannot be resolved

Check the CoreDNS query log for the response code. Enable the log plugin if not already on (see Configure DNS resolution), then search for the failing domain. NXDOMAIN means the domain does not exist upstream—often because a search suffix was appended, creating an invalid FQDN. SERVFAIL or REFUSED means the upstream server has a problem; check CoreDNS configuration and connectivity to 100.100.2.136 and 100.100.2.138.

Domain names of headless Services cannot be resolved

In CoreDNS before 1.7.0, API server network jitter can cause CoreDNS to exit, stopping headless Service record updates. Update to 1.7.0 or later. See \[Component Updates\] Update CoreDNS.

Domain names of StatefulSet pods cannot be resolved

The StatefulSet pod template must set serviceName to the headless Service name. Without this, per-pod DNS names (for example, pod.headless-svc.ns.svc.cluster.local) cannot be resolved, even though the Service-level name (for example, headless-svc.ns.svc.cluster.local) works. Set serviceName in the StatefulSet spec.

DNS queries are blocked by security group rules or network ACLs

Security group rules or network access control lists (ACLs) are blocking UDP port 53, causing DNS failures on affected nodes. Allow inbound and outbound UDP port 53 traffic.

Container network connectivity errors cause DNS failures

Container network errors block UDP port 53. Use the Network Diagnosis feature to identify the broken path and root cause.

CoreDNS pods are overloaded

When query volume exceeds CoreDNS replica capacity, latency rises and failures occur. Check whether CPU and memory usage is near the limit (kubectl -n kube-system top pod -l k8s-app=kube-dns).

Two fixes:

Deploy NodeLocal DNSCache to absorb queries locally and reduce the load on CoreDNS. See Configure NodeLocal DNSCache.
Scale out CoreDNS replicas so that peak CPU utilization per pod stays well below the node's available CPU.

DNS queries are not evenly distributed among CoreDNS pods

Imbalanced pod scheduling or a sessionAffinity setting on kube-dns can cause uneven query distribution. Symptom: noticeably different CPU utilization across CoreDNS pods.

Two fixes:

Scale out CoreDNS pods and spread them across different nodes.
Remove the sessionAffinity setting from the kube-dns Service. See Configure the kube-dns Service.

CoreDNS pods do not run as normal

Misconfigured YAML or ConfigMap can prevent CoreDNS from starting or cause crashes. Symptoms: pods not Running, increasing restart count, or log errors.

Check the CoreDNS log for these common errors:

Error	Cause	Fix
`/etc/coredns/Corefile:4 - Error during parsing: Unknown directive 'ready'`	CoreDNS ConfigMap contains a plugin unsupported by the current version.	Delete the unsupported plugin (e.g., `ready`) from the ConfigMap in the `kube-system` namespace. Repeat for other plugins in the error.
`Failed to watch *v1.Pod: ... connect: connection refused`	API server connections were interrupted when the log was generated.	If no DNS failures occurred, this is not the root cause. Otherwise, test CoreDNS connectivity. See Test network connectivity of CoreDNS.
`[ERROR] plugin/errors: 2 www.aliyun.com. A: read udp ...->100.100.2.136:53: i/o timeout`	CoreDNS could not reach upstream DNS servers.	Test connectivity from the CoreDNS pod to `100.100.2.136` and `100.100.2.138`.

DNS resolutions fail because the client is overloaded

When the ECS instance is fully loaded, UDP packets may drop before reaching CoreDNS. Look for abnormal network interface controller (NIC) retransmission rate and high CPU utilization in monitoring data.

Deploy NodeLocal DNSCache to reduce inter-node DNS traffic. See Configure NodeLocal DNSCache.

Conntrack table is full

When the conntrack table is full, new UDP and TCP connections drop. This typically causes peak-hour DNS failures that recover off-peak. To confirm, run dmesg -H on the affected node and look for conntrack full during the failure window.

Increase the maximum number of entries in the conntrack table. See How do I increase the maximum number of tracked connections in the conntrack table of the Linux kernel?

`autopath` plugin does not work as normal

A known autopath plugin defect causes occasional resolution failures or wrong IPs for external domains. Internal domains resolve correctly. The issue worsens in clusters with high container creation rates.

Disable the autopath plugin:

Edit the CoreDNS ConfigMap with kubectl -n kube-system edit configmap coredns.
Delete the autopath @kubernetes line. Save and exit.
Verify the configuration loaded by checking CoreDNS logs for reload.

DNS resolutions fail due to concurrent A and AAAA record queries

Some Linux distributions send A and AAAA queries simultaneously over the same port, triggering conntrack conflicts that drop UDP packets.

Symptoms: intermittent resolution failures; packet capture shows simultaneous A and AAAA queries from the same source port.

Fixes depend on your image base:

CentOS or Ubuntu: Add options timeout:2 attempts:3 rotate single-request-reopen to the DNS resolver configuration.
Alpine Linux: Replace the Alpine-based image with one based on another OS. See Alpine caveats.
PHP with cURL: Add CURL_IPRESOLVE_V4 to force IPv4-only resolution. See cURL functions.
All environments: Deploy NodeLocal DNSCache, which mitigates the race condition. See Configure NodeLocal DNSCache.

DNS resolutions fail due to IPVS errors

In IPVS mode with CentOS or Alibaba Cloud Linux 2 (kernel before 4.19.91-25.1.al7.x86_64), removing UDP backend pods causes source port conflicts that drop packets. DNS failures last ~5 minutes during node or CoreDNS scaling events.

Two fixes:

Deploy NodeLocal DNSCache to bypass the IPVS path for DNS queries. See Configure NodeLocal DNSCache.
Shorten the UDP session timeout in IPVS mode. See Change the UDP timeout period in IPVS mode.

NodeLocal DNSCache does not work

DNS queries bypass NodeLocal DNSCache when either condition applies:

dnsConfig was not injected into the application pods, so they still point to the kube-dns Service IP.
Pods use an Alpine Linux base image, which queries all nameservers concurrently, including CoreDNS directly.

For the first case, enable automatic dnsConfig injection. See Configure NodeLocal DNSCache. For Alpine images, use an image built on another OS. See Alpine caveats.

Alibaba Cloud DNS PrivateZone names cannot be resolved

Alibaba Cloud DNS PrivateZone requires UDP, not TCP. With NodeLocal DNSCache, PrivateZone domains, vpc-proxy API endpoints, or other domain names may fail to resolve or resolve to wrong IPs.

Add prefer_udp to the CoreDNS configuration to force UDP for upstream queries. See Configure CoreDNS.

Container Service for Kubernetes:Troubleshoot DNS resolution errors

How DNS resolution works

Key concepts

Step 1: Identify your error message

Step 2: Check DNS policy and server address

Step 3: Check CoreDNS pod health

Step 4: Check CoreDNS operational logs

Step 5: Reproduce the error and isolate the cause

Diagnostic methods

Test network connectivity between application pods and CoreDNS

Test network connectivity of CoreDNS

Diagnose the container network

Capture packets

Known issues

FAQ

External domain name cannot be resolved

Domain names of headless Services cannot be resolved

Domain names of StatefulSet pods cannot be resolved

DNS queries are blocked by security group rules or network ACLs

Container network connectivity errors cause DNS failures

CoreDNS pods are overloaded

DNS queries are not evenly distributed among CoreDNS pods

CoreDNS pods do not run as normal

DNS resolutions fail because the client is overloaded

Conntrack table is full

`autopath` plugin does not work as normal

DNS resolutions fail due to concurrent A and AAAA record queries

DNS resolutions fail due to IPVS errors

NodeLocal DNSCache does not work

Alibaba Cloud DNS PrivateZone names cannot be resolved

Next steps

How DNS resolution works

Key concepts

Step 1: Identify your error message

Step 2: Check DNS policy and server address

Step 3: Check CoreDNS pod health

Step 4: Check CoreDNS operational logs

Step 5: Reproduce the error and isolate the cause

Diagnostic methods

Test network connectivity between application pods and CoreDNS

Test network connectivity of CoreDNS

Diagnose the container network

Capture packets

Known issues

FAQ

External domain name cannot be resolved

Domain names of headless Services cannot be resolved

Domain names of StatefulSet pods cannot be resolved

DNS queries are blocked by security group rules or network ACLs

Container network connectivity errors cause DNS failures

CoreDNS pods are overloaded

DNS queries are not evenly distributed among CoreDNS pods

CoreDNS pods do not run as normal

DNS resolutions fail because the client is overloaded

Conntrack table is full

autopath plugin does not work as normal

DNS resolutions fail due to concurrent A and AAAA record queries

DNS resolutions fail due to IPVS errors

NodeLocal DNSCache does not work

Alibaba Cloud DNS PrivateZone names cannot be resolved

Next steps

`autopath` plugin does not work as normal