DNS resolution failures in an ACK cluster typically stem from a small set of root causes: CoreDNS pod issues, network connectivity problems, misconfigured DNS policies, or known Linux kernel and IPVS edge cases. This guide walks you through a systematic diagnosis process and provides targeted fixes for each failure pattern.
How DNS resolution works
When an application pod makes a DNS query, the request flows through the following path:
-
The pod sends a DNS query to the address in
/etc/resolv.conf, which is typically the kube-dns Service IP. -
kube-dns forwards the query to a CoreDNS pod in the
kube-systemnamespace. -
For internal domain names ending with
.cluster.local, CoreDNS resolves the query from its own cache without contacting upstream servers. -
For external domain names, CoreDNS forwards the query to the upstream DNS servers specified in its configuration. The default upstream servers are
100.100.2.136and100.100.2.138, both deployed in the virtual private cloud (VPC).
If you have NodeLocal DNSCache installed, DNS queries go to the local cache (169.254.20.10) first. Only if the local cache cannot resolve the query does it fall back to the kube-dns Service.
Key concepts
| Term | Description |
|---|---|
| Internal domain name | A domain name ending with .cluster.local. CoreDNS resolves these from its cache, not upstream servers. |
| External domain name | Any domain name that does not end with .cluster.local. CoreDNS forwards queries for these to upstream DNS servers. |
| Application pod | Any pod that is not a system component. |
| kube-dns Service | The Kubernetes Service that routes DNS traffic to CoreDNS pods. Its IP is the default DNS server address in application pods. |
| NodeLocal DNSCache | A DaemonSet that runs a local DNS caching agent on each node. When enabled, pods send DNS queries to the local cache (169.254.20.10) instead of kube-dns directly. |
| Upstream DNS server | The DNS server that CoreDNS contacts for external domain names. Defaults to 100.100.2.136 and 100.100.2.138. |
Step 1: Identify your error message
Match your error message to determine the likely failure category.
| Client | Error message | Likely cause |
|---|---|---|
| ping | ping: xxx.yyy.zzz: Name or service not known |
Domain name does not exist, or DNS server is unreachable. Latency over 5 seconds points to DNS server unreachable. |
| curl | curl: (6) Could not resolve host: xxx.yyy.zzz |
Same as above. |
| PHP HTTP client | php_network_getaddresses: getaddrinfo failed: Name or service not known in xxx.php on line yyy |
Same as above. |
| Golang HTTP client | dial tcp: lookup xxx.yyy.zzz on 100.100.2.136:53: no such host |
Domain name does not exist. |
| dig | ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: xxxxx |
Domain name does not exist. |
| Golang HTTP client | dial tcp: lookup xxx.yyy.zzz on 100.100.2.139:53: read udp 192.168.0.100:42922->100.100.2.139:53: i/o timeout |
DNS server unreachable. |
| dig | ;; connection timed out; no servers could be reached |
DNS server unreachable. |
-
Domain does not exist → Check whether the domain name itself is correct. If only external domains fail, see External domain name cannot be resolved.
-
DNS server unreachable → Proceed to Step 2.
Step 2: Check DNS policy and server address
Verify that the pod is using CoreDNS as its DNS server.
Run the following commands:
# View the pod's DNS policy
kubectl get pod <pod-name> -o yaml
# Log in to the pod and inspect the DNS configuration
kubectl exec -it <pod-name> -- cat /etc/resolv.conf
Check the dnsPolicy field and the nameserver entries in /etc/resolv.conf.
dnsPolicy value |
Behavior |
|---|---|
ClusterFirst |
Default. The pod uses the kube-dns Service IP as its DNS server. |
ClusterFirstWithHostNet |
Same as ClusterFirst for host-network pods. |
Default |
The pod inherits DNS settings from the Elastic Compute Service (ECS) node. Use this only if the pod does not need to resolve cluster-internal names. |
None |
DNS is configured entirely through dnsConfig. NodeLocal DNSCache uses this value to inject 169.254.20.10 and the kube-dns IP as nameservers. |
If the pod is not using CoreDNS (nameserver is not the kube-dns Service IP), the pod may be overloaded or the conntrack table may be full. See Client is overloaded and Conntrack table is full.
If the pod is using NodeLocal DNSCache (nameserver is 169.254.20.10), see NodeLocal DNSCache does not work and Alibaba Cloud DNS PrivateZone names cannot be resolved.
If the pod is using CoreDNS, continue to Step 3.
Step 3: Check CoreDNS pod health
Run the following commands to inspect the CoreDNS pods:
# View CoreDNS pod status and placement
kubectl -n kube-system get pod -o wide -l k8s-app=kube-dns
Expected output:
NAME READY STATUS RESTARTS AGE IP NODE
coredns-xxxxxxxxx-xxxxx 1/1 Running 0 25h 172.20.6.53 cn-hangzhou.192.168.0.198# View real-time CPU and memory usage
kubectl -n kube-system top pod -l k8s-app=kube-dns
Expected output:
NAME CPU(cores) MEMORY(bytes)
coredns-xxxxxxxxx-xxxxx 3m 18Mi
-
Pods not in Running state → Run
kubectl -n kube-system describe pod <CoreDNS-pod-name>to identify the cause. See CoreDNS pods do not run as normal. -
CPU or memory near the limit → See CoreDNS pods are overloaded.
-
CPU usage uneven across pods → See DNS queries are not evenly distributed.
Step 4: Check CoreDNS operational logs
kubectl -n kube-system logs -f --tail=500 --timestamps <coredns-pod-name>
| Flag | Description |
|---|---|
-f |
Stream the log output. |
--tail=500 |
Show the last 500 lines. |
--timestamps |
Include timestamps in each log line. |
Look for error patterns that match known issues. For DNS query-level logs, you must first enable the CoreDNS log plugin. For details, see Configure DNS resolution.
After enabling the log plugin, each resolved query produces an entry like:
[INFO] 172.20.2.25:44525 - 36259 "A IN redis-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000116946s
Common DNS response codes in the log:
| Response code | Meaning | What to do |
|---|---|---|
NOERROR |
Resolved successfully. | No action needed. |
NXDOMAIN |
Domain does not exist on the upstream server. | Check whether the domain name includes a search suffix that doesn't resolve. |
SERVFAIL |
Upstream DNS server returned an error. | Check CoreDNS connectivity to upstream servers. |
REFUSED |
Upstream server rejected the query. | Check the CoreDNS Corefile configuration and the node's /etc/resolv.conf. |
For more information about DNS response codes, see RFC 1035.
Step 5: Reproduce the error and isolate the cause
If the error occurs consistently:
-
Check the DNS query log for error response codes. See External domain name cannot be resolved.
-
Test network connectivity between application pods and CoreDNS. See Test network connectivity between application pods and CoreDNS.
-
Diagnose the container network. See Diagnose the container network.
If the error occurs intermittently:
Capture packets to collect evidence. See Capture packets.
If none of the above steps resolve the issue, submit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticket.
Diagnostic methods
Test network connectivity between application pods and CoreDNS
Connect to the application pod's network namespace using one of the following methods:
-
Method 1 (recommended): Run
kubectl exec -it <pod-name> -- bashto enter the pod directly. -
Method 2: Log in to the node, find the process ID with
ps aux | grep <application-process-name>, then enter the network namespace withnsenter -t <pid> -n bash. -
Method 3 (for frequently restarting pods):
-
Log in to the node.
-
Run
docker ps -a | grep <application-container-name>to find sandboxed container IDs (names start withk8s_POD_). -
Run
docker inspect <sandboxed-container-ID> | grep netnsto find the network namespace path in/var/run/docker/netns/xxxx. -
Run
nsenter -n<netns-path> -n bashto enter the namespace. > Note: Do not add a space between-nand<netns-path>.
-
Once inside the pod's network namespace, run these connectivity tests:
# Test connectivity to the kube-dns Service
dig <domain> @<kube-dns-svc-ip>
# Test Internet Control Message Protocol (ICMP) connectivity to the CoreDNS pod
ping <coredns-pod-ip>
# Test DNS query directly to the CoreDNS pod
dig <domain> @<coredns-pod-ip>
Replace <kube-dns-svc-ip> with the kube-dns Service IP in the kube-system namespace, and <coredns-pod-ip> with the IP of a CoreDNS pod.
| Symptom | Likely cause | Next step |
|---|---|---|
| Cannot reach the kube-dns Service | Node overloaded, kube-proxy down, or security group blocking User Datagram Protocol (UDP) port 53 | Verify that security group rules allow UDP port 53. If they do, submit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticket. |
| Cannot reach the CoreDNS pod (ICMP) | Container network error or security group blocking ICMP | Diagnose the container network. |
| Cannot reach the CoreDNS pod (DNS) | Node overloaded or security group blocking UDP port 53 | Verify that security group rules allow UDP port 53. If they do, submit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticket. |
Test network connectivity of CoreDNS
-
Log in to the node where the CoreDNS pod runs.
-
Run
ps aux | grep corednsto get the CoreDNS process ID. -
Run
nsenter -t <pid> -n bashto enter the CoreDNS network namespace. -
Test connectivity:
# Test connectivity to the Kubernetes API server telnet <apiserver-slb-ip> 443 # Test connectivity to upstream DNS servers dig <domain> @100.100.2.136 dig <domain> @100.100.2.138
| Symptom | Likely cause | Next step |
|---|---|---|
| Cannot reach the Kubernetes API server | API server error, node overloaded, or kube-proxy down | Submit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticket. |
| Cannot reach upstream DNS servers | Node overloaded, CoreDNS misconfigured, or Express Connect routing error | Submit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticket. |
Diagnose the container network
-
Log in to the ACK consoleACK console.
-
On the Clusters page, click the name of your cluster or click Details in the Actions column.
-
In the left-side navigation pane, choose Operations > Cluster Check.
-
On the Container Intelligence Service page, choose Cluster Check > Diagnosis.
-
On the Diagnosis page, click the Network Diagnosis tab.
-
Set Source address to the application pod IP, Destination address to the kube-dns Service IP, and Destination port to
53. Select Enable packet tracing and I know and agree, then click Create diagnosis. -
In the diagnosis list, click Diagnosis details for your record.
The results show the Diagnosis result, Packet paths, and All possible paths sections, along with identified error causes. For details, see Use the cluster diagnostics feature to troubleshoot cluster issues.
Capture packets
Use packet capture when errors are intermittent and hard to reproduce.
-
Log in to the nodes where the application pods and the CoreDNS pod run.
-
Run the following command on each ECS instance:
tcpdump -i any port 53 -C 20 -W 200 -w /tmp/client_dns.pcapThis captures all traffic on port 53, rotating across up to 200 files of 20 MB each.
-
Reproduce the error, then analyze the packets from the time window when DNS failures occurred. Check your application log for the exact timestamps.
Packet capture does not affect your service. It causes only a slight increase in CPU utilization and disk I/O.
Known issues
The following issues are caused by environment-specific conditions. Scan this list before doing deeper investigation.
| Issue | Affected environments | Quick identification |
|---|---|---|
| Concurrent A and AAAA record queries | All (especially Alpine-based images and PHP apps) | Intermittent failures; packet capture shows simultaneous A/AAAA queries on the same port |
| IPVS UDP source port conflicts | kube-proxy in IP Virtual Server (IPVS) mode; CentOS or Alibaba Cloud Linux 2 with kernel earlier than 4.19.91-25.1.al7.x86_64 |
Failures last approximately 5 minutes during node scaling or CoreDNS scaling |
| Conntrack table full | High-traffic nodes | dmesg -H shows conntrack full; failures during peak hours |
| Alibaba Cloud DNS PrivateZone with NodeLocal DNSCache | Clusters using both NodeLocal DNSCache and DNS PrivateZone | PrivateZone or vpc-proxy domain names fail to resolve or resolve to wrong addresses |
| autopath plugin bug | Clusters creating containers at high frequency | External names intermittently fail or resolve to wrong IPs; internal names resolve normally |
| DNS PrivateZone and vpc-proxy names | Clusters where both internal and external domain names fail | Resolution errors only on domain names added to Alibaba Cloud DNS PrivateZone and domain names that contain vpc-proxy |
FAQ
External domain name cannot be resolved
Check the CoreDNS DNS query log for the response code. Enable the CoreDNS log plugin if it is not already on (see Configure DNS resolution), then look for entries for the failing domain. A NXDOMAIN response means the domain does not exist on the upstream server—often because a search domain suffix was appended to a short name, creating an invalid fully qualified domain name (FQDN). A SERVFAIL or REFUSED response means the upstream DNS server itself has a problem; check the CoreDNS configuration and its connectivity to 100.100.2.136 and 100.100.2.138.
Domain names of headless Services cannot be resolved
In CoreDNS versions earlier than 1.7.0, a network jitter on the Kubernetes API server can cause CoreDNS to exit unexpectedly, which stops it from updating headless Service domain records while it is down. Update CoreDNS to 1.7.0 or later to fix this. See \[Component Updates\] Update CoreDNS.
Domain names of StatefulSet pods cannot be resolved
The pod YAML template must set serviceName to the name of the headless Service that exposes the StatefulSet. Without this, the per-pod DNS names (for example, pod.headless-svc.ns.svc.cluster.local) cannot be resolved, even though the Service-level name (for example, headless-svc.ns.svc.cluster.local) works fine. Set serviceName in the StatefulSet spec to the headless Service name.
DNS queries are blocked by security group rules or network ACLs
Security group rules or network access control lists (ACLs) that control the network communications of the ECS instance are blocking UDP port 53, causing CoreDNS resolution failures on some or all nodes. Modify the security group rules or network ACLs to allow inbound and outbound traffic on UDP port 53.
Container network connectivity errors cause DNS failures
UDP port 53 is blocked due to container network errors. Use the Network Diagnosis feature to identify the broken network path and the root cause.
CoreDNS pods are overloaded
When DNS query volume exceeds what the current number of CoreDNS replicas can handle, resolution latency rises and failures occur. Check whether CPU and memory usage on the CoreDNS pods is near the limit (kubectl -n kube-system top pod -l k8s-app=kube-dns).
Two fixes:
-
Deploy NodeLocal DNSCache to absorb queries locally and reduce the load on CoreDNS. See Configure NodeLocal DNSCache.
-
Scale out CoreDNS replicas so that peak CPU utilization per pod stays well below the node's available CPU.
DNS queries are not evenly distributed among CoreDNS pods
Imbalanced pod scheduling or a sessionAffinity setting on the kube-dns Service can cause some CoreDNS pods to handle far more queries than others. You'll see noticeably different CPU utilization across CoreDNS pods.
Two fixes:
-
Scale out CoreDNS pods and spread them across different nodes.
-
Remove the
sessionAffinitysetting from the kube-dns Service. See Configure the kube-dns Service.
CoreDNS pods do not run as normal
Misconfigured YAML or ConfigMap settings can prevent CoreDNS from starting or cause it to crash. Symptoms include pods not in Running state, a continuously increasing restart count, or errors in the operational log.
Check the CoreDNS log for these common errors:
| Error | Cause | Fix |
|---|---|---|
/etc/coredns/Corefile:4 - Error during parsing: Unknown directive 'ready' |
The CoreDNS ConfigMap contains a plugin that the current CoreDNS version does not support. | Delete the unsupported plugin (for example, ready) from the ConfigMap in the kube-system namespace. Repeat for any other plugins listed in the error. |
Failed to watch *v1.Pod: ... connect: connection refused |
Connections to the Kubernetes API server were interrupted when the log was generated. | If no DNS failures occurred during that period, this is not the root cause. Otherwise, test CoreDNS network connectivity. See Test network connectivity of CoreDNS. |
[ERROR] plugin/errors: 2 www.aliyun.com. A: read udp ...->100.100.2.136:53: i/o timeout |
CoreDNS could not reach the upstream DNS servers. | Test connectivity from the CoreDNS pod to 100.100.2.136 and 100.100.2.138. |
DNS resolutions fail because the client is overloaded
When the ECS instance running the pod is fully loaded, UDP packets can be dropped before they reach CoreDNS. Look for an abnormal network interface controller (NIC) retransmission rate and high CPU utilization on the instance in your monitoring data.
Two options:
-
Submit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticket for detailed investigation.
-
Deploy NodeLocal DNSCache to reduce inter-node DNS traffic. See Configure NodeLocal DNSCache.
Conntrack table is full
When the Linux kernel's conntrack table is full, new UDP and TCP connections are dropped. This typically causes DNS failures during peak hours that recover during off-peak times. To confirm, run dmesg -H on the affected node and look for the keyword conntrack full during the failure window.
Increase the maximum number of entries in the conntrack table. See How do I increase the maximum number of tracked connections in the conntrack table of the Linux kernel?
autopath plugin does not work as normal
A known defect in CoreDNS's autopath plugin causes occasional resolution failures or wrong IP addresses for external domain names. Internal domain names continue to resolve correctly. The issue becomes more visible in clusters that create containers at a high frequency.
Disable the autopath plugin:
-
Run
kubectl -n kube-system edit configmap corednsto open the CoreDNS ConfigMap. -
Delete the
autopath @kubernetesline. Save and exit. -
Verify the new configuration loaded by checking the CoreDNS log for the keyword
reload.
DNS resolutions fail due to concurrent A and AAAA record queries
Some Linux distributions send A and AAAA record queries simultaneously over the same port. This can trigger a conntrack table conflict that drops UDP packets.
Symptoms: intermittent CoreDNS resolution failures, with packet capture or DNS query logs showing simultaneous A and AAAA queries from the same source port.
Fixes depend on your image base:
-
CentOS or Ubuntu: Add
options timeout:2 attempts:3 rotate single-request-reopento the DNS resolver configuration. -
Alpine Linux: Replace the Alpine-based image with one based on another OS. See Alpine caveats.
-
PHP with cURL: Add
CURL_IPRESOLVE_V4to force IPv4-only resolution. See cURL functions. -
All environments: Deploy NodeLocal DNSCache, which mitigates the race condition. See Configure NodeLocal DNSCache.
DNS resolutions fail due to IPVS errors
If kube-proxy runs in IPVS mode and the cluster nodes run CentOS or Alibaba Cloud Linux 2 with a kernel earlier than 4.19.91-25.1.al7.x86_64, removing IPVS UDP backend pods causes source port conflicts that drop UDP packets. This appears as DNS failures lasting approximately 5 minutes during node scaling, node shutdown, or CoreDNS scaling events.
Two fixes:
-
Deploy NodeLocal DNSCache to bypass the IPVS path for DNS queries. See Configure NodeLocal DNSCache.
-
Shorten the UDP session timeout in IPVS mode. See Change the UDP timeout period in IPVS mode.
NodeLocal DNSCache does not work
All DNS queries go to CoreDNS instead of NodeLocal DNSCache when either of these conditions applies:
-
dnsConfigwas not injected into the application pods, so they still point to the kube-dns Service IP. -
The pods use an Alpine Linux base image, which sends DNS queries to all nameservers concurrently, including CoreDNS pods directly.
To fix the first case, enable automatic dnsConfig injection. See Configure NodeLocal DNSCache. For Alpine-based images, replace them with images built on another OS. See Alpine caveats.
Alibaba Cloud DNS PrivateZone names cannot be resolved
Alibaba Cloud DNS PrivateZone does not support TCP—it requires UDP. When NodeLocal DNSCache is used, domain names added to Alibaba Cloud DNS PrivateZone cannot be resolved, the endpoints of Alibaba Cloud service APIs that contain vpc-proxy cannot be resolved, or domain names are resolved to wrong IP addresses.
Add prefer_udp to the CoreDNS configuration to force UDP for upstream queries. See Configure CoreDNS.