All Products
Search
Document Center

Container Service for Kubernetes:Troubleshoot DNS resolution errors

Last Updated:Mar 26, 2026

DNS resolution failures in an ACK cluster typically stem from a small set of root causes: CoreDNS pod issues, network connectivity problems, misconfigured DNS policies, or known Linux kernel and IPVS edge cases. This guide walks you through a systematic diagnosis process and provides targeted fixes for each failure pattern.

How DNS resolution works

When an application pod makes a DNS query, the request flows through the following path:

  1. The pod sends a DNS query to the address in /etc/resolv.conf, which is typically the kube-dns Service IP.

  2. kube-dns forwards the query to a CoreDNS pod in the kube-system namespace.

  3. For internal domain names ending with .cluster.local, CoreDNS resolves the query from its own cache without contacting upstream servers.

  4. For external domain names, CoreDNS forwards the query to the upstream DNS servers specified in its configuration. The default upstream servers are 100.100.2.136 and 100.100.2.138, both deployed in the virtual private cloud (VPC).

If you have NodeLocal DNSCache installed, DNS queries go to the local cache (169.254.20.10) first. Only if the local cache cannot resolve the query does it fall back to the kube-dns Service.

Key concepts

Term Description
Internal domain name A domain name ending with .cluster.local. CoreDNS resolves these from its cache, not upstream servers.
External domain name Any domain name that does not end with .cluster.local. CoreDNS forwards queries for these to upstream DNS servers.
Application pod Any pod that is not a system component.
kube-dns Service The Kubernetes Service that routes DNS traffic to CoreDNS pods. Its IP is the default DNS server address in application pods.
NodeLocal DNSCache A DaemonSet that runs a local DNS caching agent on each node. When enabled, pods send DNS queries to the local cache (169.254.20.10) instead of kube-dns directly.
Upstream DNS server The DNS server that CoreDNS contacts for external domain names. Defaults to 100.100.2.136 and 100.100.2.138.

Step 1: Identify your error message

Match your error message to determine the likely failure category.

Client Error message Likely cause
ping ping: xxx.yyy.zzz: Name or service not known Domain name does not exist, or DNS server is unreachable. Latency over 5 seconds points to DNS server unreachable.
curl curl: (6) Could not resolve host: xxx.yyy.zzz Same as above.
PHP HTTP client php_network_getaddresses: getaddrinfo failed: Name or service not known in xxx.php on line yyy Same as above.
Golang HTTP client dial tcp: lookup xxx.yyy.zzz on 100.100.2.136:53: no such host Domain name does not exist.
dig ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: xxxxx Domain name does not exist.
Golang HTTP client dial tcp: lookup xxx.yyy.zzz on 100.100.2.139:53: read udp 192.168.0.100:42922->100.100.2.139:53: i/o timeout DNS server unreachable.
dig ;; connection timed out; no servers could be reached DNS server unreachable.

Step 2: Check DNS policy and server address

Verify that the pod is using CoreDNS as its DNS server.

Run the following commands:

# View the pod's DNS policy
kubectl get pod <pod-name> -o yaml

# Log in to the pod and inspect the DNS configuration
kubectl exec -it <pod-name> -- cat /etc/resolv.conf

Check the dnsPolicy field and the nameserver entries in /etc/resolv.conf.

dnsPolicy value Behavior
ClusterFirst Default. The pod uses the kube-dns Service IP as its DNS server.
ClusterFirstWithHostNet Same as ClusterFirst for host-network pods.
Default The pod inherits DNS settings from the Elastic Compute Service (ECS) node. Use this only if the pod does not need to resolve cluster-internal names.
None DNS is configured entirely through dnsConfig. NodeLocal DNSCache uses this value to inject 169.254.20.10 and the kube-dns IP as nameservers.

If the pod is not using CoreDNS (nameserver is not the kube-dns Service IP), the pod may be overloaded or the conntrack table may be full. See Client is overloaded and Conntrack table is full.

If the pod is using NodeLocal DNSCache (nameserver is 169.254.20.10), see NodeLocal DNSCache does not work and Alibaba Cloud DNS PrivateZone names cannot be resolved.

If the pod is using CoreDNS, continue to Step 3.

Step 3: Check CoreDNS pod health

Run the following commands to inspect the CoreDNS pods:

# View CoreDNS pod status and placement
kubectl -n kube-system get pod -o wide -l k8s-app=kube-dns

Expected output:

NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE
coredns-xxxxxxxxx-xxxxx   1/1     Running   0          25h   172.20.6.53   cn-hangzhou.192.168.0.198
# View real-time CPU and memory usage
kubectl -n kube-system top pod -l k8s-app=kube-dns

Expected output:

NAME                      CPU(cores)   MEMORY(bytes)
coredns-xxxxxxxxx-xxxxx   3m           18Mi

Step 4: Check CoreDNS operational logs

kubectl -n kube-system logs -f --tail=500 --timestamps <coredns-pod-name>
Flag Description
-f Stream the log output.
--tail=500 Show the last 500 lines.
--timestamps Include timestamps in each log line.

Look for error patterns that match known issues. For DNS query-level logs, you must first enable the CoreDNS log plugin. For details, see Configure DNS resolution.

After enabling the log plugin, each resolved query produces an entry like:

[INFO] 172.20.2.25:44525 - 36259 "A IN redis-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000116946s

Common DNS response codes in the log:

Response code Meaning What to do
NOERROR Resolved successfully. No action needed.
NXDOMAIN Domain does not exist on the upstream server. Check whether the domain name includes a search suffix that doesn't resolve.
SERVFAIL Upstream DNS server returned an error. Check CoreDNS connectivity to upstream servers.
REFUSED Upstream server rejected the query. Check the CoreDNS Corefile configuration and the node's /etc/resolv.conf.

For more information about DNS response codes, see RFC 1035.

Step 5: Reproduce the error and isolate the cause

If the error occurs consistently:

  1. Check the DNS query log for error response codes. See External domain name cannot be resolved.

  2. Test network connectivity between application pods and CoreDNS. See Test network connectivity between application pods and CoreDNS.

  3. Diagnose the container network. See Diagnose the container network.

If the error occurs intermittently:

Capture packets to collect evidence. See Capture packets.

If none of the above steps resolve the issue, submit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticket.

Diagnostic methods

Test network connectivity between application pods and CoreDNS

Connect to the application pod's network namespace using one of the following methods:

  • Method 1 (recommended): Run kubectl exec -it <pod-name> -- bash to enter the pod directly.

  • Method 2: Log in to the node, find the process ID with ps aux | grep <application-process-name>, then enter the network namespace with nsenter -t <pid> -n bash.

  • Method 3 (for frequently restarting pods):

    1. Log in to the node.

    2. Run docker ps -a | grep <application-container-name> to find sandboxed container IDs (names start with k8s_POD_).

    3. Run docker inspect <sandboxed-container-ID> | grep netns to find the network namespace path in /var/run/docker/netns/xxxx.

    4. Run nsenter -n<netns-path> -n bash to enter the namespace. > Note: Do not add a space between -n and <netns-path>.

Once inside the pod's network namespace, run these connectivity tests:

# Test connectivity to the kube-dns Service
dig <domain> @<kube-dns-svc-ip>

# Test Internet Control Message Protocol (ICMP) connectivity to the CoreDNS pod
ping <coredns-pod-ip>

# Test DNS query directly to the CoreDNS pod
dig <domain> @<coredns-pod-ip>

Replace <kube-dns-svc-ip> with the kube-dns Service IP in the kube-system namespace, and <coredns-pod-ip> with the IP of a CoreDNS pod.

Symptom Likely cause Next step
Cannot reach the kube-dns Service Node overloaded, kube-proxy down, or security group blocking User Datagram Protocol (UDP) port 53 Verify that security group rules allow UDP port 53. If they do, submit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticket.
Cannot reach the CoreDNS pod (ICMP) Container network error or security group blocking ICMP Diagnose the container network.
Cannot reach the CoreDNS pod (DNS) Node overloaded or security group blocking UDP port 53 Verify that security group rules allow UDP port 53. If they do, submit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticketsubmit a ticket.

Test network connectivity of CoreDNS

  1. Log in to the node where the CoreDNS pod runs.

  2. Run ps aux | grep coredns to get the CoreDNS process ID.

  3. Run nsenter -t <pid> -n bash to enter the CoreDNS network namespace.

  4. Test connectivity:

    # Test connectivity to the Kubernetes API server
    telnet <apiserver-slb-ip> 443
    
    # Test connectivity to upstream DNS servers
    dig <domain> @100.100.2.136
    dig <domain> @100.100.2.138
Symptom Likely cause Next step
Cannot reach the Kubernetes API server API server error, node overloaded, or kube-proxy down Submit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticket.
Cannot reach upstream DNS servers Node overloaded, CoreDNS misconfigured, or Express Connect routing error Submit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticketSubmit a ticket.

Diagnose the container network

  1. Log in to the ACK consoleACK console.

  2. On the Clusters page, click the name of your cluster or click Details in the Actions column.

  3. In the left-side navigation pane, choose Operations > Cluster Check.

  4. On the Container Intelligence Service page, choose Cluster Check > Diagnosis.

  5. On the Diagnosis page, click the Network Diagnosis tab.

  6. Set Source address to the application pod IP, Destination address to the kube-dns Service IP, and Destination port to 53. Select Enable packet tracing and I know and agree, then click Create diagnosis.

  7. In the diagnosis list, click Diagnosis details for your record.

The results show the Diagnosis result, Packet paths, and All possible paths sections, along with identified error causes. For details, see Use the cluster diagnostics feature to troubleshoot cluster issues.

Capture packets

Use packet capture when errors are intermittent and hard to reproduce.

  1. Log in to the nodes where the application pods and the CoreDNS pod run.

  2. Run the following command on each ECS instance:

    tcpdump -i any port 53 -C 20 -W 200 -w /tmp/client_dns.pcap

    This captures all traffic on port 53, rotating across up to 200 files of 20 MB each.

  3. Reproduce the error, then analyze the packets from the time window when DNS failures occurred. Check your application log for the exact timestamps.

Packet capture does not affect your service. It causes only a slight increase in CPU utilization and disk I/O.

Known issues

The following issues are caused by environment-specific conditions. Scan this list before doing deeper investigation.

Issue Affected environments Quick identification
Concurrent A and AAAA record queries All (especially Alpine-based images and PHP apps) Intermittent failures; packet capture shows simultaneous A/AAAA queries on the same port
IPVS UDP source port conflicts kube-proxy in IP Virtual Server (IPVS) mode; CentOS or Alibaba Cloud Linux 2 with kernel earlier than 4.19.91-25.1.al7.x86_64 Failures last approximately 5 minutes during node scaling or CoreDNS scaling
Conntrack table full High-traffic nodes dmesg -H shows conntrack full; failures during peak hours
Alibaba Cloud DNS PrivateZone with NodeLocal DNSCache Clusters using both NodeLocal DNSCache and DNS PrivateZone PrivateZone or vpc-proxy domain names fail to resolve or resolve to wrong addresses
autopath plugin bug Clusters creating containers at high frequency External names intermittently fail or resolve to wrong IPs; internal names resolve normally
DNS PrivateZone and vpc-proxy names Clusters where both internal and external domain names fail Resolution errors only on domain names added to Alibaba Cloud DNS PrivateZone and domain names that contain vpc-proxy

FAQ

External domain name cannot be resolved

Check the CoreDNS DNS query log for the response code. Enable the CoreDNS log plugin if it is not already on (see Configure DNS resolution), then look for entries for the failing domain. A NXDOMAIN response means the domain does not exist on the upstream server—often because a search domain suffix was appended to a short name, creating an invalid fully qualified domain name (FQDN). A SERVFAIL or REFUSED response means the upstream DNS server itself has a problem; check the CoreDNS configuration and its connectivity to 100.100.2.136 and 100.100.2.138.

Domain names of headless Services cannot be resolved

In CoreDNS versions earlier than 1.7.0, a network jitter on the Kubernetes API server can cause CoreDNS to exit unexpectedly, which stops it from updating headless Service domain records while it is down. Update CoreDNS to 1.7.0 or later to fix this. See \[Component Updates\] Update CoreDNS.

Domain names of StatefulSet pods cannot be resolved

The pod YAML template must set serviceName to the name of the headless Service that exposes the StatefulSet. Without this, the per-pod DNS names (for example, pod.headless-svc.ns.svc.cluster.local) cannot be resolved, even though the Service-level name (for example, headless-svc.ns.svc.cluster.local) works fine. Set serviceName in the StatefulSet spec to the headless Service name.

DNS queries are blocked by security group rules or network ACLs

Security group rules or network access control lists (ACLs) that control the network communications of the ECS instance are blocking UDP port 53, causing CoreDNS resolution failures on some or all nodes. Modify the security group rules or network ACLs to allow inbound and outbound traffic on UDP port 53.

Container network connectivity errors cause DNS failures

UDP port 53 is blocked due to container network errors. Use the Network Diagnosis feature to identify the broken network path and the root cause.

CoreDNS pods are overloaded

When DNS query volume exceeds what the current number of CoreDNS replicas can handle, resolution latency rises and failures occur. Check whether CPU and memory usage on the CoreDNS pods is near the limit (kubectl -n kube-system top pod -l k8s-app=kube-dns).

Two fixes:

  • Deploy NodeLocal DNSCache to absorb queries locally and reduce the load on CoreDNS. See Configure NodeLocal DNSCache.

  • Scale out CoreDNS replicas so that peak CPU utilization per pod stays well below the node's available CPU.

DNS queries are not evenly distributed among CoreDNS pods

Imbalanced pod scheduling or a sessionAffinity setting on the kube-dns Service can cause some CoreDNS pods to handle far more queries than others. You'll see noticeably different CPU utilization across CoreDNS pods.

Two fixes:

  • Scale out CoreDNS pods and spread them across different nodes.

  • Remove the sessionAffinity setting from the kube-dns Service. See Configure the kube-dns Service.

CoreDNS pods do not run as normal

Misconfigured YAML or ConfigMap settings can prevent CoreDNS from starting or cause it to crash. Symptoms include pods not in Running state, a continuously increasing restart count, or errors in the operational log.

Check the CoreDNS log for these common errors:

Error Cause Fix
/etc/coredns/Corefile:4 - Error during parsing: Unknown directive 'ready' The CoreDNS ConfigMap contains a plugin that the current CoreDNS version does not support. Delete the unsupported plugin (for example, ready) from the ConfigMap in the kube-system namespace. Repeat for any other plugins listed in the error.
Failed to watch *v1.Pod: ... connect: connection refused Connections to the Kubernetes API server were interrupted when the log was generated. If no DNS failures occurred during that period, this is not the root cause. Otherwise, test CoreDNS network connectivity. See Test network connectivity of CoreDNS.
[ERROR] plugin/errors: 2 www.aliyun.com. A: read udp ...->100.100.2.136:53: i/o timeout CoreDNS could not reach the upstream DNS servers. Test connectivity from the CoreDNS pod to 100.100.2.136 and 100.100.2.138.

DNS resolutions fail because the client is overloaded

When the ECS instance running the pod is fully loaded, UDP packets can be dropped before they reach CoreDNS. Look for an abnormal network interface controller (NIC) retransmission rate and high CPU utilization on the instance in your monitoring data.

Two options:

Conntrack table is full

When the Linux kernel's conntrack table is full, new UDP and TCP connections are dropped. This typically causes DNS failures during peak hours that recover during off-peak times. To confirm, run dmesg -H on the affected node and look for the keyword conntrack full during the failure window.

Increase the maximum number of entries in the conntrack table. See How do I increase the maximum number of tracked connections in the conntrack table of the Linux kernel?

autopath plugin does not work as normal

A known defect in CoreDNS's autopath plugin causes occasional resolution failures or wrong IP addresses for external domain names. Internal domain names continue to resolve correctly. The issue becomes more visible in clusters that create containers at a high frequency.

Disable the autopath plugin:

  1. Run kubectl -n kube-system edit configmap coredns to open the CoreDNS ConfigMap.

  2. Delete the autopath @kubernetes line. Save and exit.

  3. Verify the new configuration loaded by checking the CoreDNS log for the keyword reload.

DNS resolutions fail due to concurrent A and AAAA record queries

Some Linux distributions send A and AAAA record queries simultaneously over the same port. This can trigger a conntrack table conflict that drops UDP packets.

Symptoms: intermittent CoreDNS resolution failures, with packet capture or DNS query logs showing simultaneous A and AAAA queries from the same source port.

Fixes depend on your image base:

  • CentOS or Ubuntu: Add options timeout:2 attempts:3 rotate single-request-reopen to the DNS resolver configuration.

  • Alpine Linux: Replace the Alpine-based image with one based on another OS. See Alpine caveats.

  • PHP with cURL: Add CURL_IPRESOLVE_V4 to force IPv4-only resolution. See cURL functions.

  • All environments: Deploy NodeLocal DNSCache, which mitigates the race condition. See Configure NodeLocal DNSCache.

DNS resolutions fail due to IPVS errors

If kube-proxy runs in IPVS mode and the cluster nodes run CentOS or Alibaba Cloud Linux 2 with a kernel earlier than 4.19.91-25.1.al7.x86_64, removing IPVS UDP backend pods causes source port conflicts that drop UDP packets. This appears as DNS failures lasting approximately 5 minutes during node scaling, node shutdown, or CoreDNS scaling events.

Two fixes:

NodeLocal DNSCache does not work

All DNS queries go to CoreDNS instead of NodeLocal DNSCache when either of these conditions applies:

  • dnsConfig was not injected into the application pods, so they still point to the kube-dns Service IP.

  • The pods use an Alpine Linux base image, which sends DNS queries to all nameservers concurrently, including CoreDNS pods directly.

To fix the first case, enable automatic dnsConfig injection. See Configure NodeLocal DNSCache. For Alpine-based images, replace them with images built on another OS. See Alpine caveats.

Alibaba Cloud DNS PrivateZone names cannot be resolved

Alibaba Cloud DNS PrivateZone does not support TCP—it requires UDP. When NodeLocal DNSCache is used, domain names added to Alibaba Cloud DNS PrivateZone cannot be resolved, the endpoints of Alibaba Cloud service APIs that contain vpc-proxy cannot be resolved, or domain names are resolved to wrong IP addresses.

Add prefer_udp to the CoreDNS configuration to force UDP for upstream queries. See Configure CoreDNS.

What's next