Fix DNS Resolution Failures in ACK - Container Service for Kubernetes

DNS resolution failures in Kubernetes are tricky to diagnose because the resolution pipeline spans multiple components — CoreDNS, kube-proxy, the Container Network Interface (CNI), and upstream DNS servers. This guide walks you through a structured process to isolate the root cause and apply the right fix.

Before diving into diagnostics, review Best practices for DNS services to reduce the risk of recurring DNS failures.

Key concepts

Term	Definition
Internal domain name	Resolved by CoreDNS from its local cache. Always ends with `.cluster.local`.
External domain name	Resolved by an upstream DNS server (Alibaba Cloud DNS, Alibaba Cloud DNS PrivateZone, or a third-party provider). CoreDNS only forwards these queries.
Application pod	Any pod that is not a system component pod in the Kubernetes cluster.
NodeLocal DNSCache	A local DNS cache that intercepts pod DNS queries before they reach CoreDNS. Queries that NodeLocal DNSCache cannot resolve are forwarded to the kube-dns Service.

Identify the error

Common error messages

Match your error message to identify the failure category:

Client	Error message	Failure category
`ping`	`ping: xxx.yyy.zzz: Name or service not known`	Domain not found, or DNS server unreachable (latency > 5s suggests unreachable)
`curl`	`curl: (6) Could not resolve host: xxx.yyy.zzz`	Domain not found, or DNS server unreachable
PHP HTTP client	`php_network_getaddresses: getaddrinfo failed: Name or service not known in xxx.php on line yyy`	Domain not found, or DNS server unreachable
Go HTTP client	`dial tcp: lookup xxx.yyy.zzz on 100.100.2.136:53: no such host`	Domain does not exist
`dig`	`;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: xxxxx`	Domain does not exist
Go HTTP client	`dial tcp: lookup xxx.yyy.zzz on 100.100.2.139:53: read udp 192.168.0.100:42922->100.100.2.139:53: i/o timeout`	DNS server unreachable
`dig`	`;; connection timed out; no servers could be reached`	DNS server unreachable

Set up a debug pod

Before running any diagnostic commands, establish a clean observation environment. This eliminates ambiguity between application bugs and DNS bugs.

kubectl run -it --rm debug --image=nicolaka/netshoot -- dig

You can also run the following command:

nslookup <Domain name>

Diagnostic procedure

The following diagram shows the overall troubleshooting flow for CoreDNS and NodeLocal DNSCache failures.

Step 1: Check the DNS configuration of the application pod

# Get the pod's YAML to check its dnsPolicy field
kubectl get pod <pod-name> -o yaml

# If dnsPolicy looks correct, inspect the in-container DNS config
kubectl exec -it <pod-name> -- cat /etc/resolv.conf

Check whether nameserver points to CoreDNS or NodeLocal DNSCache:

CoreDNS (kube-dns Service IP, e.g., 172.21.0.10): continue to Step 2.
NodeLocal DNSCache (169.254.20.10): see NodeLocal DNSCache issues.
Neither (no cluster DNS configured): the pod may be running on an overloaded node, or the conntrack table may be full. See Client overload and Conntrack table full.

DNS policy reference

The following table describes the available dnsPolicy values and when to use each:

`dnsPolicy` value	Behavior
`ClusterFirst` (default)	Uses the kube-dns Service IP as the DNS server. For host-network pods, behaves the same as `Default`.
`Default`	Uses the DNS servers from the ECS instance's `/etc/resolv.conf`. Use this only if the pod does not need cluster-internal service discovery.
`ClusterFirstWithHostNet`	Same as `ClusterFirst` but for host-network pods.
`None`	Lets you specify custom DNS servers and options in `dnsConfig`. Used when NodeLocal DNSCache injects its configuration automatically.

A pod configured for NodeLocal DNSCache typically looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: <pod-name>
  namespace: <pod-namespace>
spec:
  containers:
  - image: <container-image>
    name: <container-name>
  dnsPolicy: None
  dnsConfig:
    nameservers:
    - 169.254.20.10
    - 172.21.0.10
    options:
    - name: ndots
      value: "3"
    - name: timeout
      value: "1"
    - name: attempts
      value: "2"
    searches:
    - default.svc.cluster.local
    - svc.cluster.local
    - cluster.local

Step 2: Check CoreDNS pod status

kubectl -n kube-system get pod -o wide -l k8s-app=kube-dns

Expected output (all pods in Running state):

NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE
coredns-xxxxxxxxx-xxxxx   1/1     Running   0          25h   172.20.6.53   cn-hangzhou.192.168.0.198

Check resource usage:

kubectl -n kube-system top pod -l k8s-app=kube-dns

Expected output:

NAME                      CPU(cores)   MEMORY(bytes)
coredns-xxxxxxxxx-xxxxx   3m           18Mi

If a pod is not in Running state, describe it to find the cause:

kubectl -n kube-system describe pod <coredns-pod-name>

See CoreDNS pods not running as expected for common error logs and fixes.

Step 3: Check CoreDNS operational logs

kubectl -n kube-system logs -f --tail=500 --timestamps <coredns-pod-name>

Flag	Effect
`-f`	Streams live log output
`--tail=500`	Shows the last 500 lines
`--timestamps`	Adds a timestamp to each log line

Look for NXDOMAIN, SERVFAIL, or REFUSED responses in the logs. If any of these appear for external domain names, the upstream DNS server is returning errors. See External domain name cannot be resolved.

Step 4: Check whether the issue is reproducible

Always fails: proceed with Diagnose the DNS query log and Diagnose network connectivity between application pods and CoreDNS.
Intermittent: capture packets as described in Capture packets.

Diagnose the DNS query log

CoreDNS generates a query log entry for each DNS request only when the log plugin is enabled. To enable it, see Configure CoreDNS.

Run the same command used to query operational logs to view the DNS query logs. For more information, see Step 3: Check CoreDNS operational logs.

After saving, CoreDNS automatically reloads. Confirm by checking the logs for reload.

DNS query log format

[INFO] 172.20.2.25:44525 - 36259 "A IN redis-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000116946s

DNS response codes

For the full specification, see RFC 1035.

Response code	Meaning	Common cause
`NOERROR`	Resolved successfully	—
`NXDOMAIN`	Domain does not exist in the upstream DNS server	Pod requests append search domain suffixes; a suffixed name that doesn't exist triggers this code
`SERVFAIL`	Error in the upstream DNS server	Upstream DNS is unreachable or misconfigured
`REFUSED`	Query rejected by the upstream DNS server	Upstream DNS server (in the CoreDNS config or node's `/etc/resolv.conf`) cannot resolve the domain

If the log shows NXDOMAIN, SERVFAIL, or REFUSED for external domain names, the upstream DNS server is the root cause. By default, CoreDNS uses VPC DNS servers 100.100.2.136 and 100.100.2.138 as upstream resolvers. Submit a ticket to the ECS team and include:

Field	Description	Example
Domain name	The external domain name from the log	`www.aliyun.com`
DNS response code	The response code in the log	`NXDOMAIN`
Time	Log entry timestamp (seconds precision)	`2022-12-22 20:00:03`
ECS instance IDs	IDs of the ECS instances running CoreDNS pods	`i-xxxxx i-yyyyy`

Diagnose network connectivity of the CoreDNS pod

Use either the ACK console or the CLI.

ACK console

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the cluster name. In the left-side navigation pane, choose Inspections and Diagnostics > Diagnostics.
On the Diagnosis page, click Network diagnosis.
In the Network panel, configure the following parameters: Read the warning, select I know and agree, then click Create diagnosis.
- Source address: IP address of the CoreDNS pod
- Destination address: IP address of the upstream DNS server (100.100.2.136 or 100.100.2.138)
- Destination port: 53
- Protocol: udp
On the Diagnosis result page, the Packet paths section shows all nodes that were diagnosed.

CLI

Log on to the node running the CoreDNS pod.
Get the CoreDNS process ID:
```
ps aux | grep coredns
```
Enter the CoreDNS network namespace:
```
nsenter -t <pid> -n -- <related commands>
```
Replace <pid> with the process ID from the previous step.

Test connectivity:

# Test connectivity to the Kubernetes API server
telnet <apiserver_clusterip> 6443

# Test connectivity to upstream DNS servers
dig <domain> @100.100.2.136
dig <domain> @100.100.2.138

Issue	Cause	Action
CoreDNS cannot connect to the Kubernetes API server	API server errors, node overload, or kube-proxy issues	Submit a ticket
CoreDNS cannot connect to upstream DNS servers	Node overload, wrong CoreDNS config, or Express Connect routing issues	Submit a ticket

Diagnose network connectivity between application pods and CoreDNS

ACK console

Log on to the ACK console. In the left-side navigation pane, click Clusters.
On the Clusters page, click the cluster name. In the left-side navigation pane, choose Inspections and Diagnostics > Diagnostics.
On the Diagnosis page, click Network diagnosis.
In the Network panel, configure the following parameters: Read the warning, select I know and agree, then click Create diagnosis.
- Source address: IP address of the application pod
- Destination address: IP address of the CoreDNS pod or the cluster
- Destination port: 53
- Protocol: udp
On the Diagnosis result page, the Packet paths section shows all nodes that were diagnosed.

CLI

Connect to the application pod's network namespace using one of these methods:

Method 1 — kubectl exec:
```
kubectl exec -it <pod-name> -- bash
```

Method 2 — nsenter (when kubectl exec is unavailable):

ps aux | grep <application-process-name>
nsenter -t <pid> -n bash

Method 3 — for pods that restart frequently:

Do not add a space between -n and <netns-path>.

docker ps -a | grep <application-container-name>
docker inspect <sandboxed-container-id> | grep netns
nsenter -n<netns-path> -n bash

From inside the pod's network namespace, test connectivity:

# Test connectivity to the kube-dns Service
dig <domain> @<kube_dns_svc_ip>

# Test ICMP reachability of the CoreDNS pod
ping <coredns_pod_ip>

# Test DNS resolution via the CoreDNS pod directly
dig <domain> @<coredns_pod_ip>

Issue	Cause	Action
Cannot connect to the kube-dns Service	Node overload, kube-proxy issues, or UDP port 53 blocked	Check whether security group rules allow UDP port 53. If they do, submit a ticket.
Cannot ping the CoreDNS pod	Container network errors or ICMP blocked by security group rules	Check whether ICMP is allowed. If it is, submit a ticket.
`dig` to CoreDNS pod fails	Node overload or UDP port 53 blocked by security group rules	Check whether security group rules allow UDP port 53. If they do, submit a ticket.

Capture packets

If you cannot identify the issue through logs and connectivity tests, capture packets to narrow down where packets are being dropped.

Log on to the nodes running the application pods and the CoreDNS pod.
Capture packets on each ECS instance:
Packet capture does not interrupt service. It causes a minor increase in CPU utilization and disk I/O. The command rotates files and generates at most 200 .pcap files, each up to 20 MB.
```
tcpdump -i any port 53 -C 20 -W 200 -w /tmp/client_dns.pcap
```
Analyze the captured packets from the time period when the error occurred.

Troubleshooting reference

Use this table to jump directly to the relevant issue based on the symptom you observe.

Symptom	Relevant issues
Resolution errors on both internal and external domain names	Security group rules blocking DNS, Container network errors, CoreDNS overloaded, Unbalanced DNS load, CoreDNS pods not running, Conntrack table full, Autopath plugin, Concurrent A/AAAA queries, IPVS errors
Resolution errors only on external domain names	External domain name cannot be resolved
Resolution errors only on PrivateZone or vpc-proxy domain names	PrivateZone domain names cannot be resolved
Resolution errors only on headless Service domain names	Headless Service DNS failure, After CoreDNS update, StatefulSet pods
Errors every time	CoreDNS pods not running, PrivateZone, Security group rules, External domain, Headless Service, Container network errors
Errors only during peak hours	CoreDNS overloaded, Unbalanced DNS load
High-frequency errors	IPVS errors, NodeLocal DNSCache not working
Low-frequency or intermittent errors	Autopath plugin, Concurrent A/AAAA queries
Errors during node scaling or CoreDNS scaling	IPVS errors

Other modules in the DNS resolution pipeline

In addition to CoreDNS and NodeLocal DNSCache, the following components can cause DNS resolution failures.

Component	How it can cause failures
DNS resolver (Go, glibc, musl)	Language-level or library-level DNS implementation bugs can cause failures in rare cases
`/etc/resolv.conf`	Misconfigured DNS server IPs or search domains in the container
kube-proxy	After a CoreDNS update, if kube-proxy rules are not updated, CoreDNS becomes unreachable
Upstream DNS servers	CoreDNS forwards external domain queries to upstream servers (such as VPC Private DNS). Misconfigurations on the upstream server cause forwarded queries to fail.

FAQ

What to do if an external domain name cannot be resolved

Internal domain names resolve correctly, but external domain names fail.

Check the CoreDNS query log for NXDOMAIN, SERVFAIL, or REFUSED on the failing domain name. These codes indicate that the upstream DNS server (by default, 100.100.2.136 and 100.100.2.138) is returning an error. Submit a ticket to the ECS team with the domain name, response code, timestamp, and ECS instance IDs of the nodes running CoreDNS.

What to do if headless Service domain names cannot be resolved

CoreDNS cannot resolve headless Service domain names.

This typically happens with CoreDNS versions earlier than 1.7.0, which can exit unexpectedly during Kubernetes API server network jitters, leaving headless Service DNS records stale. Update CoreDNS to 1.7.0 or later. See [Component Updates] Update CoreDNS.

If dig shows the tc flag in the response, the headless Service has too many backing IP addresses and the DNS response exceeds the UDP packet size limit. Configure the client to send DNS queries over TCP:

For glibc-based resolvers, add use-vc to dnsConfig:
```
dnsConfig:
  options:
  - name: use-vc
```
This maps to the options directive in /etc/resolv.conf. For details, see the Linux man page for resolv.conf.

For Go applications, configure the resolver to use TCP:

package main

import (
  "fmt"
  "net"
  "context"
)

func main() {
  resolver := &net.Resolver{
    PreferGo: true,
    Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
      return net.Dial("tcp", address)
    },
  }

  addrs, err := resolver.LookupHost(context.TODO(), "example.com")
  if err != nil {
    fmt.Println("Error:", err)
    return
  }
  fmt.Println("Addresses:", addrs)
}

What to do if headless Service domain names cannot be resolved after updating CoreDNS

After upgrading to Kubernetes 1.20 or later with CoreDNS 1.8.4 or later, open-source components such as etcd, Nacos, and Kafka fail to discover services.

CoreDNS 1.8.4 switched to the EndpointSlice API. Some open-source components rely on the service.alpha.kubernetes.io/tolerate-unready-endpoints annotation from the older Endpoint API to publish not-ready Services during initialization. This annotation is not supported in EndpointSlice — it was replaced by publishNotReadyAddresses.

Check whether the YAML or Helm chart for the affected component uses service.alpha.kubernetes.io/tolerate-unready-endpoints. If it does, upgrade the component or consult the component's community.

What to do if StatefulSet pod domain names cannot be resolved

Domain names of StatefulSet pods (e.g., pod.headless-svc.ns.svc.cluster.local) cannot be resolved, but the headless Service's own domain name resolves correctly.

The StatefulSet pod YAML template has serviceName set to the wrong value or left blank. Set serviceName in the StatefulSet spec to the exact name of the headless Service that exposes the StatefulSet.

What to do if security group rules block DNS queries

DNS resolution fails on some or all nodes, persistently.

The security group rules or network access control lists (ACLs) on the ECS instances are blocking UDP port 53. Modify the security group rules or network ACLs to allow UDP port 53.

What to do if container network connectivity errors occur

DNS resolution fails on some or all nodes, persistently.

Container network connectivity errors or other issues are blocking UDP port 53. Use the network diagnostics feature to diagnose network connectivity between the application pods and the CoreDNS pod. If the issue persists, submit a ticket.

What to do if CoreDNS pods are overloaded

DNS resolution latency is high, or failures occur persistently or intermittently. CoreDNS pod CPU or memory usage is near the upper limit.

The number of CoreDNS replicas is too low to handle the DNS query volume. Take one or both of these steps:

Enable NodeLocal DNSCache to reduce the load on CoreDNS. See Configure NodeLocal DNSCache.
Scale out CoreDNS pods so that peak CPU utilization per pod stays below the node's available CPU headroom.

What to do if DNS query load is unbalanced

DNS resolution latency is high or failures are intermittent on some (not all) nodes. CoreDNS pod CPU utilization differs significantly between pods. Fewer than two CoreDNS replicas are running, or multiple pods are scheduled on the same node.

DNS queries are unevenly distributed due to imbalanced pod scheduling or SessionAffinity settings on the kube-dns Service. To fix this:

Scale out CoreDNS pods and schedule them on separate nodes.
Remove the SessionAffinity setting from the kube-dns Service. See Configure ACK to automatically update CoreDNS.

What to do if CoreDNS pods are not running as expected

DNS resolution latency is high, failures occur on some nodes, CoreDNS pods are not in Running state, the restart count keeps increasing, or the CoreDNS log shows errors.

Misconfigured YAML or a misconfigured CoreDNS ConfigMap is preventing pods from running. Check pod status and logs.

Common error messages:

Error	Cause	Fix
`/etc/coredns/Corefile:4 - Error during parsing: Unknown directive 'ready'`	The `ready` plugin in the Corefile is not supported by the current CoreDNS version	Delete `ready` (or the unsupported plugin) from the CoreDNS ConfigMap in `kube-system`
`Failed to watch *v1.Pod: Get "https://192.168.0.1:443/api/v1/": dial tcp 192.168.0.1:443: connect: connection refused`	API server connection interrupted at log time	If DNS resolution was not affected during this period, this is not the root cause. Otherwise, diagnose CoreDNS network connectivity.
`[ERROR] plugin/errors: 2 www.aliyun.com. A: read udp 172.20.6.53:58814->100.100.2.136:53: i/o timeout`	Upstream DNS server unreachable at log time	Check CoreDNS network connectivity to upstream DNS servers

What to do if DNS fails because the client is overloaded

DNS errors occur intermittently or during peak hours. ECS instance monitoring shows abnormal NIC retransmission rates or high CPU utilization.

The ECS instance hosting the pod that sends DNS queries is fully loaded, causing UDP packet loss. Submit a ticket. In parallel, enable NodeLocal DNSCache to reduce per-node DNS load. See Configure NodeLocal DNSCache.

What to do if the conntrack table is full

DNS resolution fails frequently on some or all nodes during peak hours but works during off-peak hours. Running dmesg -H on the instance shows conntrack full in the logs during the failure window.

The Linux kernel conntrack table is full, so UDP and TCP packets cannot be processed. Increase the maximum number of entries in the conntrack table. See How do I increase the maximum number of tracked connections in the conntrack table of the Linux kernel?

What to do if the autopath plugin does not work as expected

External domain names occasionally fail to resolve or resolve to a wrong IP address, while internal domain names resolve correctly. When the cluster creates containers at a high rate, internal domain names are occasionally resolved to wrong IP addresses.

The autopath plugin in CoreDNS has a known defect. Disable it:

Edit the CoreDNS ConfigMap:

kubectl -n kube-system edit configmap coredns

Delete the autopath @kubernetes line, save the file, and exit.
Verify that CoreDNS loaded the new configuration by checking the logs for the reload keyword.

What to do if DNS fails due to concurrent A and AAAA record queries

CoreDNS DNS resolution fails intermittently. Packet captures or DNS query logs show A record and AAAA record queries sent simultaneously over the same source port.

Concurrent A and AAAA record queries cause conntrack table errors, which lead to UDP packet loss. On ARM machines, libc versions earlier than 2.33 have this issue (see GLIBC#26600).

Apply one or more of these fixes based on your environment:

NodeLocal DNSCache: reduces the impact of packet loss. See Configure NodeLocal DNSCache.
CentOS or Ubuntu with libc: update libc to 2.33 or later, or add resolver options: options timeout:2 attempts:3 rotate single-request-reopen.
PHP cURL: add CURL_IPRESOLVE_V4 to specify that domain names can be resolved only to IPv4 addresses. See curl_setopt.
Alpine Linux: replace with a non-Alpine base image. See Alpine Linux caveats.

What to do if DNS fails due to IPVS errors

DNS resolution fails intermittently when nodes are added or removed from the cluster, when nodes shut down, or when CoreDNS is scaled in. Failures typically last about 5 minutes.

kube-proxy is running in IPVS mode. When UDP backend pods are removed from CentOS or Alibaba Cloud Linux 2 nodes with kernel versions earlier than 4.19.91-25.1.al7.x86_64, source port conflicts cause UDP packets to be dropped.

To fix this:

Enable NodeLocal DNSCache to bypass IPVS for DNS traffic. See Configure NodeLocal DNSCache.
Shorten the UDP session timeout in IPVS mode. See Change the UDP timeout period in IPVS mode.

NodeLocal DNSCache issues

What to do if NodeLocal DNSCache is not working

All DNS queries are reaching CoreDNS instead of NodeLocal DNSCache.

One of two things is happening:

DNSConfig injection is not configured, so pods still use the kube-dns Service IP.
The pod uses an Alpine Linux base image, which sends DNS queries concurrently to all configured nameservers (including both NodeLocal DNSCache and CoreDNS).

To fix the first cause, configure automatic DNSConfig injection. To fix the second, replace the Alpine-based image with a non-Alpine image. See Configure NodeLocal DNSCache and Alpine Linux caveats.

What to do if Alibaba Cloud DNS PrivateZone domain names cannot be resolved

When NodeLocal DNSCache is in use, domain names added to Alibaba Cloud DNS PrivateZone cannot be resolved, Alibaba Cloud service API endpoints containing vpc-proxy fail to resolve, or domain names resolve to wrong IP addresses.

Alibaba Cloud DNS PrivateZone does not support TCP. NodeLocal DNSCache must forward these queries over UDP. Add the prefer_udp setting to the CoreDNS configuration. See Configure CoreDNS.

If none of the above steps resolve the issue, submit a ticket.

Key concepts

Identify the error

Common error messages

Set up a debug pod

Diagnostic procedure

Step 1: Check the DNS configuration of the application pod

DNS policy reference

Step 2: Check CoreDNS pod status

Step 3: Check CoreDNS operational logs

Step 4: Check whether the issue is reproducible

Diagnose the DNS query log

DNS query log format

DNS response codes

Diagnose network connectivity of the CoreDNS pod

ACK console

CLI

Diagnose network connectivity between application pods and CoreDNS

ACK console

CLI

Capture packets

Troubleshooting reference

Other modules in the DNS resolution pipeline

FAQ

What to do if an external domain name cannot be resolved

What to do if headless Service domain names cannot be resolved

What to do if headless Service domain names cannot be resolved after updating CoreDNS

What to do if StatefulSet pod domain names cannot be resolved

What to do if security group rules block DNS queries

What to do if container network connectivity errors occur

What to do if CoreDNS pods are overloaded

What to do if DNS query load is unbalanced

What to do if CoreDNS pods are not running as expected

What to do if DNS fails because the client is overloaded

What to do if the conntrack table is full

What to do if the autopath plugin does not work as expected

What to do if DNS fails due to concurrent A and AAAA record queries

What to do if DNS fails due to IPVS errors

NodeLocal DNSCache issues

What to do if NodeLocal DNSCache is not working

What to do if Alibaba Cloud DNS PrivateZone domain names cannot be resolved

What's next