This topic provides answers to some frequently asked questions about domain name system (DNS) resolution errors in Container Service for Kubernetes (ACK) clusters.

What do I do if the external domain name of my cluster cannot be resolved?

Cause

The upstream DNS server returns an error code, which indicates that a domain name resolution error occurred.

Symptom

The internal domain name of the cluster can be resolved, but the external domain name cannot be resolved.

Solution

Check the DNS query log of CoreDNS.

Example of DNS query record
After CoreDNS responds to the DNS query of a client, CoreDNS generates a log entry to record the DNS query:
# If the response code is NOERROR, it indicates that the domain name is resolved without errors. 
[INFO] 172.20.2.25:44525 - 36259 "A IN redis-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000116946s
Common DNS response codes
For more information about DNS response codes, see Specification.
Response code Definition Cause
NXDOMAIN The domain name does not exist on the upstream DNS server. Domain names in pod requests are appended with the suffix search. If a suffixed domain name does not exist on the DNS server, this response code is returned. If you find this response code in the DNS query log, this indicates that a domain name resolution error occurred.
SERVFAIL An error occurs on the upstream DNS server. An error occurs on the upstream DNS server. For example, connections to the upstream DNS server cannot be established.
REFUSED The DNS query is rejected by the upstream DNS server. The upstream DNS server that is specified in the CoreDNS configuration or the /etc/resolv.conf file of the node cannot resolve the domain name. You can check the configuration file of CoreDNS.

What do I do if domain names of headless Services cannot be resolved?

Cause

In CoreDNS versions earlier than 1.7.0, CoreDNS may unexpectedly exit if network jitters occur on the Kubernetes API server of the cluster. As a result, the domain names of headless Services are not updated when CoreDNS is down.

Symptom

CoreDNS cannot resolve domain names of headless Services.

Solution

Update CoreDNS to 1.7.0 or later. For more information, see [Component Upgrades] Upgrade CoreDNS.

What do I do if domains names of StatefulSet pods cannot be resolved?

Cause

If a StatefulSet is exposed by using a headless Service, the ServiceName parameter in the pod YAML template must be set to the name of the headless Service. Otherwise, you cannot access the domain names of the StatefulSet pods, such as pod.headless-svc.ns.svc.cluster.local. However, you can access the domain name of the headless Service, such as headless-svc.ns.svc.cluster.local.

Symptom

The domain names of StatefulSet pods cannot be resolved.

Solution

Set the ServiceName parameter in the pod YAML template to the name of the headless Service that is used to expose the StatefulSet pods.

What do I do if DNS queries are blocked by security group rules or the network access control lists (ACLs) that are associated with vSwitches?

Cause

The security group rules or network ACLs that control the network communications of the Elastic Compute Service (ECS) instance block UDP port 53.

Symptom

DNS resolution failures of CoreDNS persist on some or all nodes.

Solution

Modify the security group rules or network ACLs to open UDP port 53.

What do I do if container network connectivity errors occur?

Cause

UDP port 53 is blocked due to container network connectivity errors or other causes.

Symptom

DNS resolution failures of CoreDNS persist on some or all nodes.

Solution

Diagnose the container network. For more information, see Use the cluster diagnosis feature to troubleshoot cluster issues.

What do I do if CoreDNS pods are overloaded?

Cause

The number of replicated pods that are configured for CoreDNS is insufficient to handle the DNS query load.

Symptom
  • The DNS resolution latency of CoreDNS is high, or DNS resolution failures of CoreDNS persist or occasionally occur on some or all nodes.
  • Check the status of CoreDNS pods and check whether the CPU and memory utilization is about to reach the upper limit.
Solution
  • Use NodeLocal DNSCache to reduce the DNS query load of CoreDNS and improve DNS resolution efficiency. For more information, see Use NodeLocal DNSCache in an ACK cluster.
  • Scale out the number of CoreDNS pods to ensure that the peak CPU utilization of each pod is less than the amount of idle CPU resources of the node.

What do I do if the DNS query load is not balanced among CoreDNS pods?

Cause

The DNS query load is not balanced among CoreDNS pods due to imbalanced pod scheduling or improper SessionAffinity settings of the kube-dns Service.

Symptom
  • The DNS resolution latency of CoreDNS is high, or DNS resolution failures of CoreDNS persist or occasionally occur on some or all nodes.
  • The status of CoreDNS pods shows that the CPU utilization is different among the pods.
  • The number of replicated pods that are configured for CoreDNS is less than two or multiple CoreDNS pods are deployed on the same node.
Solution
  • Scale out the number of CoreDNS pods and schedule the pods to different nodes.
  • You can delete the SessionAffinity parameter from the configuration of the kube-dns Service. For more information, see Configure the kube-dns Service.

What do I do if CoreDNS pods do not run as normal?

Cause

CoreDNS pods do not run as normal due to improper settings in the YAML file or the CoreDNS ConfigMap.

Symptom
  • The DNS resolution latency of CoreDNS is high, or DNS resolution failures of CoreDNS persist or occasionally occur on some or all nodes.
  • CoreDNS pods are not in the Running state or the number of pod restarts continuously increases.
  • The CoreDNS log data indicates that errors occurred.
Solution

Check the status and logs of CoreDNS pods.

CoreDNS errors and solutions
Error Cause Solution
/etc/coredns/Corefile:4 - Error during parsing: Unknown directive 'ready' The configurations in the CoreDNS ConfigMap are incompatible with the current CoreDNS version. The Unknown directive content in the error record indicates that the current CoreDNS version does not support the ready plug-in that is specified in Corefile. Delete the ready plug-in from the CoreDNS ConfigMap in the kube-system namespace. If other plug-ins appear in the error log, delete the plug-ins from the ConfigMap.
pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to watch *v1.Pod: Get "https://192.168.0.1:443/api/v1/": dial tcp 192.168.0.1:443: connect: connection refused Connections to the Kubernetes API server are interrupted during the period when the log was generated. If no DNS resolution error occurs in this period, the error is not caused by network connectivity issues. Otherwise, check the network connectivity of CoreDNS pods. For more information, see Diagnose the network connectivity of the CoreDNS pod.
[ERROR] plugin/errors: 2 www.aliyun.com. A: read udp 172.20.6.53:58814->100.100.2.136:53: i/o timeout Connections to the upstream DNS servers cannot be established during the period when the log was generated.

What do I do if DNS resolutions fail because the client is overloaded?

Cause

The ECS instance that hosts the pod that sends the DNS query to CoreDNS is fully loaded, which causes UDP packet loss.

Symptom

DNS resolution errors occur occasionally or during peak hours. The monitoring information about the ECS instance indicates an abnormal retransmission rate of the network interface controller (NIC) and an abnormal CPU utilization.

Solution

What do I do if the conntrack table is full?

Cause

The conntrack table of the Linux kernel is full. As a result, requests that are sent over UDP or TCP cannot be processed.

Symptom
  • CoreDNS frequently fails to resolve domain names on some or all nodes in peak hours, but can resolve domain names as expected beyond peak hours
  • Run the dmesg -H command on the instance and check the log that is generated during the period in which the resolution fails. The log contains the keyword conntrack full.
Solution

Increase the maximum number of entries in the conntrack table of the Linux kernel. For more information, see How do I increase the maximum number of tracked connections in the conntrack table of the Linux kernel?.

What do I do if the autopath plug-in does not work as normal?

Cause

The autopath plug-in does not work as normal due to the defects of CoreDNS.

Symptom
  • The external domain name occasionally fails to be resolved or is occasionally resolved to a wrong IP address. However, the internal domain name is resolved as normal.
  • When the cluster creates containers at a high frequency, the internal domain name is resolved to a wrong IP address.
Solution
Disable the autopath plug-in.
  1. Run the kubectl -n kube-system edit configmap coredns command to modify the CoreDNS ConfigMap.
  2. Delete autopath @kubernetes. Then, save the change and exit.
  3. Check the status and logs of CoreDNS pods. If the logs contain the keywordreload, it indicates that the modification is successful.

What do I do if DNS resolutions fail due to concurrent queries for A records and AAAA records?

Cause

Concurrent DNS queries for A records and AAAA records cause errors of the conntrack table of the Linux kernel, which results in UDP packet loss.

Symptom
  • DNS resolutions of CoreDNS occasionally fail.
  • The captured packets or the log of DNS queries to CoreDNS shows that queries for A records and AAAA records are initiated at the same time over the same port.
Solution
  • Use NodeLocal DNSCache to reduce the DNS query load of CoreDNS and improve DNS resolution efficiency. For more information, see Use NodeLocal DNSCache in an ACK cluster.
  • If the image that you use is based on CentOS or Ubuntu, add the options timeout:2 attempts:3 rotate single-request-reopen configuration.
  • If the image that you use is based on Alpine Linux, we recommend that you replace the image with an image that is based on another operating system. For more information, see Alpine.
  • A variety of resolution errors may occur when applications written in PHP send DNS queries by using short-lived connections. If you use PHP cURL, you must add CURL_IPRESOLVE_V4 to specify that domain names can be resolved only to IPv4 addresses. For more information, see cURL functions.

What do I do if DNS resolutions fail due to IP Virtual Server (IPVS) errors?

Cause

After CoreDNS pods are removed from the IPVS backend, DNS queries that are sent to the ports of CoreDNS pods may cause packet loss.

Symptom

DNS resolutions occasionally fail when nodes are added to or removed from the cluster, nodes are shut down, or CoreDNS is scaled in. In most cases, this situation lasts for about 5 minutes.

Solution

What do I do if NodeLocal DNSCache doe not work?

Cause
  • DNSConfig is not injected into the application pods. The IP address of the kube-dns Service is configured as the address of the DNS server for the application pods.
  • The application pods are deployed by using an image based on Alpine Linux. As a result, DNS queries are concurrently sent to all nameservers, including the local DNS cache and CoreDNS pods.
Symptom

All DNS queries are sent to CoreDNS instead of NodeLocal DNSCache.

Solution
  • Configure automatic injection for DNSConfig. For more information, see Use NodeLocal DNSCache in an ACK cluster.
  • If the image that you use is based on Alpine Linux, we recommend that you replace the image with an image that is based on another operating system. For more information, see Alpine.

What do I do if domain names that are added to Alibaba Cloud DNS PrivateZone cannot be resolved?

Cause

Alibaba Cloud DNS PrivateZone does not support TCP. You must use UDP.

Symptom

When NodeLocal DNSCache is used, domain names that are added to Alibaba Cloud DNS PrivateZone cannot be resolved, the endpoints of Alibaba Cloud service APIs that contain vpc-proxy cannot be resolved, or domain names are resolved to wrong IP addresses.

Solution

Add the prefer_udp configuration to CoreDNS. For more information, see Modify CoreDNS.