All Products
Search
Document Center

Container Service for Kubernetes:Use KubeSkoop to troubleshoot network issues

Last Updated:Mar 25, 2026

Container network failures—intermittent DNS timeouts, TCP resets, Nginx 499/502/503/504 errors, and latency spikes—are notoriously hard to pin down because they are transient and difficult to reproduce. ACK KubeSkoop (formerly ACK Net Exporter) is an open-source network monitoring and troubleshooting suite that uses eBPF to collect per-Pod network metrics directly from each node, so you can identify root causes without manual packet captures or guesswork.

This topic shows you how to install KubeSkoop in a managed ACK cluster, configure monitoring, and use the per-Pod metrics to diagnose four common network failure types.

How it works

KubeSkoop runs as a DaemonSet, placing one agent on every node. Each agent uses eBPF to collect low-level network data from the node and aggregates it per Pod. The agent exposes Prometheus metrics and abnormal events through an HTTP endpoint on port 9102.

KubeSkoop supports:

  • Deep network monitoring

  • Connectivity diagnostics

  • Packet capturing

  • Latency probing

The following figure shows the core architecture of KubeSkoop.

KubeSkoop architecture

Prerequisites

Before you begin, make sure you have:

Install and configure ACK KubeSkoop

Install the ACK KubeSkoop add-on

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.

  3. On the Add-ons page, search for ACK KubeSkoop, find the add-on, and click Install.

  4. On the Install Component ACK KubeSkoop page, click Confirm.

Configure the KubeSkoop add-on

KubeSkoop is configured through a ConfigMap named kubeskoop-config in the ack-kubeskoop namespace. Changes take effect immediately—no restart is required.

Option 1: Edit via kubectl

kubectl edit cm kubeskoop-config -n ack-kubeskoop

Option 2: Edit via the console

  1. Log on to the ACK console. In the left navigation pane, click Clusters.

  2. Click the name of your cluster. In the left navigation pane, choose Configurations > ConfigMaps.

  3. On the ConfigMaps page, set Namespace to ack-kubeskoop, search for kubeskoop-config, and click Edit in the Actions column.

  4. In the Edit panel, update the parameters and click OK.

The following table describes the supported parameters.

ParameterDescriptionDefault value
debugmodeEnable debug mode. When set to true, provides DEBUG-level logs, debugging interfaces, and Go pprof and gops diagnostic tools.false
portPort for the metrics HTTP endpoint.9102
enableControllerEnable the Controller component, which interacts with the Kubernetes API to perform monitoring and management tasks.true
controllerAddrAddress of the KubeSkoop Controller component.dns:kubeskoop-controller:10263
metrics.probesList of probe types to collect. Each probe maps to a metric category. Default probes: conntrack, qdisc, netdev, io, sock, tcpsummary, tcp, tcpext, udp, rdma. For the full probe reference, see Probes, metrics, and events.See description

Set up the ARMS Prometheus dashboard

  1. Log on to the ARMS console. In the left navigation pane, click Integration Management.

  2. On the Integration Management page, click Add Integration. Search for KubeSkoop and click ACK KubeSkoop Network Monitoring.

  3. In the ACK KubeSkoop Network Monitoring dialog box, select the ACK cluster, enter an Integration Name, and click OK.

  4. Log on to the ACK console. Click the cluster name. In the left navigation pane, choose Operations > Prometheus Monitoring.

  5. Click the Others tab. The node and Pod monitoring dashboards created by KubeSkoop appear in the dashboard list.

KubeSkoop Prometheus dashboards in the ACK console
For more information about Alibaba Cloud Prometheus Service, see Use Alibaba Cloud Prometheus Service.

Access KubeSkoop metrics manually

KubeSkoop exposes metrics in Prometheus format at port 9102 on each agent Pod. To query metrics directly:

  1. Get the list of KubeSkoop agent Pods and their node IPs:

    kubectl get pod -n ack-kubeskoop -o wide | grep kubeskoop-agent

    Expected output:

    kubeskoop-agent-2chvw   1/1   Running   0   43m   172.16.16.xxx   cn-hangzhou.172.16.16.xxx   <none>   <none>
    kubeskoop-agent-2qtbf   1/1   Running   0   43m   172.16.16.xxx   cn-hangzhou.172.16.16.xxx   <none>   <none>
    kubeskoop-agent-72pgf   1/1   Running   0   43m   172.16.16.xxx   cn-hangzhou.172.16.16.xxx   <none>   <none>
  2. Fetch all metrics from an agent. Replace 172.16.16.xxx with the IP from the previous step.

    curl http://172.16.16.xxx:9102/metrics

Metrics follow this format:

kubeskoop_netdev_rxbytes{k8s_namespace="",k8s_node="cn-hangzhou.172.16.16.xxx",k8s_pod=""} 2.970963745e+09

Each metric is labeled with k8s_namespace, k8s_node, and k8s_pod, so you can filter by namespace, node, or Pod in Prometheus queries.

Troubleshoot common network issues

The following sections describe the metrics to watch for each failure type and how to interpret them.

Troubleshoot DNS timeouts

DNS timeouts in container environments typically have one of three root causes:

  • The DNS server is slow to respond and misses the client-side timeout.

  • The sender cannot dispatch the DNS query packet promptly.

  • The server responds in time, but the sender drops the response due to resource constraints such as insufficient memory.

Because most cloud-native workloads rely on CoreDNS, monitor the following metrics for both your application Pods and any CoreDNS Pods.

MetricWhat it measures
kubeskoop_pod_udpsndbuferrorsErrors when sending UDP packets through the network layer
kubeskoop_pod_udpincsumerrorsChecksum errors when receiving UDP packets
kubeskoop_pod_udpnoportsTimes the network layer cannot find a matching socket when receiving with __udp4_lib_rcv
kubeskoop_pod_udpinerrorsErrors when receiving UDP packets
kubeskoop_pod_udpoutdatagramsPackets successfully sent by UDP through the network layer
kubeskoop_pod_udprcvbuferrorsErrors from an insufficient socket receive queue when copying data to the application layer

Troubleshoot Nginx Ingress 499/502/503/504 errors

These four status codes each point to a different failure layer.

Status codeMeaningCommon causes
499Client closed the TCP connection before Nginx respondedClient-side timeout reached while Nginx was processing; server processes the connection slowly after establishment; upstream backend is slow
502Nginx could not get a valid response from the upstreamFailed DNS resolution for the backend (common when using a Kubernetes Service); failed to establish a connection with the upstream; upstream request or response too large, causing memory allocation failures
503All upstream servers unavailableNo available backends; traffic throttled by the Ingress limit-req setting
504Nginx timed out waiting for the upstreamDelayed response from the upstream backend

Collect baseline information first. Before checking KubeSkoop metrics, gather:

  • Nginx access_log, specifically request_time, upstream_connect_time, and upstream_response_time

  • Nginx error_log entries around the time of the issue

  • Liveness and readiness probe status if configured

If you suspect connection failures, check for changes in these metrics:

MetricWhat it measures
kubeskoop_tcpext_listenoverflowHalf-connection queue overflow on a socket in the LISTEN state
kubeskoop_tcpext_listendropsFailures to create a SYN_RECV socket from a LISTEN socket
kubeskoop_netdev_txdroppedPackets dropped by the NIC (network interface card) due to a transmission error
kubeskoop_netdev_rxdroppedPackets dropped by the NIC due to a reception error
kubeskoop_tcp_activeopensTimes a Pod initiates a TCP handshake with a SYN packet (failed connections also increment this counter)
kubeskoop_tcp_passiveopensTimes a Pod completes a TCP handshake and allocates a socket (equivalent to successfully established connections)
kubeskoop_tcp_retranssegsTotal retransmitted segments in a single Pod, calculated after TCP Segmentation Offload (TSO)
kubeskoop_tcp_estabresetsTimes a TCP connection is abnormally closed in a single Pod
kubeskoop_tcp_outrstsReset packets sent by TCP in a single Pod
kubeskoop_conntrack_invalidTimes a connection tracking (conntrack) entry cannot be established, but the packet is not dropped
kubeskoop_conntrack_dropPackets dropped because a conntrack entry could not be established

If Nginx responses are slow (for example, request_time is long but upstream_response_time is short), check for queue buildup:

MetricWhat it measures
kubeskoop_tcpsummary_tcpestablishedconnCurrent TCP connections in the ESTABLISHED state
kubeskoop_pod_tcpsummarytcptimewaitconnCurrent TCP connections in the TIME_WAIT state
kubeskoop_tcpsummary_tcptimewaitconnTotal bytes in the send queue of ESTABLISHED TCP connections
kubeskoop_tcpsummary_tcprxqueueTotal bytes in the receive queue of ESTABLISHED TCP connections
kubeskoop_tcpext_tcpretransfailRetransmitted packets that returned an error other than EBUSY, indicating retransmission failure

Troubleshoot TCP resets

A TCP reset surfaces in applications as a connection reset by peer error (common in C-based apps like Nginx) or a Broken pipe error (common in Java or Python apps that use TCP connection wrappers).

Common causes in cloud-native environments:

  • Server-side resource exhaustion, such as insufficient TCP memory, triggers a proactive reset.

  • Load Balancing or a Kubernetes Service forwards traffic to an unexpected backend due to anomalies in Endpoint selection or conntrack state.

  • Security policies close the connection.

  • In NAT environments or under high concurrency, Protection Against Wrapped Sequence Numbers (PAWS) or sequence number wraparound occurs.

  • TCP keepalive expires because no business traffic has passed for an extended period.

Start by mapping the network topology between the client and server at the time the reset occurs. Then check the following metrics:

MetricWhat it measures
kubeskoop_tcpext_tcpabortontimeoutResets sent because the maximum number of keepalive, window probe, or retransmission attempts was exceeded
kubeskoop_tcpext_tcpabortonlingerResets sent to reclaim connections in the FIN_WAIT2 state when the TCP Linger2 option is enabled
kubeskoop_tcpext_tcpabortoncloseResets sent because unread data was present when a TCP connection was closed
kubeskoop_tcpext_tcpabortonmemoryResets sent due to insufficient memory during allocation of tw_sock or tcp_sock resources
kubeskoop_tcpext_tcpabortondataResets sent for fast connection reclamation when the Linger or Linger2 option is enabled
kubeskoop_tcpext_tcpackskippedsynrecvTimes a socket in the SYN_RECV state did not reply with an ACK
kubeskoop_tcpext_tcpackskippedpawsTimes an ACK was not sent due to out-of-window (OOW) rate limiting, even though the PAWS mechanism triggered a correction
kubeskoop_tcp_estabresetsTimes a TCP connection was abnormally closed in a single Pod
kubeskoop_tcp_outrstsReset packets sent by TCP in a single Pod

Troubleshoot network latency jitter

Intermittent latency spikes in container networks often have non-obvious causes. Although the symptoms appear as network delays, the root cause is frequently an operating system scheduling or resource issue.

Common causes:

  • A real-time process managed by the RT scheduler runs for too long, starving business processes or network kernel threads.

  • The workload experiences occasional slow external calls, such as high-latency cloud disk I/O or intermittent RDS Round-Trip Time (RTT) spikes.

  • Uneven load between CPUs or NUMA nodes causes some CPUs to lag.

  • Stateful kernel mechanisms such as the conntrack confirm operation, or a large number of orphan sockets, slow down socket lookups.

Monitor the following metrics to narrow down the cause:

MetricWhat it measures
kubeskoop_io_ioreadsyscallTimes a process performs file system read operations (read, pread)
kubeskoop_io_iowritesyscallTimes a process performs file system write operations (write, pwrite)
kubeskoop_io_ioreadbytesBytes a process reads from the file system (typically from a block device)
kubeskoop_io_iowritebytesBytes a process writes to the file system
kubeskoop_tcpext_tcptimeoutsSYN packets that were not acknowledged and were retransmitted (incremented when the congestion avoidance state has not entered recovery, loss, or disorder)
kubeskoop_tcpsummary_tcpestablishedconnCurrent TCP connections in the ESTABLISHED state
kubeskoop_tcpsummary_tcptimewaitconnCurrent TCP connections in the TIME_WAIT state
kubeskoop_tcpsummary_tcptxqueueTotal bytes in the send queue of ESTABLISHED TCP connections
kubeskoop_tcpsummary_tcprxqueueTotal bytes in the receive queue of ESTABLISHED TCP connections
kubeskoop_softnet_processedPackets from the NIC backlog processed by all CPUs within a single Pod
kubeskoop_softnet_droppedPackets dropped by all CPUs within a single Pod

Customer use cases

The following cases show how ACK KubeSkoop was used to identify and resolve real production issues.

Case 1: Intermittent DNS timeouts

Problem: A customer running a PHP application experienced intermittent DNS resolution timeouts. The DNS service was CoreDNS, with the configured upstream resolver pointing to a public provider.

Findings: During error windows:

  • kubeskoop_udp_noports increased by 1.

  • kubeskoop_packetloss_total increased by 1.

Both changes were small in absolute terms.

Root cause: The public DNS server was responding slowly. The response arrived after the PHP application had already timed out on the client side. The small metric increments were consistent with occasional, not systematic, packet delays.

Case 2: Intermittent connection failures in a Java application

Problem: A customer's Tomcat service became unavailable intermittently, with each outage lasting approximately 5 to 10 seconds.

Findings: Log analysis confirmed that a Garbage Collection (GC) pause was occurring at the time of each incident. After deploying KubeSkoop, kubeskoop_tcpext_listendrops increased significantly during GC events.

Root cause: During GC, request processing slowed and connections were not released quickly. Incoming connection requests continued to accumulate, filling the listen socket's backlog and causing it to overflow, which triggered the spike in kubeskoop_tcpext_listendrops.

Resolution: The customer adjusted Tomcat's connection backlog parameters, which resolved the issue.

Case 3: Intermittent RTT spikes to Redis

Problem: A customer observed intermittent Redis requests with total response times exceeding 300 ms, but could not reproduce the issue on demand.

Findings: After deploying KubeSkoop, kubeskoop_virtcmdlatency_latency increased during the latency windows:

  • Bucket le=15 (threshold ~36 ms) increased, indicating at least one call exceeded 36 ms.

  • Bucket le=18 (threshold ~200 ms) increased, indicating a call exceeding 200 ms.

Root cause: Kernel virtualization calls triggered during batch Pod creation and deletion occupied CPU cores and could not be preempted, causing the customer's intermittent latency.

Case 4: Intermittent health check failures for Nginx Ingress

Problem: Nginx Ingress health checks failed intermittently, accompanied by business request failures.

Findings: At the time of each failure:

  • kubeskoop_tcpsummary_tcprxqueue and kubeskoop_tcpsummary_tcptxqueue both increased.

  • kubeskoop_tcpext_tcptimeouts increased.

  • kubeskoop_tcpsummary_tcptimewaitconn decreased while kubeskoop_tcpsummary_tcpestablishedconn increased.

The kernel and connection establishment paths appeared normal, but the user-space process was not consuming packets from the receive queue or flushing the send queue.

Root cause: cgroup CPU throttling was intermittently preventing the Ingress process from being scheduled, stalling both packet processing and transmission.

Resolution: The customer enabled CPU Burst for the Nginx Ingress Pods by following Enable CPU Burst, which eliminated the throttling.

What's next