Container network failures—intermittent DNS timeouts, TCP resets, Nginx 499/502/503/504 errors, and latency spikes—are notoriously hard to pin down because they are transient and difficult to reproduce. ACK KubeSkoop (formerly ACK Net Exporter) is an open-source network monitoring and troubleshooting suite that uses eBPF to collect per-Pod network metrics directly from each node, so you can identify root causes without manual packet captures or guesswork.
This topic shows you how to install KubeSkoop in a managed ACK cluster, configure monitoring, and use the per-Pod metrics to diagnose four common network failure types.
How it works
KubeSkoop runs as a DaemonSet, placing one agent on every node. Each agent uses eBPF to collect low-level network data from the node and aggregates it per Pod. The agent exposes Prometheus metrics and abnormal events through an HTTP endpoint on port 9102.
KubeSkoop supports:
Deep network monitoring
Connectivity diagnostics
Packet capturing
Latency probing
The following figure shows the core architecture of KubeSkoop.
Prerequisites
Before you begin, make sure you have:
A managed ACK cluster
Access to the ACK console
Access to the ARMS console (required for the Prometheus dashboard)
Install and configure ACK KubeSkoop
Install the ACK KubeSkoop add-on
Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Add-ons.
On the Add-ons page, search for ACK KubeSkoop, find the add-on, and click Install.
On the Install Component ACK KubeSkoop page, click Confirm.
Configure the KubeSkoop add-on
KubeSkoop is configured through a ConfigMap named kubeskoop-config in the ack-kubeskoop namespace. Changes take effect immediately—no restart is required.
Option 1: Edit via kubectl
kubectl edit cm kubeskoop-config -n ack-kubeskoopOption 2: Edit via the console
Log on to the ACK console. In the left navigation pane, click Clusters.
Click the name of your cluster. In the left navigation pane, choose Configurations > ConfigMaps.
On the ConfigMaps page, set Namespace to ack-kubeskoop, search for kubeskoop-config, and click Edit in the Actions column.
In the Edit panel, update the parameters and click OK.
The following table describes the supported parameters.
| Parameter | Description | Default value |
|---|---|---|
debugmode | Enable debug mode. When set to true, provides DEBUG-level logs, debugging interfaces, and Go pprof and gops diagnostic tools. | false |
port | Port for the metrics HTTP endpoint. | 9102 |
enableController | Enable the Controller component, which interacts with the Kubernetes API to perform monitoring and management tasks. | true |
controllerAddr | Address of the KubeSkoop Controller component. | dns:kubeskoop-controller:10263 |
metrics.probes | List of probe types to collect. Each probe maps to a metric category. Default probes: conntrack, qdisc, netdev, io, sock, tcpsummary, tcp, tcpext, udp, rdma. For the full probe reference, see Probes, metrics, and events. | See description |
Set up the ARMS Prometheus dashboard
Log on to the ARMS console. In the left navigation pane, click Integration Management.
On the Integration Management page, click Add Integration. Search for KubeSkoop and click ACK KubeSkoop Network Monitoring.
In the ACK KubeSkoop Network Monitoring dialog box, select the ACK cluster, enter an Integration Name, and click OK.
Log on to the ACK console. Click the cluster name. In the left navigation pane, choose Operations > Prometheus Monitoring.
Click the Others tab. The node and Pod monitoring dashboards created by KubeSkoop appear in the dashboard list.

For more information about Alibaba Cloud Prometheus Service, see Use Alibaba Cloud Prometheus Service.
Access KubeSkoop metrics manually
KubeSkoop exposes metrics in Prometheus format at port 9102 on each agent Pod. To query metrics directly:
Get the list of KubeSkoop agent Pods and their node IPs:
kubectl get pod -n ack-kubeskoop -o wide | grep kubeskoop-agentExpected output:
kubeskoop-agent-2chvw 1/1 Running 0 43m 172.16.16.xxx cn-hangzhou.172.16.16.xxx <none> <none> kubeskoop-agent-2qtbf 1/1 Running 0 43m 172.16.16.xxx cn-hangzhou.172.16.16.xxx <none> <none> kubeskoop-agent-72pgf 1/1 Running 0 43m 172.16.16.xxx cn-hangzhou.172.16.16.xxx <none> <none>Fetch all metrics from an agent. Replace
172.16.16.xxxwith the IP from the previous step.curl http://172.16.16.xxx:9102/metrics
Metrics follow this format:
kubeskoop_netdev_rxbytes{k8s_namespace="",k8s_node="cn-hangzhou.172.16.16.xxx",k8s_pod=""} 2.970963745e+09Each metric is labeled with k8s_namespace, k8s_node, and k8s_pod, so you can filter by namespace, node, or Pod in Prometheus queries.
Troubleshoot common network issues
The following sections describe the metrics to watch for each failure type and how to interpret them.
Troubleshoot DNS timeouts
DNS timeouts in container environments typically have one of three root causes:
The DNS server is slow to respond and misses the client-side timeout.
The sender cannot dispatch the DNS query packet promptly.
The server responds in time, but the sender drops the response due to resource constraints such as insufficient memory.
Because most cloud-native workloads rely on CoreDNS, monitor the following metrics for both your application Pods and any CoreDNS Pods.
| Metric | What it measures |
|---|---|
kubeskoop_pod_udpsndbuferrors | Errors when sending UDP packets through the network layer |
kubeskoop_pod_udpincsumerrors | Checksum errors when receiving UDP packets |
kubeskoop_pod_udpnoports | Times the network layer cannot find a matching socket when receiving with __udp4_lib_rcv |
kubeskoop_pod_udpinerrors | Errors when receiving UDP packets |
kubeskoop_pod_udpoutdatagrams | Packets successfully sent by UDP through the network layer |
kubeskoop_pod_udprcvbuferrors | Errors from an insufficient socket receive queue when copying data to the application layer |
Troubleshoot Nginx Ingress 499/502/503/504 errors
These four status codes each point to a different failure layer.
| Status code | Meaning | Common causes |
|---|---|---|
499 | Client closed the TCP connection before Nginx responded | Client-side timeout reached while Nginx was processing; server processes the connection slowly after establishment; upstream backend is slow |
502 | Nginx could not get a valid response from the upstream | Failed DNS resolution for the backend (common when using a Kubernetes Service); failed to establish a connection with the upstream; upstream request or response too large, causing memory allocation failures |
503 | All upstream servers unavailable | No available backends; traffic throttled by the Ingress limit-req setting |
504 | Nginx timed out waiting for the upstream | Delayed response from the upstream backend |
Collect baseline information first. Before checking KubeSkoop metrics, gather:
Nginx
access_log, specificallyrequest_time,upstream_connect_time, andupstream_response_timeNginx
error_logentries around the time of the issueLiveness and readiness probe status if configured
If you suspect connection failures, check for changes in these metrics:
| Metric | What it measures |
|---|---|
kubeskoop_tcpext_listenoverflow | Half-connection queue overflow on a socket in the LISTEN state |
kubeskoop_tcpext_listendrops | Failures to create a SYN_RECV socket from a LISTEN socket |
kubeskoop_netdev_txdropped | Packets dropped by the NIC (network interface card) due to a transmission error |
kubeskoop_netdev_rxdropped | Packets dropped by the NIC due to a reception error |
kubeskoop_tcp_activeopens | Times a Pod initiates a TCP handshake with a SYN packet (failed connections also increment this counter) |
kubeskoop_tcp_passiveopens | Times a Pod completes a TCP handshake and allocates a socket (equivalent to successfully established connections) |
kubeskoop_tcp_retranssegs | Total retransmitted segments in a single Pod, calculated after TCP Segmentation Offload (TSO) |
kubeskoop_tcp_estabresets | Times a TCP connection is abnormally closed in a single Pod |
kubeskoop_tcp_outrsts | Reset packets sent by TCP in a single Pod |
kubeskoop_conntrack_invalid | Times a connection tracking (conntrack) entry cannot be established, but the packet is not dropped |
kubeskoop_conntrack_drop | Packets dropped because a conntrack entry could not be established |
If Nginx responses are slow (for example, request_time is long but upstream_response_time is short), check for queue buildup:
| Metric | What it measures |
|---|---|
kubeskoop_tcpsummary_tcpestablishedconn | Current TCP connections in the ESTABLISHED state |
kubeskoop_pod_tcpsummarytcptimewaitconn | Current TCP connections in the TIME_WAIT state |
kubeskoop_tcpsummary_tcptimewaitconn | Total bytes in the send queue of ESTABLISHED TCP connections |
kubeskoop_tcpsummary_tcprxqueue | Total bytes in the receive queue of ESTABLISHED TCP connections |
kubeskoop_tcpext_tcpretransfail | Retransmitted packets that returned an error other than EBUSY, indicating retransmission failure |
Troubleshoot TCP resets
A TCP reset surfaces in applications as a connection reset by peer error (common in C-based apps like Nginx) or a Broken pipe error (common in Java or Python apps that use TCP connection wrappers).
Common causes in cloud-native environments:
Server-side resource exhaustion, such as insufficient TCP memory, triggers a proactive reset.
Load Balancing or a Kubernetes Service forwards traffic to an unexpected backend due to anomalies in Endpoint selection or conntrack state.
Security policies close the connection.
In NAT environments or under high concurrency, Protection Against Wrapped Sequence Numbers (PAWS) or sequence number wraparound occurs.
TCP keepalive expires because no business traffic has passed for an extended period.
Start by mapping the network topology between the client and server at the time the reset occurs. Then check the following metrics:
| Metric | What it measures |
|---|---|
kubeskoop_tcpext_tcpabortontimeout | Resets sent because the maximum number of keepalive, window probe, or retransmission attempts was exceeded |
kubeskoop_tcpext_tcpabortonlinger | Resets sent to reclaim connections in the FIN_WAIT2 state when the TCP Linger2 option is enabled |
kubeskoop_tcpext_tcpabortonclose | Resets sent because unread data was present when a TCP connection was closed |
kubeskoop_tcpext_tcpabortonmemory | Resets sent due to insufficient memory during allocation of tw_sock or tcp_sock resources |
kubeskoop_tcpext_tcpabortondata | Resets sent for fast connection reclamation when the Linger or Linger2 option is enabled |
kubeskoop_tcpext_tcpackskippedsynrecv | Times a socket in the SYN_RECV state did not reply with an ACK |
kubeskoop_tcpext_tcpackskippedpaws | Times an ACK was not sent due to out-of-window (OOW) rate limiting, even though the PAWS mechanism triggered a correction |
kubeskoop_tcp_estabresets | Times a TCP connection was abnormally closed in a single Pod |
kubeskoop_tcp_outrsts | Reset packets sent by TCP in a single Pod |
Troubleshoot network latency jitter
Intermittent latency spikes in container networks often have non-obvious causes. Although the symptoms appear as network delays, the root cause is frequently an operating system scheduling or resource issue.
Common causes:
A real-time process managed by the RT scheduler runs for too long, starving business processes or network kernel threads.
The workload experiences occasional slow external calls, such as high-latency cloud disk I/O or intermittent RDS Round-Trip Time (RTT) spikes.
Uneven load between CPUs or NUMA nodes causes some CPUs to lag.
Stateful kernel mechanisms such as the conntrack confirm operation, or a large number of orphan sockets, slow down socket lookups.
Monitor the following metrics to narrow down the cause:
| Metric | What it measures |
|---|---|
kubeskoop_io_ioreadsyscall | Times a process performs file system read operations (read, pread) |
kubeskoop_io_iowritesyscall | Times a process performs file system write operations (write, pwrite) |
kubeskoop_io_ioreadbytes | Bytes a process reads from the file system (typically from a block device) |
kubeskoop_io_iowritebytes | Bytes a process writes to the file system |
kubeskoop_tcpext_tcptimeouts | SYN packets that were not acknowledged and were retransmitted (incremented when the congestion avoidance state has not entered recovery, loss, or disorder) |
kubeskoop_tcpsummary_tcpestablishedconn | Current TCP connections in the ESTABLISHED state |
kubeskoop_tcpsummary_tcptimewaitconn | Current TCP connections in the TIME_WAIT state |
kubeskoop_tcpsummary_tcptxqueue | Total bytes in the send queue of ESTABLISHED TCP connections |
kubeskoop_tcpsummary_tcprxqueue | Total bytes in the receive queue of ESTABLISHED TCP connections |
kubeskoop_softnet_processed | Packets from the NIC backlog processed by all CPUs within a single Pod |
kubeskoop_softnet_dropped | Packets dropped by all CPUs within a single Pod |
Customer use cases
The following cases show how ACK KubeSkoop was used to identify and resolve real production issues.
Case 1: Intermittent DNS timeouts
Problem: A customer running a PHP application experienced intermittent DNS resolution timeouts. The DNS service was CoreDNS, with the configured upstream resolver pointing to a public provider.
Findings: During error windows:
kubeskoop_udp_noportsincreased by 1.kubeskoop_packetloss_totalincreased by 1.
Both changes were small in absolute terms.
Root cause: The public DNS server was responding slowly. The response arrived after the PHP application had already timed out on the client side. The small metric increments were consistent with occasional, not systematic, packet delays.
Case 2: Intermittent connection failures in a Java application
Problem: A customer's Tomcat service became unavailable intermittently, with each outage lasting approximately 5 to 10 seconds.
Findings: Log analysis confirmed that a Garbage Collection (GC) pause was occurring at the time of each incident. After deploying KubeSkoop, kubeskoop_tcpext_listendrops increased significantly during GC events.
Root cause: During GC, request processing slowed and connections were not released quickly. Incoming connection requests continued to accumulate, filling the listen socket's backlog and causing it to overflow, which triggered the spike in kubeskoop_tcpext_listendrops.
Resolution: The customer adjusted Tomcat's connection backlog parameters, which resolved the issue.
Case 3: Intermittent RTT spikes to Redis
Problem: A customer observed intermittent Redis requests with total response times exceeding 300 ms, but could not reproduce the issue on demand.
Findings: After deploying KubeSkoop, kubeskoop_virtcmdlatency_latency increased during the latency windows:
Bucket
le=15(threshold ~36 ms) increased, indicating at least one call exceeded 36 ms.Bucket
le=18(threshold ~200 ms) increased, indicating a call exceeding 200 ms.
Root cause: Kernel virtualization calls triggered during batch Pod creation and deletion occupied CPU cores and could not be preempted, causing the customer's intermittent latency.
Case 4: Intermittent health check failures for Nginx Ingress
Problem: Nginx Ingress health checks failed intermittently, accompanied by business request failures.
Findings: At the time of each failure:
kubeskoop_tcpsummary_tcprxqueueandkubeskoop_tcpsummary_tcptxqueueboth increased.kubeskoop_tcpext_tcptimeoutsincreased.kubeskoop_tcpsummary_tcptimewaitconndecreased whilekubeskoop_tcpsummary_tcpestablishedconnincreased.
The kernel and connection establishment paths appeared normal, but the user-space process was not consuming packets from the receive queue or flushing the send queue.
Root cause: cgroup CPU throttling was intermittently preventing the Ingress process from being scheduled, stalling both packet processing and transmission.
Resolution: The customer enabled CPU Burst for the Nginx Ingress Pods by following Enable CPU Burst, which eliminated the throttling.
What's next
Probes, metrics, and events — full reference for all KubeSkoop probes and the metrics they expose
Use Alibaba Cloud Prometheus Service — set up Prometheus monitoring for your ACK cluster
Enable CPU Burst — configure CPU Burst to prevent cgroup throttling on latency-sensitive workloads