Ysera
Assistant Engineer
Assistant Engineer
  • UID634
  • Fans0
  • Follows0
  • Posts44
Reads:1603Replies:0

Investigating the "HUNG" Issue in the Layer-7 Server Load Balancer Instance

Created#
More Posted time:Jul 15, 2016 14:12 PM
Investigating the "HUNG" Issue in the Layer-7 Server Load Balancer Instance
We recently received feedback from cnblogs.com, which reported that  layer-7 Server Load Balancer instances occasionally experience sudden drops in traffic, lasting about 10 seconds. Also, the Server Load Balancer's own monitoring system has observed this phenomenon.

To address this problem, we conducted a continuous investigation. We finally found that the cause was the Nginx configuration. Here, we want to share our investigation and analysis with you. We hope this will help you better use Nginx.

The Problem

1. Layer-7 Server Load Balancers (Nginx) encounter occasional sudden drops in traffic, lasting for about 10 seconds. Also, this always occurs at 12:00pm or 12:00am.

2. Looking at Server Load Balancer traffic patterns, we found that some instances experience traffic surges at 12:00pm and 12:00am by as much as dozens of times.

Because of the two times when this phenomenon occurred, we concluded that a sudden increase in traffic caused Nginx exceptions resulting in a decrease in traffic.

Analysis

3. We observed the traffic for each Nginx and found that the server load is far below the threshold. Also, the CPU, MEM, and NET indicators were not high.

4. Through packet capture, we found that numerous SYN packets were discarded and retransmitted.

From the above evidence, we suspected a network problem, but did not find any network exceptions when troubleshooting the protocol stack, network cards, and switches.

5. The curl service statistics interface on the Nginx machine also experienced hung requests. This was the breakthrough.

Packet capture showed that locally initiated requests' SYN packets were discarded and retransmitted and we determined that this was not due to a network problem. Therefore, we must have a Ngnix problem.

We looked at the Linux protocol stack source code, and found two possible reasons for the discarding of SYN packets:

1. The accept backlog (receiving queue) was full
2. Memory was too low to be allocated
The core code is as follows:

int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
{

...

   if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
       NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
       goto drop;
   }

   req = inet_reqsk_alloc(&tcp_request_sock_ops);
   if (!req)
       goto drop;

...
}

The machine's memory was sufficient, so the problem could only have been caused by a full accept backlog.

However, the problem was that we configure a single server { listen vip; } for each virtual IP, so how could a full backlog affect the virtual IPs of other business systems?

We then returned to the Nginx configuration file:

http{
    server{
        listen 1.1.1.1:80;
        location /{
           return200"10.232.6.3:80";
       }
   }

    server{
        listen 1.1.1.2:80;
        location /{
           return200"10.232.6.3:80";
       }
   }

    server{
        listen 1.1.1.3:80;
        location /{
           return200"10.232.6.3:80";
        }
   }

    ...

    server{
        listen 80;
        location /{
           return200"0.0.0.0:80";
       }
   }
}

Maybe you can understand this if you have a very good understanding of Nginx.

We have very many virtual IP servers in the configuration file, and never noticed the "listen 80" configuration in the last line (this configuration is used for Nginx health checks and status statistics).

Cause: When Nginx processes bind and listen operations, it merges all listener ip:port pairs. Then, because of "listen 80" in the final line, when listening, Nginx only binds one 0.0.0.0:80 port. Afterwards, when requests are sent to Nginx, they pass through the IP and then search for its corresponding virtual server. This causes the issue we observed. The momentary traffic spike produces a full accept backlog, which also affects other virtual IP services.

We can also see this in the Nginx source code:

void
ngx_http_init_connection(ngx_connection_t *c)
{

   ...

   port = c->listening->servers;

   if (port->naddrs > 1) {

       /*
         * there are several addresses on this port and one of them
         * is an "*:port" wildcard so getsockname() in ngx_http_server_addr()
         * is required to determine a server address
         */

       if (ngx_connection_local_sockaddr(c, NULL, 0) != NGX_OK) {
           ngx_http_close_connection(c);
           return;
       }

       switch (c->local_sockaddr->sa_family) {

#if (NGX_HAVE_INET6)
       case AF_INET6:
           sin6 = (struct sockaddr_in6 *) c->local_sockaddr;

           addr6 = port->addrs;

           /* the last address is "*" */

           for (i = 0; i < port->naddrs - 1; i++) {
               if (ngx_memcmp(&addr6.addr6, &sin6->sin6_addr, 16) == 0) {
                   break;
               }
           }

           hc->addr_conf = &addr6.conf;

           break;
#endif

       default: /* AF_INET */
           sin = (struct sockaddr_in *) c->local_sockaddr;

           addr = port->addrs;

           /* the last address is "*" */

           for (i = 0; i < port->naddrs - 1; i++) {
               if (addr.addr == sin->sin_addr.s_addr) {
                   break;
               }
           }

           hc->addr_conf = &addr.conf;

           break;
       }

   } else {

       switch (c->local_sockaddr->sa_family) {

#if (NGX_HAVE_INET6)
       case AF_INET6:
           addr6 = port->addrs;
           hc->addr_conf = &addr6[0].conf;
           break;
#endif

       default: /* AF_INET */
           addr = port->addrs;
            hc->addr_conf = &addr[0].conf;
           break;
       }
   }

   /* the default server configuration for the address:port */
   hc->conf_ctx = hc->addr_conf->default_server->ctx;

   ...
}

Here, we will look for the appropriate server when creating the struct conn. We can clearly see that there are multiple servers for a single listener. This means that these servers share a listen socket and backlog.

Solution

For the causes identified, there are several possible solutions.

We solved the problem by adding local intranet IP listen 172.168.1.1:80 to the listen 80 configuration. This way, Nginx will perform one bind and listen operation for each virtual IP. This will isolate the instances so the virtual IPs cannot affect each other.

This means that a sudden increase of traffic to a virtual IP will not result in a hung problem for the entire service.
Guest