×
Community Blog PouchContainer Goroutine Leak Detection Practices

PouchContainer Goroutine Leak Detection Practices

We will introduce how PouchContainer deals with goroutine leaks, including what constitutes a goroutine leak, how they can be detected, and the kinds of tools that can be used for analysis.

PouchContainer is an open source container runtime product built by Alibaba Group. It has strong isolation and portability capabilities, which can help enterprises to quickly achieve the containerization of inventory business and better utilize physical resources.

PouchContainer is also a golang project. In this project, goroutines are widely used to implement modules such as container management, image management, and log management. Goroutine is a user-mode "thread" supported by golang at the language level. This native concurrency support helps developers quickly build highly concurrent services.

Although it is easy for goroutine to perform concurrent or parallel operations, if there is a state in which the channel receiver is blocked for a long time and cannot be woken up, then a goroutine leak will occur. A goroutine leak is just as bad as a memory leak, and will constantly occupy resources, causing the system to run slower or even crash. Developers must prevent goroutine leaks if they are to ensure healthy operation of the system. Next we will introduce how PouchContainer deals with goroutine leaks, including what constitutes a goroutine leak, how they can be detected, and the kinds of tools that can be used for analysis.

1. Goroutine Leak

In the world of golang, there are a lot of gophers that you can control. They can handle a lot of the same problems at the same time, and they can work together to handle the same issue. As long as you command them properly, problems can be quickly processed. That's right, a gopher is a goroutine that we often talk about. With just a go, you have a gopher, and it will perform the tasks you specify:

func main() {
    waitCh := make(chan struct{})
    go func() {
        fmt.Println("Hi, Pouch. I'm new gopher!")
        waitCh <- struct{}{}
    }()

    <-waitCh
}

Normally, each time a gopher completes a task, it will return and wait for your next call. However, it is also possible for the gopher to not return for an extended period of time.

func main() {
        // /exec?cmd=xx&args=yy runs the shell command in the host
        http.HandleFunc("/exec", func(w http.ResponseWriter, r *http.Request) {
                defer func() { log.Printf("finish %v\n", r.URL) }()
                out, err := genCmd(r).CombinedOutput()
                if err != nil {
                        w.WriteHeader(500)
                        w.Write([]byte(err.Error()))
                        return
                }
                w.Write(out)
        })
        log.Fatal(http.ListenAndServe(":8080", nil))
}

func genCmd(r *http.Request) (cmd *exec.Cmd) {
        var args []string
        if got := r.FormValue("args"); got != "" {
                args = strings.Split(got, " ")
        }

        if c := r.FormValue("cmd"); len(args) == 0 {
                cmd = exec.Command(c)
        } else {
                cmd = exec.Command(c, args...)
        }
        return
}

The above code will start the HTTP server, which will allow the client to remotely execute shell commands via HTTP requests. For example, you can use curl "{ip}:8080/exec?cmd=ps&args=-ef" to view processes on the server side. After execution, the gopher will print a log and indicate that the commands have been executed.

However, execution time sometimes exceeds the amount of time the operator is willing to wait for it. For example, the user could issue the command curl -m 3 "{ip}:8080/exec?cmd=dosomething", which indicates that the command must be executed within 3 seconds or the connection will be terminated by the requester. Since the above code does not check that the connection still exists, then even if the requester terminates the connection midway through the operation, the gopher will still complete the operation and issue a return. In the case of a command like curl -m 1 "{ip}:8080/exec?cmd=sleep&args=10000", the gopher will not return in time and system resources will be irrevocably occupied for the duration.

These unreturned and uncontrolled gophers are what we often refer to as goroutine leaks. There are many potential reasons behind a goroutine leak, for example if a channel does not have a sender, then it will become a leak. After running the following code, you will find that the runtime stably displays a total of two goroutines, one of which is the main function itself, and the other is the gopher that is waiting for input data.

func main() {
        logGoNum()

        // without sender and blocking....
        var ch chan int
        go func(ch chan int) {
                <-ch
        }(ch)

        for range time.Tick(2 * time.Second) {
                logGoNum()
        }
}

func logGoNum() {
        log.Printf("goroutine number: %d\n", runtime.NumGoroutine())
}

There are several situations in which a goroutine leak may occur. Next we will describe how to detect a leak according to Pouch Logs API calls, and provide some solutions to fix them.

2. Pouch Logs API Practices

2.1 Detailed Scenarios

For a better illustration, this article simplifies the code for the Pouch Logs HTTP Handler:

func logsContainer(ctx context.Context, w http.ResponseWriter, r *http.Request) {
    ...
    writeLogStream(ctx, w, msgCh)
    return
}

func writeLogStream(ctx context.Context, w http.ResponseWriter, msgCh <-chan Message) {
    for {
        select {
        case <-ctx.Done():
            return
        case msg, ok := <-msgCh:
            if !ok {
                return
            }
            w.Write(msg.Byte())
        }
    }
}

The Logs API Handler will start the goroutine to read logs and pass the data to writeLogStream via a channel. writeLogStream will return the data to the caller. This Logs API has a follow-up function that will continuously display new log content until the container stops. But for the caller, it will terminate the request at any time. So how do we detect whether there is a remaining goroutine?

When the connection is terminated and the Handler needs to send data to the client, the error "write: broken pipe" will occur. Normally the goroutine will exit. But if the Handler waits for too long, the operation is a goroutine leak.

2.2 How to Detect a goroutine Leak

For the HTTP server, we usually view the status of the current process by introducing the package net/http/pprof. One operation to view the goroutine stack information is {ip}:{port}/debug/pprof/goroutine?debug=2. Let's take a look at the goroutine stack information after the caller actively terminates the connection.

# step 1: create background job
pouch run -d busybox sh -c "while true; do sleep 1; done"

# step 2: follow the log and stop it after 3 seconds
curl -m 3 {ip}:{port}/v1.24/containers/{container_id}/logs?stdout=1&follow=1

# step 3: after 3 seconds, dump the stack info
curl -s "{ip}:{port}/debug/pprof/goroutine?debug=2" | grep -A 10 logsContainer
github.com/alibaba/pouch/apis/server.(*Server).logsContainer(0xc420330b80, 0x251b3e0, 0xc420d93240, 0x251a1e0, 0xc420432c40, 0xc4203f7a00, 0x3, 0x3)
        /tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/container_bridge.go:339 +0x347
github.com/alibaba/pouch/apis/server.(*Server).(github.com/alibaba/pouch/apis/server.logsContainer)-fm(0x251b3e0, 0xc420d93240, 0x251a1e0, 0xc420432c40, 0xc4203f7a00, 0x3, 0x3)
        /tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/router.go:53 +0x5c
github.com/alibaba/pouch/apis/server.withCancelHandler.func1(0x251b3e0, 0xc420d93240, 0x251a1e0, 0xc420432c40, 0xc4203f7a00, 0xc4203f7a00, 0xc42091dad0)
        /tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/router.go:114 +0x57
github.com/alibaba/pouch/apis/server.filter.func1(0x251a1e0, 0xc420432c40, 0xc4203f7a00)
        /tmp/pouchbuild/src/github.com/alibaba/pouch/apis/server/router.go:181 +0x327
net/http.HandlerFunc.ServeHTTP(0xc420a84090, 0x251a1e0, 0xc420432c40, 0xc4203f7a00)
        /usr/local/go/src/net/http/server.go:1918 +0x44
github.com/alibaba/pouch/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc4209fad20, 0x251a1e0, 0xc420432c40, 0xc4203f7a00)
        /tmp/pouchbuild/src/github.com/alibaba/pouch/vendor/github.com/gorilla/mux/mux.go:133 +0xed
net/http.serverHandler.ServeHTTP(0xc420a18d00, 0x251a1e0, 0xc420432c40, 0xc4203f7800)

We can see that the logsContainer goroutine still exists in the current process. Because this container does not have any chance to output any logs, this goroutine cannot exit with the error write: broken pipe, so it will continuously occupy system resources. So how can we solve this problem?

2.3 Solutions

The package net/http provided by golang features the ability to monitor for terminated connections:

// HTTP Handler Interceptors
func withCancelHandler(h handler) handler {
        return func(ctx context.Context, rw http.ResponseWriter, req *http.Request) error {
                // https://golang.org/pkg/net/http/#CloseNotifier
                if notifier, ok := rw.(http.CloseNotifier); ok {
                        var cancel context.CancelFunc
                        ctx, cancel = context.WithCancel(ctx)

                        waitCh := make(chan struct{})
                        defer close(waitCh)

                        closeNotify := notifier.CloseNotify()
                        go func() {
                                select {
                                case <-closeNotify:
                                        cancel()
                                case <-waitCh:
                                }
                        }()
                }
                return h(ctx, rw, req)
        }
}

When the request has not been executed, the client actively exits, then CloseNotify() will receive the corresponding message and cancel it via context.Context, so that we can solve the goroutine leak problem. In the world of golang, you will often see goroutines for read and write; the first parameter of the function will usually have context.Context, so you can control the recovery of goroutine through WithTimeout and WithCancel.

CloseNotify is not applicable to scenarios involving Hijack connections, because after the Hijack, the connection is handled by the Handler, rather than the HTTP server.

So can we automate this kind of detection? Next we will describe some commonly used analysis tools that can be utilized to this end.

3. Commonly Used Analysis Tools

3.1 net/http/pprof

When developing the HTTP server, we can introduce the package net/http/pprof to enable debug mode, and then access the goroutine stack information via /debug/pprof/goroutine. A goroutine stack will generally be designed according to the following style.

goroutine 93 [chan receive]:
github.com/alibaba/pouch/daemon/mgr.NewContainerMonitor.func1(0xc4202ce618)
        /tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container_monitor.go:62 +0x45
created by github.com/alibaba/pouch/daemon/mgr.NewContainerMonitor
        /tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container_monitor.go:60 +0x8d

goroutine 94 [chan receive]:
github.com/alibaba/pouch/daemon/mgr.(*ContainerManager).execProcessGC(0xc42037e090)
        /tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container.go:2177 +0x1a5
created by github.com/alibaba/pouch/daemon/mgr.NewContainerManager
        /tmp/pouchbuild/src/github.com/alibaba/pouch/daemon/mgr/container.go:179 +0x50b

The goroutine stack usually contains a goroutine ID on the first line, and detailed stack call information on the next few lines. With the stack call information, we can detect leaks by keyword matching.

In integration testing of Pouch, the Pouch Logs API is interested in goroutine stacks containing (Server).logsContainer. So after the follow-up mode is tested, the debug API is called to check whether the call stack of (Server).logsContainer is included. If we find that the call stack is included, it means that the goroutine has not been recycled and there is a risk of leaks.

In general, the way the debug API works applies to integration testing, because the test cases and the target service are not in the same process, and the goroutine stack of the target process needs to be dumped to get the leak information.

3.2 runtime.NumGoroutine

When the test cases and the target function/service are in the same process, the change in number of goroutines can be used to determine whether a leak has occurred.

func TestXXX(t *testing.T) {
    orgNum := runtime.NumGoroutine()
    defer func() {
        if got := runtime.NumGoroutine(); orgNum != got {
            t.Fatalf("xxx", orgNum, got)
        }
    }()

    ...
}

3.3 github.com/google/gops

Gops is similar to the package net/http/pprof in that it places an agent into your process and provides a command line API to view the state of the process running. The gops stack ${PID} can be used to view the current goroutine stack state.

Summary

When developing the HTTP server, net/http/pprof helps us analyze the code. If the code is logically complex and there is a potential for leaks, mark some functions that might leak and use them as part of the test so that the automated CI can spot problems before the code is reviewed.

0 0 0
Share on

You may also like

Comments

Related Products