Community Blog Hyperledger Fabric Deployment on Alibaba Cloud Environment – SIGSEGV Problem Analysis and Solutions

Hyperledger Fabric Deployment on Alibaba Cloud Environment – SIGSEGV Problem Analysis and Solutions

In this article, we will analyze the SIGSEV-related errors when deploying Hyperledger Fabric in Alibaba Cloud and discuss solutions for these issues.

According to recent feedback from the Hyperledger community, when the open-source blockchain project Hyperledger Fabric is deployed in an Alibaba Cloud environment, SIGSEV-related fatal errors occur. Based on my experience with this problem, I want to share my analysis process and solution for your reference.

Problem Description

During deployment of Hyperledger Fabric, the startup of the peer and orderer services fails and an error is returned when cli-test.sh is executed on the CLI container. All the error messages are signal SIGSEGV: segmentation violation.

The following is an error log example:

2017-11-01 02:44:04.247 UTC [peer] updateTrustedRoots -> DEBU 2a0 Updating trusted root authorities for channel mychannel
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x63 pc=0x7f9d15ded259]
runtime stack:
runtime.throw(0xdc37a7, 0x2a)
        /opt/go/src/runtime/panic.go:566 +0x95
        /opt/go/src/runtime/sigpanic_unix.go:12 +0x2cc
goroutine 64 [syscall, locked to thread]:
runtime.cgocall(0xb08d50, 0xc4203bcdf8, 0xc400000000)
        /opt/go/src/runtime/cgocall.go:131 +0x110 fp=0xc4203bcdb0 sp=0xc4203bcd70
net._C2func_getaddrinfo(0x7f9d000008c0, 0x0, 0xc420323110, 0xc4201a01e8, 0x0, 0x0, 0x0)

Analysis Process

After in-depth analysis and testing, as well as being inspired by the Hyperledger Fabric bug at https://jira.hyperledger.org/browse/FAB-5822, I came up with a method to identify and solve the problem.

In Docker Compose, I added the line GODEBUG=netdns=go to the PEER, ORDERER, and CLI environment variables in the YAML file. After this setting, the pure go resolver instead of the cgo resolver was used. According to the error log, the error is thrown by the cgo resolver.

I further analyzed the situations in which golang switches between the cgo resolver and the pure go resolver:

Note: For the golang official documentation, visit https://golang.org/pkg/net/.

Name Resolution

The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.

On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.

By default, the pure Go resolver is used because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.

The resolver decision can be overridden by setting the netdns value of the GODEBUG environment variable (see package runtime) to go or cgo, as in:

export GODEBUG=netdns=go    # force pure Go resolver
export GODEBUG=netdns=cgo   # force cgo resolver*

Comparing the Go and Cgo Resolvers

Based on this clue, I made a comparison between the underlying configuration files of the successful deployment environment and the deployment failure environment and found the following difference:

In the container in the old environment (in which the blockchain is successfully deployed), shows:

# cat /etc/resolv.conf 

options ndots:0

In the container in the new environment (in which the blockchain deployment fails), shows:

# cat /etc/resolv.conf 

options timeout:2 attempts:3 rotate single-request-reopen ndots:0

Due to this difference, the pure Go resolver is used in the successful deployment environment, while the cgo resolver is used in the deployment failure environment because the environment contains "options single-request-reopen" that is not supported by the pure Go resolver.

Note: Currently, Pure Go resolver only supports ndots, timeout, attempts, and rotate.


       case "options": // magic options
            for _, s := range f[1:] {
                switch {
                case hasPrefix(s, "ndots:"):
                    n, _, _ := dtoi(s[6:])
                    if n < 0 {
                        n = 0
                    } else if n > 15 {
                        n = 15
                    conf.ndots = n
                case hasPrefix(s, "timeout:"):
                    n, _, _ := dtoi(s[8:])
                    if n < 1 {
                        n = 1
                    conf.timeout = time.Duration(n) * time.Second
                case hasPrefix(s, "attempts:"):
                    n, _, _ := dtoi(s[9:])
                    if n < 1 {
                        n = 1
                    conf.attempts = n
                case s == "rotate":
                    conf.rotate = true
                    conf.unknownOpt = true

Next, I analyzed what is the cause for the change to the resolv.conf file in the old and new containers. The cause is that the configuration file of the host ECS has changed:

Deployment failure environment - newly created ECS:

# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

options timeout:2 attempts:3 rotate single-request-reopen

Deployment successful environment - original ECS:

# cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

Additionally, I have tried to find out why SIGSEGV-related errors occur when the cgo resolver is used. The article at

http://tbg.github.io/golang-static-linking-bug explains SIGSEGV-related errors caused by the static link cgo.

In the description of the Hyperledger Fabric bug, the Hyperledger Fabric build (especially getaddrinfo-related methods) is a statically-linked build. For details, visit: https://jira.hyperledger.org/browse/FAB-6403.

Recommended Solution

By now, we have found the root cause and replayed the problem logic: Due to changes to the file resolv.conf of the new ECS host, the inter-container domain name resolution in Hyperledger Fabric is switched from the pure Go resolver to the cgo resolver, an SIGSEGV error caused by the known static link cgo is triggered, and thus the deployment of Hyperledger Fabric fails.

To prevent this issue, update the docker compose yaml template of Hyperledger Fabric, and add an environment variable GODEBUG=netdns=go to all the Hyperledger Fabric nodes (such as orderer, peer, ca, and cli) to forcibly enable the pure Go resolver.

Alibaba Cloud Container Service Blockchain Solution

Alibaba Cloud Container Service provides a basic solution for Hyperledger Fabric automatic configuration and deployment, helping developers avoid complex underlying operations and to be more focused on innovation of the blockchain business application.

For more information, see:

  1. Alibaba Cloud Container Service Blockchain Solution
  2. Blockchain Solution Documentation of Alibaba Cloud Container Service
0 0 0
Share on

You may also like


Related Products