Exploring Alibaba Group's PouchContainer Resource Management APIs

PouchContainer is Alibaba Group's efficient, open source, enterprise-class container engine technology featuring strong isolation, high portability and low resource consumption. This article will introduce you to the common APIs of PouchContainer resource management and corresponding underlying kernel APIs.

The following is a detailed description of each resource management API. Test cases are provided for the sake of understandability. PouchContainer 0.4.0 is used in these cases. If the stress command is not available in your image, you can install the stress tool via the command sudo apt-get install stress.

1. Memory Resource Management

1.1 -m, --memory

It can limit the amount of memory used by the container, and the corresponding cgroup file is cgroup/memory/memory.limit_in_bytes.

Unit: B, KB, MB, GB

By default, a container can consume an unlimited amount of memory until the host's memory resources are exhausted.

Run the following command to confirm that the cgroup file corresponds to the resource management of the container memory.

# pouch run -ti --memory 100M reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/memory/memory.limit_in_bytes"
104857600

It can be seen that when the memory is limited to 100 MB, the corresponding value of the cgroup file is 104,857,600 in bytes, which is equal to 100 MB.

The local memory environment is as follows:

# free -m
              total        used        free      shared  buff/cache   available
Mem:         257755        2557      254234           1         963      254903
Swap:          2047           0        2047

We use the stress tool to verify whether the memory limit is in effect. The following command will create a process in the container, in which memory is constantly being occupied (malloc) or freed (free). Theoretically, as long as the memory used is less than the limit, the container will work normally. Note that if you try to use a boundary value, that is, occupying 100 MB of memory in the container using the stress tool, this operation usually fails because there are other processes running in the container.

Then we attempt to perform an operation that occupies 150 MB memory on a container where memory usage is limited to 100 MB, but the operation is normal and no OOM occurs.

# pouch run -ti --memory 100M reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress stress --vm 1 --vm-bytes 150M
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

Check the memory usage of the system with the following command; you will find that the memory usage of Swap has increased, indicating that the "--memory" option does not limit the amount of Swap memory usage.

#free -m
              total        used        free      shared  buff/cache   available
Mem:         257755        2676      254114           1         965      254783
Swap:          2047          41        2006

When we try to close Swap using the swapoff -a command, we execute the previous command again. As you can see from the following log, an error occurs when the container uses memory that exceeds the limit.

# pouch run -ti --memory 100M reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress stress --vm 1 --vm-bytes 150M
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1](422) kill error: No such process
stress: FAIL: [1](452) failed run completed in 0s
esses
stress: FAIL: [1](422) kill error: No such process
stress: FAIL: [1](452) failed run completed in 0s

1.2 --memory-swap

It can limit the total amount of swap partition and memory used by the container, and the corresponding cgroup file is cgroup/memory/memory.memsw.limit_in_bytes.

Value range: greater than the memory limit

Unit: B, KB, MB, GB

Run the following command to confirm that the cgroup file corresponds to the resource management of the container swap partition. It can be seen that when the memory swap is limited to 1 GB, the corresponding value of the cgroup file is 1,073,741,824 in bytes, which is equal to 1 GB.

# pouch run -ti -m 300M --memory-swap 1G reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/memory/memory.memsw.limit_in_bytes"
1073741824

As shown below, the container throws an exception when it tries to occupy more memory than is available in the memory swap.

# pouch run -ti -m 100M --memory-swap 200M reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress bash -c "stress --vm 1 --vm-bytes 300M"
stress: info: [1] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [1](416) <-- worker 10 got signal 9
stress: WARN: [1](418) now reaping child worker processes
stress: FAIL: [1](422) kill error: No such process
stress: FAIL: [1](452) failed run completed in 0s

1.3 --memory-swappiness

This API sets the trend for the container to use the swap partition, which is an integer ranging from 0 to 100 (inclusive). 0 indicates that the container does not use the swap partition, and 100 indicates that the container uses the swap partition as much as possible. The corresponding cgroup file is cgroup/memory/memory.swappiness.

# pouch run -ti --memory-swappiness=100 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c 'cat /sys/fs/cgroup/memory/memory.swappiness'
100

1.4 --memory-wmark-ratio

Used to calculate low_wmark, low_wmark = memory.limit_in_bytes * MemoryWmarkRatio. When memory.usage_in_bytes is greater than low_wmark, the kernel thread is triggered to perform memory reclamation. When the memory.usage_in_bytes is less than high_wmark, the reclamation is stopped. The corresponding cgroup file is cgroup/memory/memory.wmark_ratio.

# pouch run -ti --memory-wmark-ratio=60 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c 'cat /sys/fs/cgroup/memory/memory.wmark_ratio'
60

1.5 --oom-kill-disable

When out-of-memory (OOM) occurs, the system kills the container process by default. If you do not want the container process to be killed, you can use this API. This API corresponds to the cgroup file cgroup/memory/memory.oom_control.

OOM is triggered when the container attempts to use a memory that exceeds the limit. Then there are two cases: one is that the API --oom-kill-disable=false, in which the container will be killed; the other is that the API --oom-kill-disable=true, in which the container will be suspended.

The following command sets the container's memory usage limit to 20 MB and sets the value of the API --oom-kill-disable to true. Check that the cgroup file corresponding to the API, and the value of oom_kill_disable is 1.

# pouch run -m 20m --oom-kill-disable=true reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c 'cat /sys/fs/cgroup/memory/memory.oom_control'
oom_kill_disable 1
under_oom 0

oom_kill_disable: A value of 0 or 1. 1 indicates that when the container tries to use a memory that exceeds the limit (i.e. 20 MB), the container will be suspended.

Under_oom: A value is 0 or 1. When the value is 1, OOM has already occurred in the container.

Use x=a; while true; do x=$x$x$x$x; done to occupy as much memory as possible and force OOM to be triggered, the log is as follows.

# pouch run -m 20m --oom-kill-disable=false reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c 'x=a; while true; do x=$x$x$x$x; done'

[root@r10d08216.sqa.zmf /root]
#echo $?
137

As can be seen from the above log, when the container's memory is exhausted, the container exits with an exit code of 137. Because the container tries to use a memory value that exceeds the limit, the system will trigger OOM, the container will be killed, and the under_oom value will be 1. We can view the value of under_oom through the cgroup file (/sys/fs/cgroup/memory/docker/${container_id}/memory.oom_control) in the system (oom_kill_disable 1, under_oom 1).

When --oom-kill-disable=true, the container will not be killed, but will be suspended by the system.

1.6 --oom-score-adj

The parameter --oom-score-adj sets the possibility that the container process will trigger OOM. The larger the value, the easier the OOM of the container process will be triggered. When the value is -1000, the container process does not trigger OOM at all. This option corresponds to the underlying API /proc/$pid/oom_score_adj.

# pouch run -ti --oom-score-adj=300 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress bash -c "cat /proc/self/oom_score_adj"
300

2. CPU resource management

2.1 --cpu-period

The cycle of the kernel default Linux CFS (Completely Fair Scheduler) is 100 ms; we use --cpu-period to set the CPU usage cycle for the container, and the API --cpu-period needs to be used with the API --cpu-quota. The API --cpu-quota sets the value of CPU usage. CFS is the default scheduling model used by the kernel to allocate CPU resources to running processes. For multi-core CPUs, the value of --cpu-quota is adjusted as needed.

It corresponds to the cgroup file cgroup/cpu/cpu.cfs_period_us. The following command creates a container, and sets the container's CPU usage time to 50,000 (in microseconds), and verifies the value corresponding to the cgroup file corresponding to the API.

# pouch run -ti --cpu-period 50000 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/cpu/cpu.cfs_period_us"
50000

The following command sets the value of --cpu-period to 50,000 and the value of --cpu-quota to 25,000. The container can get 50% of the CPU resource at runtime.

# pouch run -ti --cpu-period=50000 --cpu-quota=25000 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress stress -c 1
stress: info: [1] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

As can be seen from the last line of the log, the CPU usage of the container is about 50.0%, which is in line with expectations.

# top -n1
top - 17:22:40 up 1 day, 57 min,  3 users,  load average: 0.68, 0.16, 0.05
Tasks: 431 total,   2 running, 429 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.1 sy,  0.0 ni, 99.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26354243+total, 25960588+free,  1697108 used,  2239424 buff/cache
KiB Swap:  2096636 total,        0 free,  2096636 used. 25957392+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 53256 root      20   0    7324    100      0 R  50.0  0.0   0:12.95 stress

2.2 --cpu-quota

It corresponds to the cgroup file cgroup/cpu/cpu.cfs_quota_us.

# pouch run -ti --cpu-quota 1600 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us"
1600

The API --cpu-quota sets the value of CPU usage. Normally it needs to be used with the API --cpu-period. For detailed use, refer to the option for --cpu-period.

2.3 --cpu-share

It corresponds to the cgroup file cgroup/cpu/cpu.shares.

# pouch run -ti --cpu-quota 1600 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/cpu/cpu.shares"
1600

The --cpu-shares sets the weight for the container using the CPU. This weight setting is for CPU-intensive processes. If the process in a container is idle, then other containers can use the CPU resources that would otherwise be occupied by the idle container. That is, the --cpu-shares setting is only applied when two or more containers are trying to occupy the entire CPU resource.

We use the following command to create two containers with weights of 1024 and 512 respectively.

# pouch run -d --cpuset-cpus=0 --cpu-share 1024 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress stress -c 1
c7b99f3bc4cf1af94da35025c66913d4b42fa763e7a0905fc72dce66c359c258

[root@r10d08216.sqa.zmf /root]
# pouch run -d --cpuset-cpus=0 --cpu-share 512 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress stress -c 1
1ade73df0dd9939cc65e05117e3b0950b78079fb36f6cc548eff8b20e8f5ecb9

As can be seen from the log of the top command, the PID generated by the first container is 10513, and the CPU usage is 65.1%; the PID generated by the second container is 10687, and the CPU usage is 34.9%. The CPU usage of the two containers is approximately 2:1, and the test results are in line with expectations.

#top
top - 09:38:24 up 3 min,  2 users,  load average: 1.20, 0.34, 0.12
Tasks: 447 total,   3 running, 444 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.1 us,  0.0 sy,  0.0 ni, 96.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 26354243+total, 26187224+free,   964068 used,   706120 buff/cache
KiB Swap:  2096636 total,  2096636 free,        0 used. 26052548+avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 10513 root      20   0    7324    100      0 R  65.1  0.0   0:48.22 stress
 10687 root      20   0    7324     96      0 R  34.9  0.0   0:20.32 stress

2.4 --cpuset-cpus

This API corresponds to the cgroup file cgroup/cpuset/cpuset.cpus.

In the virtual machine with multi-core CPU, start a container, set the container to use only CPU core 1, and check that the corresponding cgroup file of the API is modified to 1, the log is as follows.

# pouch run -ti --cpuset-cpus 1 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/cpuset/cpuset.cpus"
1

Use the following command to specify that the container uses CPU core 1 and use the stress command.

# pouch run -ti --cpuset-cpus 1 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 stress -c 1

The log of the top command for viewing CPU resource is as follows. It should be noted that after typing the top command and pressing the Enter key, then pressing the number key 1, the status of each CPU core can be displayed in the terminal.

#top
top - 17:58:38 up 1 day,  1:33,  3 users,  load average: 0.51, 0.11, 0.04
Tasks: 427 total,   2 running, 425 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

From the above log, only the load of CPU core 1 is 100%, while other CPU cores are idle, and the result is in line with expectations.

2.5 --cpuset-mems

This API corresponds to the cgroup file cgroup/cpuset/cpuset.mems.

# pouch run -ti --cpuset-mems=0 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/cpuset/cpuset.mems"
0

The following command will restrict the container process from using the memory of memory nodes 1 and 3.

# pouch run -ti --cpuset-mems="1,3" reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash

The following command will restrict the container process from using the memory of memory nodes 0, 1 and 2.

# pouch run -ti --cpuset-mems="0-2" reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash

3. IO resource management

3.1 --blkio-weight

The container's block device IO weight can be set by the API --blkio-weight, which is an integer ranging from 10 to 1,000 (inclusive). By default, all containers get the same weight value (500). It corresponds to the cgroup file cgroup/blkio/blkio.weight. The following command sets the IO weight of the container block device to 10, and you can see that the value of the corresponding cgroup file is 10 in the log.

# pouch run -ti --rm --blkio-weight 10 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/blkio/blkio.weight"
10

Use the following two commands to create containers for different block device IO weight values.

# pouch run -it --name c1 --blkio-weight 300 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 /bin/bash
# pouch run -it --name c2 --blkio-weight 600 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 /bin/bash

Block device operations (such as the following command) are simultaneously performed in two containers. You will find that the time spent is inversely proportional to the IO weight of the block device for the container.

# time dd if=/mnt/zerofile of=test.out bs=1M count=1024 oflag=direct

3.2 --blkio-weight-device

The container's specific block device IO weight can be set by the API --blkio-weight-device="deviceName:weight", which is an integer ranging from 10 to 1,000 (inclusive).

It corresponds to the cgroup file cgroup/blkio/blkio.weight_device.

# pouch run --blkio-weight-device "/dev/sda:1000" reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/blkio/blkio.weight_device"
8:0 1000

The "8:0" in the above log indicates the device number of the SDA. You can use the stat command to obtain the device number of a device. You can see that the primary device number corresponding to /dev/sda is 8 and the secondary device number is 0.

#stat -c %t:%T /dev/sda
8:0

If the API --blkio-weight-device is used with the API --blkio-weight, the docker will use the value of --blkio-weight as the default weight and then use the value of the --blkio-weight-device to set the weight for a specified device. The previously set default weight will not take effect in this specific device.

# pouch run --blkio-weight 300 --blkio-weight-device "/dev/sda:500" reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/blkio/blkio.weight_device"
8:0 500

As can be seen from the above log, when the API --blkio-weight is used with the API --blkio-weight-device, the weight of the /dev/sda device is determined by the value set by --blkio-weight-device.

3.3 --device-read-bps

This API limits the read rate of a specified device. The unit can be KB, MB, or GB. It corresponds to the cgroup file cgroup/blkio/blkio.throttle.read_bps_device.

# pouch run -it --device /dev/sda:/dev/sda --device-read-bps /dev/sda:1mb reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device"
8:0 1048576

The above log shows 8:0 1000, in which 8:0 means /dev/sda, and the value of the cgroup file corresponding to this API is 1,048,576, which is the number of bytes corresponding to 1 MB, i.e. the square of 1,024.

Use the API --device-read-bps to set the device read rate to 500 KB/s when creating the container. As can be seen from the following log, the read rate is limited to 498 KB/s, which is in line with expectations.

# pouch run -it --device /dev/sda:/dev/sda --device-read-bps /dev/sda:500k reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash
root@r10f10195:/# dd iflag=direct,nonblock if=/dev/sda of=/dev/null bs=5000k coun
1+0 records in
1+0 records out
5120000 bytes (5.1 MB) copied, 10.2738 s, 498 kB/s

3.4 --device-write-bps

This API limits the write rate of a specified device. The unit can be KB, MB, or GB. It corresponds to the cgroup file cgroup/blkio/blkio.throttle.write_bps_device.

# pouch run -it --device /dev/sda:/dev/sda --device-write-bps /dev/sda:1mB reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/blkio/blkio.throttle.write_bps_device"
8:0 1048576

Use the API --device-write-bps to set the device write rate to 1 MB/s when creating the container. As can be seen from the following log, the read rate is limited to 1.0 MB/s, which is in line with expectations.

Rate limiting operation:

# pouch run -it --device /dev/sdb:/dev/sdb --device-write-bps /dev/sdb:1mB reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash
root@r10d08216:/# dd oflag=direct,nonblock of=/dev/sdb if=/dev/urandom bs=10K count=1000
1024+0 records in
1024+0 records out
10485760 bytes (10 MB) copied, 10.0022 s, 1.0 MB/s

3.5 --device-read-iops

This API sets the IO read rate of the device, and corresponds to the cgroup file cgroup/blkio/blkio.throttle.read_iops_device.

# pouch run -it --device /dev/sda:/dev/sda --device-read-iops /dev/sda:400 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/blkio/blkio.throttle.read_iops_device"
8:0 400

The IO read rate of the SDA can be limited by "--device-read-iops /dev/sda:400" (400 times/second), and the log is as follows.

# pouch run -it --device /dev/sda:/dev/sda --device-read-iops /dev/sda:400 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash
root@r10d08216:/# dd iflag=direct,nonblock if=/dev/sda of=/dev/null bs=1k count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 2.51044 s, 418 kB/s
root@r10d08216:/#

As can be seen from the above log, the number of reads of IO per second is 400, and a total of 1,024 reads (line 2 in the log: count=1024) are needed. The test result shows that the execution time is 2.51044 seconds, which is close to the expected value 2.56 (1,024/400) seconds.

3.6 --device-write-iops

This API sets the IO write rate of the device, and corresponds to the cgroup file cgroup/blkio/blkio.throttle.write_iops_device.

# pouch run -it --device /dev/sda:/dev/sda --device-write-iops /dev/sda:400 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash -c "cat /sys/fs/cgroup/blkio/blkio.throttle.write_iops_device"
8:0 400

The IO write rate of the SDA can be limited by "--device-write-iops /dev/sda:400" (400 times/second), and the log is as follows.

# pouch run -it --device /dev/sdb:/dev/sdb --device-write-iops /dev/sdb:400 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04 bash
root@r10d08216:/# dd oflag=direct,nonblock of=/dev/sdb if=/dev/urandom bs=1K count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 2.50754 s, 418 kB/s

As can be seen from the above log, the number of writes of IO per second is 400, and a total of 1,024 writes (line 2 in the log: count=1024) are needed. The test result shows that the execution time is 2.50754 seconds, which is close to the expected value of 2.56 (1,024/400) seconds.

3.7 Other resource management APIs

--pids-limit

--pids-limit limits the number of PIDs within a container, and corresponds to the cgroup API cgroup/pids/pids.max.

# pouch run -ti --pids-limit 100 reg.docker.alibaba-inc.com/sunyuan/ubuntu:14.04stress bash -c "cat /sys/fs/cgroup/pids/pids.max"
100

If new processes are continuously created inside the container, the system will prompt the following error.

bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable

4. Summary

PouchContainer's resource management depends on the underlying technology of the Linux kernel. You can add some targeted tests as needed to learn more. The implementation of the kernel technology on which it depends is far beyond the scope of this article. You can read the kernel manual for more information: PouchContainer community document.

Community

Exploring Alibaba Group's PouchContainer Resource Management APIs – Part 2