×
Community Blog Memory Model and Synchronization Primitive - Part 1: Memory Barrier

Memory Model and Synchronization Primitive - Part 1: Memory Barrier

Part 1 of this 2-part series explains Memory Barrier and its associated functions in depth.

By Feixu (Jeffle Xu jefflexu@linux.alibaba.com)

Introduction

Memory Reordering

Let's introduce memory reordering before introducing memory barrier.

Memory reordering refers to the reordering of instructions when the execution order of the instructions is inconsistent with the order written in the code. It usually occurs in the following two procedures:

  1. Compiler Reordering (Compile Time)
  2. CPU Reordering (Run Time)

Memory reordering is an optimization designed to improve the speed of running code. Its basic principle is, "Memory reordering shall not modify the behavior of a single-threaded program." Therefore, memory reordering does not affect single-threaded programs.

However, since this optimization is carried out at the compiler reordering and CPU reordering levels, the compiler and CPU do not know whether the upper-layer software is single-threaded or multi-threaded.

If it is single-threaded, memory reordering will not cause any problems to the semantics of the program while improving the code running speed. However, if it is multi-threaded, this optimization can bring problems to the semantics of the program.

Memory Barrier

The memory reordering introduced above may affect the correct execution of the program. Therefore, memory barrier is needed to solve this problem.

Less attention is paid to memory barrier in daily programming because memory barrier has already been used in lock mechanisms, such as spinlock/mutex. Only lock-free programming requires developers to explicitly use memory barrier.

As mentioned above, memory reordering includes compiler reordering and CPU reordering. Correspondingly, memory barrier includes compiler barrier and CPU barrier. The memory reordering and memory barrier are described in the following sections.

Compiler Barrier

Compiler Reordering

The compiler may adopt a series of compilation optimizations during the compilation process. For example, registers are used to cache the values of variables and instruction reordering in the memory. The premise of these optimizations is that the optimizations cannot affect the normal implementation of the functionality in single-threaded programs.

However, in multi-threaded programs, these optimizations can cause problems sometimes. For example:

int a = 1;

void foo(void) {
    while (a) ;
}

void bar(void) {a = 0;}

Suppose there are two threads in the program. One executes the foo() function and the other executes the bar() function. The two threads will access the global variable "a" in parallel.

When optimization option is not enabled by default, the compiler executes foo() function below:

foo:
.L2:
        movl    a(%rip), %eax   // %eax = a
        testl   %eax, %eax      // test if a == 0
        jne     .L2             // reenter the loop if a != 0
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret                     // exit and return if a == 0

When optimization is not enabled, the execution of the foo() function is consistent with the design intent of the program.

Let's look at the foo() function execution by the compiler when compilation optimization (gcc -O2) is enabled:

foo:
.LFB0:
        movl    a(%rip), %eax   // %eax = a
        testl   %eax, %eax      // test if %eax == 0
        jne     .L4
        rep ret                 // exit and return if %eax == 0
.L4:
        jmp     .L4             // infinite loop

The foo() function execution is in an infinite loop. The C code of foo() function is equivalent to:

int a = 1;

void foo(void) {
        register int reg = a;
        while (reg) ;
}

The compiler uses registers to cache memory values for compilation optimization. The optimization does not harm single-threaded programs. However, in multi-threaded programs, one of these threads will be stuck in an infinite loop, even if other threads have modified the value of the global variable "a."

Compiler Barrier

The optimization at the compiler level doesn't know the specific design of the upper-layer software. As a result, the compiler does not know whether the upper-layer software is designed as single-threaded or multi-threaded. This is the root of the problem. Therefore, the solution is to explicitly tell the compiler by the designer of the upper software (through compiler barrier.) When reading a specific variable, the value of the variable should be read from memory instead of reusing the previously read value cached in the register.

Note: The compiler barrier only constrains the GCC compiler, not the CPU behavior at runtime. The compiler barrier works during compilation, and no special instruction is inserted into the generated instruction. (The CPU barriers described below correspond to special instructions, for example, the AArch64 architecture corresponds to DMB instruction.) The compiler barrier only affects the arrangement of the memory load/store instructions generated in the final compilation.

Note: "barrier()" is purely a compiler barrier. It has zero impact on the CPU pipeline and zero impact on anything that GCC knows isn't visible in memory (i.e. local variables that don't have their address taken), so barrier()is cheap.

Barrier()

There are several ways to implement the compiler barrier in Linux.

The barrier() is one of the common compiler barriers. It is implemented based on compliers. The implementation based on the GCC compiler is defined in the include/linux/compiler-gcc.h :

#define barrier()__asm__ __volatile__("": : :"memory")

The barrier() in the GCC compiler is a null instruction, in which only a "memory" clobber is used. It is interpreted below:

The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands. GCC may need to flush specific register values to memory before executing the asm to ensure the memory contains the correct values. Furthermore, the compiler does not assume any values read from memory before an asm remain unchanged after that asm. Instead, it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler.

The "memory" clobber tells the compiler that the instruction may read or write to any memory address. Thus, the compiler will hold back the actions, so the memory access operation before this instruction will not be moved after this instruction and vice versa. This prevents instruction reordering due to compiler optimization and ensures the relative order of the code blocks before and after barrier().

There is a side effect of "memory" clobber. The compiler will flush the cached values from all registers. Then, it will reread the values from the memory and caches them into the registers. Therefore, the compiler reordering optimization is suppressed.

The Effect of barrier()`

Let's take a look at the effect of barrier():

#define barrier()__asm__ __volatile__("": : :"memory")
int a = 1;

void foo(void) {while (a) barrier();}

When compilation optimization (gcc -O2) is enabled, the foo() function is executed like this:

foo:
.LFB0:
        .cfi_startproc
        jmp     .L6
        .p2align 4,,10
        .p2align 3
.L5:
.L6:
        movl    a(%rip), %eax
        testl   %eax, %eax
        jne     .L5
        rep ret
        .cfi_endproc

After the barrier() is added, the value of the variable is read from the memory.

Why Volatile Deprecated

Another way to implement the compiler barrier is by using the volatile keyword in the C language.

The volatile keyword is used to inform the compiler that the value of the variable it modifies is likely to be changed by factors outside the program. For example, if the variable is stored in the mapped memory of hardware register IO, the value may be changed. As a result, the compiler is prevented from caching the variable. For a variable modified by volatile, the compiler cannot cache the variable. Instead, when the value of the variable is used each time, the compiler must reread the value of the variable from memory.

Although both barrier() and volatile suppress compiler optimization, there are still subtle differences between them.

Since volatile modifies a variable, volatile will always be bound with this variable. This variable can no longer be cached by registers. The value of this variable must be read from memory each time it is accessed.

In the example above, the global variable "a" is modified by volatile and accessed in the loop.

volatile int a = 1;

void foo(void) {while (a) {b = a;};
}

When compilation optimization "gcc -O2" is enabled, the foo() function execution is:

foo:
.LFB0:
        movl    a(%rip), %eax   
        testl   %eax, %eax      
        je      .L1
.L6:
        movl    a(%rip), %edx   // 3) read a from memory again 
        movl    a(%rip), %eax   // 1) %eax = a
        testl   %eax, %eax      // 2) test if %eax == 0
        jne     .L6
        movl    %edx, b(%rip)
.L1:
        rep ret             // exit and return if a == 0

Each loop reads the memory twice, which obviously brings unnecessary overhead.

Again, take a look at the effect of using barrier():

void foo(void) {while (a) {
                b = a;
                barrier()};}

Now, the function execution is:

foo:
        jmp     .L9
.L7:
        movl    %eax, b(%rip)   // b=a if a != 0
.L9:
        movl    a(%rip), %eax   // %eax = a
        testl   %eax, %eax      // test if %eax == 0
        jne     .L7
        rep ret                 // exit and return if a == 0

The memory is read once in each loop and the value of "a" cached in the register is reused when "b = a" is executed.

In the example above, barrier() is a better choice than volatile.

The volatile keyword was initially used to modify variables in IO mapped memory. These variables are mapped as the register space of the hardware processor. The values of these variables may be modified by the hardware logic at any time. Therefore, when accessing these variables, the CPU cannot reuse the values cached in the registers previously. It must repeatedly read the values in the memory.

Today, volatile is not limited to modifying the variables in IO mapped memory only, and it will restrict the compilation optimization of the compiler as previously introduced. Volatile is not recommended for Linux kernels.

ACCESS_ONCE()

Is barrier() perfect enough? No. Based on the definition of "memory" clobber, barrier() will flush the values cached in all CPU registers when it is called. Then, the values in the memory are reread and cached in the registers. This works similarly to the full compiler barrier.

If it is only necessary to eliminate the compiler reordering optimization for a specific variable, using barrier() is equivalent to suppressing the compiler reordering optimization for all variables. Originally, it only required targeted action, but using barrier() is more comprehensive.

The volatile keyword described in the previous section can be used to suppress the compiler reordering optimization for specific variables only. However, as mentioned earlier, volatile is always bound with a variable. Most of the time, the compiler reordering optimization for a specific variable is expected to be eliminated at a specific time.

This is why ACCESS_ONCE() was created. It is defined as:

#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

In previous examples, the variable is declared as a volatile variable. Although the volatile keyword is still used here to suppress the compiler optimization, here, the ACCESS_ONCE()uses a pointer to temporarily convert a variable into a volatile variable. This is equal to suppressing compiler optimization only when ACCESS_ONCE() is called, while compiler optimization is normally enabled when accessing a variable elsewhere.

READ_ONCE()/WRITE_ONCE()

READ_ONCE()/WRITE_ONCE() is an upgraded version of ACCESS_ONCE(). Due to the defects of ACCESS_ONCE() (please see, ACCESS_ONCE() and compiler bugs), the kernel code gradually uses READ_ONCE() to replace ACCESS_ONCE(), and eventually removes ACCESS_ONCE() in v4.15.

For now, barrier() and READ_ONCE()/WRITE_ONCE() are both used in the kernel. The latter is used for the complier barrier. Sometimes, it explicitly marks lock-free access to its variables.

To be precise, it is not because ACCESS_ONCE() has defects. Instead, it is the bug in the compiler that causes the defects in the implementation of ACCESS_ONCE() .

GCC 4.6/4.7 has a bug. When the variable is not of the scalar type, the volatile modifying the variable is removed, resulting in ACCESS_ONCE() failure.

For example, the access to the following variables of the pte_t type through ACCESS_ONCE() will fail:

typedef struct {
    unsigned long pte;
} pte_t;

pte_t p = ACCESS_ONCE(pte);

READ_ONCE()

The design of ACCESS_ONCE() is compatible with C language semantics. The kernel community still decided to improve ACCESS_ONCE() to avoid this problem since GCC 4.6/4.7 was widely used at that time.

#define __READ_ONCE(x, check) \
({                                    \
    union { typeof(x) __val; char __c[1]; } __u; \
    if (check)                        \
        __read_once_size(&(x), __u.__c, sizeof(x)); \
    else                                \
        __read_once_size_nocheck(&(x), __u.__c, sizeof(x)); \
    smp_read_barrier_depends(); /* Enforce dependency ordering from x */ \
    __u.__val;                        \
})
#define READ_ONCE(x) __READ_ONCE(x, 1)

The "__read_once_size()" is called by READ_ONCE().

#define __READ_ONCE_SIZE    \
({                                    \
    switch (size) {            \
    case 1: *(__u8 *)res = *(volatile __u8 *)p; break;   \
    case 2: *(__u16 *)res = *(volatile __u16 *)p; break; \
    case 4: *(__u32 *)res = *(volatile __u32 *)p; break; \
    case 8: *(__u64 *)res = *(volatile __u64 *)p; break; \
    default:                        \
        barrier();                    \
        __builtin_memcpy((void *)res, (const void *)p, size); \
        barrier();                    \
    }                                \
})

static __always_inline
void __read_once_size(const volatile void *p, void *res, int size)
{
    __READ_ONCE_SIZE;
}

The principle to avoid the problem is simple; first, change a variable of the non-scalar type to the scalar-type variable. Then, follow the instructions in ACCESS_ONCE() to temporarily convert the scalar type variable to volatile variable for access.

WRITE_ONCE()

The corresponding improvement for WRITE_ONCE() is:

WRITE_ONCE(x, val)

CPU Barrier

CPU Reordering

There are two sources of memory reordering mentioned above.

One is compiler reordering, which occurs during the compilation. The solution is to explicitly insert the compiler barrier where compilation optimization is not required.

Another one is CPU reordering, which occurs when the CPU is running.

The memory reordering discussed here mainly refers to the reordering of two memory operation (read/write) instructions. Therefore, the memory reorder is divided into the following four types:

  • Read After Write (RAW), StoreLoad
  • Between Two Writes (WAW), StoreStore
  • Write After Read (WAR), LoadStore
  • Between Two Reads (RAR), LoadLoad

In the early stage of processor development, memory reordering did not exist. Later, with the introduction of hardware optimization mechanisms, such as local cache, invalidate queue, and store buffer, memory reordering gradually emerged. The introduction of these hardware optimization mechanisms brought about memory reordering. The following section describes the four types of memory reordering:

References for this section:

  1. Is Parallel Programming Hard, And, If So, What Can You Do About It?
  2. Memory Barriers: A Hardware View for Software Hackers

Store Buffer

The first example of memory reordering is related to the store buffer used by the CPU.

When the CPU performs the store operation, the CPU must wait until it receives the invalidate acknowledge message from all other CPUs after sending the invalidate message. Then, the modified data can be written to the cache line. However, waiting is unnecessary because the write operation is asynchronous.

1

The CPU will usually use store buffer to resolve this issue.

2

A store buffer is added between each CPU and the cache. During the store operation, the CPU can directly write the data to be modified into the store buffer after sending the invalidate message. Then, the CPU can continue performing other operations without waiting to receive the invalidate acknowledge message from all other CPUs. After the store buffer receives the invalidate acknowledge message from all other CPUs, it will automatically write the modified data previously stored by the CPU to the cache line.

However, the preceding example only shows the simplified store buffer, which has a problem. Please see the following code:

a=1;
b=a+1;
assert(b == 2);

The initial values of global variables "a" and "b" are 0. Meanwhile, variable "a" is only cached in the cache line of CPU 1, and variable "b" is only cached in the cache line of CPU 0.

Let's imagine CPU 0 runs the preceding code. The running process is listed below:

  1. CPU 0 runs "a=1;" statement. Since the variable "a" is only cached in the cache line of CPU 1 at this time, cache miss occurs during the store operation. After sending a read invalidate message, CPU 0 writes the new value of variable "a" to the store buffer.
  2. CPU 1 receives the read invalidate message from CPU 0 and replies with the read response message and invalidate acknowledge message. The read response message describes the value of variable "a" as 0.
  3. Then, CPU 0 runs the "b=a+1;" statement. CPU 0 receives the read response message from CPU 1, in which the value of the variable "a" is described as 0. Therefore, the value of the variable "a" in the cache line of CPU 0 is updated as 0.
  4. CPU 0 runs "b=a+1;" statement. The value of variable "a" received from the cache line of CPU 0 is 0, and the value of variable "b" in the cache line is updated as 1.

There is a problem with the code execution above. CPU 0 caches the same variable in the store buffer and cache line without synchronization. The following figure shows an improved store buffer:

3

In the store buffer shown in the figure above, the CPU also saves the modified values to the store buffer after the store operation is executed. Then, it continues to execute the subsequent instructions. When the CPU performs a load operation after the store, it needs to access the store buffer before accessing the corresponding cache line. If the store buffer caches the data of the load operation, the load operation will use the data cached in the store buffer instead of the data in the cache line.

StoreStore

This section describes the first type of memory reordering, StoreStore.

As mentioned above, store buffer can speed up the execution of a CPU instruction. However, in an SMP system, when multiple threads or processes are running in parallel on multiple CPUs, store buffer will cause StoreStore reordering.

Example

Please see the following code:

void foo(void) {                    void bar(void) {
    a=1;                                while (b == 0) continue;
    b=1;                                assert(a == 1);
}                                   }

CPU0                                CPU1
                                    a = 0
b = 0

The initial values of global variables "a" and "b" are 0. Initially, variable "a" is only cached in the cache line of CPU 1, and variable "b" is only cached in the cache line of CPU 0.

Let's imagine that CPU 0 runs the foo() function and CPU 1 runs the bar() function. The running process is:

  1. CPU 0 executes "a=1". The cache line is not in the cache of CPU 0, so CPU 0 places the new value of "a" in its store buffer and transmits a "read invalidate" message.
  2. CPU 1 executes "while (b == 0) continue", but the cache line containing "b" is not in its cache. Therefore, it transmits a "read" message.
  3. CPU 0 executes "b=1". It already owns this cache line (in other words, the cache line is already in either the "modified" or the "exclusive" state), so it stores the new value of "b" in its cache line.
  4. CPU 0 receives the "read" message and transmits the cache line containing the now-updated value of "b" to CPU 1, also marking the line as "shared" in its cache.
  5. CPU 1 receives the cache line containing "b" and installs it in its cache.
  6. CPU 1 can now finish executing "while (b == 0) continue". Since it finds that the value of "b" is 1, it proceeds to the next statement.
  7. CPU 1 executes the "assert(a == 1)". Since CPU 1 is working with the old value of "a," this assertion fails.
  8. CPU 1 receives the "read invalidate" message and transmits the cache line containing "a" to CPU 0 and invalidates this cache line from its cache, but it is too late.
  9. CPU 0 receives the cache line containing "a" and applies the buffered store just in time to fall victim to the failed assertion of CPU 1.

Therefore, in the example above, CPU 1 receives the read response message of variable "b" where the value of variable "b" is described as 1. Then, CPU 1 receives the invalidate message of variable "a." Hence, in the view of CPU 1, the order of instructions executed by CPU 0 is:

b = 1;
a = 1;

In the view of CPU 1, memory reordering exists in the CPU 0.

However, in the view CPU 0, it modifies the variable "a" first, and then the variable "b." The execution order of its instructions is:

a = 1;
b = 1;

In this example, two Store instructions are reordered, so it is called StoreStore.

Memory Barrier to Solve This Problem

CPU 0 stores the new value of variable "a" in store buffer after executing "a = 1;" statement. After that, the "b = 1;" statement is executed without receiving the invalidate acknowledge message from all other CPUs. This causes the synchronization problem between caches of CPU 0 and CPU 1.

After the store buffer mechanism is introduced, there is a connection between the execution logic of threads or processes running in parallel on multiple CPUs. However, it is difficult for the compiler and processor designer to perceive this logic connection. Only the software programmer knows it. Therefore, it is usually necessary for the software programmer to explicitly use memory barrier to solve this synchronization problem.

Memory barrier is implemented by the hardware processor. CPUs with different architectures have different memory barrier instructions. Therefore, the programmer needs to explicitly call memory barrier in the code to solve the preceding problem.

void foo(void) {
    a=1;
    smp_mb();
    b=1;
}

void bar(void) {
    while (b == 0) continue;
    assert(a == 1);
}

For example, in the preceding code, the smp_mb() memory barrier is used to implement store-store synchronization. When the smp_mb() memory barrier is called, the CPU first performs a flush operation on the store buffer. During the flush operation, the modified data cached in the store buffer will be updated to the cache line after the invalidate acknowledge message is received from all other CPUs. By doing so, when all the data in the store buffer are updated to the cache line, the CPU can continue to execute the store operation after the smp_mb() memory barrier. The load operations after the smp_mb() memory barrier can be performed without completing the store buffer flush operation.

When the smp_mb() memory barrier is called to flush a store buffer, the CPU can wait until the data in the store buffer is cleared before proceeding with the store operation after the memory barrier. During the store buffer flushing, the CPU can continue to perform the store operation after the memory barrier. The data modified after the store operation must be saved in the store buffer.

In the code above, let's imagine the smp_mb() memory barrier is not used. When CPU 0 executes the "b = 1;" statement, CPU 0 can directly write the new value of variable "b" to the cache line because the variable "b" is only cached in the cache line of CPU 0, namely, the cache hit.

Now, imagine the smp_mb() memory barrier is used. When CPU 0 executes the "b = 1;" statement, the new value of variable "b" cannot be written to the cache line even though the cache hit occurs. It is because that the store buffer flush operation is being executed at this time. Instead, the new value of variable "b" must be written to store buffer. Then, the new value of variable "b" can be updated to the cache line during the store buffer flush process.

After using the smp_mb() memory barrier, the execution process of the preceding code changes to:

  1. CPU 0 executes "a=1". The cache line is not in CPU 0's cache, so CPU 0 places the new value of "a" in its store buffer and transmits a "read invalidate" message.
  2. CPU 1 executes "while (b == 0) continue", but the cache line containing "b" is not in its cache. Therefore, it transmits a "read" message.
  3. CPU 0 executes "smp_mb()", and marks all current store-buffer entries (namely, the "a=1").
  4. CPU 0 executes "b=1". It already owns this cache line (in other words, the cache line is already in either the "modified" or the "exclusive" state), but there is a marked entry in the store buffer. Therefore, rather than store the new value of "b" in the cache line, it instead places it in the store buffer (but in an unmarked entry.)
  5. CPU 0 receives the "read" message and transmits the cache line containing the original value of "b" to CPU 1. It also marks its own copy of this cache line as "shared."
  6. CPU 1 receives the cache line containing "b" and installs it in its cache.
  7. CPU 1 can now finish executing "while (b == 0) continue", but since the value of "b" is still 0, it repeats the while statement. The new value of "b" is safely hidden in CPU 0's store buffer.
  8. CPU 1 receives the "read invalidate" message and transmits the cache line containing "a" to CPU 0 and invalidates this cache line from its cache.
  9. CPU 0 receives the cache line containing "a" and applies the buffered store, placing this line into the "modified" state.
  10. Since the store to "a" was the only entry in the store buffer that was marked by the "smp_mb()", CPU 0 can also store the new value of "b," except for the fact that the cache line containing "b" is now in "shared" state.
  11. Therefore, CPU 0 sends an "invalidate" message to CPU 1.
  12. CPU 1 receives the "invalidate" message, invalidatefs the cache line containing "b" from its cache, and sends an "acknowledge" message to CPU 0.
  13. CPU 1 executes "while (b == 0) continue", but the cache line containing "b" is not in its cache. Therefore, it transmits a "read" message to CPU 0.
  14. CPU 0 receives the "acknowledge" message and puts the cache line containing "b" into the "exclusive" state. CPU 0 stores the new value of "b" into the cache line.
  15. CPU 0 receives the "read" message and transmits the cache line containing the new value of "b" to CPU 1. It also marks its own copy of this cache line as "shared."
  16. CPU 1 receives the cache line containing "b" and installs it in its cache.
  17. CPU 1 can now finish executing "while (b == 0) continue", and since it finds that the value of "b" is 1, it proceeds to the next statement.
  18. CPU 1 executes the "assert(a == 1)", but the cache line containing "a" is no longer in its cache. Once it gets this cache from CPU 0, it will be working with the up-to-date value of "a" and therefore, the assertion passes.

StoreLoad

The store buffer also introduces StoreLoad.

Please see the following sample code:

CPU 0                    CPU 1
WRITE(A, 1)             WRITE(B, 1)
READ(B)                 READ(A)

                        A = 0
B = 0

The initial values of memory A and B are both 0. CPU 0 executes "A = 1" and CPU 1 executes "B = 1". The final results read from the two CPUs may be "A = B = 0".

Let's imagine variable A is initially cached in the cache line of CPU 1 only, and variable B is only cached in the cache line of CPU 0. Let's see the following process:

  1. When CPU 0 executes "A = 1", cache miss occurs. Therefore, the new value of A is saved in store buffer and an invalidate message is sent to CPU 1.
  2. Then, CPU 0 continues to perform "READ(B)". The cache hit occurs and the value of B is 0.
  3. When CPU 1 executes "B = 1", cache miss occurs. Therefore, the new value of B is saved in the store buffer, and an invalidate message is sent to CPU 0.
  4. Then, CPU 1 continues to perform "READ(A)". The cache hit occurs, and the value of A is 0.
  5. Then, CPU 1 receives the invalidate message about variable A from CPU 0 and updates the value of variable A in the cache line to 1.
  6. Then, CPU 0 receives the invalidate message about variable B from CPU 1 and updates the value of variable B in the cache line to 1.

The preceding memory reordering is called StoreLoad reordering. The solution is to insert the StoreLoad memory barrier between Store and Load operations. When the StoreLoad memory barrier is executed, the store buffer is flushed. In this case, the CPU must wait for the data in the store buffer to be cleared, before proceeding with memory access operations after the memory barrier.

CPU 0                    CPU 1
WRITE(A, 1)             WRITE(B, 1)
smp_mb()                smp_mb()
READ(B)                 READ(A)

Invalidation Queue

The second example of memory reordering is related to the invalidate queue used by the CPU.

The store buffer is usually small in capacity and will soon be filled up in the following two cases:

  1. When a cache miss occurs during a store operation by the CPU, the data modified by the current store operation must be cached in the store buffer, and then the following instructions can be performed. When cache misses occur during multiple store operations executed by the CPU continuously, the store buffer may be filled up. At this time, the CPU must wait for the invalidate acknowledge message from other CPUs. After that, the CPU can update the cached data in the store buffer to the cache line. This is the only way there will be spaces available in the store buffer.
  2. When the CPU calls memory barrier, store operations after memory barrier must be cached in store buffer. In this case, the store buffer may be filled up soon, whether the corresponding cache line is ready or not. The CPU must wait for the invalidate acknowledge message from other CPUs.

In the two cases above, when the store buffer is filled up, the CPU must wait to receive the invalidate acknowledge message from other CPUs. The invalidate queue can reduce the waiting time.

Sometimes, the CPU delays sending an invalidate acknowledge message because the CPU must perform the invalidate operation on the corresponding cache line before sending an invalidate acknowledge message. However, when the load/store operation of the CPU is heavy, the CPU will usually put the received invalidate message aside for the time being. When the load becomes lighter, it will process the invalidate message and reply with an invalidate acknowledge message.

After using the invalidate queue, the CPU can cache the received invalidate message into the invalidate queue and immediately send the invalidate acknowledge message. Later, when sending the invalidate message, the CPU must check whether the corresponding invalidate message exists in the invalidate queue.

4

LoadLoad

However, the introduction of the invalidate queue mechanism may cause the LoadLoad reordering.

Example

Please see the following code:

void foo(void) {
    a=1;
    smp_mb();
    b=1;
}

void bar(void) {
    while (b == 0) continue;
    assert(a == 1);
}

The initial values of the global variables "a" and "b" are both 0, and the initial variable "a" is cached in the cache lines of CPU 0 and CPU 1 simultaneously. Variable "a" in CPU 0 and CPU 1 are in the shared state, and variable "b" is only cached in the cache line of CPU 0.

Let's imagine CPU 0 runs the foo() function and CPU 1 runs the bar() function. The running process is:

  1. CPU 0 executes "a=1". The corresponding cache line is read-only in CPU 0's cache, so CPU 0 places the new value of "a" in its store buffer and transmits an "invalidate" message to flush the corresponding cache line from CPU 1's cache.
  2. CPU 1 executes "while (b == 0) continue", but the cache line containing "b" is not in its cache. Therefore, it transmits a "read" message.
  3. CPU 1 receives CPU 0's "invalidate" message, queues it, and immediately responds to it.
  4. CPU 0 receives the response from CPU 1 and is free to proceed past the "smp_mb()" on line 4 above, moving the value of "a" from its store buffer to its cache line.
  5. CPU 0 executes "b=1". It already owns this cache line (in other words, the cache line is already in either the "modified" or the "exclusive" state), so it stores the new value of "b" in its cache line.
  6. CPU 0 receives the "read" message and transmits the cache line containing the now-updated value of "b" to CPU 1, also marking the line as "shared" in its cache.
  7. CPU 1 receives the cache line containing "b" and installs it in its cache.
  8. CPU 1 can now finish executing "while (b == 0) continue". Since it finds the value of "b" is 1, it proceeds to the next statement.
  9. CPU 1 executes the "assert(a == 1)". Since the old value of "a" is still in CPU 1's cache, this assertion fails.
  10. Despite the assertion failure, CPU 1 processes the queued "invalidate" message and (tardily) invalidates the cache line containing "a" from its cache.

Memory Barrier to Solve This Problem

CPU 1 does not check whether invalidate messages of variable "a" to be processed exists in the invalidate queue before performing the load operation to access the value of variable "a."

void foo(void) {
    a=1;
    smp_mb();
    b=1;
}

void bar(void) {
    while (b == 0) continue;
    smp_mb();
    assert(a == 1);
}

Therefore, when the LoadLoad memory barrier is called between two load operations of the CPU, the LoadLoad memory barrier will mark the cache line corresponding to the pending invalidate message in the invalidate queue. When the cache line corresponding to the load operation after the LoadLoad memory barrier is marked, the load operation will not be performed until the corresponding pending invalidate message is processed.

The LoadLoad memory barrier will wait for all pending invalidate messages in the invalidate queue to be processed. Then, the load operation after memory barrier is executed.

The code execution process is:

  1. CPU 0 executes "a=1". The corresponding cache line is read-only in CPU 0's cache, so CPU 0 places the new value of "a" in its store buffer and transmits an "invalidate" message to flush the corresponding cache line from CPU 1's cache.
  2. CPU 1 executes "while (b == 0) continue", but the cache line containing "b" is not in its cache. Therefore, it transmits a "read" message.
  3. CPU 1 receives CPU 0's "invalidate" message, queues it, and immediately responds to it.
  4. CPU 0 receives the response from CPU 1 and is free to proceed past the "smp_mb()" on line 4 above, moving the value of "a" from its store buffer to its cache line.
  5. CPU 0 executes "b=1". It already owns this cache line (in other words, the cache line is already in either the "modified" or the "exclusive" state), so it stores the new value of "b" in its cache line.
  6. CPU 0 receives the "read" message and transmits the cache line containing the now-updated value of "b" to CPU 1, also marking the line as "shared" in its cache.
  7. CPU 1 receives the cache line containing "b" and installs it in its cache.
  8. CPU 1 can now finish executing "while (b == 0) continue". Since it finds the value of "b" is 1, it proceeds to the next statement, which is now a memory barrier.
  9. CPU 1 must now stall until it processes all pre-existing messages in its invalidation queue.
  10. CPU 1 now processes the queued "invalidate" message and invalidates the cache line containing "a" from its cache.
  11. CPU 1 executes the "assert(a == 1)". Since the cache line containing "a" is no longer in CPU 1's cache, it transmits a "read" message.
  12. CPU 0 responds to this "read" message with the cache line containing the new value of "a."
  13. CPU 1 receives this cache line, which contains a value of 1 for "a," so the assertion does not trigger.

LoadStore

TODO

CPU Barrier

As described above, the solution for CPU reordering is to use CPU barrier.

CPU Reordering CPU Barrier
LoadLoad smp_rmb()
StoreStore smp_wmb()
full smp_mb()

The Linux kernel defines multiple types of CPU barriers and the specific implementation of them is architecture-related:

  • smp_rmb() is the read barrier. The load operation before the barrier is completed before the load operation after the barrier.
  • smp_wmb() is the write barrier. The store operation before the barrier is completed before the store operation after the barrier.
  • smp_mb() is the read/write barrier. The load/store operation before the barrier is completed before the load/store operation after the barrier.

CPU barriers are compiled into specific instructions under certain architecture. For example, CPU barriers in the ARM architecture are compiled into DMB instructions.

#define smp_mb()    dmb()
#define smp_rmb()    dmb()
#define smp_wmb()    dmb()

#define dmb() __asm__ __volatile__ ("dmb" : : : "memory")

Data Memory Barrier (DMB) causes the specified type of operations to appear as completed before any subsequent operations of the same type. The "type" of operations can be all operations or restricted to only writes (similar to the Alpha wmb and the POWER eieio instructions.) In addition, ARM allows cache coherence to have one of three scopes: single processor, a subset of the processors ("inner"), and global ("outer").

0 0 0
Share on

Jeffle Xu

2 posts | 1 followers

You may also like

Comments

Jeffle Xu

2 posts | 1 followers

Related Products