Memory Model and Synchronization Primitive - Part 2: Memory Model

By Feixu (Jeffle Xu jefflexu@linux.alibaba.com)

Concept

What Is a Memory Model?

The memory model is usually discussed under the multi-processor architecture.

In the multi-processor architecture, multiple processors need to implement information transmission synchronously. Generally, two mechanisms are used.

One is the message-passing mechanism. In this mechanism, each processor has a local memory location and can only access its local memory location. The information transmission between processors depends on explicit message operation. Therefore, the system must implement two sets of instructions to access local memory and message operations. This brings additional complexity to the architecture design and program design.

All the processors in the system share the same shared-memory location. Now, the system only needs to implement one set of instructions to access the memory. Information transmission between multiple processors is achieved through reading and writing shared-memory. This simplifies the architecture design and program design.

However, under the shared-memory mechanism, multiple processors may access the same memory location in parallel. In this case, a set of rules must be abstracted to describe the results returned. This set of rules is called the memory consistency model. Sometimes, it is also called memory model for short.

In other words, the memory model describes when the effect of a write operation is visible to the current thread or other threads after a thread performs a write operation on a memory address in a multi-threaded environment. How long after the write operation can the read operation of the current thread (or other threads) reflect the values written?

Since shared-memory systems allow multiple processors to simultaneously read and write the same memory locations, programmers require a conceptual model for the semantics of memory operations to allow them to correctly use the shared memory. This model is typically referred to as a memory consistency model or memory model.

Kourosh Gharachorloo - "Memory Consistency Models for Shared-Memory Multiprocessors"

The source code is compiled by the compiler into machine code with a high-level language and is executed on a specific hardware processor. This process goes through multiple layers of abstraction. Therefore, a corresponding memory model can be implemented at each abstraction layer, such as the operating system (Linux), language (C/C++), and architecture (X86/ARM).

Ordering

Before going deep into the memory model, it is necessary to introduce several concepts.

1. Program Order

Program order describes the order of instructions in the source code. Since the memory model only focuses on memory operation instructions, program order refers to the order of memory operation instructions in the source code.

Given the following source code and program order, the write instruction to "buf" memory precedes the write instruction to "flag" memory:

int buf = 1;
int flag = 1;

2. Execution Order

Execution order refers to the order in which memory operation instructions are executed on the processor. The execution order may be inconsistent with the program order because the compiler and the processor will rearrange the compiled instruction order. This is called reordering, which is an important optimization method for modern compilers and processors. By doing so, the performance of applications can be improved.

For the preceding sample code, according to the execution order, the write instruction to the "flag" memory may come before the "buf" memory.

Sequential Memory Model

The most straightforward memory model is the sequential consistency memory model (SC).

SC in Single-Processor

A single-processor works naturally with the sequential memory model. Therefore, SC can be simply understood as that; all the accesses to the memory are executed serially. The memory model here is definite. The value read by the read operation is the value written by the last write operation on the location in program order.

The reordering between the compiler and the processor is carried out without changing the execution result of a single-threaded program. Although the reorder still exists with a single processor, only commands for non-memory operations or memory operations at different memory locations are reordered.

For example, the following example code contains three memory commands: WRITE(buf), WRITE(flag), and READ(flag).

int buf = 1;
int flag = 1;
int x = flag;

WRITE(buf) may be arranged after WRITE(flag) (or after READ(flag)), which is out of order. However, there will be no out-of-order between WRITE(flag) and READ(flag), because the reordering operation ensures that memory operation instructions at the same location will not be reordered.

Since the memory model (SC) under the single-processor scenario is clear, the discussion of the memory model in the single-processor scenario is usually of little significance. Only the discussion of memory model in multi-processor scenarios makes sense. The premises include:

It is in a multi-processor scenario.
A thread is run on each processor.
These threads running on multiple processors access the same memory location in parallel. Therefore, the following discussions on the memory model all meet the three premises above.

SC in Multi-Processor

SC Without Cache

When the computer architecture evolves to the multi-processor architecture, a simple idea emerges. This idea applies SC in the single-processor architecture to the multi-processor architecture.

In this case, SC in the multi-processor architecture is defined like this:

A multiprocessor system is sequentially consistent i] the result of any execution is the same as if the operations of all the processors were executed in some sequential order and the operations of each processor appear in this sequence in the order specified by its program.

Leslie Lamport - "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs"

In short, the memory access instructions of multiple processors must be executed in series. The following two conditions must be met in the system to execute this:

Maintain a single sequential order among operations from all processors
Maintain program order among operations from individual processors

First of all, the memory access instructions among multiple processors must be executed in series according to program order. Only one processor can execute the memory access instructions at one time. This is similar to reusing memory resources among multiple processors in a time-based manner.

For example:

P1                          P2
buf = 1;                    while (flag ==1) {
flag = 1;                       data = buf;
                            }

For the procedure above, it is necessary to ensure that the two processors access the "flag" memory location in series to implement SC. During program execution, either processor P1 or processor P2 accesses the "flag" first. The "flag" memory cannot be accessed by both processors simultaneously.

SC cannot be implemented with only the conditions above. For example, in the preceding sample code, the instructions executed by the processor P1 are reordered by the compiler or the processor. As a result, WRITE(flag) is rearranged before WRITE(buf). Imagine the following execution sequence:

P1: WRITE(flag)
P2: READ(flag)
P2: READ(buf)
P1: WRITE(buf)

At this time, the value of "buf" read by processor P2 is still the original value, which is 0. It does not conform to the semantics of the sequential memory model because the compiler or processor reorders the memory instructions on a single processor.

As mentioned earlier, the reordering of compilers or processors does not affect the semantics of the sequential memory model under the single-processor architecture. Here, the reordering is imperceptible to processor P1. Whether the reordering occurs or not does not affect the results of the code execution on P1. However, other processors can perceive the reordering on P1.

Therefore, another condition must be met to comply with the sequential memory model semantics. The memory operation instructions on a single processor must be executed in series according to program order.

Therefore, the sufficient conditions for the sequential memory model are:

Each processor issues memory requests in the order specified by its program.
Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of placing the request in this queue.

Leslie Lamport - "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs"

SC With Cache

The sufficient conditions for the sequential memory model described above are obtained without discussing cache or when no cache exists in the architecture. However, modern computer architectures generally use the cache mechanism to reduce memory access latency. In this case, the preceding sufficient conditions are not applicable. Please see the following sample code:

P1                  P2                  P3
A = 1               u = A               v = B
                    B = 1               w = A

Assume the value of A exists in the cache of the processor P1/2/3 at the same time. Consider the following execution order:

P1: WRITE(A)
P2: READ(A) (new value)
P2: WRITE(B)
P3: READ(B) (new value)
P3: READ(A) (old value)

When processor P1 executes WRITE(A), it needs to send an invalid message to processor P2/3.
Due to differences in processor topologies, the invalid message sent by P1 may arrive at each processor at different times. Assume the invalid message arrives first at processor P2 and processor P2 executes READ(A) and WRITE(B) in sequence. During the execution of WRITE(B), processor P2 also sends an invalid message to other processors.
Assume the invalid message sent by processor P2 arrives at processor P3 before the invalid message sent by processor P1. As a result, the final value of A read by processor P3 is still the original one, which is in line with the semantics of the sequential memory model.

There is an implicit premise that memory operations are all atomic in terms of sufficient conditions for the sequential memory model. This is met naturally in the architecture without cache. However, after the cache mechanism is introduced, copies of the value in the same memory location may be stored in the caches of multiple processors. At this point, sending an invalid message to other processors is far from enough to ensure the atomicity of the write operation because the invalid message may have a different delay when it is received by other processors. Therefore, a time interval exists in this process, which causes different results of the write operation perceived by other processors. Processor P2 perceives the newly updated value of A, while processor P3 perceives the old one.

Therefore, new sufficient conditions of the sequential memory model applicable to the cache architecture have been updated:

Each processor issues memory requests in the order specified by its program.
After a write operation is issued, the issuing processor should wait for the write to complete before issuing its next operation.
After a read operation is issued, the issuing processor should wait for the read to complete and for the write whose value is being returned by the read to complete before issuing its next operation.
Write operations to the same location are serialized in the same order for all processors.

Christoph Scheurich and Michel Dubois - "Correct Memory Operation of Cache-Based Multiprocessors"

The new sufficient conditions include the atomicity assurance of memory access operations in the cache architecture.

The Write Atomicity: After a write operation is performed on the memory, the write operation will not end until the system receives an invalid response message from all other processors. Then, other memory access operations can be performed.
The Read Atomicity: The read operation is finished only when the previous write operation on the same memory location is completed. Then, other memory access operations can be performed.

For example, after processor 1 performs a write operation on a memory location, it sends an invalid message to the other processors. Then, when processor 2 performs a read operation on that memory location, the read operation of processor 2 can only be returned when all the other processors receive the invalid response message from processor 1. Processor 1 receives all the invalid response messages from other processors.

The system must meet the following conditions to ensure the sequential memory model semantics:

Program Order Requirement: A processor must ensure that its previous memory operation is complete before proceeding with its next memory operation in program order.
The Write Atomicity: Writes to the same location are serialized (i.e., writes to the same location are made visible in the same order to all processors) and that value of a write is not returned by a read until all invalid messages or updates generated by the write are acknowledged.

Relaxed Memory Model

The semantics of the sequential memory model is relatively strict. This directly limits the widely used optimization measures under the modern processor architecture. For example, the reordering optimization of compilers and processors cannot be used, thus the system performance degrades.

The requirements for the sequential memory model are usually so strict that the performance might be affected. Sometimes, memory consistency can also be implemented with lower requirements for some aspects. Meanwhile, the performance is better than the sequential memory model. This kind of memory model is collectively referred to as the relaxed memory model.

Most modern processors use the relaxed memory model, which has fewer restrictions than the sequential consistency model. It usually loosens the synchronization requirements in the following aspects:

Program Order Requirement

In the sequential memory model, memory instructions in the single-processor scenarios can only be executed in series, regardless of whether the memory locations for the operation are the same. In the relaxed memory model, serial execution of two consecutive memory instructions is required only when the memory locations of two instructions are the same. This serial execution process is divided into the following four types:

Read After Write (RAW)
Between Two Writes (WAW)
Write After Read (WAR)
Between Two Reads (RAR)

The relaxed memory model can also be divided into several types according to the degree of looseness.

X86

Processor Consistency

X86 architecture adopts a memory model called processor consistency (PC) where RAW requirements are loosened:

Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes.
Reads may be reordered with older writes to different locations but not with older writes to the same location.
Reads may be reordered with older writes to the same location in case of Intra-Processor Forwarding.

Intel Architecture Software Developer Manual, volume 3A, chapter 8, section 8.2 "Memory Ordering"

In the X86 memory model, reordering may only occur in RAW scenarios. This only applies to operations at different memory locations.

In the following sample code, A and B are memory locations. The following code executes the WRITE(A) and READ(B) operations, respectively.

P1
A = 1
u = B

Since RAW operations at different memory locations can be reordered in the X86 memory model, the WRITE(A) and READ(B) operations mentioned above may be reordered.

Please see the following code, which performs the WRITE(A) and READ(A) operations, respectively:

P1
A = 1
u = A

In principle, the WRITE(A) and READ(A) operations cannot be reordered. The WRITE(A) operation must wait until the processor receives the invalid message from other processors. Then, the READ(A) operation can be performed. If the store buffer mechanism is implemented by the processor, the processor no longer needs to wait to receive the invalid acknowledge message. It can get the value of A directly from the store buffer.

CPU Barrier

The Linux kernel defines three types of CPU barriers:

smp_rmb() ensures the load operation before the barrier is completed before the one after the barrier.
smp_wmb() ensures the store operation before the barrier is completed before the one after the barrier.
smp_mb() ensures the load/store operation before the barrier is completed before the one after the barrier.

Only RAW reordering exists in the X86 memory model, thus:

Both smp_rmb() and smp_wmb() are no-op
smp_mb() is implemented with the "lock; addl" instruction to prohibit RAW reordering.

AArch64

Weak Ordering

AArch64 architecture adopts a memory model called weak ordering (WO).

In weak ordering, RAW, WAW, RAR, and WAR can be reordered after performing consecutive instructions at different memory locations. If not, the reordering cannot occur.

All four instruction orders in weak ordering can be reordered. When synchronization is required between different processors, special instructions are used to explicitly synchronize the data. In this way, the reordering optimization of the compiler and processor can be utilized as much as possible to improve the system performance.

CPU Barrier

Since all four instruction orders can be reordered in weak ordering, the three CPU barriers of the Linux kernel in the AArch64 architecture are compiled as DMB instructions.

Data Memory Barrier (DMB) causes the specified type of operations to appear as completed before any subsequent operations of the same type. The "type" of operations can be all operations or restricted to only writes (similar to the Alpha wmb and the POWER eieio instructions.) In addition, ARM allows cache coherence to have one of three scopes: single processor, a subset of the processors ("inner"), and global ("outer").

Community

Memory Model and Synchronization Primitive - Part 2: Memory Model

Concept

What Is a Memory Model?

Ordering

Sequential Memory Model

SC in Single-Processor

SC in Multi-Processor

SC Without Cache

SC With Cache

Relaxed Memory Model

X86

Processor Consistency

CPU Barrier

AArch64

Weak Ordering

CPU Barrier

Read previous post:

Jeffle Xu

You may also like

Comments

Jeffle Xu

Related Products

Alibaba Cloud Linux

Alibaba Cloud Model Studio

Storage Capacity Unit

Hybrid Cloud Storage