I will take you to read "Computer composition and design: hardware/software interfaces (English version of the original book, 5th edition RISC-V Edition)" II: Instructions:Language of the Computer-Alibaba Cloud Developer Community

from: Huazhang Publishing House 2019-11-11 1147

introduction: the book focuses on the most basic concepts in current computer design, shows the relationship between software and hardware, and comprehensively introduces the mainstream technology and the latest achievements in the development of contemporary computer systems. The book lists the complete MIPS instruction set one by one, and introduces the basic contents of the network and multi-processor structure. Closely linking CPU performance with program performance is a new part of this edition. In addition, the discussion on software and hardware in this edition is more in-depth. The author shows how software and hardware components affect the performance of the program, and provides relevant materials for readers who focus on hardware and software respectively in the CD.
+ Follow to continue viewing

click to view Chapter 1

I speak Spanish to God, Italian to women, French to men, and German to my horse.

2.1 Introduction

Tb command a computers hardware, you must speak its language. The words of a computers language are called instructions, and its vocabulary is called instruction set . In this chapter, you will see the instruction set of a real computer, both in the form written by people and in the form read by the computer. We introduce instructions in a top-down fashion. Starting from a notation that looks like a restricted programming language, we refine it step-by-step until you see the actual language of a real computer. Chapter 3 continues our downward descent, unveiling the hardware for arithmetic and the representation of floating-point numbers.You might think that the languages of computers would be as diverse as those of people, but in reality computer languages are quite similar; more like regional dialects than independent languages. Hence, once you learn one it is easy to pick up others.The chosen instruction set is RISC-V which was originally developed at UC Berkeley starting in 2010.To demonstrate how easy it is to pick up other instruction sets, we will also take a quick look at two other popular instruction sets.1.MIPS is an elegant example of the instruction sets designed since the 1980s. In several respects, RISC-V follows a similar design.2.The Intel x86 originated in the 1970s, but still today powers both the PC and the Cloud of the post-PC era.This similarity of instruction sets occurs because all computers are constructed from hardware technologies based on similar underlying principles and because there are a few basic operations that all computers must provide. Moreover, computer designers have a common goal: to find a language that makes it easy to build the hardware and the compiler while maximizing performance and minimizing cost and energy. This goal is time-honored; the following quote was written before you could buy a computer, and it is as true today as it was in 1947:

The "simplicity of the equipment” is as valuable a consideration for todays computers as it was for those of the 1950s. The goal of this chapter is to teach an instruction set that follows this advice showing both how it is represented in hardware and the relationship between high-level programming languages and this more primitive one. Our examples are in the C programming language; shows how these would change for an object-oriented language like Java.By learning how to represent instructions, you will also discover the secret of computing: stored-program concept . Moreover; you will exercise your "foreign language" skills by writing programs in the language of the computer and running them on the simulator that comes with this book. You will also see the impact of programming languages and compiler optimization on performance. We conclude with a look at the historical evolution of instruction sets and an overview of other computer dialects.We reveal our first instruction set a piece at a time, giving the rationale along with the computer structures. This top-down, step-by-step tutorial weaves the components with their explanations, making the computer s language more palatable. Figure 2.1 gives a sneak preview of the instruction set covered in this chapter.

2.2 Operations of the Computer Hardware

Every computer must be able to perform arithmetic. The RISC-V assembly language notation

add a, b, c

instructs a computer to add the two variables B and c and to put their sum in a.This notation is rigid in that each RISC-V arithmetic instruction performs only one operation and must always have exactly three variables. For example, suppose we want to place the sum of four variables B, c, d and e into variable a. (In this section, we are being deliberately vague about what a "variable” is; in the next section, well explain in detail.)The following sequence of instructions adds the four variables:

add    a,    b,    c    //    The    sum    of    b and    c is placed    in a
add    a,    a,    d    //    The    sum    of    b, c,    and d is    now in    a
add    a,    a,    e    //    The    sum    of    b, c,    d, and e    is    now    in a

Thus, it takes three instructions to sum the four variables.The words to the right of the double slashes (/ /) on each line above are comments for the human reader, so the computer ignores them. Note that unlike other programming languages each line of this language can contain at most one instruction. Another difference from C is that comments always terminate at the end of a line.The natural number of operands for an operation like addition is three: the two numbers being added together and a place to put the sum. Requiring every instruction to have exactly three operands no more and no less, conforms to the philosophy of keeping the hardware simple: hardware for a variable number of operands is more complicated than hardware for a fixed number. This situation illustrates the first of three underlying principles of hardware design:Design Principle 1: Simplicity favors regularity.We can now show in the two examples that follow, the relationship of programs written in higher-level programming languages to programs in this more primitive notation.

For a given function, which programming language likely takes the most lines of code? Put the three representations below in order.

  1. Java
  2. C
  3. RISC-V assembly language

Elaboration : To increase portability, Java was originally envisioned as relying on a software interpreter. The instruction set of this interpreter is called Java bytecodes (see ), which is quite different from the RISC-V instruction set. To get performance close to the equivalent C program, Java systems today typically compile Java bytecodes into native instruction sets like RISC-V. Because this compilation is normally done much later than for C programs, such Java compilers are often called Just In Time (JIT) compilers. Section 2.12 shows how JITs are used later than C compilers in the start-up process, and Section 2.13 shows the performance consequences of compiling versus interpreting Java programs.

2.3 Operands of the Computer Hardware

Unlike programs in high-level languages, the operands of arithmetic instructions are restricted; they must be from a limited number of special locations built directly in hardware called registers. Registers are primitives used in hardware design that are also visible to the programmer when the computer is completed so you can think of registers as the bricks of computer construction. The size of a register in the RISC-V architecture is 64 bits; groups of 64 bits occur so frequently that they are given the name doubleword in the RISC-V architecture. (Another popular size is a group of 32 bits, called a word in the RISC-V architecture.)One major difference between the variables of a programming language and registers is the limited number of registers, typically 32 on current computers, like RISC-V (See for the history of the number of registers.) Thus, continuing in our top-down, stepwise evolution of the symbolic representation of the RISC-V language in this section we have added the restriction that the three operands of RISC-V arithmetic instructions must each be chosen from one of the 32 64-bit registers.The reason for the limit of 32 registers may be found in the second of our three underlying design principles of hardware technology:Design Principle 2: Smaller is faster.A very large number of registers may increase the clock cycle time simply because it takes electronic signals longer when they must travel farther.Guidelines such as "smaller is fosterM are not absolutes; 31 registers may not be faster than 32. Even so, the truth behind such observations causes computer designers to take them seriously. In this case, the designer must balance the craving of programs for more registers with the designers desire to keep the clock cycle fast. Another reason for not using more than 32 is the number of bits it would take in the instruction format AsSection 2.5 demonstrates. Chapter 4 shows the central role that registers play in hardware construction; as we shall see in that chapter, effective use of registers is critical to program performance.Although we could simply write instructions using numbers for registers, from 0 to 31, the RISC-V convention is x followed by the number of the register, except for a few register names that we will cover later.

Memory Operands Programming languages have simple variables that contain single data elements, as in these examples, but they also have more complex data structures a arrays and structures, These composite data structures can contain many more data elements than there are registers in a computer. How can a computer represent and access such large structures? Recall the five components of a computer introduced In Chapter 1 and repeated on page 61. The processor can keep only a small amount of data in registers, but computer memory contains billions of data elements. Hence, data structures (arrays and structures) are kept in memory.As explained above, arithmetic operations occur only on registers in RISC-V instructions; thus, RISC-V must include instructions that transfer data between memory and registers. Such instructions are called data transfer instructions . To access a word or doubleword in memory, the instruction must supply the memory address . Memory is just a large, single-dimensional array, with the address acting as the index to that array starting at 0. For example, in Figure 2.2 , the address of the third data element is 2, and the value of memory [2] is 10.

The data transfer instruction that copies data from memory to a register is traditionally called load. The format of the load instruction is the name of the operation followed by the register to be loaded then register and a constant used to access memory. The sum of the constant portion of the instruction and the contents of the second register forms the memory address. The real RISC-V name for this instruction is 1 d, standing for load doubleword.

Elaboration : In many architectures, words must start at addresses that are multiples of 4 and doublewords must start at addresses that are multiples of 8. This requirement is called alignment restriction . (Chapter 4 suggests why alignment leads to faster data transfers.) RISC-V and Intel x86 do not have alignment restrictions, but MIPS does.

Load doubleword and store doubleword are the instructions that copy doublewords between memory and registers in the RISC-V architecture. Some brands of computers use other instructions along with load and store to transfer data. An architecture with such alternatives is the Intel x86, described in Section 2.17.

Elaboration : Lefs put the energy and performance of registers versus memory into perspective. Assuming 64-bit data, registers are roughly 200 times faster (0.25 vs. 50 nanoseconds) and are 10,000 times more energy efficient (0.1 vs. 1000 picoJoules) than DRAM in 2015. These large differences led to caches which reduce the performance and energy penalties of going to memory (see Chapter 5 ). Constant or Immediate Operands Many times a program will use a constant in an operation a for example, incrementing an index to point to the next element of an array. In feet, more than half of the RISC-V arithmetic instructions have a constant as an operand when running the SPEC CPU2006 benchmarks.Using only the instructions we have seen so far, we would have to load a constant from memory to use one. (The constants would have been placed in memory when the program was loaded.) For example, to add the constant 4 to register x22, we could use the code

1d x9, AddrConstant4(x3) // x9 = constant 4
add x22, x22, x9    // x22 = x22 + x9 (where x9 == 4)

assuming that x3 + AddrConstant4 is the memory address of the constant 4.An alternative that avoids the load instruction is to offer versions ofthe arithmetic instructions in which one operand is a constant. This quick add instruction with one constant operand is called add immediate or ad di. Tb add 4 to register x22, we just write

addi x22, x22, 4    // x22 = x22 + 4

Constant operands occur frequently; indeed, addi is the most popular instruction in most RISC-V programs. By including constants inside arithmetic instructions, operations are much fester and use less energy than if constants were loaded from memory.The constant zero has another role, which is to simplify the instruction set by offering useful variations. For example you can negate the value in a register by using the sub instruction with zero fbr the first operand. Hence, RISC-V dedicates register xO to be hard-wired to the value zero. Using frequency to justify the inclusions of constants is another example of the great idea fromChapter 1 of making common case fast .

Given the importance of registers, what is the rate of increase in the number of registers in a chip over time? 1.Very fast: They increase as fast Moore's Law , which predicts doubling the number of transistors on a chip every 18 months.2.Very slow: Since programs are usually distributed in the language of the computer, there is inertia in instruction set architecture, and so the number of registers increases only as fast as new instruction sets become viable.

Elaboration : Although the RISC-V registers in this book are 64 bits wide, the RISC-V architects conceived multiple variants of the ISA. In addition to this variant, known as RV64, a variant named RV32 has 32-bit registers, whose reduced cost make RV32 better suited to very low-cost processors. Elaboration : The RISC-V offset plus base register addressing is an excellent match to structures as well as arrays, since the register can point to the beginning of the structure and the offset can select the desired element. Well see such an example in Section 2.13. Elaboration : The register in the data transfer instructions was originally invented to hold an index of an array with the offset used for the starting address of an array. Thus, the base register is also called the index register. Today's memories are much larger, and the software model of data allocation is more sophisticated so the base address of the array is normally passed in a register since it won't fit in the offset, as we shall see. Elaboration : The migration from 32-bit address computers to 64-bit address computers left compiler writers a choice of the size of data types in C. Clearly, pointers should be 64 bits, but what about integers? Moreover, C has the data types I nt, 1 ong I nt and 1 ong 1 ong 1 nt. The problems come from converting from one data type to another and having an unexpected overflow in C code that is not fully standard compliant, which unfortunately is not rare code. The table below shows the two popular options:

While each compiler could have different choices, generally the compilers associated with each operating system make the same decision. To keep the examples simple in this book well assume pointers are all 64 bits and declare all C integers as 1 ong 1 ong 1 nt to keep them the same size. We also follow C99 standard and declare variables used as indexes to arrays to be s I ze_t, which guarantees they are the right size no matter how big the array. They are typically declared the same as 1 ong 1 nt.

2.4 Signed and Unsigned Numbers

First, lets quickly review how a computer represents numbers. Humans are taught to think in base 10, but numbers may be represented in any base. For example, 123 base 10= 1111011 base 2.Numbers are kept in computer hardware as a series of high and low electronic signals, and so they are considered base 2 numbers. (Just as base 10 numbers are called decimal numbers, base 2 numbers are called binary numbers.)A single digit of a binary number is thus the "atom” of computing, since all information is composed binary digits or bits. This fundamental building block can be one of two values, which can be thought of as several alternatives: high or low, on or off, true or false, or 1 or 0.Generalizing the point, in any number base, the value of zth digit d is

where I starts at 0 and increases from right to left. This representation leads to an obvious way to number the bits in the doubleword: simply use the power of the base for that bit. We subscript decimal numbers with ten and binary numbers with two. For example

represents

We number the bits 0, 1, 2, 3, ... from right to left in a doubleword. The drawing below shows the numbering ofbits within a RISC-V doubleword and the placement of the number 101 ltwo, (which we must unfortunately split in half to fit on the page of the book):

Since doublewords are drawn vertically as well as horizontally leftmost and rightmost may be unclear. Hence, the phrase least significant bit is used to refer to the rightmost bit (bit 0 above) andmost significant bit to the leftmost bit (bit 63).The RISC-V double wo rd is 64 bits long, so we can represent 2M different 64-bit patterns. It is natural to let these combinations represent the numbers from 0 to 2M -1 (18,446,774,073,709,55 l,615ten):

That is, 64-bit binary numbers can be represented in terms of the bit value times a power of 2 (here x. means the rth bit of x):

For reasons we will shordysee, these positive numbers are called unsigned numbers.

Keep in mind that the binary bit patterns above are simply representatives of numbers. Numbers really have an infinite number of digits, with almost all being 0 except for a few of the rightmost digits. We just dont normally show leading OS .Hardware can be designed to add, subtract multiply and divide these binary bit patterns. If the number that is the proper result of such operations cannot be represented by these rightmost hardware bits, overflow is said to have occurred. Its up to the programming language, the operating system and the program to determine what to do if overflow occurs.Computer programs calculate both positive and negative numbers, so we need a representation that distinguishes the positive from the negative. The most obvious solution is to add a separate sign, which conveniently can be represented in a single bit; the name for this representation is sign and magnitude.Alas, sign and magnitude representation has several shortcomings. First, its not obvious where to put the sign bit. To the right? Tb the left? Early computerstried both. Second adders for sign and magnitude may need an extra step to set the sign because we cant know in advance what the proper sign will be. Final plus a separate sign bit means that sign and magnitude has both a positive and a negative zero, which can lead to problems for inattentive programmers. Because of these shortcomings, sign and magnitude representation was soon abandoned.In the search for a more attractive alternative, the question arose as to what would be the result fbr unsigned numbers if we tried to subtract a large number from a small one. The answer is that it would try to borrow from a string of leading OS so the result would have a string of leading Is.Given that there was no obvious better alternative, the final solution was to pick the representation that made the hardware simple: leading OS mean positive, and leading Is mean negative. This convention for representing signed binary numbers is called twos complement representation:

The positive half of the numbers, from 0 to 9,223,372,036,854,775,807ten (263-1), use the same representation as before. The following bit pattern (1000 ... 0000two) represents the most negative number -9,223,372,036,854,775,808ten (-263). It is followed by a declining set of negative numbers: -9,223,372,036,854,775,807ten (1000 ...0001two) down to-1len(1111 ... 1111two).

Two's complement does have one negative number that has no corresponding positive number: -9,223,372,036,854,775,808ten. Such imbalance was also a worry to the inattentive programmer, but sign and magnitude had problems for both the programmer and the hardware designer. Consequently every computer today uses twos complement binary representations for signed numbers.Two's complement representation has the advantage that all negative numbers have a 1 in the most significant bit. Thus, hardware needs to test only this bit to see if a number is positive or negative (with the number 0 is considered positive). This bit is often called the sign bit. By recognizing the role of the sign bit, we can represent positive and negative 64-bit numbers in terms of the bit value times a power of 2:

The sign bit is multiplied by -263, and the rest of the bits are then multiplied by positive versions of their respective base values.

Just as an operation on unsigned numbers can overflow the capacity of hardware to represent the result, so can an operation on twos complement numbers. Overflow occurs when the leftmost retained bit of the binary bit pattern is not the same as the infinite number of digits to the left (the sign bit is incorrect): a O on the left of the bit pattern when the number is negative or a 1 when the number is positive.

Lets examine two useful shortcuts when working with twos complement numbers. The first shortcut is a quick way to negate a twos complement binary number. Simply invert every 0 to 1 and every 1 to 0 then add one to the result. This shortcut is based on the observation that the sum of a number and its inverted representation must be 111 ... llllwo, which represents -1. Since x + x = - 1, therefore x + x +1=0 or x +1 = -x. (We use the notation % to mean invert every bit in x from 0 to 1 and vice versa.)

Our next shortcut tells us how to convert a binary number represented in n bits to a number represented with more than n bits. The shortcut is to take the most significant bit from the smaller quantity-the sign bit-and replicate it to fill the new bits of the larger quantity. The old nonsign bits are simply copied into the right portion of the new doubleword. This shortcut is commonly called sign extension.

This trick works because positive twos complement numbers really have an infinite number of OS on the left and negative twos complement numbers have an infinite number of Is. The binary bit pattern representing a number hides leading bits to fit the width of the hardware; sign extension simply restores some of them. Summary The main point of this section is that we need to represent both positive and negative integers within a computer, and although there are pros and cons to any option, the unanimous choice since 1965 has been twos complement. Elaboration : For signed decimal numbers, we used "one" to represent negative because there are no limits to the size of a decimal number. Given a fixed data size, binary and hexadecimal (see Figure 2.4) bit strings can encode the sign; therefore, we do not normally use "+" or "one" with binary or hexadecimal notation.What is the decimal value of this 64-bit twos complement number?

Elaboration : TV /o's complement gets its name from the rule that the unsigned sum of an n-bit number and its n-bit negative is 2n; hence, the negation or complement of a number x is 2n - x, or its “two's complement.”A third alternative representation to two's complement and sign and magnitude is called one's complement. The negative A one's complement is found by inverting each bit, from 0 to 1 and from 1 to 0, or x. This relation helps explain its name since the complement of x is 2n - x - 1. It was also an attempt to be a better solution than sign and magnitude and several early scientific computers did use the notation. This representation is similar to two's complement except that it also has two OS: 00 ... 00two. is positive 0 and 11 ... 11two is negative 0. The most negative number, 10 ... 000two, represents -2,147,483,647ten and so the positives and negatives are balanced. One's complement adders did need an extra step to subtract a number, and hence two's complement dominates today.A final notation, which we will look at when we discuss floating point in Chapter 3 is to represent the most negative value by 00 ... 000two and the most positive value by 11... 11two, with 0 typically having the value 10 ... 00two. This representation is called abiased notation , since it biases the number such that the number plus the bias has a non-negative representation.

2.5 Representing Instructions in the Computer

We are now ready to explain the difference between the way humans instruct computers and the way computers see instructions.Instructions are kept in the computer as a series of high and low electronic signals and may be represented as numbers. In fact, each piece of an instruction can be considered as an individual number and placing these numbers side by side forms the instruction, lhe 32 registers of R1SC-V are just referred to by their number, from 0 to 31.

This layout of the instruction is called instruction format . As you can see from counting the number of bits, this RISC-V instruction takes exactly 32 bits a word, or one half of a doubleword. In keeping with our design principle that simplicity favors regularity RISC-V instructions are all 32 bits long.To distinguish it from assembly language, we call the numeric version of instructions machine language and a sequence of such instructions machine code.It would appear that you would now be reading and writing long, tiresome strings of binary numbers. We avoid that tedium by using a higher base than binary that converts easily into binary. Since almost all computer data sizes are multiples of 4 hexadecimal (base 16) numbers are popular. As base 16 is a power of 2, we can trivially convert by replacing each group of four binary digits by a single hexadecimal digit, and vice versa. Figure 2.4 converts between hexadecimal and binary;

Because we frequently deal with different number bases, to avoid confusion, we will subscript decimal numbers with ten, binary numbers with two, and hexadecimal numbers with hex. (If there is no subscript, the default is base 10.) By the way, C and Java use the notation Oxnnnn fbr hexadecimal numbers.

RISC-V Fields RISC-V fields are given names to make them easier to discuss:

Here is the meaning of each name of the fields in RISC-V instructions:

  • opcode: Basic operation of the instruction, and this abbreviation is its traditional name.
  • rd: The register destination operand. It gets the result of the operation.
  • funct3: An additional opcode field.
  • rsl: The first register source operand.
  • rs2: The second register source operand.
  • funct7: An additional opcode field.

A problem occurs when an instruction needs longer fields than those shown above. For example, the load register instruction must specify two registers and a constant. If the address were to use one of the 5-bit fields in the format above the largest constant within the load register instruction would be limited to only 25-1 or 31. This constant is used to select elements from arrays or data structures, and it often needs to be much larger than 31. This 5-bit field is too small to be useful.Hence we have a conflict between the desire to keep all instructions the same length and the desire to have a single instruction format. This conflict leads us to the final hardware design principle:Design Principle 3: Good design demands good compromises.The compromise chosen by the RISC-V designers is to keep all instructions the same length thereby requiring distinct instruction formats for different kinds of instructions. For example, the format above is called R-type (for register). A second type of instruction format is I-type and is used by arithmetic operands with one constant operand, including addi, and by load instructions. The fields of the I-type format are

The 12-bit immediate is interpreted as a twos complement value, so it can represent integers from -211 to 2n-l. When the I-type format is used for load instructions, the immediate represents a byte offset, so the load doubleword instruction can refer to any doubleword within a region of ±2U or 2048 bytes (±28 or 256 doublewords) of the base address in the base register rd. We see that more than 32 registers would be diflicull in this formal, as the rd and rsl fields would each need another bit, making it harder to fit everything in one word.Lets look at the load register instruction from page 71:

1d x9, 64(x22) // Temporary reg x9 gets A[8]

Here, 22 (for x22) is placed in the rsl field, 64 is placed in the immediate field, and 9 (for x9) is placed in the rd field. We also need a format for the store doubleword instruction, sd, which needs two source registers (for the base address and the store data) and an immediate fbr the address offset. The fields of the S-type format are

The 12-bit immediate in the S-type format is split into two fields which supply the lower 5 bits and upper 7 bits. The RISC-V architects chose this design because it keeps the rsl and rs2 fields in the same place in all instruction formats. Keeping the instruction formats as similar as possible reduces hardware complexity. Similarlythe opcode and funct3 fields are the same size in all locations, and they are always in the same place.In case you were wondering, the formats are distinguished by the values in the opcode field: each format is assigned a distinct set of opcode values in the first field (opcode) so that the hardware knows how to treat the rest of the instruction. Figure 2.5 shows the numbers used in each field for the RISC-V instructions covered so far.

Elaboration : RISC-V assembly language programmers aren't forced to use addi when working with constants. The programmer simply writes add, and the assembler generates the proper opcode and the proper instruction format depending on whether the operands are all registers (R-type) or if one is a constant (l-type). We use the explicit names in RISC-V for the different opcodes and formats as we think it is less confusing when introducing assembly language versus machine language. Elaboration : Although RISC-V has both acid and sub instructions, it does not have a subi counterpart to addi. This is because the immediate field represents a two's complement integer, so addi can be used to subtract constants.

Figure 2.6 summarizes the portions of RISC-V machine language described in this section. As we shall see in Chapter 4, the similarity of the binary representations of related instructions simplifies hardware design. These similarities are another example of regularity in the RISC-V architecture.

What RISC-V instruction does this represent? Choose from one of the four options below.

2.6 Logical Operations

Although the first computers operated on full words, it soon became clear that it was useful to operate on fields of bits within a word or even on individual bits. Examining characters within a word, each of which is stored as 8 bits, is one example of such an operation (see Section 2.9). It follows that operations were added to programming languages and instruction set architectures to simplify, among other things, the packing and unpacking of bits inlo words. Tliese inslruclions are called logical operations. Figure 2.8 shows logical operations in C, Java, and RISC-V.

The first class of such operations is called shifts. They move all the bits in a doubleword to the left or right, filling the emptied bits with OS. For example, if register xl9 contained

and the instruction to shift left by 4 was executed, the new value would be:

The dual of a shift left is a shift right. The actual names of the two RISC-V shift instructions are shift left logical immediate (sill) and shift right logical immediate (srli). The following instruction performs the operation above, if the original value was in register xl9 and the result should go in register xll:

These shift instructions use the I-type format. Since it isnt useful to shift a 64-bit register by more than 63 bits, only the lower 6 bits of the I-type formats 12-bit immediate are actually used. The remaining 6 bits are repurposed as an additional opcode field, funct6.

The encoding of slli is 19 in the opcode field, rd contains 11, funct3 contains 1, rsl contains 19, immediate contains 4 and funct6 contains 0.Shift left logical provides a bonus benefit. Shifting left by I bits gives the identical result as multiplying by 2, just as shifting a decimal number by I digits is equivalent to multiplying by 10 For example, the above sill shifts by 4, which gives the same result as multiplying by 24 or 16. The first bit pattern above represents 9, and 9X16 = 144, the value of the second bit pattern. RISC-V provides a third type of shift, shift right arithmetic (srai). This variant is similar to srl 1 except rather than filling the vacated bits on the left with zeros, it fills them with copies of the old sign bit. It also provides variants of all three shifts that take the shift amount from a register, rather than from an immediate: si 1, srl and sra.Another useful operation that isolates fields is AND. (We capitalize the word to avoid confusion between the operation and the English conjunction.) AND is a bit-by-bit operation that leaves a 1 in the result only if both bits of the operands are1.For example, if register x11 contains

and register xlO contains

then, after executing the RISC-V instruction

the value of register x9 would be

As you can see, AND can apply a bit pattern to a set of bits to force OS where there is a 0 in the bit pattern. Such a bit pattern in conjunction with AND is traditionally called a mask, since the mask "conceals” some bits.To place a value into one of these seas of OS there is the dual to AND, called OR. It is a bit-by-bit operation that places a 1 in the result if either operand bit is a 1. Tb elaborate, if the registers xlO and xll are unchanged from the preceding example, the result of the RISC-V instruction

is this value in register x9:

The final logical operation is a contrarian. NOT takes one operand and places a 1 in the result if one operand bit is a 0, and vice versa. Using our prior notation, it calculates x.In keeping with the three-operand format, the designers of RISC-V decided to include the instruction XOR (exclusive OR) instead of NOT. Since exclusive OR creates a 0 when bits are the same and a 1 if they are different, the equivalent to NOT is an xor 111...111.If the register xlO is unchanged from the preceding example and register xl2 has the value 0, the result of the RISC-V instruction

is this value in register x9:

Figure 2.8 above shows the relationship between the C and Java operators and the RISC-V instructions. Constants are useful in logical operations as well as in arithmetic operations, so RISC-V also provides the instructions and immediate (arid!) or immediate (or I ), and exclusive or immediate (xor I).

Elaboration : C allows bit fields or fields to be defined within doublewords both allowing objects to be packed within a doubleword and to match an externally enforced interface such as an I /O device. All fields must fit within a single doubleword. Fields are unsigned integers that can be as short as 1 bit. C compilers insert and extract fields using logical instructions in RISC-V: andi, ori, slli, and srli.Which operations can isolate a field in a doubleword? 1.AND2.A shift left followed by a shift right

2.7 Instructions for Making Decisions

What distinguishes a computer from a simple calculator is its ability to make decisions. Based on the input data and the values created during computation, different instructions execute. Decision making is commonly represented in programming languages using the if statement, sometimes combined with go to statements and labels. RISC-V assembly language includes two decision-making instructions, similar to an zf statement with ago to. The first instruction is

beq rsl, rs2, LI

This instruction means go to the statement labeled LI if the value in register rsl equals the value in register rs2. The mnemonic beq stands for branch if equal. The second instruction is

bne rsl, rs2, LI

It means go to the statement labeled LI if the value in register rsl does not equal the value in register rs2. The mnemonic bne stands fbr branch if not equal. These two instructions are traditionally called conditional branches .

Notice that the assembler relieves the compiler and the assembly language programmer from the tedium of calculating addresses for branches, just as it does for calculating data addresses for loads and stores (see Section 2.12).

The test for equality or inequality is probably the most popular test, but there are many other relationships between two numbers. For example, loop may want to test to see if the index variable is less than 0. The full set of comparisons is less than (<), less than or equal (<) greater than (>), greater than or equal (>), equal (=), and not equal (≠).Comparison of bit patterns must also deal with the dichotomy between signed and unsigned numbers. Sometimes a bit pattern with a 1 in the most significant bit represents a negative number and, of course is less than any positive number, which must have a 0 in the most significant bit. With unsigned integers, on the other hand, a 1 in the most significant bit represents a number that is larger than any that beginswith a 0. (Well soon take advantage of this dual meaning of the most significant bit to reduce the cost of the array bounds checking.) RISC-V provides instructions that handle both cases. These instructions have the same form as beq and bne, but perform different comparisons. The branch if less than (bl t) instruction compares the values in registers rsl and rs2 and takes the branch if the value in rsl is smaller, when they are treated as twos complement numbers. Branch if greater than or equal (bge) takes the branch in the opposite case, that is if the value in rsl is at least the value in rs2. Branch if less than, unsigned (bl tu) takes the branch if the value in rsl is smaller than the value in rs2 when the values are treated as unsigned numbers. Finally branch if greater than or equal, unsigned (bgeu) takes the branch in the opposite case.An alternative to providing these additional branch instructions is to set a register based upon the result of the comparison, then branch on the value in that temporary register with the beq or bne instructions. This approach, used by the MIPS instruction set, can make the processor datapath slightly simpler but it takes more instructions to express a program.Yet another alternative, used by ARMs instruction sets, is to keep extra bits that record what occurred during an instruction. These additional bits, called condition codes or flags, indicate, for example, if the result of an arithmetic operation was negative, or zero or resulted in overflow.Conditional branches then use combinations of these condition codes to perform the desired test.One downside to condition codes is that if many instructions always set them, it will create dependencies that will make it difficult for pipelined execution (see Chapter 4).Bounds Check Shortcut Treating signed numbers as if they were unsigned gives us a low-cost way of checking if 0 < x < y, which matches the index out-of-bounds check for arrays. The key is that negative integers in twos complement notation look like large numbers in unsigned notation; that is the most significant bit is a sign bit in the former notation but a large part of the number in the latter. Thus, an unsigned comparison of x < y checks ifx is negative as well as if x is less than y.

Case/Switch Statement Most programming languages have a case or switch statement that allows the programmer to select one of many alternatives depending on a single value. The simplest way to implement switch is via a sequence of conditional tests turning the switch statement into a chain of if-then-else statements.Sometimes the alternatives may be more efficiently encoded as a table of addresses of alternative instruction sequences, called a branch address table or branch table , and the program needs only to index into the table and then branch to the appropriate sequence. The branch table is therefore just an array of doublewords containing addresses that correspond to labels in the code. The program loads the appropriate entry from the branch table into a register. It then needs to branch using the address in the register. To support such situations, computers like RISC-V include an indirect jump instruction, which performs an unconditional branch to the address specified in a register. In RISC-V the jump-and-link register instruction (j alr) serves this purpose. Well see an even more popular use of this versatile instruction in the next section.

I .C has many statements for decisions and loops, while RISC-V has few. Which of the following does or does not explain this imbalance? Why? 1.More decision statements make code easier to read and understand. 2.Fewer decision statements simplify the task of the underlying layer that is responsible for execution. 3.More decision statements mean fewer lines of code, which generally reduces coding time. 4.More decision statements mean fewer lines of code which generally results in the execution of fewer operations.II.Why does C provide two sets of operators for AND (∧ &&) and two sets of operators for OR (| and ||), while RISC-V doesnt? 1.Logical operations AND and ORR implement & and | while conditional branches implement && and ||. 2.The previous statement has it backwards: && and || correspond to logical operations, while & and | map to conditional branches. 3.They are redundant and mean the same thing: && and || are simply inherited from the programming language B, the predecessor ofC.Supporting Procedures in Computer Hardware A procedure or function is one tool programmers use to structure programs, both to make them easier to understand and to allow code to be reused. Procedures allow the programmer to concentrate on just one portion of the task at a time; parameters act as an interface between the procedure and the rest of the program and data, since they can pass values and return results. We describe the equivalent to procedures in Java in , but Java needs everything from a computer that C needs. Procedures are one way to implement abstraction in software.You can think of a procedure like a spy who leaves with a secret plan, acquires resources, performs the task, covers his or her tracks and then returns to the point of origin with the desired result. Nothing else should be perturbed once the mission is complete. Moreover, a spy operates on only a "need to know” basis, so the spy cant make assumptions about the spymaster.Similarly in the execution of a procedure the program must follow these six steps:1.Put parameters in a place where the procedure can access them.2.Transfer control to the procedure.3.Acquire the storage resources needed for the procedure.4.Perform the desired task.5.Put the result value in a place where the calling program can access it.6.Return control to the point of origin since a procedure can be called from several points in a program.As mentioned above, registers are the fastest place to hold data in a computer, so we want to use them as much as possible. RISC-V software follows the following convention for procedure calling in allocating its 32 registers:

  • xl0-xl7: eight parameter registers in which to pass parameters or return values.
  • xl: one return address register to return to the point of origin.

In addition to allocating these registers, RISC-V assembly language includes an instruction just for the procedures: it branches to an address and simultaneously saves the address of the following instruction to the destination register rd.jump-and-linkinstruction (jal) is written

jal xl, ProcedureAddress // jump to
ProcedureAddress and write return address to xl

The link portion of the name means that an address or link is formed that points to the calling site to allow the procedure to return to the proper address. This "link: stored in register xl, is called return address . The return address is needed because the same procedure could be called from several parts of the program.To support the return from a procedure, computers like RISC-V use an indirect jump, like the jump-and-link instruction (jal r) introduced above to help with case statements:

jal r xO, 0(x1)

The jump-and-link register instruction branches to the address stored in register xl-which is just what we want. Thus, the calling program, or caller, puts the parameter values in xl0-xl7 and uses jal xl, X to branch to procedure X (sometimes named the callee). The callee then performs the calculations, places the results in the same parameter registers, and returns control to the caller using jal r xO, 0(x1).Implicit in the stored-program idea is the need to have a register to hold the address of the current instruction being executed. For historical reasons, this register is almost always called the program counter abbreviated PC in the RISC-V architecture, although a more sensible name would have been instruction address register. The j a 1 instruction actually saves PC +4 in its designation register (usually xl) to link to the byte address of the following instruction to set up the procedure return. Elaboration : The jump-and-link instruction can also be used to perform an unconditional branch within a procedure by using xO as the destination register. Since xO is hard-wired to zero, the effect is to discard the return address:

jal xO, Label // unconditionally branch to Label

Using More Registers Suppose a compiler needs more registers for a procedure than the eight argument registers. Since we must cover our tracks after our mission is complete any registers needed by the caller must be restored to the values that they contained before the procedure was invoked. This situation is an example in which we need to spill registers to memory as mentioned in the Hardware/Software Interface section on page 69.The ideal data structure for spilling registers is a stack-a last-in-first-out queue. A stack needs a pointer to the most recently allocated address in the stack to show where the next procedure should place the registers to be spilled or where old register values are found. In RISC-V the stack pointer is register x2, also known by the name sp. The stack pointer is adjusted by one doubleword for each register that is saved or restored. Stacks are so popular that they have their own buzzwords for transferring data to and from the stack: placing data onto the stack is called a push, and removing data from the stack is called a pop.By historical precedent, stacks "grow” from higher addresses to lower addresses. This convention means that you push values onto the stack by subtracting from the stack pointer. Adding to the stack pointer shrinks the stack, thereby popping values off the stack.

In the previous example, we used temporary registers and assumed their old values must be saved and restored. Tb avoid saving and restoring a register whose value is never used, which might happen with a temporary register, RISC-V software separates 19 of the registers into two groups:

  • x5-x7 and x28-x31: temporary registers that are not preserved by the callee (called procedure) on a procedure call
  • x8-x9 and xl 8-x27: saved registers that must be preserved on a procedure call (if used, the callee saves and restores them)

This simple convention reduces register spilling. In the example above, since the caller does not expect registers x5 and x6 to be preserved across a procedure call, we can drop two stores and two loads from the code. We still must save and restore x20, since the callee must assume that the caller needs its value. Nested Procedures Procedures that do not call others are called leaf procedures. Life would be simple if all procedures were leaf procedures, but they aren t. Just as a spy might employ other spies as part of a mission, who in turn might use even more spies, so do procedures invoke other procedures. Moreover recursive procedures even invoke "clones” of themselves. Just as we need to be careful when using registers in procedures, attention must be paid when invoking nonleaf procedures.For example, suppose that the main program calls procedure A with an argument of 3, by placing the value 3 into register xlO and then using jal xl A. Then suppose that procedure A calls procedure B via jal xl, B with an argument of 7, also placed in xlO. Since A hasnt finished its task yet, there is a conflict over the use of register xlO. Similar dam there is a conflict over the return address in register xl, since it now has the return address for B. Unless we take steps to prevent the problem, this conflict will eliminate procedure As ability to return to its caller.One solution is to push all the other registers that must be preserved on the stack just as we did with the saved registers. The caller pushes any argument registers (xl0-xl7) or temporary registers (x5-x7 and x28-x31) that are needed after the call. The callee pushes the return address register xl and any saved registers (x8- x9 and xl8-x27) used by the callee. The stack pointer sp is adjusted to account for the number of registers placed on the stack. Upon the return, the registers are restored from memory, and the stack pointer is readjusted.

Figure 2.11 summarizes what is preserved across a procedure call. Note that several schemes preserve the stack, guaranteeing that the caller will get the same data back on a load from the stack as it stored onto the stack. The stack above s p is preserved simply by making sure the callee does not write above sp; sp is itself preserved by the callee adding exactly the same amount that was subtracted from it; and the other registers are preserved by saving them on the stack (if they are used) and restoring them from there.

Allocating Space for New Data on the Stack The final complexity is that the stack is also used to store variables that are local to the procedure but do not fit in registers such as local arrays or structures. The segment of the stack containing a procedures saved registers and local variables is called a procedure frame or activation record. Figure 2.12 shows the state of the stack before, during, and after the procedure call.Some RISC-V compilers use a frame pointer f p or register x8 to point to the first doubleword of the frame of a procedure. A stack pointer might change during the procedure, and so references to a local variable in memory might have different offsets depending on where they are in the procedure, making the procedure harder to understand. Alternatively a frame pointer offers a stable base register within a procedure for local memory-references. Note that an activation record appears on the stack whether or not an explicit frame pointer is used. Weve been avoiding using f p by avoiding changes to sp within a procedure: in our examples, the stack is adjusted only on entry to and exit from the procedure.

Allocating Space for New Data on the Heap In addition to automatic variables that are local to procedures, C programmers need space in memory for static variables and for dynamic data structures. Figure 2.13 shows the RISC-V convention for allocation of memory when running the Linux operating system. The stack starts in the high end of the user addresses space (see Chapter 5) and grows down. The first part of the low end of memory is reserved, followed by the home of the RISC-V machine code, traditionally called the text segment Above the code is the static data segment which is the place for constants and other static variables. Although arrays tend to be a fixed length and thus are a good match to the static data segment, data structures like linked lists tend to grow and shrink during their lifetimes. The segment for such data structures is traditionally called the heap and it is placed next in memory. Note that this allocation allows the stack and heap to grow toward each other, thereby allowing the efficient use of memory as the two segments wax and wane.

C allocates and frees space on the heap with explicit functions, mal 1 oc() allocates space on the heap and returns a pointer to it, and f ree() releases space on the heap to which the pointer points. C programs control memory allocation which is the source of many common and difficult bugs. Forgetting to free space leads to a "memory leak: which ultimately uses up so much memory that the operating system may crash. Freeing space too early leads to "dangling pointers: which can cause pointers to point to things that the program never intended. Java uses automatic memory allocation and garbage collection just to avoid such bugs.Figure 2.14 summarizes the register conventions fbr the RISC-V assembly language. This convention is another example of making the common case fast: most procedures can be satisfied with up to eight argument registers twelve saved registers, and seven temporary registers without ever going to memory. Elaboration : What if there are more than eight parameters? The RISC-V convention is to place the extra parameters on the stack just above the frame pointer. The procedure then expects the first eight parameters to be in registers xlO through xl7 and the rest in memory addressable via the frame pointer.As mentioned in the caption of Figure 2.12, the frame pointer is convenient because all references to variables in the stack within a procedure will have same offset. The frame pointer is not necessary, however. The RISC-V C compiler only uses a frame pointer in procedures that change the stack pointer in the body of the procedure.

Elaboration : Some recursive procedures can be implemented iteratively without using recursion. Iteration can significantly improve performance by removing the overhead associated with recursive procedure calls. For example, consider a procedure used to accumulate a sum:

1ong 1ong int sum (long 1ong int n, 1ong 1ong 1 nt acc) { if (n > 0)
return sum(n 一 1, acc + n);
el se
return acc;
}

Consider the procedure call sum(3,0). This will result in recursive calls to sum(2,3), sum(l ,5), and sum(0,6), and then the result 6 will be returned four times. This recursive call of sum is referred to as a tail call, and this example use of tail recursion can be implemented very efficiently (assume xl 0 = n , xl 1 = acc, and the result goes into xl2):

sum: ble xlO, xO, sum_exit    //    go to sum_exit if n <= 0
add xll, xll, xlO    //    add n to acc
addi xlO, xlO, -1    //    subtract 1 from n
jal xO, sum    //    jump to sum
sum_exit:        
addi xl2, xll, 0    //    return value acc
jalr xO, 0(x1)    //    return to cal 1 er

Which of the following statements about C and Java is generally true? 1.C programmers manage data explicitly while its automatic in Java.2.C leads to more pointer bugs and memory leak bugs than does Java.

2.9 Communicating with People

Computers were invented to crunch numbers, but as soon as they became commercially viable they were used to process text. Most computers today offer 8-bit bytes to represent characters, with the American Standard Code for Information Interchange (ASCII) being the representation that nearly everyone follows. Figure 2.15 summarizes ASCII.

A series of instructions can extract a byte from a doubleword, so load register and store register are sufficient fbr transferring bytes as well as words. Because of the popularity of text in some programs, however, RISC-V provides instructions to move bytes. Load byte unsigned (1 bu) loads a byte from memory placing it in the rightmost 8 bits of a register. Store byte (sb) takes a byte from the rightmost 8 bits of a register and writes it to memory. Thus, we copy a byte with the sequence

1bu xl2, 0(x10)    // Read byte from source
sb xl2, 0(x11)           // Write byte to destination

Characters are normally combined into strings, which have a variable number of characters. There are three choices fbr representing a string: (1) the first position of the string is reserved to give the length of a string, (2) an accompanying variable has the length of the string (as in a structure) or (3) the last position of a string is indicated by a character used to mark the end of a string. C uses the third choice, terminating a string with a byte whose value is 0 (named null in ASCII). Thus, the string "Cal” is represented in C by the following 4 bytes shown as decimal numbers: 67, 97,108, and 0. (As we shall see, Java uses the first option.)

Since the procedure strcpy above is a leaf procedure, the compiler could allocate 1 to a temporary register and avoid saving and restoring xl9. Hence, instead of thinking of these registers as being just for temporaries we can think of them as registers that the callee should use whenever convenient. When a compiler finds a leaf procedure, it exhausts all temporary registers before using registers it must save.Characters and Strings in Java Unicode is a universal encoding of the alphabets of most human languages. Figure 2.16 gives a list of Unicode alphabets; there are almost as many alphabets in Unicode as there are useful symbols in ASCII. To be more inclusive, Java uses Unicode for characters. By default, it uses 16 bits to represent a character.

The RTSC-V instruction set has explicit instructions to load and store such 16- bit quantities, called half words. Load half unsigned loads a halfword from memory placing it in the rightmost 16 bits of a register, filling the leftmost 48 bits with zeros. Like load byte, load half h) treats the halfword as a signed number and thus sign-extends to fill the 48 leftmost bits of the register. Store half(sh) takes a halfword from the rightmost 16 bits of a register and writes it to memory. We copy a halfword with the sequence

1hu xl9, 0(x10) // Read halfword (16 bits) from source 
sh xl9, 0(x11) // Write halfword (16 bits) to dest

Strings are a standard Java class with special built-in support and predefined methods for concatenation, comparison, and conversion. Unlike C, Java includes a word that gives the length of the string, similar to Java arrays. Elaboration : RISC-V software is required to keep the stack aligned to "quadword" (16 byte) addresses to get better performance. This convention means that a char variable allocated on the stack may occupy as much as 16 bytes, even though it needs less. However a C string variable or an array of bytes will pack 16 bytes per quadword, and a Java string variable or array of shorts packs 8 halfwords per quadword. Elaboration : Reflecting the international nature of the web, most web pages today use Unicode instead of ASCII. Hence, Unicode may be even more popular than ASCII today. Elaboration : RISC-V also includes instructions to move 324oit values to and from memory. Load word unsigned (1 wu) loads a 32-bit word from memory into the rightmost 32 bits of a register, filling the leftmost 32 bits with zeros. Load word (1 w) instead fills the leftmost 32 bits with copies of bit 31. Store word (sw) takes a word from the rightmost 32 bits of a register and stores it to memory. I .Which of the following statements about characters and strings in C and Java is true? 1.A string in C takes about half the memory as the same string in Java. 2.Strings are just an informal name for single-dimension arrays of characters in C and Java. 3.Strings in C and Java use null (0) to mark the end of a string. 4.Operations on strings, like length are faster in C than in Java.II. Which type of variable that can contain 1,000,000,000ten takes the most memory space? 1.1 ong 1 ong int inC 2.stri ng inC 3.stri ng in Java

2.10 RISC-V Addressing for Wide Immediates and Addresses

Although keeping all RISC-V instructions 32 bits long simplifies the hardware, there are times where it would be convenient to have 32-bit or larger constants or addresses. This section starts with the general solution for large constants, and then shows the optimizations for instruction addresses used in branches. Wide Immediate Operands Although constants are frequently short and fit into the 12-bit fields, sometimes they are bigger.The RISC-V instruction set includes the instruction Load upper immediate (lui) to load a 20-bit constant into bits 12 through 31 of a register. The leftmost 32 bits are filled with copies of bit 31 and the rightmost 12 bits are filled with zeros. This instruction allows, for example, a 32-bit constant to be created withtwo instructions. 1 ui uses a new instruction format, U-type, as the other formats cannot accommodate such a large constant.

Elaboration : In the previous example, bit 11 of the constant was 0. If bit 11 had been set, there would have been an additional complication: the 12-bit immediate is sign-extended, so the addend would have been negative. This means that in addition to adding in the rightmost 11 bits of the constant we would have also subtracted 212. To compensate for this error, it suffices to add 1 to the constant loaded with lui, since the lui constant is scaled .

Addressing in Branches The RISC-V branch instructions use the RISC-V instruction format called SB-type, This format can represent branch addresses from -4096 to 4094, in multiples of 2. For reasons revealed short with it is only possible to branch to even addresses. The SB-type format consists of a 7-bit opcode, a 3-bit function code, two 5-bit register operands (rsl and rs2), and a 12-bit address immediate. The address uses an unusual encoding, which simplifies datapath design but complicates assembly. The instruction

bne xlO, xll, 2000 // if xlO != xll, go to location 2000ten = 0111 1101 0000

could be assembled into this format (its actually a bit more complicated, as we will see):

where the opcode for conditional branches is 1100111two and bne's funct3 code is 001twoThe unconditional jump-and-link instruction (jal) is the only instruction that uses the UJ-type format. This instruction consists of a 7-bit opcode, a 5-bit destination register operand (rd), and a 20-bit address immediate. The link address which is the address of the instruction following the j a 1, is written to rd.Like the SB-type format, the UJ-type formats address operand uses an unusual immediate encoding, and it cannot encode odd addresses. So

jal xO, 2000 // go to location 2000ten = 0111 1101 0000

is assembled into this format:

If addresses of the program had to fit in this 20-bit field, it would mean that no program could be bigger than 220, which is far too small to be a realistic option today. An alternative would be to specify a register that would always be added to the branch offset so that a branch instruction would calculate the following:

Program counter = Register + Branch offset

This sum allows the program to be as large as 2M and still be able to use conditional branches, solving the branch address size problem. Then the question is, which register? The answer comes from seeing how conditional branches are used. Conditional branches are found in loops and in if statements so they tend to branch to a nearby instruction. For example, about half of all conditional branches in SPEC benchmarks go to locations less than 16 instructions away. Since the program counter (PC) contains the address of the current instruction, we can branch within ±210 words of the current instruction or jump within ±218 words of the current instruction, if we use the PC as the register to be added to the address. Almost all loops and if statements are smaller than 210 words, so the PC is the ideal choice. This form of branch addressing is called PC-relative addressing . Like most recent computers, RISC-V uses PC-relative addressing for both conditional branches and unconditional jumps, because the destination of these instructions is likely to be close to the branch. On the other hand procedure calls may require jumping more than 218 words away since there is no guarantee that the callee is close to the caller. Hence, RISC-V allows very long jumps to any 32- bit address with a two-instruction sequence: 1 ui writes bits 12 through 31 of the address to a temporary register and jal r adds the lower 12 bits of the address to the temporary register and jumps to the sum. Since RISC-V instructions are 4 bytes long, the RISC-V branch instructions could have been designed to stretch their reach by having the PC-relative address refer to the number of words between the branch and the target instruction rather than the number ofbytes. However, the RISC-V architects wanted to support the possibility of instructions that are only2 bytes long, so the branch instructions represent the number of halfwords between the branch and the branch target. Thus, the 20- bit address field in the j a 1 instruction can encode a distance of ±219 halfwords or ±1 MiB from the current PC. Similarly the 12-bit field in the conditional branch instructions is also a halfword address, meaning that it represents a 13- bit byte address.

RISC-V Addressing Mode Summary Multiple forms of addressing are generically called addressing modes . Figure2.17 shows how operands are identified for each addressing mode. The addressing modes of the RISC-V instructions are the following:

1.Immediate addressing, where the operand is a constant within the instruction itself.2.Register addressing, where the operand is a register.3.Base or displacement addressing, where the operand is at the memory location whose address is the sum of a register and a constant in the instruction.4.PC-relative addressing, where the branch address is the sum of the PC and a constant in instruction. Decoding Machine Language Sometimes you are forced to reverse-engineer machine language to create the original assembly language. One example is when looking at "core dump." Figure 2.18 shows the RISC-V encoding of the opcodes for the RISC-V machine language. This figure helps when translating by hand between assembly language and machine language.

Figure 2.19 shows all the RISC-V instruction formats. Figure 2.1 on pages 64-65 shows the RISC-V assembly language revealed in this chapter. The next chapter covers RISC-V instructions for multiply, divide, and arithmetic for real numbers.

I .What is the range of byte addresses for conditional branches in RISC-V (K = 1024)? 1.Addresses between 0 and 4K - 1 2.Addresses between 0 and 8K - 1 3.Addresses up to about 2K before the branch to about 2K after 4.Addresses up to about 4K before the branch to about 4K afterII.What is the range of byte addresses for jump-and-link instruction in RISC-V (M = 1024K)? 1.Addresses between 0 and 512K - 1 2.Addresses between 0 and IM - 1 3.Addresses up to about 512K before the branch to about 512K after 4.Addresses up to about IM before the branch to about IM After

2.11 Parallelism and Instructions: Synchronization

Parallel execution is easier when tasks are independent, but often they need to cooperate. Cooperation usually means some tasks are writing new values that others must read. To know when a task is finished writing so that it is safe for another to read, the tasks need to synchronize. If they dont synchronize, there is a danger of adata race , where the results of the program can change depending on how events happen to occur.For example, recall the analogy of the eight reporters writing a story on pages 44-45 of Chapter 1. Suppose one reporter needs to read all the prior sections before writing a conclusion. Hence he or she must know when the other reporters have finished their sections, so that there is no danger of sections being changed afterwards. That is, they had better synchronize the writing and reading of each section so that the conclusion will be consistent with what is printed in the prior sections.In computing synchronization mechanisms are typically built with user-level software routines that rely on hardware-supplied synchronization instructions. In this section, we focus on the implementation of lock and unlock synchronization operations. Lock and unlock can be used straightforwardly to create regions where only a single processor can operate, called a mutual exclusion as well as to implement more complex synchronization mechanisms.The critical ability we require to implement synchronization in a multiprocessor is a set of hardware primitives with the ability to atomically read and modify a memory location. That is, nothing else can interpose itself between the read and the write of the memory location. Without such a capability cost of building basic synchronization primitives will be high and will increase unreasonably as the processor count increases.There are a number of alternative formulations of the basic hardware primitives, all of which provide the ability to atomically read and modify a location, together with some way to tell if the read and write were performed atomically. In general, architects do not expect users to employ the basic hardware primitives, but instead expect system programmers will use the primitives to build a synchronization library a process that is often complex and tricky.Lets start with one such hardware primitive and show how it can be used to build a basic synchronization primitive. One typical operation building synchronization operations is the atomic exchange or atomic swap, which interchanges a value in a register for a value in memory.To see how to use this to build a basic synchronization primitive, assume that we want to build a simple lock where the value 0 is used to indicate that the lock is free and 1 is used to indicate that the lock is unavailable. A processor tries to set the lock by doing an exchange of 1, which is in a register, with the memory address corresponding to the lock. The value returned from the exchange instruction is 1 if some other processor had already claimed access, and 0 otherwise. In the latter case, the value is also changed to 1, preventing any competing exchange in another processor from also retrieving a 0.For example, consider two processors that each try to do the exchange simultaneously: this race is prevented, since exactly one of the processors will perform the exchange first, returning 0, and the second processor will return 1 when it does the exchange. The key to using the exchange primitive to implement synchronization is that the operation is atomic: the exchange is indivisible, and two simultaneous exchanges will be ordered by the hardware. It is impossible for two processors trying to set the synchronization variable in this manner both think they have simultaneously set the variable.Implementing a single atomic memory operation introduces some challenges in the design of the processor, since it requires both a memory read and a write in a single, uninterruptible instruction.An alternative is to have a pair of instructions in which the second instruction returns a value showing whether the pair of instructions was executed as if the pair was atomic. The pair of instructions is effectively atomic if it appears as if all other operations executed by any processor occurred before or after the pair. Thus, when an instruction pair is effectively atomic, no other processor can change the value between the pair of instructions.In RISC-V this pair of instructions includes a special load called a load-reserved doubleword (1 r. d) and a special store called a store-conditional doubleword (SC. d). These instructions are used in sequence: if the contents of the memory location specified by the load-reserved are changed before the store-conditional to the same address occurs, then the store-conditional foils and does not write the value to memory The store-conditional is defined to both store the value of a (presumably different) register in memory and to change the value of another register to a 0 if it succeeds and to a nonzero value if it foils. Thus, SC. d specifies three registers: one hold the address, one to indicate whether the atomic operation failed or succeeded, and one to hold the value to be stored in memory if it succeeded. Since the load-reserved returns the initial value, and the store-conditional returns 0 only if it succeeds, the followingsequence implements an atomic exchange on the memory location specified the contents of x20:

again:1r.d    xlO,    (x20)    //    1oad-reserved
sc.d    xll,    x23, (x20)    //    store-conditional
bne    xll,    xO, again    //    branch i f store fails
addi    x23,    xlO, 0    //    put 1oaded value in x23

Any time a processor intervenes and modifies the value in memory between the lr.d and SC .d instructions, the SC .d writes a nonzero value into xll, causing the code sequence to try again. At the end of this sequence, the contents of x23 and the memory location specified by x20 have been atomically exchanged. Elaboration : Although it was presented for multiprocessor synchronization, atomic exchange is also useful for the operating system in dealing with multiple processes in a single processor. To make sure nothing interferes in a single processor, the store conditional also fails if the processor does a context switch between the two instructions (see Chapter 5). Elaboration : An advantage of the load-reserved/store-conditional mechanism is that it can be used to build other synchronization primitives, such as atomic compare and swap or atomic fete h-and-in creme nt t which are used in some parallel programming models. These involve more instructions between the 1 r . d and the SC. d but not too many.Since the store-conditional will fail after either another attempted store to the load reservation address or any exception, care must be taken in choosing which instructions are inserted between the two instructions. In particular, only integer arithmetic, forward branches, and backward branches out of the Ioad-reserved/store-conditionaI block can safely be permitted; otherwise, it is possible to create deadlock situations where the processor can never complete the SC. d because of repeated page faults. In addition, the number of instructions between the load-reserved and the store-conditional should be small to minimize the probability that either an unrelated event or a competing processor causes the store-conditional to fail frequently. Elaboration : While the code above implemented an atomic exchange, the following code would more efficiently acquire a lock at the location in register x20, where the value of 0 means the lock was free and 1 to mean lock was acquired:

addi xl2, xO, 1    //    copy 1ocked value
again: 1 r.d xlO, (x20)    //    1oad-reserved to read 1 ock
bne xlO, xO, again    //    check if it is 0 yet
sc.d xll, xl2, (x20)    //    attempt to store new value
bne xll, xO, again    //    branch if store fails

We release the lock just using a regular store to write 0 into the location:

sd xO, 0(x20)    // free 1ock by writing 0

When do you use primitives like load-reserved and store-conditional?

  1. when cooperating threads of a parallel program need to synchronize to get proper behavior for reading and writing shared data.
  2. When cooperating processes on a uniprocessor need to synchronize for reading and writing shared data.

2.12 Translating and Starting a Program

This section describes the four steps in transforming a C program in a file from storage (disk or flash memory) into a program running on a computer. Figure 2.20 shows the translation hierarchy. Some systems combine these steps to reduce translation time, but programs go through these four logical phases. This section follows this translation hierarchy.

Compiler The compiler transforms the C program into an assembly language program, a symbolic form of what the machine understands. High-level language programs take many fewer lines of code than assembly language, so programmer productivity is much higher.In 1975, many operating systems and assemblers were written in assembly language because memories were small and compilers were inefficient. The million-fold increase in memory capacity per single DRAM chip has reduced program size concerns, and optimizing compilers today can produce assembly language programs nearly as well as an assembly language expert, and sometimes even better for large programs. Assembler Since assembly language is an interfece to higher-level software, the assembler can also treat common variations of machine language instructions as if they were instructions in their own right. The hardware need not implement these instructions; however, their appearance in assembly language simplifies translation and programming. Such instructions are called pseudoinstructions .As mentioned above, the RISC-V hardware makes sure that register xO always has the value 0. That is, whenever register xO is used, it supplies a 0, and if the programmer attempts to change the value in xO the new value is simply discarded. Register xO is used to create the assembly language instruction that copies the contents of one register to another. Thus, the RISC-V assembler accepts the following instruction even though it is not found in the RISC-V machine language:

li x9, 123    // 1oad immediate value 123 into register x9

The assembler converts this assembly language instruction into the machine language equivalent of the following instruction:

addi x9, xO, 123 // register x9 gets register xO + 123

The RISC-V assembler also converts mv (move) into an addi instruction. Thus

mv xlO, xll // register xlO gets register xll

becomes

addi xlO, xll, 0 // regi ster xlO gets register xll + 0

The assembler also accepts j Label to unconditionally branch to a label, as a stand-in for j a 1 xO, Label. It also converts branches to faraway locations into a branch and a jump. As mentioned above the RISC-V assembler allows large constants to be loaded into a register despite the limited size of the immediate instructions. Thus, the load immediate (1 I) pseudoinstruction introduced above cancreate constants larger than addi s immediate field can contain; the load address (1 a) macro works similarly for symbolic addresses. Finally it can simplify the instruction set by determining which variation of an instruction the programmer wants. For example, the RISC-V assembler does not require the programmer to specify the immediate version of the instruction when using a constant for arithmetic and logical instructions; it just generates the proper opcode. Thus

and x9, xlO, 15 // regi ster x9 gets xlO AND 15

becomes

andi x9, xlO, 15 // register x9 gets xlO AND 15

We include the "I" on the instructions to remind the reader that andi produces a different opcode in a different instruction format than the and instruction with no immediate operands.In summary, pseudoinstructions give RISC-V a richer set of assembly language instructions than those implemented by the hardware. If you are going to write assembly programs use pseudoinstructions to simplify your task. To understand the RISC-V architecture and be sure to get best performance, however, study the real RISC-V instructions found in Figures 2.1 and 2.18.Assemblers will also accept numbers in a variety of bases. In addition to binary and decimal they usually accept a base that is more succinct than binary yet converts easily to a bit pattern. RISC-V assemblers use hexadecimal and octal.Such features are convenient, but the primary task of an assembler is assembly into machine code. The assembler turns the assembly language program into an object file which is a combination of machine language instructions, data, and information needed to place instructions properly in memory.Tb produce the binary version of each instruction in the assembly language program, the assembler must determine the addresses corresponding to all labels. Assemblers keep track of labels used in branches and data transfer instructions in asymboltable . As you might expect, the table contains pairs of symbols and addresses.The object file for UNIX systems typically contains six distinct pieces:

  • The object file header describes the size and position of the other pieces of the object file.
  • The text segment contains the machine language code.
  • The static data segment contains data allocated for the life of the program. (UNIX allows programs to use both static data, which is allocated throughout the program, and dynamic data, which can grow or shrink as needed by the program. See Figure 2.13.)
  • The relocation information identifies instructions and data words that depend on absolute addresses when the program is loaded into memory.
  • The symbol table contains the remaining labels that are not defined, such as external references.
  • The debugging information contains a concise description of how the modules were compiled so that a debugger can associate machine instructions with C source files and make data structures readable.

The next subsection shows how to attach such routines that have already been assembled, such as library routines. Linker What we have presented so for suggests that a single change to one line of one procedure requires compiling and assembling the whole program. Complete retranslation is a terrible waste of computing resources. This repetition is particularly wasteful for standard library routines because programmers would be compiling and assembling routines that by definition almost never change. An alternative is to compile and assemble each procedure independently so that a change to one line would require compiling and assembling only one procedure. This alternative requires a new systems program, called a link editor or linker , which takes all the independendy assembled machine language programs and" stitches” them together. The reason a linker is useful is that it is much fester to patch code than it is to recompile and reassemble.There are three steps for the linker:1.Place code and data modules symbolically in memory.2.Determine the addresses of data and instruction labels.3.Patch both the internal and external references.The linker uses the relocation information and symbol table in each object module to resolve all undefined labels. Such references occur in branch instructions and data addresses, so the job of this program is much like that of an editor: it finds the old addresses and replaces them with the new addresses. Editing is the origin of the name Tink editor,or linker for short.If all external references are resolved, the linker next determines the memory locations each module will occupy. Recall that Figure 2.13 on page 106 shows the RISC-V convention for allocation of program and data to memory. Since the files were assembled in isolation the assembler could not know where a modules instructions and data would be placed relative to other modules. When the linker places a module in memory all absolute references, that is, memory addresses that are not relative to a register, must be relocated to reflect its true location.The linker produces an executable file that can be run on a computer. Typically this file has the same format as an object file, except that it contains no unresolved references. It is possible to have partially linked files, such as library routines, that still have unresolved addresses and hence result in object files.

Loader Now that the executable file is on disk, the operating system reads it to memory and starts it. Loader follows these steps in UNIX systems:1.Reads the executable file header to determine size of the text and data segments.2.Creates an address space large enough for the text and data.3.Copies the instructions and data from the executable file into memory.4.Copies the parameters (if any) to the main program onto the stack.5.Initializes the processor registers and sets the stack pointer to the first free location.6.Branches to a start-up routine that copies the parameters into the argument registers and calls the main routine of the program. When the main routine returns, the start-up routine terminates the program with an exi t system call. Dynamically Linked Libraries The first part of this section describes the traditional approach to linking libraries before the program is run. Although this static approach is the fastest way to call library routines, it has a few disadvantages:

  • The library routines become part of the executable code. If a new version of the library is released that fixes bugs or supports new hardware devices, the statically linked program keeps using the old version.
  • It loads all routines in the library that are called anywhere in the executable, even if those calls are not executed. The library can be large relative to the program; for example, the standard C library on a RISC-V system running the Linux operating system is 1.5 MiB.

These disadvantages lead dynamically linked libraries (DLLs) , where the library routines are not linked and loaded until the program is run. Both the program and library routines keep extra information on the location of nonlocal procedures and their names. In the original version of DLLs, the loader ran a dynamic linker using the extra information in the file to find the appropriate libraries and to update all external references.The downside of the initial version of DLLs was that it still linked all routines of the library that might be called, versus just those that are called during the running of the program. This observation led to the lazy procedure linkage version of DLLs, where each routine is linked only after it is called.Like many innovations in our field, this trick relies on a level of indirection. Figure 2.21 shows the technique. It starts with the nonlocal routines calling a set of dummy routines at the end of the program, with one entry per nonlocal routine. These dummy entries each contain an indirect branch.

The first time the library routine is called, the program calls the dummy entry and follows the indirect branch. It points to code that puts a number in a register to identify the desired library routine and then branches to the dynamic linker/ loader. The linker/loader finds the wanted routine, remaps it and changes the address in the indirect branch location to point to that routine. It then branches to it. When the routine completes, it returns to the original calling site. Thereafter the call to the library routine branches indirectly to the routine without the extra hops.In summary DLLs require additional space for the information needed for dynamic linking, but do not require that whole libraries be copied or linked. They pay a good deal of overhead the first time a routine is called but only a single indirect branch thereafter. Note that the return from the library pays no extra overhead. Microsofts Windows relies extensively on dynamically linked libraries, and it is also the default when executing programs on UNIX systems today. Starting a Java Program The discussion above captures the traditional model of executing a program, where the emphasis is on fast execution time for a program targeted to a specific instruction set architecture, or even a particular implementation of that architecture. Indeed it is possible to execute Java programs just like C. Java was invented with a different set of goals, however. One was to run safely on any computer, even if it might slow execution time.Figure 2.22 shows the typical translation and execution steps for Java. Rather than compile to the assembly language of a target computer Java is compiled first to instructions that are easy to interpret: the Java bytecode instruction set (see country Section 2.15). This instruction set is designed to be close to the Java language so that this compilation step is trivial. Virtually no optimizations are performed. Like the C compiler, the Java compiler checks the types of data and produces the proper operation for each type. Java programs are distributed in the binary version of these bytecodes.

A software interpreter, called a Java Virtual Machine (JVM) , can execute Java bytecodes. An interpreter is a program that simulates an instruction set architecture. For example the RISC-V simulator used with this book is an interpreter. There is no need for a separate assembly step since either the translation is so simple that the compiler fills in the addresses or JVM finds them at runtime.The upside of interpretation is portability. The availability of software Java virtual machines meant that most people could write and run Java programs shortly after Java was announced. Tbday Java virtual machines are found in billions of devices, in everything from cell phones to Internet browsers.The downside of interpretation is lower performance. The incredible advances in performance of the 1980s and 1990s made interpretation viable for many important applications, but the factor of 10 slowdown when compared to traditionally compiled C programs made Java unattractive for some applications.To preserve portability and improve execution speed, the next phase of Javas development was compilers that translated while the program was running. Such Just In Time compilers (JIT) typically profile the running program to find where the "hot” methods are and then compile them into the native instruction set on which the virtual machine is running. The compiled portion is saved for the next time the program is run so that it can run fester each time it is run. This balance of interpretation and compilation evolves over time, so that frequently run Java programs suffer little of the overhead interpretation.As computers get fester so that compilers can do more, and as researchers invent betters ways to compile Java on the fly the performance gap between Java and C or C++ is closing. goes into much greater depth on the implementation of Java, Java bytecodes, JVM, and JIT compilers.Which of the advantages of an interpreter over a translator was the most important for the designers of Java? 1.Ease of writing an interpreter2.Better error messages3.Smaller object code4.Machine independence

2.13 A C Sort Example to Put it All Together

One danger of showing assembly language code in snippets is that you will have no idea what a full assembly language program looks like. In this section, we derive the RISC-V code from two procedures written in C: one to swap array elements and one to sort them.

The Procedure swap Lets start with the code for the procedure swap in Figure 2.23. This procedure simply swaps two locations in memory. When translating from C to assembly language by hand we follow these general steps:1.Allocate registers to program variables.2.Produce code for the body of the procedure.3.Preserve registers across the procedure invocation.This section describes the swap procedure in these three pieces, concluding by putting all the pieces together. Register Allocation for swap As mentioned on page 98, the RISC-V convention on parameter passing is to use registers xlO to xl7. Since swap has just two parameters, v and k, they will be found in registers xlO and xll. only other variable is temp, which we associate with register x5 since swap is a leaf procedure (see page 102). This register allocation corresponds to the variable declarations in the first part of the swap procedure in Figure 2.23. Code for the Body of the Procedure swap The remaining lines of C code in swap are

Recall that the memory address for RISC-V refers to the byte address, and so doublewords are really 8 bytes apart. Hence, we need to multiply the index k by 8 before adding it to the address. Forgetting that sequential doubleword addresses differ by 8 instead of by 1 is a common mistake in assembly language programming. Hence the first step is to get the address of v [ k ] by multiplying k by 8 via a shift left by 3:

slli x6, xll, 3    //    reg x6 = k * 8
add x6, xlO, x6       // reg x6 = v + (k * 8)

Now we load v [ k ] using x6, and then v [ k+1 ] by adding 8 to x6:

Id    x5, 0(x6)    // reg x5 (temp) = v[k]
Id    x7, 8(x6)    // reg x7 = vLk + 1]
                     // refers to next element of v

Next we store x9 and xll to the swapped addresses:

sd    x7, 0(x6)    // v[k] = reg x7
sd    x5, 8(x6)    // v[k+l] = reg x5 (temp)

Now we have allocated registers and written the code to perform the operations of the procedure. What is missing is the code for preserving the saved registers used within swap. Since we are not using saved registers in this leaf procedure, there is nothing to preserve.The Full swap Procedure We are now ready fbr the whole routine. All that remains is to add the procedure label and the return branch.

swap:
slli    x6, xll, 3    // reg x6 = k * 8
add    x6, xlO, x6    // reg x6 = v + (k * 8)
Id    x5, 0(x6)    // reg x5 (temp) = v[k]
Id    x7, 8(x6)    // reg x7 = v[k + 1]
sd    x7, 0(x6)    // v[k] = reg x7
sd    x5, 8(x6)    // v[k+l] = reg x5 (temp)
jalr    xO, 0(x1)    // return to calling routine

The Procedure sort To ensure that you appreciate the rigor of programming in assembly language, well try a second, longer example. In this case, well build a routine that calls the swap procedure. This program sorts an array of integers, using bubble or exchange sort which is one of the simplest if not the fastest sorts. Figure 2.24 shows the C version of the program. Once again, we present this procedure in several steps, concluding with the full procedure.

Register Allocation for sort The two parameters of the procedure sort, v and n, are in the parameter registers xlO and xl 1, and we assign register x 19 to I and register x20 to j. Code for the Body of the Procedure sort The procedure body consists of two nestedybr loops and a call to swap that includes parameters. Lets unwrap the code from the outside to the middle.The first translation step is the firstybr loop:

for (i = 0; i < n; i += 1) {

Recall that the C for statement has three parts: initialization, loop test, and iteration increment. It takes just one instruction to initialize I to 0, the first part of the for statement:

li xl9, 0

(Remember that 1 I is a pseudoinstruction provided by the assembler for the convenience of the assembly language programmer; see page 125.) It also takes just one instruction to increment 1, the last part of the for statement:

addi xl9, xl9, 1 // 1 += 1

The loop should be exited if 1 &lt; n is not true or, said another way should be exited if I&gt; n. This test takes just one instruction:

forltst: bge xl9, xll, exitl // go to exitl if xl9 > xl (i>n)

The bottom of the loop just branches back to the loop test:

j forltst // branch to test of outer 1 oop
exitl:

The skeleton code of the first ybr loop is then

li xl9, 0    // i = 0
forltst:
bge xl9, xll, exitl // go to exitl i f xl9 > xl (i>n)
...
(body of first for loop)
...
addi xl9, xl9, 1 // i += 1
j forltst    // branch to test of outer loop
exitl:    

Voila! (The exercises explore writing faster code for similar loops.)The secondybr loop looks like this in C:

for (j = i - 1; j >= 0 && v[j] > vEj + 1]; j -= 1) {

The initialization portion of this loop is again one instruction:

addi x20, xl9, -1    // j = i - 1

The decrement of j at the end of the loop is also one instruction:

addi x20, x20, -1 j -= 1

The loop test has two parts. We exit the loop if either condition foils, so the first test must exit the loop if it foils (j &lt; 0):

for2tst:
blt x20, xO, exit2    // go to exit2 if x20 < 0 (j < 0)

This branch will skip over the second condition test. If it doesnt skip, then j > 0.The second test exits if v [j ] > v[ j + 1] is not true, or exits ifv[j] < v [j +1 ]. First we create the address by multiplying j by 8 (since we need a byte address) and add it to the base address of v:

slli    x5, x20, 3     // reg x5 = j * 8
add    x5, xlO, x5    // reg x5 = v + (j * 8)

Now we load v [j]:

Id    x6, 0(x5)    // reg x6 = v[j]

Since we know that the second element is just the following doubleword, we add 8 to the address in register x5 to get v [j + 1]:

Id    x7, 8(x5)    // reg x7 = v[j + 1]

We test v[j] &lt; v[j + 1] to exit the loop:

ble x6, x7, exit2 // go to exit2 if x6 < x7

The bottom of the loop branches back to the inner loop test:

j    for2tst    // branch to test of inner loop

Combining the pieces, the skeleton of the second ybr loop looks like this:

addi x20, xl9, -1 // j = i - 1
fo「2tst: blt x20, xO, exit2 // go to exit2 if x20 < 0 (j < 0)
slli    x5,    x20, 3    //    reg x5    =J * 8    
add    x5,    xlO, x5    //    reg x5    =v + (j *    8)
Id    x6,    0(x5)    //    reg x6    =v[j]    
Id    x7,    8(x5)    //    reg x7    =v[j + 1]    
bl e    x6.    x7, exit?    //    go to i    sxit2 i f x6    < x7
    ...
    (body    of second    for    1 oop)        
    ...
addi    x20.    x20, -1    //    j _= 1        
J    for2tst    //    branch    to test of    i nner 1oop

exit2:

The Procedure Call in sort The next step is the body of the secondybr loop:

swap(v,j);

Calling swap is easy enough:

jal xl, swap

Passing Parameters in sort The problem comes when we want to pass parameters because the sort procedure needs the values in registers xlO and xll, yet the swap procedure needs to have its parameters placed in those same registers. One solution is to copy the parameters for sort into other registers earlier in the procedure making registers xlO and xll available for the call of swap. (This copy is faster than saving and restoring on the stack.) We first copy xlO and xll into x21 and x22 during the procedure:

mv    x21,    xlO    //    copy parameter xlO 1nto    x21
mv    x22,    xll    //    copy parameter xll into    x22

Then we pass the parameters to swap with these two instructions:

mv    xlO,    x21    //    f1rst swap parameter is    v
mv    xll,    x20    //    second swap parameter i s j

Preserving Registers in sort The only remaining code is the saving and restoring of registers. Clearly we must save the return address in register xl, since s o r t is a procedure and is itself called. The sort procedure also uses the callee-saved registers xl9, x20, x21, and x 2 2, so they must be saved. The prologue of the sort procedure is then

addi    sp, sp, -40    //    make    room on stack for 5 regs
sd    xl, 32(sp)    //    save    xl on stack
sd    x22, 24(sp)    //    save    x22 on stack
sd    x21, 16(sp)    //    save    x21 on stack
sd    x20, 8(sp)    // save x20 on stack 
sd    xl9, 0(sp)    // save xl9 on stack

The tail of the procedure simply reverses all these instructions, and then adds a jalr to return. The Full Procedure sort Now we put all the pieces together in Figure 2.25, being careful to replace references to registers xlO and xl 1 in the for loops with references to registers x21 and x22. Once again, to make the code easier to follow, we identify each block of code with its purpose in the procedure. In this example, nine lines of the sort procedure in C became 34 lines in the RISC-V assembly language.

Elaboration : One optimization that works with this example is procedure inlining. Instead of passing arguments in parameters and invoking the code with a jal instruction the compiler would copy the code from the body of the swap procedure where the call to swap appears in the code. Inlining would avoid four instructions in this example. The downside of the inlining optimization is that the compiled code would be bigger if the inlined procedure is called from several locations. Such a code expansion might turn into /ower performance if it increased the cache miss rate; see Chapter 5.

2.14 Arrays versus Pointers

A challenge for any new C programmer is understanding pointers. Comparing assembly code that uses arrays and array indices to the assembly code that uses pointers offers insights about pointers. This section shows C and RISC-V assembly versions of two procedures to clear a sequence of doublewords in memory: one using array indices and one with pointers. Figure 2.28 shows the two C procedures.The purpose of this section is to show how pointers map into RISC-V instructions, and not to endorse a dated programming style. Well see the impact of modern compiler optimization on these two procedures at the end of the section.

Array Version of Clear Lets start with the array version, cl earl, focusing on the body of the loop and ignoring the procedure linkage code. We assume that the two parameters array and size are found in the registers xlO and xl 1, and that I is allocated to register x5.The initialization of I the first part of the for loop, is straightforward:

li    x5, 0    // i = 0 (register x5 = 0)

To set array [1 ] to 0 we must first get its address. Start by multiplying I by 8 to get the byte address:

loopl: sill x6, x5, 3    // x6 = i * 8

Since the starting address of the array is in a register, we must add it to the index to get the address of array [I] using an add instruction:

add x7, xlO, x6 // x7 = address of array[i]

Finally we can store 0 in that address:

sd xO, 0(x7)    // array[i] = 0

This instruction is the end of the body of the loop, so the next step is to increment I:

addi x5, x5, 1    // i = 1 + 1

The loop test checks if I is less than size:

bit x5, xll, loopl // if (1 < size) go to loopl

We have now seen all the pieces of the procedure. Here is the RISC-V code for clearing an array using irulices:

li    x5,    0    //    i = 0
loopl: si 1i    x6,    x5, 3    //    x6 = i * 8
add    x7,    xlO, x6    //    x7 = address of array"]
sd    xO,    0(x7)    //    arrayEi ] = 0
addi    x5,    x5, 1    //    1=1 + 1
bit    x5,    xll, loopl    //    if (i < size) go to loopl

(This code works as long as si ze is greater than 0; ANSI C requires a test of size before the loop, but well skip that legality here.) Pointer Version of Clear The second procedure that uses pointers allocates the two parameters array and s 1 ze to the registers xlO and xll and allocates p to register x5. The code for the second procedure starts with assigning the pointer p to the address of the first element of the array:

mv x5, xlO // p = address of array[0]

The next code is the body of the for loop, which simply stores 0 into p:

1oop2: sd xO, 0(x5)    // Memory[p] = 0

This instruction implements the body of the loop, so the next code is the iteration increment, which changes p to point to the next doubleword:

addi x5, x5, 8    // p = p + 8

Incrementing a pointer by 1 means moving the pointer to the next sequential object in C. Since p is a pointer to integers declared as long 1 ong I nt, each of which uses 8 bytes the compiler increments p by 8.The loop test is next. The first step is calculating the address of the last element of array. Start with multiplying s I ze by 8 to get its byte address:

slli x6, xll, 3    // x6 = size * 8

and then we add the product to the starting address of the array to get the address of the first doubleword after the array:

add x7, xlO, x6 // x7 = address of array[size]

The loop test is simply to see if p is less than the last element of array:

bltu x5, x7, loop2 // if (p<&array[size]) go to loop2

With all the pieces completed, we can show a pointer version of the code to zero an array:

    mv    x5, xlO    // p = address of array[0]
loop2:    sd    xO, 0(x5)    // MemoryEp] = 0
    addi    x5, x5, 8    // p = p + 8
    slli    x6, xll, 3    // x6 = size * 8
    add    x7, xlO, x6    // x7 = address of array[size]
    bl tu    x5, x7, loop2    // if (p<&array[size]) go to loop2

As in the first example, this code assumes si ze is greater than 0.Note that this program calculates the address of the end of the array in every iteration of the loop, even though it does not change. A faster version of the code moves this calculation outside the loop:

    mv    x5, xlO       // p = address of array[0]
    slli  x6,x11,3    //x6 = si ze * 8
       add    x7,xlO,x6  // x7 = address of array[size]
loop2:    sd    xO,0(x5)       //Memory[p]=    0
    addi  x5,x5, 8       //p = p + 8    
    bltu  x5,x7, 1oop2 //if (p < &array[size]) go to loop2

Comparing the Two Versions of Clear Comparing the two code sequences side by side illustrates the difference between array indices and pointers (the changes introduced by the pointer version are highlighted):

The version on the left must have the "multiply” and add inside the loop because I is incremented and each address must be recalculated from the new index. The memory pointer version on the right increments the pointer p directly. The pointer version moves the scaling shift and the array bound addition outside the loop thereby reducing the instructions executed per iteration from five to three. This manual optimization corresponds to the compiler optimization of strength reduction (shift instead of multiply) and induction variable elimination (eliminating array address calculations within loops).describes these two and many other optimizations. Elaboration : As mentioned earlier, a C compiler would add a test to be sure that si ze is greater than 0. One way would be to branch to the instruction after the loop with bit xO, xll, after Loop.

This section gives a brief overview of how the C compiler works and how Java is executed. Because the compiler will significantly affect the performance of a computer, understanding compiler technology today is critical to understanding

  1. Keep in mind that the subject of compiler construction is usually taught in a one- or two-semester course, so our introduction will necessarily only touch on the basics.The second part of this section is for readers interested in seeing how object- oriented language like Java executes on an RISC-V architecture. It shows the Java byte-codes used fbr interpretation and the RISC-V code for the Java version of some of the C segments in prior sections, including Bubble Sort. It covers both the Java Virtual Machine and JIT compilers.

The rest can be found online.

2.16 Real Stuff: MIPS Instructions

The instruction set most similar to RISC・V MIPS, also originated in academia, but is now owned by Imagination Technologies. MIPS and RISC-V share the same design philosophy despite MIPS being 25 years more senior than RISC-V The good news is that if you know RISC-V it will be very easy to pick up MIPS. To show their similarity Figure 2.29 compares instruction formats for RISC-V and MIPS.The MIPS ISA has both 32-bit address and 64-bit address versions, sensibly called MIPS-32 and MIPS-64. These instruction sets are virtually identical except for the larger address size needing 64-bit registers instead of 32-bit registers. Here are the common features between RISC-V and MIPS:

  • All instructions are 32 bits wide for both architectures.
  • Both have 32 general-purpose registers, with one register being hardwired to 0.
  • The only way to access memory is via load and store instructions on both architectures.
  • Unlike some architectures, there are no instructions that can load or store many registers in MIPS or RISC-V.
  • Both have instructions that branch if a register is equal to zero and branch if a register is not equal to zero.
  • Both sets of addressing modes work for all word sizes.

One of the main differences between RISC-V and MIPS is for conditional branches other than equal or not equal. Whereas RISC-V simply provides branch instructions to compare two registers MIPS relies on a comparison instruction that sets a register to 0 or 1 depending on whether the comparison is true. Programmers then follow that comparison instruction with a branch on equal to or not equal to zero depending on the desired outcome of the comparison. Keeping with its minimalist philosophy MIPS only performs less than comparisons, leaving it up to the programmer to switch order of operands or to switch the condition being tested by the branch to get all the desired outcomes. MIPS has both signed and unsigned versions of the set on less than instructions: slt and sltu.

When we look beyond the core instructions that are most commonly used, the other main difference is that the full MIPS is a much larger instruction set than RISC-Y as we shall see in Section 2.18.

2.17 Real Stuff: x86 Instructions

Designers of instruction sets sometimes provide more powerful operations than those found in RISC-V and MIPS. The goal is generally to reduce the number of instructions executed by a program. The danger is that this reduction can occur at the cost of simplicity increasing the time a program takes to execute because the instructions are slower. This slowness may be the result of a slower clock cycle time or of requiring more clock cycles than a simpler sequence.The path toward operation complexity is thus fraught with peril. Section 2.19 demonstrates the pitfells of complexity. Evolution of the Intel x86 RISC-V and MIPS were the vision of single groups working at the same time; the pieces of these architectures fit nicely together. Such is not the case for the x86; it is product of several independent groups who evolved the architecture over almost 40 years, adding new features to the original instruction set as someone might add clothing to a packed bag. Here are important x86 milestones.

  • 1978 : The Intel 8086 architecture was announced as an assembly languagecompatible extension of the then-successful Intel 8080, an 8-bit microprocessor. The 8086 is a 16-bit architecture, with all internal registers 16 bits wide. Unlike RISC-V the registers have dedicated uses, and hence the 8086 is not considered a general-purpose register (GPR) architecture.
  • 1980 : The Intel 8087 floating-point coprocessor is announced, This architecture extends the 8086 with about 60 floating-point instructions. Instead of using registers, it relies on a stack (see wei Section 2.21 and Section 3.7).
  • 1982 : The 80286 extended the 8086 architecture by increasing the address space to 24 bits, by creating an elaborate memory-mapping and protection model (see Chapter 5), and by adding a few instructions to round out the instruction set and to manipulate the protection model.
  • 1985 : The 80386 extended the 80286 architecture to 32 bits. In addition to a 32-bit architecture with 32-bit registers and a 32-bit address space, the 80386 added new addressing modes and additional operations. The expanded instructions make the 80386 nearly a general-purpose register machine. The 80386 also added paging support in addition to segmented addressing (see Chapter 5). Like the 80286, the 80386 has a mode to execute 8086 programs without change.
  • 1989-95 : The subsequent 80486 in 1989, Pentium in 1992, and Pentium Pro in 1995 were aimed at higher performance, with only four instructions added to the user-visible instruction set: three to help with multiprocessing (see Chapter 6) and a conditional move instruction.
  • 1997 : After the Pentium and Pentium Pro were shipping, Intel announced that it would expand the Pentium and the Pentium Pro architectures with MMX (Multi Media Extensions). This new set of 57 instructions uses the floatingpoint stack to accelerate multimedia and communication applications. MMX instructions typically operate on multiple short data elements at a time in the tradition of single instruction, multiple data (SIMD) architectures (see Chapter 6). Pentium II did not introduce any new instructions.
  • 1999 : Intel added another 70 instructions, labeled SSE (Streaming SIMD Extensions) as part of Pentium III. The primary changes were to add eight separate registers, double their width to 128 bits, and add a single precision floating-point data type. Hence, four 32-bit floating-point operations can be performed in parallel. To improve memory performance, SSE includes cache prefetch instructions plus streaming store instructions that bypass the caches and write directly to memory.
  • 2001 : Intel added yet another 144 instructions, this time labeled SSE2. The new data type is double precision arithmetic which allows pairs of 64-bit floating-point operations in parallel. Almost all of these 144 instructions are versions of existing MMX and SSE instructions that operate on 64 bits of data in parallel. Not only does this change enable more multimedia operations; it gives the compiler a different target for floating-point operations than the unique stack architecture. Compilers can choose to use the eight SSE registers as floating-point registers like those found in other computers. This change boosted the floating-point performance of the Pentium 4, the first microprocessor to include SSE2 instructions.
  • 2003 : A company other than Intel enhanced the x86 architecture this time. AMD announced a set of architectural extensions to increase the address space from 32 to 64 bits. Similar to the transition from a 16- to 32-bit address space in 1985 with the 80386 AMD64 widens all registers to 64 bits. It also increases the number of registers to 16 and increases the number of 128-bit SSE registers to 16. The primary ISA change comes from adding a new mode called long mode that redefines the execution of all x86 instructions with 64-bit addresses and data. To address the larger number of registers, it adds a new prefix to instructions. Depending how you count, long mode also adds four to 10 new instructions and drops 27 old ones. PC-relative data addressing is another extension. AMD64 still has a mode that is identical to x86 (legacy mode) plus a mode that restricts user programs to x86 but allows operating systems to use AMD64 (compatibility mode). These modes allow a more graceful transition to 64-bit addressing than the HP/Intel IA-64 architecture.
  • 2004 : Intel capitulates and embraces AMD64, relabeling it Extended Memory 64 Technology (EM64T). The major difference is that Intel added a 128-bit atomic compare and swap instruction, which probably should have been included in AMD64. At the same time, Intel announced another generation of media extensions. SSE3 adds 13 instructions to support complex arithmetic graphics operations on arrays of structures, video encoding, floating-point conversion, and thread synchronization (see Section 2.11). AMD added SSE3 in subsequent chips and the missing atomic swap instruction to AMD64 to maintain binary compatibility with Intel.
  • 2006 : Intel announces 54 new instructions as part of the SSE4 instruction set extensions. These extensions perform tweaks like sum of absolute differences, dot products for arrays of structures, sign or zero extension of narrow data to wider sizes, population count, and so on. They also added support for virtual machines (see Chapter 5).
  • 2007 : AMD announces 170 instructions as part of SSE5, including 46 instructions of the base instruction set that adds three operand instructions like RISC-V.
  • 2011 : Intel ships the Advanced Vector Extension that expands the SSE register width from 128 to 256 bits, thereby redefining about 250 instructions and adding 128 new instructions.

This history illustrates lhe impact of the "golden handcuffs" of compatibility on the x86, as the existing software base at each step was too important to jeopardize with significant architectural changes.Whatever the artistic failures of the x86 keep in mind that this instruction set largely drove the PC generation of computers and still dominates the Cloud portion of the post-PC era. Manufacturing 350M x86 chips per year may seem small compared to 14 billion ARM chips, but many companies would love to control such a market. Nevertheless this checkered ancestry has led to an architecture that is difficult to explain and impossible to love.Brace yourself for what you are about to see! Do not try to read this section with the care you would need to write x86 programs; the goal instead is to give you familiarity with the strengths and weaknesses of the worlds most popular desktop architecture.Rather than show the entire 16-bit, 32-bit, and 64-bit instruction set, in this section we concentrate on the 32-bit subset that originated with the 80386. We start our explanation with the registers and addressing modes move on to the integer operations, and conclude with an examination of instruction encoding.x86 Registers and Data Addressing Modes The registers of the 80386 show the evolution of the instruction set (Figure 2.30). The 80386 extended all 16-bit registers (except the segment registers) to 32 bits, prefixing an E to their name to indicate the 32-bit version. Well refer to them generically as GPRs (general-purpose registers). The 80386 contains only eight GPRs. This means RISC-V and MIPS programs can use four times as many.Figure 2.31 shows the arithmetic, logical, and data transfer instructions are two-operand instructions. There are two important differences here. The x86 arithmetic and logical instructions must have one operand act as both a source and a destination; RISC-V and MIPS allow separate registers for source and destination. This restriction puts more pressure on the limited registers, since one source register must be modified. The second important difference is that one of the operands can be in memory. Thus, virtually any instruction may have one operand in memory unlike RISC-V and MIPS.Data memory-addressing modes described in detail below, offer two sizes of addresses within the instruction. These so-called displacements can be 8 bits or 32 bits.

Although a memory operand can use any addressing mode, there are restrictions on which registers can be used in a mode. Figure 2.32 shows the x86 addressing modes and which GPRs cannot be used with each mode, as well as how to get the same effect using RISC-V instructions. x86 Integer Operations The 8086 provides support for both 8-bit (byte) and 16-bit (word) data types. The 80386 adds 32-bit addresses and data (doublewords) in the x86. (AMD64 adds 64- bit addresses and data, called quad words well stick to the 80386 in this section.) The data type distinctions apply to register operations as well as memory accesses.Almost every operation works on both 8-bit data and on one longer data size. That size is determined by the mode and is either 16 bits or 32 bits.Clearly, some programs want to operate on data of all three sizes so the 80386 architects provided a convenient way to specify each version without expanding code size significantly. They decided that either 16-bit or 32-bit data dominate most programs, and so it made sense to be able to set a default large size. This default data size is set by a bit in the code segment register. Tb override the default data size, an 8-bit prefix is attached to the instruction to tell the machine to use the other large size for this instruction.The prefix solution was borrowed from the 8086, which allows multiple prefixes to modify instruction behavior. three original prefixes override the default segment register, lock the bus to support synchronization (see Section 2.11), or repeat the following instruction until the register ECX counts down to 0. This last prefix was intended to be paired with a byte move instruction to move a variable number of bytes. The 80386 also added a prefix to override the default address size.The x86 integer operations can be divided into four major classes:1.Data movement instructions, including move, push, and pop.2.Arithmetic and logic instructions, including test, integer, and decimal arithmetic operations.3.Control flow, including conditional branches, unconditional branches, calls, and returns.4.String instructions, including string move and string compare.The first two categories are unremarkable, except that the arithmetic and logic instruction operations allow the destination to be either a register or a memory location. Figure 2.33 shows some typical x86 instructions and their functions.

Conditional branches on the x86 are based on condition codes or flags. Condition codes are set as a side effect of an operation; most are used to compare the value of a result to 0. Branches then test the condition codes. PC-relative branch addresses must be specified in the number of bytes, since unlike RISC-V and MIPS 80386 instructions have no alignment restriction.String instructions are part of the 8080 ancestry of the x86 and are not commonly executed in most programs. They are often slower than equivalent software routines (see the Fallacy on page 157).Figure 2.34 lists some of the integer x86 instructions. Many of the instructions are available in both byte and word formats.

x86 Instruction Encoding Saving the worst for last, the encoding of instructions in the 80386 is complex, with many different instruction formats. Instructions for the 80386 may vary from 1 byte, when there is only one operand up to 15 bytes.Figure 2.35 shows the instruction format for several of the example instructions in Figure 2.33. The opcode byte usually contains a bit saying whether the operand is 8 bits or 32 bits. For some instructions, the opcode may include the addressing mode and the register; this is true in many instructions that have the form 'register = register op immediate: Other instructions use a "postbyte” or extra opcode byte, labeled "mod, reg r/m;' which contains the addressing mode information. This postbyte is used for many of the instructions that address memory. The base plus scaled index mode uses a second postbyte, labeled "SC, index, base”

Figure 2.36 shows the encoding of the two postbyte address specifiers for both 16-bit and 32-bit modes. Unfortunately to understand fully which registers and which addressing modes are available, you need to see the encoding of all addressing modes and sometimes even the encoding of the instructions. x86 Conclusion Intel had a 16-bit microprocessor two years before its competitors, more elegant architectures, such as the Motorola 68000, and this head start led to the selection of the 8086 as the CPU for the IBM PC. Intel engineers generally acknowledge that the x86 is more difficult to build than computers like RISC-V and MIPS but the large market meant in the PC era that AMD and Intel could afford more resources to help overcome the added complexity. What the x86 lacks in style, it rectifies with market size, making it beautiful from the right perspective.

Its saving grace is that the most frequently used x86 architectural components are not too difficult to implement, as AMD and Intel have demonstrated by rapidly improving performance of integer programs since 1978. To get that performance, compilers must avoid the portions of the architecture that are hard to implement fest.In the post-PC era, however despite considerable architectural and manufacturing expertise, x86 has not yet been competitive in the personal mobile device.

2.18 Real Stuff: The Rest of the RISC-V Instruction Set

With the goal of making an instruction set architecture suitable for a wide variety of computers, the RISC-V architects partitioned the instruction set into a base architecture and several extensions. Each is named with a letter of the alphabet and the base architecture is named I for integer. The base architecture has few instructions relative to other popular instruction sets today; indeed, this chapter has already covered nearly all of them. This section rounds out the base architecture, then describes the five standard extensions.

Figure 2.37 lists the remaining instructions in the base RISC-V architecture. The first instruction, a ui pc, is used for PC-relative memory addressing. Like the 1 ui instruction, it holds a 20-bit constant that corresponds to bits 12 through 31 of an integer aui pcs effect is to add this number to the PC and write the sum to a register. Combined with an instruction like addi, it is possible to address any byte of memory within 4 GiB of the PC. This feature is useful for position-independent code which can execute correctly no matter where in memory it is loaded. It is most frequently used in dynamically linked libraries.The next four instructions compare two integers, then write the Boolean result of the comparison to a register, si t and si tu compare two registers as signed and unsigned numbers respectively then write 1 to a register if the first value is less than the second value, or 0 otherwise, si ti and si ti u perform the same comparisons, but with an immediate for the second operand.The remaining instructions should all look familiar as their names are the same as other instructions discussed in this chapter, but with the letter w, short for word, appended. These instructions perform the same operation as the similarly named ones weve discussed, except these only operate on the lower 32 bits of their operands ignoring bits 32 through 63. Additionally they produce sign-extended 32-bit results: that is, bits 32 through 63 are all the same as bit 31. The RISC-V architects included these w instructions because operations on 32-bit numbers remain very common on computers with 64-bit addresses. The main reason is that the popular data type I nt remains 32 bits in Java and in most implementations of the C language.

Thats it for the base architecture! Figure 2.38 lists the five standard extensions. The first, M, adds instructions to multiply and divide integers. Chapter 3 will introduce several instructions in the M extension.The second extension, A, supports atomic memory operations for multiprocessor synchronization. The load-reserved (1 r・d) and store-conditional (SC .d) instructions introduced in Section 2.11 are members of the extension. Also included are versions that operate on 32-bit words (1 r. w and SC. w). The remaining 18 instructions are optimizations of common synchronization patterns, like atomic exchange and atomic addition but do not add any additional functionality over load-reserved and store-conditional.The third and fourth extensions, F and D, provide operations on floating-point numbers, which are described in Chapter 3.The last extension, C, provides no new functionality at all. Rather, it takes the most popular RISC-V instructions, like addi and provides equivalent instructions that are only 16 bits in length, rather than 32. It thereby allows programs to be expressed in fewer bytes, which can reduce cost and, as we will see in Chapter 5, can improve performance. Tb fit in 16 bits, the new instructions have restrictions on their operands: for example, some instructions can only access some of the 32 registers, and the immediate fields are narrower.Taken together, the RISC-V base and extensions have 184 instructions, plus 13 system instructions that will be introduced at the end of Chapter 5.

2.19 Fallacies and Pitfalls

Fallacy: More powerful instructions mean higher perjbrmance.Part of the power of the Intel x86 is the prefixes that can modify the execution of the following instruction. One prefix can repeat the subsequent instruction untila counter steps down to 0. Thus to move data in memory it would seem that the natural instruction sequence is to use move with the repeat prefix to perform 32-bit memory-to-memory moves.An alternative method, which uses the standard instructions found in all computers is to load the data into the registers and then store the registers back to memory. This second version of this program, with the code replicated to reduce loop overhead, copies at about 1.5 times as fest. A third version, which uses the larger floating-point registers instead of the integer registers of the x86 copies at about 2.0 times as fast as the complex move instruction.Fallacy: Write in assembly language to obtain the highest performance.At one time compilers for programming languages produced naive instruction sequences; the increasing sophistication of compilers means the gap between compiled code and code produced by hand is closing fest. In fact to compete with current compilers, the assembly language programmer needs to understand the concepts in Chapters 4 and 5 thoroughly (processor pipelining and memory hierarchy).This battle between compilers and assembly language coders is another situation in which humans are losing ground. For example C offers the programmer a chance to give a hint to the compiler about which variables to keep in registers versus spilled to memory. When compilers were poor at register allocation, such hints were vital to performance. In fact some old C textbooks spent a feir amount of time giving examples that effectively use register hints. Todays C compilers generally ignore these hints, because the compiler does a better job at allocation than the programmer does.Even if writing by hand resulted in faster code the dangers of writing in assembly language are the protracted time spent coding and debugging, the loss in portability, and the difficulty of maintaining such code. One of the few widely accepted axioms of software engineering is that coding takes longer if you write more lines and it clearly takes many more lines to write a program in assembly language than in C or Java. Moreover, once it is coded, the next danger is that it will become a popular program. Such programs always live longer than expected meaning that someone will have to update the code over several years and make it work with new releases of operating systems and recent computers. Writing in higher-level language instead of assembly language not only allows future compilers to tailor the code to forthcoming machines; it also makes the software easier maintain and allows the program to run on more brands of computers.Fallacy: The importance of commercial binary compatibility means successful instruction sets dontchange.While backwards binary compatibility is sacrosanct, Figure 2.39 shows that the x86 architecture has grown dramatically. The average is more than one instruction per month over its 35-year lifetime!

Pitfall: Forgetting that sequential word or doubleword addresses in machines with byte addressing do not differ by one.Many an assembly language programmer has toiled over errors made by assuming that the address of the next word or doubleword can be found by incrementing the address in a register by one instead of by the word or doubleword size in bytes. Forewarned is forearmed! Pitfall: Using a pointer to an automatic variable outside its defining procedure.A common mistake in dealing with pointers is to pass a result from a procedure that includes a pointer to an array that is local to that procedure. Following the stack discipline in Figure 2.12, the memory that contains the local array will be reused as soon as the procedure returns. Pointers to automatic variables can lead chaos.

2.20 Concluding Remarks

The two principles of the stored-program computer are the use of instructions that are indistinguishable from numbers and the use of alterable memory for programs. These principles allow a single machine to aid cancer researchers, financial advisers and novelists in their specialties. The selection of a set of instructions thatthe machine can understand demands a delicate balance among the number of instructions needed to execute a program, the number of clock cycles needed by an instruction, and the speed of the clock. As illustrated in this chapter three design principles guide the authors of instruction sets in making that tricky tradeoff:1.Simplicity favors regularity. Regularity motivates many features of the RISC-V instruction set: keeping all instructions a single size, always requiring register operands in arithmetic instructions and keeping the register fields in the same place in all instruction formats.2.Smaller is faster. The desire for speed is the reason that RISC-V has 32 registers rather than many more.3.Good design demands good compromises. One RISC-V example is the compromise between providing for larger addresses and constants in instructions and keeping all instructions same length.We also saw the great idea from Chapter 1 of making the common cast fast applied to instruction sets as well as computer architecture. Examples of making the common RISC-V case fast include PC-relative addressing fbr conditional branches and immediate addressing for larger constant operands.Above this machine level is assembly language, a language that humans can read. The assembler translates it into the binary numbers that machines can understand, and it even "extends" the instruction set by creating symbolic instructions that arent in the hardware. For instance, constants or addresses that are too big are broken into properly sized pieces, common variations instructions are given their own name, and so on. Figure 2.40 lists the RISC-V instructions we have covered so far, both real and pseudoinstructions. Hiding details from the higher level is another example of the great idea of abstraction.Each category of RISC-V instructions is associated with constructs that appear in programming languages:

  • Arithmetic instructions correspond to the operations found in assignment statements.
  • Transfer instructions are most likely to occur when dealing with data structures like arrays or structures.
  • Conditional branches are used in shall statements and in loops.
  • Unconditional branches are used in procedure calls and returns and fbr case/ switch statements.

These instructions are not born equal; the popularity of the few dominates the many. For example, Figure 2.41 shows the popularity of each class of instructions for SPEC CPU2006. The varying popularity of instructions plays an important role in the chapters about datapath, control, and pipelining.After we explain computer arithmetic in Chapter 3, we reveal more of the RISC-V instruction set architecture.

2.21 Historical Perspective and Further Reading

This section surveys the history of instruction set architectures (ISAs) over time, and we give a short history of programming languages and compilers. ISAs include accumulator architectures, general-purpose register architectures, stack architectures, and a brief history of the x86 and ARMs 32-bit architecture ARMv7. We also review the controversial subjects of high-level-language computer architectures and reduced instruction set computer architectures. The history of programming languages includes Fortran, Lisp, Algol, C Cobol, Pascal, Simula, Smalltalk, C++, and Java, and the history of compilers includes the key milestones and the pioneers who achieved them. The rest is found online.

2.22 Exercises

2.1 [5] <§2.2> For the following C statement, write the corresponding RISC-V assembly code. Assume that the C variables f, g, and h, have already been placed in registers x5, x6, and x7 respectively. Use a minimal number ofRISC-V assembly instructions.

f = g + (h — 5);

2.2 [5] &lt;§2.2&gt; Write a single C statement that corresponds to the two RISC-V assembly instructions below.

add f, g, h
add f, i, f

2.3 [5] &lt;§§2.2, 2.3&gt; For the following C statement, write the corresponding RISC-V assembly code. Assume that the variables f, g, h, I, and j are assigned to registers x5 , x6, x7 , x28, and x29 respectively. Assume that the base address of the arrays A and B are in registers xlO and xll, respectively.

B[8] = AEi-jh

2.4 [10] &lt;§§2.2, 2.3&gt; For the RISC-V assembly instructions below, what is the corresponding C statement? Assume that the variables f, g, h, I, and j are assigned to registers x5, x6, x7 , x28, and x29 respectively. Assume that the base address of the arrays A and B are in registers xlO and xll, respectively.

slli    x30,    x5, 3    //    x30 = f*8
add    x30,    xlO, x30    //    x30 = &A[f]
slli    x31,    x6, 3    //    x31 = g*8
add    x31,    xll, x31    //    x31 = &B[g]
Id    x5,    0(x30)    //    f = A[f]
addi    xl2,    x30, 8        
Id    x30,    0(x12)        
add x30, x30, x5 sd    x30, 0(x31)

2.5 [5] <§2.3> Show how the value Oxabcdef 12 would be arranged in memory of a little-endian and a big-endian machine. Assume the data are stored starting at address 0 and that the word size is 4 bytes. 2.6 [5] &lt;§2.4&gt; Translate Oxabcdef 12 into decimal. 2.7 [5] &lt;§§2.2, 2.3&gt; Translate the following C code to RISC-V Assume that the variables f, g, h, I, and j are assigned to registers x5, x6 , x7 , x28, and x29 respectively. Assume that the base address of the arrays A and B are in registers xlO and xll, respectively. Assume that the elements of the arrays A and B are 8-byte words:

B[8] = ALi] + A[j];

2.8 [10] &lt;§§2.2, 2.3&gt; Translate the following RISC-V code to C. Assume that the variables f, g, h, I, and j are assigned to registers x5 , x6 , x7 , x28, and x29

  1. Assume that the base address of the arrays A and B are in registers xlO and xll, respectively.
addi    x30,    xlO, 8
addi    x31,    xlO, 0
sd    x31,    0(x30)
Id    x30,    0(x30)
add    x5,    x30, x31

2.9 [20] <§§2.2,2.5> For each RISC-V instruction in Exercise 2.8, show the value of the opcode (op), source register (rsl), and destination register (rd) fields. For the I-type instructions, show the value of the immediate field, and for the R-type instructions show the value of the second source register (rs2). For non U- and UJ-type instructions, show the funct3 field, and for R-type and S-type instructions, also show the funct7 field. 2.10 Assume that registers x5 and x6 hold the values 0x8000000000000000 and OxDO00000000000000, respectively. 2.10.1 [5] &lt;§2.4&gt; What is the value of x30 for the following assembly code?

add x30, x5, x6

2.10.2 [5] &lt;§2.4&gt; Is the result in x30 the desired result, or has there been overflow? 2.10.3 [5] &lt;§2.4&gt; For the contents of registers x5 and x6 as specified above, what is the value of x30 for the following assembly code?

sub x30, x5, x6

2.10.4 [5] &lt;§2.4&gt; Is the result in x30 the desired result, or has there been overflow?2.10.5 [5] &lt;§2.4&gt; For the contents of registers x5 and x6 as specified above, what is the value of x30 for the following assembly code?

add x30, x5, x6
add x30, x30, x5

2.10.6 [5] &lt;§2.4&gt; Is the result in x30 the desired result, or has there been overflow? 2.11 Assume that x5 holds the value 128ten. 2.11.1 [5] &lt;§2.4&gt; For the instruction add x30, x5, x6, what is the range(s) of values for x6 that would result in overflow? 2.11.2 [5] &lt;§2.4&gt; For the instruction sub x30, x5, x6, what is the range(s) of values for x6 that would result in overflow? 2.11.3 [5] &lt;§2.4&gt; For the instruction sub x30, x6 , x5, what is the range(s) of values for x 6 that would result in overflow? 2.12 [5] <§§2.2, 2.5> Provide the instruction type and assembly language instruction for the following binary value:0000 0000 0001 0000 1000 0000 1011 0011twoHint: Figure 2.20 may be helpful. 2.13 [5] &lt;§§2.2,2.5&gt; Provide the instruction type and hexadecimal representation of the following instruction:

sd x5, 32(x30)

2.14 [5] &lt;§2.5&gt; Provide the instruction type, assembly language instruction, and binary representation of instruction described by the following RISC-V fields:

opcode=0x33, funct3=0x0, funct7=0x20, rs2=5, rsl=7, rd=6

2.15 [5] &lt;§2.5&gt; Provide the instruction type, assembly language instruction, and binary representation of instruction described by the following RISC-V fields:

opcode=0x3, funct3=0x3, rsl=27, rd=3, 1mm=0x4

2.16 Assume that we would like to expand the RISC-V register file to 128 registers and expand the instruction set to contain four times as many instructions. 2.16.1 [5] &lt;§2.5&gt; How would this affect the size of each of the bit fields in the R-type instructions? 2.16.2 [5] &lt;§2.5&gt; How would this affect the size of each of the bit fields in the I-type instructions? 2.16.3 [5] &lt;§§2.5, 2.8, 2.10&gt; How could each of the two proposed changes decrease the size of a RISC-V assembly program? On the other hand, how could the proposed change increase the size of an RISC-V assembly program? 2.17 Assume the following register contents:

x5 = OxOOOOOOOOAAAAAAAA, x6 = 0x1234567812345678

2.17.1 [5] &lt;§2.6&gt; For the register values shown above, what is the value of x7 for the following sequence of instructions?

slli x7, x5, 4
or x7, x7, x6

2.17.2 [5] &lt;§2.6&gt; For the register values shown above, what is the value of x 7 for the following sequence of instructions?

slli x7 , x6, 4

2.17.3 [5] &lt;§2.6&gt; For the register values shown above, what is the value of x 7 for the following sequence of instructions?

srli x7, x5, 3
addi x7 , x7 , OxFEF

2.18 [10] &lt;§2.6&gt; Find the shortest sequence of RISC-V instructions that extracts bits 16 down to 11 from register x5 and uses the value of this field to replace bits 31 down to 26 in register xo without changing the other bits of registers x5 or x6. (Be sure to test your code using x 5=0 and x6 = Oxf f f f f f f f f f f f f f f f. Doing so may reveal a common oversight.) 2.19 [5] &lt;§2.6&gt; Provide a minimal set of RISC-V instructions that may be used to implement the following pseudoinstruction:

not x5 , x6 // bi t-wi se invert

2.20 [5] &lt;§2.6&gt; For the following C statement, write a minimal sequence of RISC-V assembly instructions that performs the identical operation. Assume x6 = A, and xl7 is the base address of C.

A = CEO] « 4;

2.21 [5] <§2.7> Assume x5 holds the value 0x00000000001010000. What is the value of x6 after the following instructions?

bge x5, xO, ELSE
jal xO, DONE
ELSE: ori x6, xO, 2
DONE:

2.22 Suppose the program counter (PC) is set to 0x20000000. 2.22.1 [5] &lt;§2.10&gt; What range of addresses can be reached using the RISC-V jump-and-link (jal) instruction? (In other words, what is the set of possible values for the PC after the jump instruction executes?) 2.22.2 [5] &lt;§2.10&gt; What range of addresses can be reached using the RISC-V branch if equal (beq) instruction? (In other words, what is the set of possible values for the PC after the branch instruction executes?) 2.23 Consider a proposed new instruction named rpt. This instruction combines a loops condition check and counter decrement into a single instruction. For example rpt x29 , 1 oop would do the following:

if (x29 > 0) {
x29 = x29 -1;
goto loop
}

2.23.1 [5] &lt;§2.7, 2.10&gt; If this instruction were to be added to the RISC-V instruction set, what is the most appropriate instruction format? 2.23.2 [5] &lt;§2.7&gt; What is the shortest sequence of RISC-V instructions that performs the same operation? 2.24 Consider the following RISC-V loop:

LOOP:    beq    x6, xO, DONE
    addi    x6 , x6 , -1
    addi    x5, x5 , 2
    jal    xO, LOOP
DONE:    

2.24.1 [5] &lt;§2.7&gt; Assume that the register x6 is initialized to the value 10. What is the final value in register x5 assuming the x5 is initially zero?2.24.2 [5] &lt;§2.7&gt; For the loop above, write the equivalent C code. Assume that the registers x5 and x6 are integers acc and 1, respectively. 2.24.3 [5] <§2.7> For the loop written in RISC-V assembly above, assume that the register x6 is initialized to the value N. How many RISC-V instructions are executed? 2.24.4 [5] &lt;§2.7&gt; For the loop written in RISC-V assembly above, replace the instruction c 2.25 [10] &lt;§2.7&gt; Translate the following C code to RISC-V assembly code. Use a minimum number of instructions. Assume that the values of a, B, I, and j are in registers x5, x6, x7, and x29, respectively. Also, assume that register xlO holds the base address of the array D.

for(i=0; i<a; i++)
for(j=0; j<b; j++)
D[4*j] = i + j;

2.26 [5] &lt;§2.7&gt; How many RISC-V instructions does it take to implement the C code from Exercise 2.25? If the variables a and B are initialized to 10 and 1 and all elements of D are initially 0, what is the total number of RISC-V instructions executed to complete the loop? 2.27 [5] <§2.7> Translate the following loop into C. Assume that the C- level integer I is held in register x5 ,x6 holds the C- level integer called resul t, and xlO holds the base address of the integer MemArray.

          addi x6, xO, 0 
          addi x29, xO, 100 
LOOP:     ld x7, 0(x10)
          add x5, x5, x7
          addi xlO, xlO, 8
          addi x6, x6, 1 
          bit x6, x29, LOOP

2.28 [10] &lt;§2.7&gt; Rewrite the loop from Exercise 2.27 to reduce the number of RISC-V instructions executed. Hint: Notice that variable 1 is used only for loop control. 2.29 [30] &lt;§2.8&gt; Implement the following C code in RISC-V assembly. Hint: Remember that the stack pointer must remain aligned on a multiple of 16.

int fib(int n) {
if (n==0)
return 0;
else if (n == 1)
return 1:
else
return fib(n-l) + fib(n-2);
}

2.30 [20] <§2.8> For each function call in Exercise 2.29, show the contents of the stack after the function call is made. Assume the stack pointer is originally at address 0x7ff ffffc, and follow the register conventions as specified in Figure 2.11. 2.31 [20] <§2.8> Translate function f into RISC-V assembly language. Assume the function declaration for g is I nt g( 1 nt a, I nt B ). The code for function f is as follows:

int f(int a, 1 nt b, int c, int d) {
return g(g(a,b), c+d);
}

2.32 [5] <§2.8> Can we use the tail-call optimization in this function? If no, explain why not. If yes, what is the difference in the number of executed instructions in f with and without the optimization?2.33 [5] <§2.8> Right before your function f from Exercise 2.31 returns, what do we know about contents of registers X10-X14, x8, xl, and sp? Keep in mind that we know what the entire function f looks like, but for function g we only know its declaration. 2.34 [30] <§2.9> Write a program in RISC-V assembly to convert an ASCII string containing a positive or negative integer decimal string to an integer. Your program should expect register xlO to hold the address of a null-terminated string containing an optional or "followed by some combination of the digits 0 through 9. Your program should compute the integer value equivalent to this string of digits, then place the number in register xlO. If a non-digit character appears anywhere in the string, your program should stop with the value -1 in register xlO. For example, if register xlO points a sequence of three bytes 50|en, 52ten, 0ten (the null-terminated string "24”),then when the program stops, register xlO should contain the value 24ten. The RISC-V mul instruction takes two registers as input. There is no Umul I ”instruction. Thus, just store the constant 10 in a register. 2.35 Consider the following code:

lb x6, 0(x7)
sd x6, 8(x7)

Assume that the register x7 contains the address 0x10000000 and the data at address is 0x1122334455667788. 2.35.1 [5] &lt;§2.3, 2.9&gt; What value is stored in 0x10000008 on a big-endian machine? 2.35.2 [5] &lt;§2.3, 2.9&gt; What value is stored in 0x10000008 on a little-endian machine? 2.36 [5] &lt;§2.10&gt; Write the RISC-V assembly code that creates the 64-bit constant 0x1122334455667788two and stores that value to register xlO. 2.37 [10] &lt;§2.11&gt; Write the RISC-V assembly code to implement the following C code as an atomic "set max” operation using the 1 r .d/ SC. d instructions. Here, the argument shvar contains the address of a shared variable which should be replaced by x if x is greater than the value it points:

void setmax(int* shvar, int x) { 
// Begin critical section 
if (x > *shvar)
*shvar = x;
// End critical section}
}

2.38 [5] &lt;§2.11&gt; Using your code from Exercise 2.37 as an example, explain what happens when two processors begin to execute this critical section at the same time, assuming that each processor executes exactly one instruction per cycle.2.39 Assume for a given processor the CPI of arithmetic instructions is 1, the CPI of load/store instructions is 10, and the CPI of branch instructions is 3. Assume a program has the following instruction breakdowns: 500 million arithmetic instructions, 300 million load/store instructions, 100 million branch instructions. 2.39.1 [5] <§§1.6, 2.13> Suppose that new, more powerful arithmetic instructions are added to the instruction set. On average, through the use of these more powerful arithmetic instructions, we can reduce the number of arithmetic instructions needed to execute a program by 25% while increasing the clock cycle time by only 10%. Is this a good design choice? Why? 2.39.2 [5] &lt;§§1.6, 2.13&gt; Suppose that we find a way to double the performance of arithmetic instructions. What is the overall speedup of our machine? What if we find a way to improve the performance of arithmetic instructions by 10 times? 2.40 Assume that for a given program 70% of the executed instructions are arithmetic, 10% are load/store, and 20% are branch. 2.40.1 [5] &lt;§§1.6, 2.13&gt; Given this instruction mix and the assumption that an arithmetic instruction requires two cycles, a load/store instruction takes six cycles, and a branch instruction takes three cycles, find the average CPI. 2.40.2 [5] <§§1.6, 2.13> For a 25% improvement in performance, how many cycles, on average, may an arithmetic instruction take if load/store and branch instructions are not improved at all? 2.40.3 [5] &lt;§§1.6, 2.13&gt; For a 50% improvement in performance, how many cycles, on average, may an arithmetic instruction take if load/store and branch instructions are not improved at all? 2.41 [10] &lt;§2.19&gt; Suppose the RISC-V ISA included a scaled offset addressing mode similar to the x86 one described in Section 2.17 (Figure 2.35). Describe how you would use scaled offset loads to further reduce the number of assembly instructions needed to cany out the function given in Exercise 2.4.

2.42 [10] &lt;§2.19&gt; Suppose the RISC-V ISA included a scaled offset addressing mode similar to the x86 one described in Section 2.17 (Figure 2.35). Describe how you would use scaled offset loads to further reduce the number of assembly instructions needed to implement the C code given in Exercise 2.7.Answers to Check Yourself §2.2, page 66: RISC-V C, Java.§2.3, page 73: 2) Very slow.§2.4, page 80: 2) -8ten§2.5, page 89: 3) sub xll, xlO, x9§2.6, page 92: Both. AND with a mask pattern of Is will leaves OS everywhere but the desired field. Shifting left by the correct amount removes the bits from the left of the field. Shifting right by the appropriate amount puts the field into the rightmost bits of the doubleword, with OS in the rest of the doubleword. Note that AND leaves the field where it was original .. and the shift pair moves the field into the rightmost part of the doubleword. 2.7, page 97:1. All are true. II. 1).§2.8, page 108: Both are true.§2.9, page 113:1. l)and 2) II. 3).§2.10, page 121:1. 4) ±4K. II. 4) ± 1M.§2.11, page 124: Both are true.§2.12, page 133: 4) Machine independence.

java development practice classic interview job interview filter web.xml loading computer software interface java Yunqi community development practical job interview java development practical job interview
developer Community&gt; huazhang Publishing House
Selected, One-Stop Store for Enterprise Applications
Support various scenarios to meet companies' needs at different stages of development

Start Building Today with a Free Trial to 50+ Products

Learn and experience the power of Alibaba Cloud.

Sign Up Now