Chapter 2

Chapter 2
OVERVIEW OF PROCESSOR ARCHITECTURE AND PIPELINES

This section provides an overview of the pipelines and architectural features of the Pentium® and dynamic execution (P6-family) processors with MMX^TM technology. By understanding how the code flows through the pipeline of the processor, you can better understand why a specific optimization will improve the speed of your code. Additionally, it will help you to schedule and optimize your application for high performance.

2.1 Pipelines of Superscalar (Pentium® Family) and Dynamic Execution (P6-Family) Architectures

2.1.1 SUPERSCALAR (PENTIUM® FAMILY) PIPELINE

The Pentium processor is an advanced superscalar processor. It is built around two general-purpose integer pipelines and a pipelined floating-point unit, allowing the processor to execute two integer instructions simultaneously. A software-transparent dynamic branch-prediction mechanism minimizes pipeline stalls due to branches. Pentium processors with MMX technology add additional stages to the pipeline. The integration of the MMX Technology pipeline with the integer pipeline is very similar to that of the floating-point pipe.

Pentium processors can issue two instructions every clock cycle, one in each pipe. The first logical pipe is referred to as the "U" pipe, and the second as the "V" pipe. During decoding of any given instruction, the next two instructions are checked, and, if possible, they are issued such that the first one executes in the U-pipe and the second in the V-pipe. If it is not possible to issue two instructions, then the next instruction is issued to the U-pipe and no instruction is issued to the V-pipe.

When instructions execute in the two pipes, their behavior is exactly the same as if they were executed sequentially. When a stall occurs, successive instructions are not allowed to pass the stalled instruction in either pipe. Figure 2-1 shows the pipelining structure for this scheme:

Figure 2-1. MMX^TM Technology Pipeline Structure

Pentium processors with MMX technology add an additional stage to the integer pipeline. The instruction bytes are prefetched from the code cache in the prefetch (PF) stage, and they are parsed into instructions in the fetch (F) stage. Additionally, any prefixes are decoded in the F stage.

Instruction parsing is decoupled from the instruction decoding by means of an instruction First In, First Out (FIFO) buffer, which is situated between the F and Decode 1 (D1) stages. The FIFO has slots for up to four instructions. This FIFO is transparent; it does not add additional latency when it is empty.

During every clock cycle, two instructions can be pushed into the instruction FIFO (depending on availability of the code bytes, and on other factors such as prefixes). Instruction pairs are pulled out of the FIFO into the D1 stage. Since the average rate of instruction execution is less than two per clock, the FIFO is normally full. As long as the FIFO is full, it can buffer any stalls that may occur during instruction fetch and parsing. If such a stall occurs, the FIFO prevents the stall from causing a stall in the execution stage of the pipe. If the FIFO is empty, then an execution stall may result from the pipeline being "starved" for instructions to execute. Stalls at the FIFO entrance may result from long instructions or prefixes (see Sections 3.2.3 and 3.4.2).

The following chart details the MMX technology pipeline on superscalar processors and the conditions in which a stall may occur in the pipeline.

Figure 2-2. MMX^TM Technology Instruction Flow in a Pentium® Family Processor with MMX Technology

Table 2-1 details the functional units, latency, throughput, and execution pipes for each type of MMX technology instruction.

Table 2-1. MMX^TM Technology Instructions and Execution Units

Operation	Number of Functional Units	Latency	Throughput	Execution Pipes
ALU	2	1	1	U and V
Multiplexer	1	3	1	U or V
Shift/pack/unpack	1	1	1	U or V
Memory access	1	1	1	U only
Integer register access	1	1	1	U only

The Arithmetic Logic Unit (ALU) executes arithmetic and logic operations (that is, add, subtract, xor, and).
The Multiplier unit performs all multiplication operations. Multiplication requires three cycles but can be pipelined, resulting in one multiplication operation every clock cycle. The processor has only one multiplier unit which means that multiplication instructions cannot pair with other multiplication instructions. However, the multiplication instructions can pair with other types of instructions. They can execute in either the U- or V-pipes.
The Shift unit performs all shift, pack and unpack operations. Only one shifter is available so shift, pack and unpack instructions cannot pair with other shift unit instructions. However, the shift unit instructions can pair with other types of instructions. They can execute in either the U- or V-pipes.
MMX Technology Instructions that access memory or integer registers can only execute in the U-pipe and cannot be paired with any instructions that are not MMX Technology instructions.
After updating an MMX technology register, two clock cycles must pass before that MMX technology register can be moved to either memory or to an integer register.

Information on pairing requirements can be found in Section 3.3.

Additional information on instruction format can be found in the Intel Architecture MMX^TM Technology Programmer's Reference Manual, (Order Number 243007).

2.1.2. DYNAMIC EXECUTION (P6-FAMILY) PIPELINE

P6-family processors use a Dynamic Execution architecture that blend out-of-order and speculative execution with hardware register renaming and branch prediction. These processors feature an in-order issue pipeline, which breaks Intel386^TM processor macroinstructions up into simple, micro-operations called micro-ops (or uops), and an out-of-order, superscalar processor core, which executes the micro-ops. The out-of-order core of the processor contains several pipelines to which integer, jump, floating-point, and memory execution units are attached. Several different execution units may be clustered on the same pipeline: for example, an integer address logic unit and the floating-point execution units (adder, multiplier, and divider) share a pipeline. The data cache is pseudo-dual ported via interleaving, with one port dedicated to loads and the other to stores. Most simple operations (integer ALU, floating-point add, even floating-point multiply) can be pipelined with a throughput of one or two operations per clock cycle. Floating-point divide is not pipelined. Long latency operations can proceed in parallel with short latency operations.

The P6-family pipeline is comprised of three parts: the In-Order Issue Front-end, the Out-of-Order Core and the In-Order Retirement unit. Details about the In-Order Issue Front-end follow below.

Figure 2-3. Out-Of-Order Core and Retirement Pipeline

Since the dynamic execution processors execute instructions out of order, the most important consideration in performance tuning is making sure enough micro-ops are ready for execution. Correct branch prediction and fast decoding are essential to getting the most performance out of the In-Order Front-End. Branch prediction and the branch target buffer are detailed in Section 2.3. Decoding is discussed below.

During every clock cycle, up to three Intel Architecture macro instructions can be decoded in the ID1 pipestage. However, if the instructions are complex or are over seven bytes then the decoder is limited to decoding fewer instructions.

The decoders can decode:

Up to three macro-instructions per clock cycle.
Up to six micro-ops per clock cycle.
Macro-instructions up to seven bytes in length.

P6-family processors have three decoders in the D1 pipestage. The first decoder is capable of decoding one IA macro-instruction of four or fewer micro-ops in each clock cycle. The other two decoders can each decode an IA instruction of one micro-op in each clock cycle. Instructions composed of more than four micro-ops will take multiple cycles to decode. When programming in assembly language, scheduling the instructions in a 4-1-1 micro-op sequence increases the number of instructions that can be decoded each clock cycle. In general:

Simple instructions of the register-register form are only one micro-op.
Load instructions are only one micro-op.
Store instructions have two micro-ops.
Simple read-modify instructions are two micro-ops.
Simple instructions of the register-memory form have two to three micro-ops.
Simple read-modify write instructions are four micro-ops.
Complex instructions generally have more than four micro-ops, therefore they will take multiple cycles to decode.

For the purpose of counting micro-ops, MMX technology instructions are simple instructions. See Optimizations for Intel's 32-bit Processors, Application Note AP-526 (Order Number 242816), Appendix D for a table that specifies the number of micro-ops for each instruction in the Intel Architecture instruction set.

Once the micro-ops are decoded, they will be issued from the In-Order Front-End into the Reservation Station (RS), which is the beginning pipestage of the Out-of-Order core. In the RS, the micro-ops wait until their data operands are available. Once a micro-op has all data sources available, it will be dispatched from the RS to an execution unit. If a micro-op enters the RS in a data-ready state (that is, all data is available), then the micro-op will be immediately dispatched to an appropriate execution unit, if one is available. In this case, the micro-op will spend very few clock cycles in the RS. All of the execution units are clustered on ports coming out of the RS. Once the micro-op has been executed it returns to the ROB, and waits for retirement. In this pipestage, all data values are written back to memory and all micro-ops are retired in-order, three at a time. The figure below provides details about the Out-of-Order core and the In-Order retirement pipestages.

Figure 2-4. Out-Of-Order Core and Retirement Pipeline

Table 2-2. Dynamic Execution (P6-Family) Processor Execution Unit Pipelines
Port	Execution Units	Latency/Thruput
0	Integer ALU Unit LEA instructions Shift instructions Integer Multiplication instruction Floating-Point Unit FADD instruction FMUL FDIV Unit MMX^TM Technology ALU Unit MMX Technology Multiplier UnitFDIV	Latency 1, Thruput 1/cycle Latency 1, Thruput 1/cycle Latency 1, Thruput 1/cycle Latency 4, Thruput 1/cycle Latency 3, Thruput 1/cycle Latency 5, Thruput 1/2cycle1,2 Latency long and data dep., Thruput non-pipelined Latency 1, Thruput1/cycle Latency 3, Thruput 1/cycle
1	Integer ALU Unit MMX Technology ALU Unit MMX Technology Shifter Unit	Latency 1, Thruput 1/cycle Latency 1, Thruput 1/cycle Latency 1, Thruput 1/cycle
2	Load Unit	Latency 3 on a cache hit, Thruput 1/cycle4
3	Store Address Unit	Latency 3 (not applicable) Thruput 1/cycle3
4	Store Data Unit	Latency 1 (not applicable) Thruput 1/cycle

Notes:

The FMUL unit cannot accept a second FMUL within the cycle after it has accepted the first. This is NOT the same as only being able to do FMULs on even clock cycles.
FMUL is pipelined one every two clock cycles. One way of thinking about this is to imagine that a P6-family processor has only a 32x32->32 multiply pipelined.
Store latency is not all that important from a dataflow perspective. The latency that matters is with respect to determining when they can retire and be completed. They also have a different latency with respect to load forwarding. For example, if the store address and store data of a particular address, for example 100, dispatch in clock cycle 10, a load (of the same size and shape) to the same address 100 can dispatch in the same clock cycle 10 and not be stalled.
A load and store to the same address can dispatch in the same clock cycle.

2.2 Caches

The on-chip cache subsystem of processors with MMX technology consists of two 16 K four-way set associative caches with a cache line length of 32 bytes. The caches employ a write-back mechanism and a pseudo-LRU replacement algorithm. The data cache consists of eight banks interleaved on four-byte boundaries.

On Pentium processors with MMX technology, the data cache can be accessed simultaneously from both pipes, as long as the references are to different cache banks. On the dynamic execution (P6-family) processors, the data cache can be accessed simultaneously by a load instruction and a store instruction, as long as the references are to different cache banks. The delay for a cache miss on the Pentium processor with MMX technology is eight internal clock cycles. On dynamic execution processors with MMX technology the minimum delay is ten internal clock cycles.

2.3 Branch Target Buffer

Branch prediction for Pentium and dynamic execution processors with MMX technology is functionally identical except for one minor exception which will be discussed in Section 2.3.1.

The Branch Target Buffer (BTB) stores the history of the previously seen branches and their targets. When a branch is prefetched, the BTB feeds the target address directly into the Instruction Fetch Unit (IFU). Once the branch is executed, the BTB is updated with the target address. Using the branch target buffer, branches that have been seen previously are dynamically predicted. The branch target buffer prediction algorithm includes pattern matching and up to four prediction history bits per target address. For example, a loop which is four iterations long should have close to 100% correct prediction. Adhering to the following guideline will improve branch prediction performance:

Program conditional branches (except for loops) so that the most executed branch immediately follows the branch instruction (that is, fall through).

Additionally, processors with MMX technology have a Return Stack Buffer (RSB), which can correctly predict return addresses for procedures that are called from different locations in succession. This increases further the benefit of unrolling loops which contain function calls, and removes the need to in-line certain procedures.

2.3.1 CONSECUTIVE BRANCHES

On the Pentium processor with MMX technology, branches may be mispredicted when the last byte of two branch instructions occur in the same aligned four byte section of memory, as shown in the figure below.

Figure 2-5. Consecutive Branch Example

This may occur when there are two consecutive branches with no intervening instructions and the second instruction is only two bytes long (such as a jump relative +/- 128).

To avoid a misprediction in these cases, make the second branch longer by using a 16-bit relative displacement on the branch instruction instead of an 8-bit relative displacement.

2.4 Write Buffers

Processors with MMX technology have four write buffers (versus two in Pentium processors without MMX technology). Additionally, the write buffers can be used by either the U-pipe or the V-pipe (versus one corresponding to each pipe in Pentium processors without MMX technology). Performance of critical loops can be improved by scheduling the writes to memory; when you expect to see write misses, you should schedule the write instructions in groups no larger than four, then schedule other instructions before scheduling further write instructions.