Understanding uops
μops refers to uop, short for Micro-operation, literally translated as “micro-operation” in Chinese. The micro part can be replaced by the Greek letter μ, and operation can be abbreviated as op, yielding μop. For convenience, u is used instead of μ. The term uops refers collectively to all uops, that is, Micro-operations (for convenience, this article will uniformly use uops).
For x86, which is a CISC architecture, instructions of varying length and complexity are internally converted into one or more uops so that they can be scheduled and optimized more efficiently, thereby improving CPU performance.
You can think of x86 instructions as coarse-grained, while uops are tiny, fine-grained micro-instructions. By breaking complex coarse-grained instructions into simple fine-grained ones, the CPU effectively gains some of the advantages of a RISC architecture, making the overall structure more compact, flexible, and efficient.
Advantages of uops
The main advantages of splitting instructions into micro-operations are:
Out-of-Order Execution
Take the PUSH rbx instruction as an example. It decrements the stack pointer by 8 bytes and then stores the source operand at the top of the stack. Suppose that after decoding, PUSH rbx is broken into two dependent uops:
SUB rsp, 8
STORE [rsp], rbx
Typically, a function prologue saves multiple registers with several PUSH instructions. In our example, the next PUSH instruction can begin execution after the SUB μop of the previous PUSH completes, without waiting for the STORE μop that is still being executed.
Parallel Execution
Take the HADDPD xmm1, xmm2 instruction as an example. It adds (reduces) the two double-precision floating-point numbers in xmm1 and xmm2 and stores the two results in xmm1, as shown below:
xmm1[63:0] = xmm2[127:64] + xmm2[63:0]
xmm1[127:64] = xmm1[127:64] + xmm1[63:0]
One way to microcode this instruction is to do the following:
- Reduce xmm2 and store the result in xmm_tmp1[63:0];
- Reduce xmm1 and store the result in xmm_tmp2[63:0];
- Merge xmm_tmp1 and xmm_tmp2 into xmm1.
That is three uops in total. Steps 1 and 2 are independent, so they can be executed in parallel.
Macro Fusion
Sometimes uops can also be fused together. There are two types of fusion in modern CPUs:
Micro-fusion: fuses uops from the same machine instruction. Micro-fusion can only be applied to two kinds of combinations: memory-write operations and read-modify operations. For example:
add eax, [mem]
This instruction contains two uops:
- Read memory location
mem; - Add it to
eax.
With micro-fusion, the two uops are fused into one during the decode stage.
Macro-fusion: fuses uops from different machine instructions. In some cases, the decoder can fuse an arithmetic or logical instruction with a subsequent conditional jump instruction into a single compute-and-branch uop. For example:
.loop:
dec rdi
jnz .loop
With macro-fusion, the two uops from DEC and JNZ are fused into one.
Both micro-fusion and macro-fusion can save bandwidth across all pipeline stages, from decode to retirement. Fused operations share a single entry in the reorder buffer (ROB). When one fused uop uses only one entry, the ROB capacity is used more efficiently.
Counting uops
To collect the number of uops issued, executed, and retired by an application, you can use Linux perf as follows:
1$ perf stat -e uops_issued.any,uops_executed.thread,uops_retired.slots -- ./a.exe
22856278 uops_issued.any
32720241 uops_executed.thread
42557884 uops_retired.slots
The way instructions are broken down into micro-operations may vary across CPU generations. In general, the fewer uops required for an instruction, the better the hardware support for it, and the lower the latency and the higher the throughput it may have. On the latest Intel and AMD CPUs, the vast majority of instructions generate exactly one uop.
References: [1] Modern CPU Performance Analysis and Optimization [2] Three Questions About the Philosophy of uops