How close to peak can you get on a CPU?

So I am trying to write a microbenchmark that achieves close to the advertised peak FLOPs on an Intel Core2 Quad Q9550 (45.2 single precision GFLOPs according to this source http://www.intel.com/support/processors/sb/cs-023143.htm ).

I am currently at ~16GFlops using 4 threads and a single unrolled loop around a large number of independent 4-wide SSE vector instructions. I have verified that no registers are being spilled (no memory ops in the loop) and that the loop fits into the L1 instruction cache. At this point I think that the LLVM code generator is working against me (the scheduled code looks a bit odd) and to get beyond this I need to start hand-scheduling the assembly. To do this, I need some deeper understanding of the functional unit latencies and bottlenecks in the datapath. I haven’t had a lot of luck finding this information online.

Does anyone know either where to find this information, or, know where I can find a (simple) benchmark that achieves close to peak that I can use as an example?

Thanks in advance.

So I am trying to write a microbenchmark that achieves close to the advertised peak FLOPs on an Intel Core2 Quad Q9550 (45.2 single precision GFLOPs according to this source http://www.intel.com/support/processors/sb/cs-023143.htm ).

I am currently at ~16GFlops using 4 threads and a single unrolled loop around a large number of independent 4-wide SSE vector instructions. I have verified that no registers are being spilled (no memory ops in the loop) and that the loop fits into the L1 instruction cache. At this point I think that the LLVM code generator is working against me (the scheduled code looks a bit odd) and to get beyond this I need to start hand-scheduling the assembly. To do this, I need some deeper understanding of the functional unit latencies and bottlenecks in the datapath. I haven’t had a lot of luck finding this information online.

Does anyone know either where to find this information, or, know where I can find a (simple) benchmark that achieves close to peak that I can use as an example?

Thanks in advance.

Are you running a 32bit or a 64bit operating system? One thing I found when hand-tuning my code is that the 8 FP registers available in 32bit are not enough to hide all latencies, while it was possible with the 16 registers in 64bit.

Are you running a 32bit or a 64bit operating system? One thing I found when hand-tuning my code is that the 8 FP registers available in 32bit are not enough to hide all latencies, while it was possible with the 16 registers in 64bit.

I’m running 64-bit Ubuntu. The entire body of the inner loop is kept within the SSE register file and the code is written such that it should be possible for the compile to schedule about 10 instructions back to back with no dependencies. I think the problem is that the compiler (LLVM) is screwing up the schedule and even though there are plenty of independent instructions, it is scheduling dependent instructions back to back (the original C/CUDA source actually has a better schedule). I’m going to try hand tuning the assembly next, but I was hoping there was some information about the SSE unit latencies…

I’m running 64-bit Ubuntu. The entire body of the inner loop is kept within the SSE register file and the code is written such that it should be possible for the compile to schedule about 10 instructions back to back with no dependencies. I think the problem is that the compiler (LLVM) is screwing up the schedule and even though there are plenty of independent instructions, it is scheduling dependent instructions back to back (the original C/CUDA source actually has a better schedule). I’m going to try hand tuning the assembly next, but I was hoping there was some information about the SSE unit latencies…

Hi Greg,

The official “Intel 64 and IA-32 Architectures Optimization Reference Manual” includes tables of latency and throughput in Appendix C.
http://www.intel.com/products/processor/manuals/
(Q9550 is 06_17H/06_1DH)

Another useful resource is Agner Fog’s page:
http://www.agner.org/optimize/
…which is often more accurate that Intel’s own documentation…

Never tried it, but Intel offers a free static code analyzer. It is designed for AVX, but should support plain SSE and older CPUs:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
It shows an ASCII-art diagram of the issue port and latency of each instruction in order to assist manual scheduling.

This is when you realize how easy it is to program a GPU. :)

Hi Greg,

The official “Intel 64 and IA-32 Architectures Optimization Reference Manual” includes tables of latency and throughput in Appendix C.
http://www.intel.com/products/processor/manuals/
(Q9550 is 06_17H/06_1DH)

Another useful resource is Agner Fog’s page:
http://www.agner.org/optimize/
…which is often more accurate that Intel’s own documentation…

Never tried it, but Intel offers a free static code analyzer. It is designed for AVX, but should support plain SSE and older CPUs:
http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
It shows an ASCII-art diagram of the issue port and latency of each instruction in order to assist manual scheduling.

This is when you realize how easy it is to program a GPU. :)

Have you tried Intel V-Tune? It’s made for profiling stuff like this:

http://software.intel.com/en-us/articles/intel-vtune/

Have you tried Intel V-Tune? It’s made for profiling stuff like this:

http://software.intel.com/en-us/articles/intel-vtune/

Thanks for the suggestions. I put in a request to download vTune, but they are still getting back to me on this one. The IA64 optimization manual was particularly useful.

So the conclusion is that it turns out to be a problem with the instruction scheduler as I thought. I started out with a series of operations like this:

mul %rd1, %rs1, %rc1
mul %rd2, %rs2, %rc1
mul %rd3, %rs3, %rc1
mul %rd4, %rs4, %rc1

mul %rd1, %rs1, %rd1
mul %rd2, %rs2, %rd2
mul %rd3, %rs3, %rd3
mul %rd4, %rs4, %rd4

And the LLVM x86 code scheduler would reorder them into something more like this:

xmulps %rd1, %rs1, %rc1
xmulps %rd1, %rs1, %rd1

xmulps %rd2, %rs2, %rc1
xmulps %rd2, %rs2, %rd2

xmulps %rd3, %rs3, %rc1
xmulps %rd3, %rs3, %rd3

xmulps %rd4, %rs4, %rc1
xmulps %rd4, %rs4, %rd4

Manually reordering the instructions brought the throughput up to the expected ~42 GFlops. I’m filing a bug against LLVM for this. The good news is that if this eventually gets resolved in LLVM it should be possible to hit the theoretical peak on a CPU using a program written in CUDA, translated first to LLVM then to x86.

Thanks for the suggestions. I put in a request to download vTune, but they are still getting back to me on this one. The IA64 optimization manual was particularly useful.

So the conclusion is that it turns out to be a problem with the instruction scheduler as I thought. I started out with a series of operations like this:

mul %rd1, %rs1, %rc1
mul %rd2, %rs2, %rc1
mul %rd3, %rs3, %rc1
mul %rd4, %rs4, %rc1

mul %rd1, %rs1, %rd1
mul %rd2, %rs2, %rd2
mul %rd3, %rs3, %rd3
mul %rd4, %rs4, %rd4

And the LLVM x86 code scheduler would reorder them into something more like this:

xmulps %rd1, %rs1, %rc1
xmulps %rd1, %rs1, %rd1

xmulps %rd2, %rs2, %rc1
xmulps %rd2, %rs2, %rd2

xmulps %rd3, %rs3, %rc1
xmulps %rd3, %rs3, %rd3

xmulps %rd4, %rs4, %rc1
xmulps %rd4, %rs4, %rd4

Manually reordering the instructions brought the throughput up to the expected ~42 GFlops. I’m filing a bug against LLVM for this. The good news is that if this eventually gets resolved in LLVM it should be possible to hit the theoretical peak on a CPU using a program written in CUDA, translated first to LLVM then to x86.

I would be surprised if it is a scheduling problem. Either version should be no match for an OoO core.

Out of curiosity, what is the x86 code produced by each version? Both scheduling options will require a few extra mov instructions in a 2-operand instruction set such as x86, depending on the register allocation.

Did you try intermixing add instructions to check whether you can reach the actual ~90 GFlops peak?

I would be surprised if it is a scheduling problem. Either version should be no match for an OoO core.

Out of curiosity, what is the x86 code produced by each version? Both scheduling options will require a few extra mov instructions in a 2-operand instruction set such as x86, depending on the register allocation.

Did you try intermixing add instructions to check whether you can reach the actual ~90 GFlops peak?

Here is a better description that Andy wrote and that we are planning to submit to the LLVM developers. I think the version that I posted before used too many registers per multiply:

I’ve noticed an unusual behavior of the LLVM x86 code generator (with default options)

that results in nearly a 4x slow-down in floating-point throughput for my microbenchmark.

I’ve written a compute-intensive microbenchmark to approach theoretical peak

throughput of the target processor by issuing a large number of independent

floating-point multiplies. The distance between dependent instructions is at

least four instructions to match the latency of the floating-point functional

unit in my Intel Core2 Quad (Q9550 at 2.83 GHz).

The microbenchmark itself replicates the following block 512 times:

.

.

{

p1 = p1 * a;

p2 = p2 * b;

p3 = p3 * c;

p4 = p4 * d;

}

.

.

Compiling with NVCC, Ocelot, and LLVM, I can confirm the interleaved instruction

schedule with a four-instruction reuse distance. An excerpt follows:

.

.

%r1500 = fmul float %r1496, %r24 ; compute %1500

%r1501 = fmul float %r1497, %r23

%r1502 = fmul float %r1498, %r22

%r1503 = fmul float %r1499, %r21

%r1504 = fmul float %r1500, %r24 ; first use of %1500

%r1505 = fmul float %r1501, %r23

%r1506 = fmul float %r1502, %r22

%r1507 = fmul float %r1503, %r21

%r1508 = fmul float %r1504, %r24 ; first use of %1504

.

.

The JIT compiler, however, seems to break the interleaving of independent instructions

and issues a long sequence of instructions with back-to-back dependencies. It is as if

all p1 = … expressions are collected at once followed by all p2 = … expressions and so

forth.

p1 = p1 * a

p1 = p1 * a

.

.

p2 = p2 * a

p2 = p2 * a

.

.

p3 = p3 * a

p3 = p3 * a

.

.

An actual excerpt of the generated x86 assembly follows:

mulss %xmm8, %xmm10

mulss %xmm8, %xmm10

.

. repeated 512 times

.

mulss %xmm7, %xmm9

mulss %xmm7, %xmm9

.

. repeated 512 times

.

mulss %xmm6, %xmm3

mulss %xmm6, %xmm3

.

. repeated 512 times

.

Since p1, p2, p3, and p4 are all independent, this reordering is correct. This would have

the possible advantage of reducing live ranges of values. However, in this microbenchmark,

the number of live values is eight single-precision floats - well within SSE’s sixteen

architectural registers.

Moreover, the pipeline latency is four cycles meaning back-to-back instructions with true

dependencies cannot issue immediately. The benchmark was intentionally written to avoid this

hazard but LLVM’s code generator seems to ignore that when it schedules instructions.

When I run this benchmark on my 2.83 GHz CPU, I observe the following performance results:

1 threads 0.648891 GFLOP/s

2 threads 1.489049 GFLOP/s

3 threads 2.209838 GFLOP/s

4 threads 2.940443 GFLOP/s

When I rewrite the generated assembly by hand to exhibit the same interleaving as in the

LLVM IR form

.

.

mulss %xmm8, %xmm10

mulss %xmm7, %xmm9

mulss %xmm6, %xmm3

mulss %xmm5, %xmm11

mulss %xmm8, %xmm10

mulss %xmm7, %xmm9

mulss %xmm6, %xmm3

mulss %xmm5, %xmm11

mulss %xmm8, %xmm10

mulss %xmm7, %xmm9

.

.

observed performance increases by nearly a factor of four:

1 threads 2.067118 GFLOP/s

2 threads 5.569419 GFLOP/s

3 threads 8.285519 GFLOP/s

4 threads 10.81742 GFLOP/s

I show similar results with packed single-precision floating point SIMD instructions. The

LLVM-generated machine code is nearly 4x slower than the interleaved schedule ‘suggested’ by

the LLVM IR input.

Vectorized - No instruction interleaving - back-to-back dependencies

(issued by LLVM code generator)

1 threads 1.540621 GFLOP/s

2 threads 5.900833 GFLOP/s

3 threads 8.755953 GFLOP/s

4 threads 11.257122 GFLOP/s

Vectorized - Interleaved - stride-4 reuse distance

(hand-modified to interleave instructions)

1 threads 3.157255 GFLOP/s

2 threads 22.104369 GFLOP/s

3 threads 32.300111 GFLOP/s

4 threads 39.112162 GFLOP/s

It is worth noting that 39.1 GFLOP/s is approaching the theoretical limits of the processor

(stated to be 45.25 GFLOP/s single-precision).

Do you mean using extra instructions to try to use more of the available issue ports/functional units, or by using an FMA instruction instead of a MUL? We hadn’t considered the first option, but the core2 that we are benchmarking does not support FMA…

Here is a better description that Andy wrote and that we are planning to submit to the LLVM developers. I think the version that I posted before used too many registers per multiply:

I’ve noticed an unusual behavior of the LLVM x86 code generator (with default options)

that results in nearly a 4x slow-down in floating-point throughput for my microbenchmark.

I’ve written a compute-intensive microbenchmark to approach theoretical peak

throughput of the target processor by issuing a large number of independent

floating-point multiplies. The distance between dependent instructions is at

least four instructions to match the latency of the floating-point functional

unit in my Intel Core2 Quad (Q9550 at 2.83 GHz).

The microbenchmark itself replicates the following block 512 times:

.

.

{

p1 = p1 * a;

p2 = p2 * b;

p3 = p3 * c;

p4 = p4 * d;

}

.

.

Compiling with NVCC, Ocelot, and LLVM, I can confirm the interleaved instruction

schedule with a four-instruction reuse distance. An excerpt follows:

.

.

%r1500 = fmul float %r1496, %r24 ; compute %1500

%r1501 = fmul float %r1497, %r23

%r1502 = fmul float %r1498, %r22

%r1503 = fmul float %r1499, %r21

%r1504 = fmul float %r1500, %r24 ; first use of %1500

%r1505 = fmul float %r1501, %r23

%r1506 = fmul float %r1502, %r22

%r1507 = fmul float %r1503, %r21

%r1508 = fmul float %r1504, %r24 ; first use of %1504

.

.

The JIT compiler, however, seems to break the interleaving of independent instructions

and issues a long sequence of instructions with back-to-back dependencies. It is as if

all p1 = … expressions are collected at once followed by all p2 = … expressions and so

forth.

p1 = p1 * a

p1 = p1 * a

.

.

p2 = p2 * a

p2 = p2 * a

.

.

p3 = p3 * a

p3 = p3 * a

.

.

An actual excerpt of the generated x86 assembly follows:

mulss %xmm8, %xmm10

mulss %xmm8, %xmm10

.

. repeated 512 times

.

mulss %xmm7, %xmm9

mulss %xmm7, %xmm9

.

. repeated 512 times

.

mulss %xmm6, %xmm3

mulss %xmm6, %xmm3

.

. repeated 512 times

.

Since p1, p2, p3, and p4 are all independent, this reordering is correct. This would have

the possible advantage of reducing live ranges of values. However, in this microbenchmark,

the number of live values is eight single-precision floats - well within SSE’s sixteen

architectural registers.

Moreover, the pipeline latency is four cycles meaning back-to-back instructions with true

dependencies cannot issue immediately. The benchmark was intentionally written to avoid this

hazard but LLVM’s code generator seems to ignore that when it schedules instructions.

When I run this benchmark on my 2.83 GHz CPU, I observe the following performance results:

1 threads 0.648891 GFLOP/s

2 threads 1.489049 GFLOP/s

3 threads 2.209838 GFLOP/s

4 threads 2.940443 GFLOP/s

When I rewrite the generated assembly by hand to exhibit the same interleaving as in the

LLVM IR form

.

.

mulss %xmm8, %xmm10

mulss %xmm7, %xmm9

mulss %xmm6, %xmm3

mulss %xmm5, %xmm11

mulss %xmm8, %xmm10

mulss %xmm7, %xmm9

mulss %xmm6, %xmm3

mulss %xmm5, %xmm11

mulss %xmm8, %xmm10

mulss %xmm7, %xmm9

.

.

observed performance increases by nearly a factor of four:

1 threads 2.067118 GFLOP/s

2 threads 5.569419 GFLOP/s

3 threads 8.285519 GFLOP/s

4 threads 10.81742 GFLOP/s

I show similar results with packed single-precision floating point SIMD instructions. The

LLVM-generated machine code is nearly 4x slower than the interleaved schedule ‘suggested’ by

the LLVM IR input.

Vectorized - No instruction interleaving - back-to-back dependencies

(issued by LLVM code generator)

1 threads 1.540621 GFLOP/s

2 threads 5.900833 GFLOP/s

3 threads 8.755953 GFLOP/s

4 threads 11.257122 GFLOP/s

Vectorized - Interleaved - stride-4 reuse distance

(hand-modified to interleave instructions)

1 threads 3.157255 GFLOP/s

2 threads 22.104369 GFLOP/s

3 threads 32.300111 GFLOP/s

4 threads 39.112162 GFLOP/s

It is worth noting that 39.1 GFLOP/s is approaching the theoretical limits of the processor

(stated to be 45.25 GFLOP/s single-precision).

Do you mean using extra instructions to try to use more of the available issue ports/functional units, or by using an FMA instruction instead of a MUL? We hadn’t considered the first option, but the core2 that we are benchmarking does not support FMA…

Thanks.

Agreed, there is no way any existing CPU can do reordering across 512 instructions…

I mean the former. The most common definition of flop involves a mix of Adds and Muls. I believe the link you cite refers to double precision flops.

A code consisting only of MULPS has an IPC of 1 only, so it is actually a very light load for a 6-issue superscalar…

Your compiler could schedule instructions and allocate registers at random and emit a lot of extraneous Movs and you would still be able to reach the peak. :)

(well, LLVM manages to do worse than random here…)

Thanks.

Agreed, there is no way any existing CPU can do reordering across 512 instructions…

I mean the former. The most common definition of flop involves a mix of Adds and Muls. I believe the link you cite refers to double precision flops.

A code consisting only of MULPS has an IPC of 1 only, so it is actually a very light load for a 6-issue superscalar…

Your compiler could schedule instructions and allocate registers at random and emit a lot of extraneous Movs and you would still be able to reach the peak. :)

(well, LLVM manages to do worse than random here…)

Hi Greg,
I have worked a little with Intel… Here is what I can tell you.

"
Getting close to peak performance involves keeping all the execution ports busy…(You can discount the load/store port.) This way you can get more instructions retiring per clock cycle. The CPI (clocks Per Instruction) in vTune serves as a goood indicator.

The Intel optimization manual has latency and throughput info as Sylvain rightly pointed out (Appendix C). I remember reading that the core microarchitecture can retire 4 muOps per cycle (0.25 CPI is the BEST you can get – assuming MUL and ADD translate to a single muop). MUL and ADD have separate pipelines. So, it would be useful to mix MULs and ADDs. This way, you can hide the “throughput” delay (CPU needs to wait a bit before accepting an instruction of the same kind – See definition of “throughput” in Appendix C). The average number of muops waiting in the ROB (Re-Order Buffer Full Ratio - vTune) is a good indication of these kind of scheduling delays.

The Intel Optimization manual describes the microarchitecture of various cores. It should be very useful as a reference.

As Sylvain pointed out, Agner Fog’s works are also a great resource.
"

It would be great if you post your final code that extracts all the FLOPs on this page. Thanks in advance,

Best Regards,
Sarnath

Hi Greg,
I have worked a little with Intel… Here is what I can tell you.

"
Getting close to peak performance involves keeping all the execution ports busy…(You can discount the load/store port.) This way you can get more instructions retiring per clock cycle. The CPI (clocks Per Instruction) in vTune serves as a goood indicator.

The Intel optimization manual has latency and throughput info as Sylvain rightly pointed out (Appendix C). I remember reading that the core microarchitecture can retire 4 muOps per cycle (0.25 CPI is the BEST you can get – assuming MUL and ADD translate to a single muop). MUL and ADD have separate pipelines. So, it would be useful to mix MULs and ADDs. This way, you can hide the “throughput” delay (CPU needs to wait a bit before accepting an instruction of the same kind – See definition of “throughput” in Appendix C). The average number of muops waiting in the ROB (Re-Order Buffer Full Ratio - vTune) is a good indication of these kind of scheduling delays.

The Intel Optimization manual describes the microarchitecture of various cores. It should be very useful as a reference.

As Sylvain pointed out, Agner Fog’s works are also a great resource.
"

It would be great if you post your final code that extracts all the FLOPs on this page. Thanks in advance,

Best Regards,
Sarnath