Theoretical peak performance question GF100 can't co-issue instructions can it?

Greetings,

I’m in the final stages of writing up a paper concerning a CUDA implementation of a cross-correlation algorithm for radio astronomy. The code is highly optimized, and on large enough problem sizes achieves over 1060 GFLOPS on a GTX 480. (The algorithm is one of those ideal ones like GEMM, in that it has O(n^2) compute, but only O(n) memory traffic.)

I am currently trying build a performance model of this code, to see what the limiting factors are. Each thread does the following computation in a for loop:

    [*] 128x fma instructions

    [*] 2x __syncthreads()

    [*] 2x floating point additions

    [*] 32x 32-bit loads from shared memory

    [*] 4x 32-bit stores to shared memory

    [*] 2x 64-bit texture requests

    [*] 1x branch compare (from the for loop)

This comes to a total of 171 instructions issued, with the 128 fma instructions being the actual desired computation, and that is what I am counting when I report the 1060 GFLOPS figure. This corresponds to 79% of peak performance. I’m trying to work out what the actual utilization is when I count the other instructions.

To the best of my knowledge the GF100/GF110 cannot co-issue instructions within the same warp (unlike GF104/GF114), so I had assumed that all instructions issued that are not computation subtract from available GFLOPS. Thus, the effective performance would be 79% * (171/128) = 106%. Clearly this can’t be correct, so what am I doing wrong here? Can GF100 co-issue some instructions, i.e., shared memory loads and fma instructions?

All cards can co-issue instructions, beginning from the very first compute capability 1.0 devices. The difference between compute capability 2.1 (GF104/GF114) and earlier cards is that 2.1 devices can (sometimes) issue two fma instructions at the same time, while earlier devices can only co-issue fma instructions only together with certain other non-fma instructions (like load/store, as you said).

Thanks for the response.

Hmm, that’s the first I’ve heard of that. Take a read of this review of the 460 on anandtech:

http://www.anandtech.com/show/3809/nvidias-geforce-gtx-460-the-200-king/2

This is an overview of the architecture differences between GF100 and GF104. They say that the GF100 can only use 2 pipelines out of 6 simultaneously, whereas GF104 can use 4 out of 7 because of its superscalar architecture. This contradicts what you are stating. Who is correct?

no contradiction. The article just says

In a best-case scenario GF104 can utilize 4 of 7 execution units, while GF100 could only utilize 2 of 6 execution units.

But it is not easy to utilize four functional units because of register pressure.

Think about gemm, it has a huge block of fma.

On GF104, the best case is to use 3 fp-32 units (each unit has 16 cores).

So only 3 of 7 execution units are used.

If register pressure is not a problem, then DFMA and non-fma can run at the same time.

However so far DFMA will use all pileline paths.

This is a contradiction. The previous poster said that GF100 can co-issue instructions. If this is the case, then it GF100 can utilize more than 2 of its 6 execution units. This contradicts the anandtech article. The example you give with the GF104 is besides the point.

Fermi card has two warp schedulers, one is for warp of odd ID, the other is for warp of even ID.

each warp scheduler can choose a warp to execute on one functional unit.

so in most cases, GF100 can use 2 functional units, that is dual-issue (co-issue).

I don’t think that GF100 should use superscalar to utilize more than 2 functional units because of register pressure.

GEMM is a typical example, almost all operations of GEMM is fma.

althrough GF100 has 6 functional units, only two among them can do fma.

Even GF100 adopts superscalar, it does not improve performance of GEMM but increase design complexity

(a core becomes bigger but die size is the same).

I would like to see that GF100 can co-issue DFMA with a load/store operation (shared memory ↔ register).

This can improve performance of dgemm and zgemm.

Hi Lung Sheng,

If I understand correctly, what you say is GF100 cannot currently co-issue any instruction with a DFMA, because register file bandwidth is a bottleneck.

A way to execute loads from shared memory and DFMAs simultaneously without impacting the register file would be to allow arithmetic instruction to get some of their operands directly from shared memory.

Ironically, this was the solution that was selected for Tesla, and then dropped with Fermi.

From a (low-level) programmer perspective, do you think going back to a Tesla-like instruction set with operands from memory would be better? Or are there problems with this solution too (ignoring technical issues that Tesla had, like the 2-cycle penalty you found when optimizing SGEMM for Tesla)?

Just curious. :)

two months ago, I just thought about peak performance of dgemm/zgemm, we know MAGMA’s kernel is the best on dgemm so far

and it reaches 300Gflops. However peak performance of DFMA on C2050 is 515Gflops, which is much bigger than 300Gflops.

I want to know why dgemm/zgemm is very difficult to be improve.!

I implemented MAGMA’s kernel with tile 32x16 on A, 16x32 on B and 32x32 on C, thread block is 16x16.

Also use cuobjdump to check assembly code on sm13.

I want to count all instructions in the kernel and estimate peak performance of the kernel.

assembly code shows

(1) DFMA has no operand from shared memory, it is of the form “DFMA dst, src1, src2, src3”

dst, src1, src2, src3 are registers.

(2) “shared memory <–> register” is 32-bit transfer, not 64-bit transfer.

so to move a "double" from shared memory to register needs two instructions.

Hence number of "shared mem -->register" is the same as number of DFMA.

The following table summarizes the cost, I ignore O(N^2) term.

dimension of matrix A is m x k,

dimension of matrix B is k x n,

dimension of matrix C is m x n.

-------------------------------------------------------------

operation                    |  number of instructions      |

------------------------------------------------------------

Load A, B to shared memory   |  (mnk/8)*(37/8) = 0.578 mnk  |

-------------------------------------------------------------

shared memory --> register   |  4 mnk                       |

------------------------------------------------------------

DFMA                         |  4 mnk                       |

-------------------------------------------------------------

__sync                       | mnk/32                       |

-------------------------------------------------------------

In order to do estimation, I assume

  1. 1 DFMA has 2 flops and its execution time is T1, then we have 2/T1 = 515 Gflops

  2. 1 non FMA has 1 flop and its execution time is T2, then we have 1/T2 = 515 Gflops

  3. no instruction can co-issue with DFMA.

then expected performance of MAGMA’s kernel is

4*mnk * 2 (flops)                                   

---------------------------------------- = 326.72 Gflops

4*mnk * T1 + (4 + 0.578 + 1/32)*mnk * T2

In fact, if best scheduling strategy is considered, then it should be 340Gflps.

Think about if DFMA and “shared mem → reg” can be issued at the same time, then

expected performance of MAGMA’s kernel should be

4*mnk * 2 (flops)                                   

---------------------------------------- = 478 Gflops

4*mnk * T1 + (4 + 0.578 + 1/32)*mnk * T2

That is why I said

I would like to see that GF100 can co-issue DFMA with a load/store operation (shared memory <--> register).

This can improve performance of dgemm and zgemm.

Even a DFMA uses two pipeline paths, “shared mem → reg” can bypass through shared memory.

I just think about maybe register file has pressure on writing ports.

The dual warp scheduler accounts for the ability to be use 2 out of the 6 pipelines. This is what I assumed in my performance model, when I’m trying to calculate the maximum achievable performance of my kernel. The problem is that when I count all the instructions issued, then my 1062 GFLOPS (= 79% peak) of fma instructions is equivalent to an excess of peak performance. So what is going wrong in my performance model? If GF100 can’t co-issue instructions from within a single warp, then there must be another explanation. Note the ratio of fma instructions to share memory loads is 4:1, so if I only count these my 79% of peak performance is using effectively (5/4) * 79% = 99%. The problem is the additional instructions that bring me over the 100%.

Regarding the GEMM performance model, the more relevant one to my work is probably the Magma CGEMM, as it has almost identical performance in terms of GFLOPS, and I’m guessing a similar ratio of fma instructions to shared memory loads.

I’m enjoying this discussion External Image

This is interesting. Thanks for sharing this!

Did you try to look at the assembly generated for sm_20, as a means of comparison? (using Imran’s script – or does CUDA 4.0 RC provide cuobjdump for Fermi?)

You are right that the DFMA in Tesla does not support shared memory operands and shared memory loads are limited to 32-bit.

My question was more on instruction set design in general, and about an hypothetical instruction set that would retain the features of Fermi (like 64-bit loads) while also offering memory-based operands for all arithmetic instructions.

Or put another way, how would these two imaginary architectures compare in the context of DGEMM:

  • Fermi with (DFMA + Load from shared mem) dual issue,

  • Fermi with a special DFMA instruction that can take an operand from shared memory, but no dual issue?

Did you count the instructions in the assembly code (actual asm, not PTX)? The compiler could have performed optimizations, like fusing together 32-bit loads into 128-bit loads, for instance.

I don’t try Imran’s script but I am waiting for CUDA 4.0. I would like to see cuobjdump on sm20 and its inverse (assembler).

Theoretically speaking, I will suggest the former.

Recall my calibration, number of “Load from shared mem” is the same as number of DFMA.

-------------------------------------------------------------

operation                    |  number of instructions      |

-------------------------------------------------------------

shared memory --> register   |  4 mnk                       |

------------------------------------------------------------

DFMA                         |  4 mnk                       |

-------------------------------------------------------------

case 1: Fermi with (DFMA + Load from shared mem) dual issue

expected performance of MAGMA’s kernel should be

4*mnk * 2 (flops)                                   

---------------------------------------- = 478 Gflops

4*mnk * T1 + (0.578 + 1/32)*mnk * T2

because DFMA and “Load from shared mem” are in parallel, not sequential.

case 2: Fermi with a special DFMA instruction that can take an operand from shared memory, but no dual issue

number of “Load from shared mem” is 2*mnk because we don’t need to load submatrix of B from shared memory

to register. But we need to load submatrix of A from shared memory to register because ONLY ONE operand

of DFMA is from shared memory (64-bit data transfer).

expected performance of MAGMA’s kernel should be

4*mnk * 2 (flops)                                   

---------------------------------------- = 388 Gflops

4*mnk * T1 + (2 + 0.578 + 1/32)*mnk * T2

Surprising, this is equivalent to support 64-bit data transfer between shared memory and register.

I am not an architect, but so far reading and writing of register file is not balance.

For “DFMA dst, src1, src2, src3”, three 64-bit reading port (in fact, 3 x 64 x 16(core) = 3072 bits)

but only one 64-bit writing port. If writing port is hard to increase, then

“Fermi with a special DFMA instruction that can take an operand from shared memory, but no dual issue”

is a choice, otherwise, “64-bit data transfer between shared memory and register” would be better.

Thanks! That makes sense.

From a hardware perspective, I think one of the most expensive parts is the crossbar between shared memory and registers: in the general case, we need to allow any thread of the warp to access any shared memory bank, so it requires a full 32x32, 32-bit crossbar.

Such a crossbar is huge, power-hungry, and likely has several clock cycles of latency. So it makes sense to put it inside a separate Load/Store unit, and also share it with the access path to global memory (whose L1 cache already share hardware with smem).

This consideration was likely more important for Fermi than the number of write ports in the register file.

The bad news is that true “64-bit data transfer between shared memory and register” would be very expensive too, as it would double the size of the crossbar.

Except if we give up on allowing arbitrary gather/scatter at full speed…

Can’t GF104/GF114 do this already? The architecture should allow co-issue of an instruction to the FP pipeline and to the load/store pipeline. If the even warp schedules an FP instruction and a load/store instruction, with the odd warp scheduling both FP instructions I imagine that GF104/GF114 would get a higher peak performance than GF100/GF110 for kernels with the correct balance of instructions. DGEMM probably isn’t a good example here since GF104/GF114 has impaired double precision. Does anyone have at hand a 460 or 560 to benchmark SGEMM/CGEMM?

SGEMM and CGEMM are good on GF100 because number of “shared mem → register” is only 1/8 ~ 1/16 of number of fma.

So MAGMA can reach 780 Gflops on CGEMM.

YES, GF104 is not good on dgemm because peak performance of DFMA is only 1/3 of peak performance of fma.

on GF100, peak performance of DFMA is 1/2 of peak performance of fma.

Thanks, I neglect design complexity of crossbar.

So in the future, GPU should not support “64-bit data transfer between shared memory and register”,

then the only way to improve DGEMM/ZGEMM is to co-issue FDMA and “shared mem → reg” ?

Just done this now. My instruction counts were accurate with one exception. I was astounded to find that the individual 32-bit shared memory loads were being coalesced into 128-bit transactions, e.g. my original CUDA code which looks like this (TILE_WIDTH = TILE_HEIGHT = 8)

float col1Xreal = input[s][4*tx];				

float col1Ximag = input[s][4*tx + 4*TILE_WIDTH];		

float col1Yreal = input[s][4*tx + 1];				

float col1Yimag = input[s][4*tx + 1 + 4*TILE_WIDTH];		

float col2Xreal = input[s][4*tx + 2];				

float col2Ximag = input[s][4*tx + 2 + 4*TILE_WIDTH];		

float col2Yreal = input[s][4*tx + 3];				

float col2Yimag = input[s][4*tx + 3 + 4*TILE_WIDTH];		

float row1Xreal = input[s][4*ty + 8*TILE_WIDTH];		

float row1Ximag = input[s][4*ty + 4*TILE_HEIGHT + 8*TILE_WIDTH];

float row1Yreal = input[s][4*ty + 1 + 8*TILE_WIDTH];		

float row1Yimag = input[s][4*ty + 1 + 4*TILE_HEIGHT + 8*TILE_WIDTH];

float row2Xreal = input[s][4*ty + 2 + 8*TILE_WIDTH];		

float row2Ximag = input[s][4*ty + 2 + 4*TILE_HEIGHT + 8*TILE_WIDTH];

float row2Yreal = input[s][4*ty + 3 + 8*TILE_WIDTH];		

float row2Yimag = input[s][4*ty + 3 + 4*TILE_HEIGHT + 8*TILE_WIDTH];

is transformed into just four instructions of the form

ld b128 $r12q s[$r61+0x80]

My immediate thought was this should cause bank conflicts, but on second thoughts I don’t think this is the case. The starting address for each of these requests will correspond to a different bank for all threads in the warp. The 128-bit word is split into four 32-bit words, each of which operates at full bandwidth since there is no bank conflict for each 32-bit sub-word. I had no idea this was possible before.

With my revised instructions counts I have

    [*] 128x fma instructions

    [*] 2x __syncthreads()

    [*] 2x floating point additions

    [*] 8x 128-bit loads from shared memory

    [*] 4x 32-bit stores to shared memory

    [*] 2x 64-bit texture requests

    [*] 1x branch compare (from the for loop)

My performance of 1062 GFLOPS is equivalent to (147/128) * 0.79 = 91% utilization. So I don’t hold much hope for further speed improvements.