low down on avoiding register bank conflicts?

I have a huge unrolled loop of multiply-adds, but it’s only getting 350 Gflops performance. I’m able to get 510 Gflops in another synthetic benchmark. This leads me to believe it has register bank conflicts. I know what you’re going to say - 350 compared to 510 is a small difference, but I’m being paid to get as much
performance as possible.

Does anyone know the details of register bank allocation other than to make the thread block size a multiple of 64, which I’m already doing?

It seems clear the 1st priority would be to spread the same variable across the banks so that all threads can access it in ||. It’s unclear how they allocate the registers that belong to the same thread. Ideally, you’d want them in separate banks as well. Since most instructions need 2 operands, this suggests there need to be 8 SPs * 2 = 16 register banks, or 24 for multiply-add.

Attached is the simplest scheme I can think of that allows r2 = r0 + r1 to read its operands in || is this (1st index is variable index within thread, 2nd is thread)

You wouldn’t be able to read r0 and r2 in || because they’re both in bank 0. I have yet to test my conjecture. Has anyone else tried figuring this out or any NVIDIA engineers willing to tell me?

SHARED MEMORY HAS bank conflict I guess… and registers have RAW conditions which hamper performance. ( i guess)

try with a thread block of >= 192 threads…

also… try intermixing different registers statement so that RAW condition is minimized…

I already know about the 24 cycle RAW delay and my thread blocks are already 256 (512 doesn’t make a difference).

Instruction scheduling always helps, but it’s not needed on the GPU because it will just switch to another warp if available.

It does help… I showed that in a recent paper… in a conference… will put that paper here on forums soon… (but jsut approx 2-5% or so)

also… as I said again bank conflicts i guess happen only in case of shared memories… I have never heard of bank conflicts in regerstiers… (my bad maybe) …

Silly me. The time I measured included uploading the image to device memory, filling the ghost regions, and copying back the results. My actual performance is 480 GFlops. Maybe I’m still

encountering register bank conflicts since my loop is huge - 81 mad instructions.

That assumes you don’t have enough threads to keep it busy. GPUs are an example of Gustafson’s weak scaling law - you need huge input sizes before you get maximum speedup.

Well I achieved 500 GFLOPS… sp performance… for Nbody simulation… slightly better than Nvidias one though with same data model…

Since the threads are executed in SIMT and there is no way to access registers indirectly, all threads in a warp always request the same register ID at the same time. In other word, the address of all banks will be the same.

So there is no need to spread these registers across several independent 32-bit-wide banks: storing them as vectors in one wider bank (32x32-bit or 16x32-bit-wide) will be enough, and will save address decoding hardware.

(Just like 128-bit SSE registers in a typical CPU…)

Not necessarily. Keep in mind that the execution context is switched to another warp each (slow) cycle, and that latency does not matter much.

Allocating all registers that belong to the same warp in the same bank is actually not a bad idea.

Consider the instruction:

MUL R0, R8, R16

executed by 4 warps, on a RF with 2 banks.

All registers of warp 0 and warp 2 are stored in bank 0, registers of warp 1 and 3 in bank 1.

For each warp, the operands need to be read sequentially, but this delay can be overlapped with the operand reads of instructions from other warps. Instructions can be pipelined with no hazard to worry about:

cycle   warp   bank 0  bank 1

0:	  0	  0:R8

1:	  1	  0:R16   1:R8

2:	  2	  2:R8	1:R16

3:	  3	  2:R16   3:R8

4:					 3:R16

The actual implementation is slightly more complex because of the different clock domains and the 2 execution units (SP & SFU), but the basic idea is the same.

If you are really curious, this patent describes something that should be fairly close to the register file of the Tesla architecture ;).

Anyway, the bottom line is: don’t worry too much about how register allocation affects bank conflicts, you just have to run enough warps (from the same or different blocks).

Right, good call. Putting them in the same register file would be a large saving.

I think they would at least try to have an instruction’s 2 operands in separate banks. That way, each instruction will only need 1 cycle to read its operands, which would improve latency - same throughput. Of course having every instruction’s 2 operands in separate banks is unfeasible unless you have enough banks.

However, I don’t think it’s practical to switch warps just because 1 warp will need 1 more cycle to read its 2nd operand. Hence, the pipeline would stall and decrease throughput. I’m just guessing, but I’ll read the patent to get a better idea.