I have a huge unrolled loop of multiply-adds, but it’s only getting 350 Gflops performance. I’m able to get 510 Gflops in another synthetic benchmark. This leads me to believe it has register bank conflicts. I know what you’re going to say - 350 compared to 510 is a small difference, but I’m being paid to get as much
performance as possible.
Does anyone know the details of register bank allocation other than to make the thread block size a multiple of 64, which I’m already doing?
It seems clear the 1st priority would be to spread the same variable across the banks so that all threads can access it in ||. It’s unclear how they allocate the registers that belong to the same thread. Ideally, you’d want them in separate banks as well. Since most instructions need 2 operands, this suggests there need to be 8 SPs * 2 = 16 register banks, or 24 for multiply-add.
Attached is the simplest scheme I can think of that allows r2 = r0 + r1 to read its operands in || is this (1st index is variable index within thread, 2nd is thread)
You wouldn’t be able to read r0 and r2 in || because they’re both in bank 0. I have yet to test my conjecture. Has anyone else tried figuring this out or any NVIDIA engineers willing to tell me?