Why do you need to have a block size of at least 64 threads? Why isn’t 32 with lots of cuncurrent blocks on same multiprocessor equally good?
Question: Why isn’t 32 with lots of cuncurrent blocks on same multiprocessor?
for example, Tesla C1060 has compute capability 1.3,
-
The maximum number of active threads per multiprocessor is 1024
-
The maximum number of active blocks per multiprocessor is 8
so even you choose block size = 64, then you can not has 1024/64 = 16 active blocks,
you only have 8 active blocks at most in a multiprocessor
The reason given by the Programming Guide is that successive instructions can have register memory bank conflicts. It’s hard to know exactly what that means without more detail on the multiprocessor structure. We can infer some things based on that statement though:
-
The registers are organized in a manner similar to the shared memory, just bigger (8192 or 16384 32-bit registers rather than 4096 shared memory words). This makes sense since there would be no other sensible way to address that many registers.
-
Like shared memory, the register file is organized in banks, and if two words need to be read at the same time from the same bank, a delay will be incurred. We could imagine that registers are ordered such that consecutive register locations are striped across banks, just as shared memory locations stripe across banks.
-
To minimize bank conflicts, consecutive locations in the register file would be assigned to the same “logical register” in consecutively numbered threads. Then the register fetch stages for a given instruction in the pipeline would naturally avoid bank conflicts.
-
A block size of 64 would mean that there would generally be 8 pipeline stages between one instruction and the next. I don’t know if the 64 thread requirement is intended to space out requests to avoid conflicts, or if that number has something to do with the number of banks in the register file.
So that doesn’t answer your question, but hopefully gives you some ideas why a block of 64 threads could behave differently than two blocks of 32.
Even if ur block size is 32, the effective number of registers used by your kernel is 64*registers_in_cubin…
So, effectively for kernels with lot of registers, you will be loosing if you are using 32 as your block size…
Check the CUDA occupancy calculator for a proof.
(I think it was either Seibert or SPWorly who found this out… or may be, my mem is wrong)
Thanks for you answers!
I have benchmarked a few kernels that actually run faster at a block size of 32 than 64. So don’t assume that 64 is always better.
@MisterAnderson42: It would be nice to see those kernels. Is it possible for you to share them for us?
Thanks!
Sure. One of them is here: [url=“https://codeblue.umich.edu/hoomd-blue/trac/browser/trunk/src/cuda/LJForceGPU.cu”]https://codeblue.umich.edu/hoomd-blue/trac/...a/LJForceGPU.cu[/url] . On GTX 280/285 / Tesla 1000 series, it consistently benchmarks faster with a block size of 32.
Do you have any idea why? The cost of __syncthreads() maybe? (Is syncthreads ignored/costless in 32-threads blocks?)
The only __syncthreads() in that kernel is in the initialization to load in a table, so I don’t think so.
The kernel is fully memory bandwidth bound with a semi-random access pattern. Perhaps running with 32 threads in a block lowers the contention for the memory bus? I don’t know and I don’t really try to. A benchmark sweeping block sizes will always find the fastest :)
For that specific semi-random access pattern ;) But in general, yes, benchmarking, benchmarking & benchmarking.