why atleast a block size of atleast 64 threads?

David_Olsson · September 23, 2009, 8:58pm

Why do you need to have a block size of at least 64 threads? Why isn’t 32 with lots of cuncurrent blocks on same multiprocessor equally good?

LSChien · September 24, 2009, 1:28am

Question: Why isn’t 32 with lots of cuncurrent blocks on same multiprocessor?

for example, Tesla C1060 has compute capability 1.3,

The maximum number of active threads per multiprocessor is 1024
The maximum number of active blocks per multiprocessor is 8

so even you choose block size = 64, then you can not has 1024/64 = 16 active blocks,

you only have 8 active blocks at most in a multiprocessor

seibert · September 24, 2009, 3:48am

The reason given by the Programming Guide is that successive instructions can have register memory bank conflicts. It’s hard to know exactly what that means without more detail on the multiprocessor structure. We can infer some things based on that statement though:

The registers are organized in a manner similar to the shared memory, just bigger (8192 or 16384 32-bit registers rather than 4096 shared memory words). This makes sense since there would be no other sensible way to address that many registers.
Like shared memory, the register file is organized in banks, and if two words need to be read at the same time from the same bank, a delay will be incurred. We could imagine that registers are ordered such that consecutive register locations are striped across banks, just as shared memory locations stripe across banks.
To minimize bank conflicts, consecutive locations in the register file would be assigned to the same “logical register” in consecutively numbered threads. Then the register fetch stages for a given instruction in the pipeline would naturally avoid bank conflicts.
A block size of 64 would mean that there would generally be 8 pipeline stages between one instruction and the next. I don’t know if the 64 thread requirement is intended to space out requests to avoid conflicts, or if that number has something to do with the number of banks in the register file.

So that doesn’t answer your question, but hopefully gives you some ideas why a block of 64 threads could behave differently than two blocks of 32.

Sarnath · September 24, 2009, 5:32am

Even if ur block size is 32, the effective number of registers used by your kernel is 64*registers_in_cubin…

So, effectively for kernels with lot of registers, you will be loosing if you are using 32 as your block size…

Check the CUDA occupancy calculator for a proof.

(I think it was either Seibert or SPWorly who found this out… or may be, my mem is wrong)

David_Olsson · September 24, 2009, 5:51pm

Thanks for you answers!

MisterAnderson42 · September 24, 2009, 7:14pm

I have benchmarked a few kernels that actually run faster at a block size of 32 than 64. So don’t assume that 64 is always better.

_teju · September 25, 2009, 6:33am

@MisterAnderson42: It would be nice to see those kernels. Is it possible for you to share them for us?
Thanks!

MisterAnderson42 · September 25, 2009, 4:02pm

Sure. One of them is here: [url=“https://codeblue.umich.edu/hoomd-blue/trac/browser/trunk/src/cuda/LJForceGPU.cu”]https://codeblue.umich.edu/hoomd-blue/trac/...a/LJForceGPU.cu[/url] . On GTX 280/285 / Tesla 1000 series, it consistently benchmarks faster with a block size of 32.

E.D_Riedijk · September 25, 2009, 6:21pm

Do you have any idea why? The cost of __syncthreads() maybe? (Is syncthreads ignored/costless in 32-threads blocks?)

MisterAnderson42 · September 25, 2009, 7:27pm

The only __syncthreads() in that kernel is in the initialization to load in a table, so I don’t think so.

The kernel is fully memory bandwidth bound with a semi-random access pattern. Perhaps running with 32 threads in a block lowers the contention for the memory bus? I don’t know and I don’t really try to. A benchmark sweeping block sizes will always find the fastest :)

E.D_Riedijk · September 28, 2009, 4:54am

For that specific semi-random access pattern ;) But in general, yes, benchmarking, benchmarking & benchmarking.

Topic		Replies	Views
number of threads and registers CUDA Programming and Performance	10	4862	March 14, 2008
Why some algorithm uses small block? Obviously the larger the better!? CUDA Programming and Performance	10	879	September 28, 2023
too large kernel solutions CUDA Programming and Performance	11	4278	September 2, 2008
ideal number of tread per block CUDA Programming and Performance	10	2959	March 25, 2010
What's the reason for max. 512 threads per block ? CUDA Programming and Performance	14	9469	November 10, 2008
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5325	June 3, 2008
Distribution of Threads to Multiprocessors CUDA Programming and Performance	8	13607	June 8, 2011
Determining Thread vs Block CUDA Programming and Performance	1	823	November 11, 2009
The choose of grid size and block size CUDA Programming and Performance	8	926	May 8, 2024
Concurrent Kernels On A Given Multiprocessor CUDA Programming and Performance	7	3016	May 30, 2012

why atleast a block size of atleast 64 threads?

Related topics