efficiency of block/thread ratios

codercat · April 17, 2007, 2:20pm

I am doing some processing on a large chunk of data which cannot fit into shared memory if I try to process it all with a single kernel call, however, the data can be broken down into smaller peices for independent processing. I have determined the maximum number of elements of data which can fit into shared memory per multiprocessor.

I am wondering if I would get better performance by using a larger number of blocks per multiprocessor with fewer threads each (8 blocks with 32 threads each) or a smaller number of blocks per multiprocessor with more threads each (1 block with 256 threads). Or does the answer lie somewhere in the middle?

Also, do I take any sort of performance hit for making many (>64) independent calls to the kernel? My feeling is that any performance lost would be more than gained back by being able to use shared memory instead of global (as I have been).

Thanks for any help.

paulius · April 18, 2007, 1:40am

There isn’t a quick answer to your question. My suggestion would be to try multiple configurations to determine what works best for your particular code. Some things to consider:

a block with many threads will incur a slightly higher overhead when threads synchronize. So, if your code requires many syncs, you may save some clock cycles if you use fewer threads.
the register file is partitioned among the threads of a given block. So, increasing the number of threads reduces the potential number of registers per thread. This may affect occupancy, or, in the worst case, cause a launch failure. The latter is not that frequent, plus it’s predictable since you can check how many registers are required per thread.
switching between warps of a block is fast. Switching between different blocks is slower. So, if thread-switching is the only concern (which it never is, see the above and memory latencies), higher number of threads per block leads to a shorter execution time simply due to lower switching overhead.

Paulius

codercat · April 18, 2007, 9:12am

Thank you for the quick reply. I think my answer may lie somewhere in the middle, mostly due to the issue of register availability.

Topic		Replies	Views
Number of thread blocks and threads in those, difference for performance? CUDA Programming and Performance	1	380	September 6, 2021
Performance in different thread-block schemes CUDA Programming and Performance	5	2343	September 19, 2008
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8349	February 12, 2008
Block Size.. CUDA Programming and Performance	2	1780	July 11, 2008
Shared memory and register usage - just 1 thread/block CUDA Programming and Performance	1	793	July 21, 2009
Is this a good match for GPU? CUDA Programming and Performance	5	3611	June 11, 2009
2 blocks versus 3 blocks CUDA Programming and Performance	5	4917	August 3, 2009
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5336	June 3, 2008
number of threads and registers CUDA Programming and Performance	10	4864	March 14, 2008
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7005	January 30, 2008

efficiency of block/thread ratios

Related topics