Newbie: More threads == much slower? :(

nickpcuda · July 25, 2008, 9:31am

Why does this happen :(

grid size: 36768 block size: 32 exec time: 21 seconds
grid size: 36768 block size: 64 exec time: 16.8 seconds
grid size: 36768 block size: 128 exec time: 17.3 seconds
grid size: 36768 block size: 256 exec time: 20.1 seconds

The last three don’t make sense… more threads make it slower. There is now shared memory, and the kernel being executes is independant on other threads and only access constant GPU memory.

Only extra command line option is --ptxas-options=“–maxrregcount 20” which was to let me launch more threads, even though they make it slower :( :(

Please help…

_Big_Mac · July 25, 2008, 9:50am

Why does this make no sense? There’s more work to do.

You should compare the effectiveness of various block sizes with the total number of threads constant, so:

grid size 36768, block size 32
grid size 18384, block size 64
grid size 8192, block size 128
grid size 4096, block size 256

Now you get 1176576 threads in total in each test.

E.D_Riedijk · July 25, 2008, 9:54am

maxregcount overflows some registers to (slow) local memory. That is why you see performance deteriorate

nickpcuda · July 25, 2008, 10:30am

grid size 36768, block size 32 → 11.5 seconds
grid size 18384, block size 64 → 9.5 seconds
grid size 8192, block size 128 → 8.7 seconds
grid size 4096, block size 256 → 10.6 seconds

_Big_Mac · July 25, 2008, 12:14pm

My mistake, it should go like this:

36768 * 32 = 1176576 threads total

1176576 / 64 = 18384
1176576 / 128 = 9192
1176576 / 256 = 4596

That’s how many blocks you should be running. This is more than you have ran, so the results may only get worse though.

I’ve fed those parameters to occupancy calculator. The results are:
for 20 registers and 0 shared memory
32 threads/block = 25%
64 and 128 = 50%
256 threads = 33%
(for G80)

Thus your results are pretty much what’s expected.

You might try 192 threads (and 6128 blocks), this also reports as 50% occupancy (which is the maximum occupancy for 20 registers).

Topic		Replies	Views
Performance in different thread-block schemes CUDA Programming and Performance	5	2348	September 19, 2008
CUDA perormances CUDA Programming and Performance	10	7128	January 22, 2008
too large kernel solutions CUDA Programming and Performance	11	4281	September 2, 2008
kernel performance and number of threads CUDA Programming and Performance	2	6595	November 22, 2007
efficiency of block/thread ratios CUDA Programming and Performance	2	3818	April 18, 2007
How to choose grid size ? No. of blocks and threads ? CUDA Programming and Performance	1	818	February 4, 2016
Optimum perfomance Blocks/Treadd/Dimensions CUDA Programming and Performance	5	4770	January 16, 2009
CUDA slower than CPU Help me please... CUDA Programming and Performance	2	5709	February 8, 2010
Here are my timing results, not impressive. Help. CUDA Programming and Performance	5	7010	January 30, 2008
maximum threads per block not always used CUDA Programming and Performance	2	753	June 14, 2018

Newbie: More threads == much slower? :(

Related topics