Newbie: More threads == much slower? :(

Why does this happen :(

grid size: 36768 block size: 32 exec time: 21 seconds
grid size: 36768 block size: 64 exec time: 16.8 seconds
grid size: 36768 block size: 128 exec time: 17.3 seconds
grid size: 36768 block size: 256 exec time: 20.1 seconds

The last three don’t make sense… more threads make it slower. There is now shared memory, and the kernel being executes is independant on other threads and only access constant GPU memory.

Only extra command line option is --ptxas-options=“–maxrregcount 20” which was to let me launch more threads, even though they make it slower :( :(

Please help…

Why does this make no sense? There’s more work to do.

You should compare the effectiveness of various block sizes with the total number of threads constant, so:

  1. grid size 36768, block size 32
  2. grid size 18384, block size 64
  3. grid size 8192, block size 128
  4. grid size 4096, block size 256

Now you get 1176576 threads in total in each test.

maxregcount overflows some registers to (slow) local memory. That is why you see performance deteriorate

  1. grid size 36768, block size 32 → 11.5 seconds
  2. grid size 18384, block size 64 → 9.5 seconds
  3. grid size 8192, block size 128 → 8.7 seconds
  4. grid size 4096, block size 256 → 10.6 seconds

My mistake, it should go like this:

36768 * 32 = 1176576 threads total

1176576 / 64 = 18384
1176576 / 128 = 9192
1176576 / 256 = 4596

That’s how many blocks you should be running. This is more than you have ran, so the results may only get worse though.

I’ve fed those parameters to occupancy calculator. The results are:
for 20 registers and 0 shared memory
32 threads/block = 25%
64 and 128 = 50%
256 threads = 33%
(for G80)

Thus your results are pretty much what’s expected.

You might try 192 threads (and 6128 blocks), this also reports as 50% occupancy (which is the maximum occupancy for 20 registers).