Dg 16 x Db 8 get optimal, correct results?! kernel config

Hi,

1, i use <<<16, 8, 8192>>> kernel config to get optimal performance. It’s weird, since it disobeys all config rules.

a typical cuda_profile.log line is like:

the .cubin is

2, CUDA_Occupancy_calculator.xls doesn’t require Dg (block number); Doesn’t Dg influence the performance? what’s the assumptions?

Thanks!

  1. A dynamic shared memory size of 8192 means that only one block can run at a time on a multiprocessor, as you need to add the static smem = 48 * 4 bytes as well.
    But sure, if you get the optimal performance that way, why not. I think it means that your kernel does a lot of memory access but hardly does computations, so there are no computations to hide it behind (could be wrong here though)

  2. Increasing the number of blocks influences the performance up until you have the right number of blocks to keep all the multiprocessors occupied at all times.
    As you can run only one block per multiprocessor with your kernel, your number is fine. It won’t scale up with future GPUs that have more multiprocessors, though.

of block still matters after the all-multiprocessors-busy point due to load balancing. Let the # of blocks be n, busy point be B, performance is suboptimal if n%B!=0, unless n/B is big enough.

Indeed, the general tendency seems to be ‘lots of blocks is good’ :) But it’s interesting that this person found that a low number resulted in the best performance in his case.

yeah:) this person finds there’re still a lot to learn about g8’s architecture:)