Hi all,
I have a strange performance issue I don’t understand:
I have Kernel A which fits 5 blocks in each Multiprocessor.
And I have Kernel B which fits 8 blocks in each Multiprocessor.
I need to run both on a lot of data. each block processes the same amount of data. I can thus choose an arbitrary granularity when scheduling blocks.
I have 14 MPs.
When I schedule 14x5 blocks for Kernel A -> good performance (fully filled GPU)
Scheduling 14x8 blocks Kernel B -> good performance (fully filled GPU)
Now my intuition said: scheduling a multiple of 4 and 8 blocks per multiprocessor, should also lead to best performance.
but scheduling 14x5x8 blocks decreases performance for both Kernels.
To select the best number of blocks to schedule per kernel launch I need to understand the reason for this performance breakdown.
I use events to measure the timings and I stop only the kernel time and no memcopies.
In reality I also have Kernel C with 6 blocks per SM for best performance, but once the issue is understood for 2 kernels, I should be able to solve it for 3 kernels aswell.
SHORT VERSION:
If 14 x Y blocks fully exploit the GPUs computing resources, why are 14 x Y x X blocks processed slower and how can I avoid this? (measured in time per block)
Thanks
Markus