Understanding performance behavior

Hi all,

I have a strange performance issue I don’t understand:
I have Kernel A which fits 5 blocks in each Multiprocessor.
And I have Kernel B which fits 8 blocks in each Multiprocessor.
I need to run both on a lot of data. each block processes the same amount of data. I can thus choose an arbitrary granularity when scheduling blocks.

I have 14 MPs.
When I schedule 14x5 blocks for Kernel A -> good performance (fully filled GPU)
Scheduling 14x8 blocks Kernel B -> good performance (fully filled GPU)

Now my intuition said: scheduling a multiple of 4 and 8 blocks per multiprocessor, should also lead to best performance.
but scheduling 14x5x8 blocks decreases performance for both Kernels.
To select the best number of blocks to schedule per kernel launch I need to understand the reason for this performance breakdown.

I use events to measure the timings and I stop only the kernel time and no memcopies.
In reality I also have Kernel C with 6 blocks per SM for best performance, but once the issue is understood for 2 kernels, I should be able to solve it for 3 kernels aswell.

SHORT VERSION:
If 14 x Y blocks fully exploit the GPUs computing resources, why are 14 x Y x X blocks processed slower and how can I avoid this? (measured in time per block)
Thanks
Markus

“If 14 x Y blocks fully exploit the GPUs computing resources, why are 14 x Y x X blocks processed slower and how can I avoid this? (measured in time per block)”

Slower than what? :unsure:

Hi ymc,
thx for reading… Sorry seems I’ve been a bit unclear here.
My important figure is of course “time per data”. In each block of either kernel the same amount of data is processed.
Following I only need to care about “time per block” defined as “time to finish the kernel”/“number of blocks scheduled for this kernel”.
The “time per block” increases for 14 x X x Y blocks compared to 14 x X blocks and I do not understand why.
Markus

Are kernels A and B launched in different streams or in the same stream?

Which GPU are you using?

Hi,
single stream & Tesla C2070.
Markus