Global memory read interference?

I’m solving task on the GPU with 27 MPs.
When the task size is 1, it is run on 1 MP with running time T.
When the task size is 5, it is run on 5 MPs with running time 1.1T.
When the task size is 15, it is run on 15 MPs with running time 1.8T.
What can cause this behavior? Assume that no two threads try to write the same memory location.

Two things are going on here.

Firstly, the program will go at the speed of the slowest block. For this reason many blocks are usually slower than one block.

Secondly, there is partition camping. There isn’t any official documentation on this, but there is a white paper floating about, and a presentation on it in the SDK. It basically like bank conflicts, but with global memory, and all active blocks, rather than just a half warp.

Thanks for fast reply, it seems to me that you are right. Need to do some research now :)