The question is: can one experience lower latency when reading from global memory, if less processors are active? If only 1 processor is active, can it get data sooner?
I’m not sure if this documented somewhere, but to me that sounds like something only experience can tell.
I don’t know if the latency is due to the fact that there needs to be arbitration between multiple processors (i.e. one processors has access to the memory only after every other processor is done) or because of long set-up times of the DDRx memory.
I don’t think the memory latency varies based on the number of active processors. I’ve certainly never seen any evidence to support that claim, at least.
Mostly I’m going to say “no”, since the minimum amount of time that it’ll take for you to get back a result won’t change. So if you’re running one warp per multiprocessor (and already at the minimum) you’re not going to get better performance running one warp per gpu.
But if you’re running more warps and are hitting your bandwidth limits (and those could in fact be pretty low if you’re issuing uncoalesced accesses) then performance will improve by running fewer multiprocessors. (And maybe it makes more sense for you to run one big block rather than many little blocks.)