2 blocks are not enough to utilize 2 MP? How many blocks there should be?

Sorry for a newbie question, trying to layout CUDA performance issues in my head.

I have 8500GT with only 2 MPs (a good choice to get it right with CUDA scheduling, I suppose).

I’ve done a simple kernel counting sum of integer array in several blocks.

The problem: a call with <<<2, 128>>> works about 4 times slower, than <<<12, 128>>>.

Yes I know all that stuff about maximum 24 warps per 1 MP. But anyway, this should not mean that CUDA application MUST have 768 threads per every MP to gain maximum performance. Should it? Sounds rather stupid, because 128 threads on the MP will anyway be split into 4 warps, which in turn will be serialized.

Can anyone comment this?

The kernel follows.

template<int GridSize, int BlockSize>

__global__ void cudaSum(int* data, int size, int* blockResults)

{

		int s = 0;

		int i = blockIdx.x * BlockSize * 2 + threadIdx.x;

		for(; i < size - BlockSize; i += GridSize * BlockSize * 2)

				s += data[i] + data[i + BlockSize];

		if (i < size)

				s += data[i];

		cuda::threads::sumToThread<BlockSize>(s, 0);

		if (threadIdx.x == 0)

				blockResults[blockIdx.x] = s;

}

Performance timing in milliseconds is attached.
cudaSum.perf.txt (2 KB)

Seems that finally got the clue.

The problem is in memory latency - which can be masked by decreased instruction execution rate if MP is rotating 24 warps.

Whenever there are less warps, every single warp could be executed in a more frequent way… But, memory latency limits frequency of warp switch, which in turn leads to decreased calculation rate and increased summary calculation time…

Just ensured that the same experiment with another kernel, operating only with local variables, fully utilizes 2 MPs with only 2 blocks.
Case closed.