2 blocks are not enough to utilize 2 MP? How many blocks there should be?

euk · October 31, 2008, 10:05am

Sorry for a newbie question, trying to layout CUDA performance issues in my head.

I have 8500GT with only 2 MPs (a good choice to get it right with CUDA scheduling, I suppose).

I’ve done a simple kernel counting sum of integer array in several blocks.

The problem: a call with <<<2, 128>>> works about 4 times slower, than <<<12, 128>>>.

Yes I know all that stuff about maximum 24 warps per 1 MP. But anyway, this should not mean that CUDA application MUST have 768 threads per every MP to gain maximum performance. Should it? Sounds rather stupid, because 128 threads on the MP will anyway be split into 4 warps, which in turn will be serialized.

Can anyone comment this?

The kernel follows.

template<int GridSize, int BlockSize>

__global__ void cudaSum(int* data, int size, int* blockResults)

{

		int s = 0;

		int i = blockIdx.x * BlockSize * 2 + threadIdx.x;

		for(; i < size - BlockSize; i += GridSize * BlockSize * 2)

				s += data[i] + data[i + BlockSize];

		if (i < size)

				s += data[i];

		cuda::threads::sumToThread<BlockSize>(s, 0);

		if (threadIdx.x == 0)

				blockResults[blockIdx.x] = s;

}

Performance timing in milliseconds is attached.
cudaSum.perf.txt (2 KB)

euk · October 31, 2008, 10:59am

Seems that finally got the clue.

The problem is in memory latency - which can be masked by decreased instruction execution rate if MP is rotating 24 warps.

Whenever there are less warps, every single warp could be executed in a more frequent way… But, memory latency limits frequency of warp switch, which in turn leads to decreased calculation rate and increased summary calculation time…

Just ensured that the same experiment with another kernel, operating only with local variables, fully utilizes 2 MPs with only 2 blocks.
Case closed.

Topic		Replies	Views
kernel performance and number of threads CUDA Programming and Performance	2	6596	November 22, 2007
Is this a good match for GPU? CUDA Programming and Performance	5	3614	June 11, 2009
Cuda Cores Cuda Cores - run threads bloocks, kernels etc. CUDA Programming and Performance	5	1750	February 22, 2011
Basic Cuda Confusion - help CUDA Programming and Performance	9	1905	February 11, 2013
Block Size.. CUDA Programming and Performance	2	1781	July 11, 2008
General Formula for Thread/Block Ratio CUDA Programming and Performance	1	593	June 2, 2011
How to use blocks CUDA Programming and Performance	1	3568	November 26, 2007
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4497	October 24, 2008
How many can use Blocks to effcient parallel prog CUDA Programming and Performance	8	5789	December 12, 2009
Warp switching does anybody understands the mechanism CUDA Programming and Performance	16	8511	March 28, 2008

2 blocks are not enough to utilize 2 MP? How many blocks there should be?

Related topics