16x16 VS 32x8 large difference -> bug?


I’m seeing some strange results when executing a simple kernel with 16x16 VS 32x8 block dimensions.

I’m using a 8800GT which should coalesce per half-warp. So I would assume that 16x16 should have no difference at all compared to 32x8.

This is the kernel I’m running:

__global__ void ExampleProductKernel2D(float * input1,float * input2,float * output){

	const unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;

	const unsigned int j = blockDim.y * blockIdx.y + threadIdx.y;

	const unsigned int width = gridDim.x * blockDim.x;

	const unsigned int index = i + j * width;

	output[index] = input1[index]*input2[index];


These are the results for different image sizes (execution time with nsight):

1024x1024  2048x2048  4096x4096  6144x6144

16x16    254.8      1004.9     6222.3      14560

32x8     254.2      1007.7     4020.3      9122.4

Why would/could there be such a large difference between the 16x16 and 32x8 blocksizes with the larger resolutions?

Partition camping ?

ref: see matrix transpose documentation in SDK.

yes, that will probably be the problem.

Thank you :thumbup:

The CUDA scheduler looks at a 2 dimensional block as “Y” amount of X…
So 16x16 gets seen as 16 Ys of 16 Xs
And 32x8 as 8 Ys of 32 Xs

If the scheduler schedules in terms of Ys, 32x8 will easily fit into the WARP scheme…
16x16 will not…

I am just guessing from previus experiences… NV will be the best to answer this.

I don’t think this has anything to do with partition camping… The effective pressure on the memory controllers would be the same… Partition camping hits you when you access columns in a row-major ordered structure or vice versa…

According to the NVIDIA Programming guide the dimensions of the block do not matter, it’s the size that matters. So there should be no difference between 32x8 and 16x16. If the warps in the 16x16 case would only be half full, there would already be performance loss at lower resolution, which does not occur.

Furthermore I do think there is a problem with Partition Camping in this kernel.

On a 8800GT there are 6 partitions (256 byte wide) and 12 microprocessors. The blocks get scheduled in row major fashion.

256 threads per block equals 3 blocks per multiprocessor. So 36 blocks will be scheduled.

With 16x16 blocks: the first 24 blocks will read from the partitions 1-6. The last 12 will read from 1-3.

With 32x8 blocks: the first 12 blocks will read from partitions 1-6, the next 12 too and the next 12 also.

With the 16x16 blocks more reads will always be concentrated on some partitions. I think that causes the kernel to slow down at larger resolutions.

EDIT: I tested with 16x8 blocks and now I get the same result as with the 32x8 case. As the number of blocks per multiprocessor doubles, this leads to less concentrated reads. This should support the fact that partition camping is to blame in this case.

Ah… Right… I missed the “j*width” calculation… Good find, Thanks for posting back,