Best performance with strange settings

Hi,

I have a really simple kernel which performs best with 64 threads / block, what I don’t really understand. But first of all, my kernel:

__global__

void multKernel(short2 *data1, short2 *data2, int2 *result)

{

	unsigned int position = (blockIdx.x * blockDim.x) + threadIdx.x;

	if(position < 1463)

	{

		short2 data1_sample = data1[position];

		short2 data2_sample = data2[position + blockIdx.y];

		multResult [(blockIdx.y * 1463) + position].x = (data2_sample.x * data1_sample.x) - (data2_sample.y * data1_sample.y);

		multResult [(blockIdx.y * 1463) + position].y = (data2_sample.x * data1_sample.y) + (data2_sample.y * data1_sample.x);

	}

}

Execution configuration is:

blockDim.x = 64 // this one can be changed

blockDim.y = 1

blockDim.z = 1

gridDim.x = ceil(1463 / blockDim.x)

gridDim.y = 60

As you can see, the only thing that is varying is blockDim.x and therefore gridDim.x. The kernel just have 2 global loads and 2 global stores. The peak performance is reached with blockDim.x = 64 on a GTX 285 => 1380 thread blocks. The occupancy is just 50 %, but higher with more threads / block. I am still using CUDA 2.1…

Does someone have a hint, why the kernel performs best with just 64 threads / block and worse with more, although the occupancy is much better?

Thanks in advance.

Higher occupancy does not necessarily mean higher performance… although it can guarantee latency hiding.

but it all boils down to what is the bottleneck in your kernel.

I am aware of that, but I thought the bottleneck of my kernel is bandwidth, so higher occupancy would increase the performance.

Am i wrong with my assumption above?

Occupancy can hide latency, by allowing some warps to run while others are waiting for memory. But if they are all stalled on memory, then once you hit that saturation point, higher occupancy won’t improve bandwidth. It just means more warps are waiting.

Ok, thanks. Probably texture memory will speed up the kernel…