Hi,
I have a really simple kernel which performs best with 64 threads / block, what I don’t really understand. But first of all, my kernel:
__global__
void multKernel(short2 *data1, short2 *data2, int2 *result)
{
unsigned int position = (blockIdx.x * blockDim.x) + threadIdx.x;
if(position < 1463)
{
short2 data1_sample = data1[position];
short2 data2_sample = data2[position + blockIdx.y];
multResult [(blockIdx.y * 1463) + position].x = (data2_sample.x * data1_sample.x) - (data2_sample.y * data1_sample.y);
multResult [(blockIdx.y * 1463) + position].y = (data2_sample.x * data1_sample.y) + (data2_sample.y * data1_sample.x);
}
}
Execution configuration is:
blockDim.x = 64 // this one can be changed
blockDim.y = 1
blockDim.z = 1
gridDim.x = ceil(1463 / blockDim.x)
gridDim.y = 60
As you can see, the only thing that is varying is blockDim.x and therefore gridDim.x. The kernel just have 2 global loads and 2 global stores. The peak performance is reached with blockDim.x = 64 on a GTX 285 => 1380 thread blocks. The occupancy is just 50 %, but higher with more threads / block. I am still using CUDA 2.1…
Does someone have a hint, why the kernel performs best with just 64 threads / block and worse with more, although the occupancy is much better?
Thanks in advance.