Best performance with strange settings

Tobi_W · May 19, 2009, 7:49am

Hi,

I have a really simple kernel which performs best with 64 threads / block, what I don’t really understand. But first of all, my kernel:

__global__

void multKernel(short2 *data1, short2 *data2, int2 *result)

{

	unsigned int position = (blockIdx.x * blockDim.x) + threadIdx.x;

	if(position < 1463)

	{

		short2 data1_sample = data1[position];

		short2 data2_sample = data2[position + blockIdx.y];

		multResult [(blockIdx.y * 1463) + position].x = (data2_sample.x * data1_sample.x) - (data2_sample.y * data1_sample.y);

		multResult [(blockIdx.y * 1463) + position].y = (data2_sample.x * data1_sample.y) + (data2_sample.y * data1_sample.x);

	}

}

Execution configuration is:

blockDim.x = 64 // this one can be changed

blockDim.y = 1

blockDim.z = 1

gridDim.x = ceil(1463 / blockDim.x)

gridDim.y = 60

As you can see, the only thing that is varying is blockDim.x and therefore gridDim.x. The kernel just have 2 global loads and 2 global stores. The peak performance is reached with blockDim.x = 64 on a GTX 285 => 1380 thread blocks. The occupancy is just 50 %, but higher with more threads / block. I am still using CUDA 2.1…

Does someone have a hint, why the kernel performs best with just 64 threads / block and worse with more, although the occupancy is much better?

Thanks in advance.

Sarnath · May 19, 2009, 7:59am

Higher occupancy does not necessarily mean higher performance… although it can guarantee latency hiding.

but it all boils down to what is the bottleneck in your kernel.

Tobi_W · May 19, 2009, 8:58am

I am aware of that, but I thought the bottleneck of my kernel is bandwidth, so higher occupancy would increase the performance.

Am i wrong with my assumption above?

Jamie_K · May 19, 2009, 8:25pm

Occupancy can hide latency, by allowing some warps to run while others are waiting for memory. But if they are all stalled on memory, then once you hit that saturation point, higher occupancy won’t improve bandwidth. It just means more warps are waiting.

Tobi_W · May 20, 2009, 6:31am

Ok, thanks. Probably texture memory will speed up the kernel…

Topic		Replies	Views
Where's my bottleneck CUDA Programming and Performance	1	1038	August 29, 2008
GPU profiling 33% occupancy faster then 50-66% CUDA Programming and Performance	2	3309	March 13, 2007
too large kernel solutions CUDA Programming and Performance	11	4280	September 2, 2008
Occupancy Query Performance not as expected CUDA Programming and Performance	11	4449	February 3, 2009
Hide latency CUDA Programming and Performance	3	486	June 9, 2023
understanding the trade-off between block size and occupancy CUDA Programming and Performance	1	14150	March 29, 2010
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5898	July 25, 2007
latency hiding How much speedup can you get? CUDA Programming and Performance	3	9687	November 10, 2007
Occupancy CUDA Programming and Performance	3	3885	May 22, 2008
About grid size and performance CUDA Programming and Performance	10	2412	June 25, 2010

Best performance with strange settings

Related topics