GPU profiling 33% occupancy faster then 50-66%

Hi,
I been profiling my cuda shaders and my algorithm runs a lot faster with a lower occupancy. When i use a block sizes of 32 threads i get 0.33 occupancy and my algorithm takes 0.97 secs and when running 136 threads i get 0.5 occupancy but the algorithm is taking 1.5 secs. I’m using three arrays that are equal in size to number of threads per block plus a few extra ints and two 6 elements constant arrays. I would appreciate any input with regards to the following.

1 Does it makes sense that i’m getting a lower occupancy when using smaller block sizes?

2 If this is so why would i get a performance hit with a higher occupancy should it not be the opposite?

Thanks Danny

There could be a lot of factors in play here, so it’s not possible to answer your question without more info. What does your kernel do? How much shared memory are you using?

The maximum occupancy for blocks of 32 threads is in fact 33%. This is because the maximum active thread blocks per multiprocessor is 8 on G80, and maximum warps is 24. So you will only have 8 warps active, which is 33%.

Mark

Hi i’m doing something very similar to a convolution, it can be a viewed as a convolution for the purpose of this discussion.

__global__ void 

parallelFwdPrjKernel(float* g_dataVol, int zIdx, int dimVolX, int prjDimX, int prjDimY,  int kernelWidth, float* g_projectionPrev, float* g_projectionNext, int projIdx, int volStride, int projStride){

	

	__shared__ float Vol[BLOCKSIZEX][BLOCKSIZEY];

	__shared__ float ConvVol[BLOCKSIZEX][BLOCKSIZEY];

	__shared__ float Proj[BLOCKSIZEX][BLOCKSIZEY];

 

	int y = blockIdx.y*blockDim.y+threadIdx.y;

	int x = blockIdx.x*blockDim.x+threadIdx.x;

	//int volDimX = dimVolX;

	//int i,j, volStride, projStride, k;

	int i,j,  k;

	

	

	//volStride = dimVolX*prjDimY;  	// Index into Z plane

	//projStride = prjDimX*prjDimY;

	j = (projStride*projIdx)+(prjDimX*y)+x;

	Proj[threadIdx.x][threadIdx.y] = g_projectionPrev[j];

	

	

	for ( k=0; k<kernelWidth; k++){

  	

  i = (volStride*zIdx)+(dimVolX*y)+x + offsetsConstant[k];  	

  	

  Vol[threadIdx.x][threadIdx.y] = g_dataVol[i];

  __syncthreads();

  //ConvVol[tx] = Vol[tx]*Kernel[k];

  ConvVol[threadIdx.x][threadIdx.y] = Vol[threadIdx.x][threadIdx.y]*kernelConstant[k];

  Proj[threadIdx.x][threadIdx.y] +=  ConvVol[threadIdx.x][threadIdx.y];

  

	}

	

	g_projectionNext[j] = Proj[threadIdx.x][threadIdx.y];

}

I know this is not the most optimized implementation as i am making extra global memory reads that could be avoided by using a padded shared mem, similar to your own example. I did however create the padded shared mem version that only performs 1 global read for central values with a few extra reads at the boundaries all at the cost of thread id checks. I noticed something like 10% performance increase a lot less then i expected. Considering the cost of doing a global fetch and the fact that the occupancy is 0.33. I thought i would also mention that this shader gets called 2500 times could this be my bottleneck?

Any tips or advice is appreciated

Regards Danny