GPU profiling 33% occupancy faster then 50-66%

jnahmias · March 12, 2007, 4:21pm

Hi,
I been profiling my cuda shaders and my algorithm runs a lot faster with a lower occupancy. When i use a block sizes of 32 threads i get 0.33 occupancy and my algorithm takes 0.97 secs and when running 136 threads i get 0.5 occupancy but the algorithm is taking 1.5 secs. I’m using three arrays that are equal in size to number of threads per block plus a few extra ints and two 6 elements constant arrays. I would appreciate any input with regards to the following.

1 Does it makes sense that i’m getting a lower occupancy when using smaller block sizes?

2 If this is so why would i get a performance hit with a higher occupancy should it not be the opposite?

Thanks Danny

Mark_Harris · March 12, 2007, 6:49pm

There could be a lot of factors in play here, so it’s not possible to answer your question without more info. What does your kernel do? How much shared memory are you using?

The maximum occupancy for blocks of 32 threads is in fact 33%. This is because the maximum active thread blocks per multiprocessor is 8 on G80, and maximum warps is 24. So you will only have 8 warps active, which is 33%.

Mark

jnahmias · March 13, 2007, 11:26am

Hi i’m doing something very similar to a convolution, it can be a viewed as a convolution for the purpose of this discussion.

__global__ void 

parallelFwdPrjKernel(float* g_dataVol, int zIdx, int dimVolX, int prjDimX, int prjDimY,  int kernelWidth, float* g_projectionPrev, float* g_projectionNext, int projIdx, int volStride, int projStride){

	

	__shared__ float Vol[BLOCKSIZEX][BLOCKSIZEY];

	__shared__ float ConvVol[BLOCKSIZEX][BLOCKSIZEY];

	__shared__ float Proj[BLOCKSIZEX][BLOCKSIZEY];

 

	int y = blockIdx.y*blockDim.y+threadIdx.y;

	int x = blockIdx.x*blockDim.x+threadIdx.x;

	//int volDimX = dimVolX;

	//int i,j, volStride, projStride, k;

	int i,j,  k;

	

	

	//volStride = dimVolX*prjDimY;  	// Index into Z plane

	//projStride = prjDimX*prjDimY;

	j = (projStride*projIdx)+(prjDimX*y)+x;

	Proj[threadIdx.x][threadIdx.y] = g_projectionPrev[j];

	

	

	for ( k=0; k<kernelWidth; k++){

  	

  i = (volStride*zIdx)+(dimVolX*y)+x + offsetsConstant[k];  	

  	

  Vol[threadIdx.x][threadIdx.y] = g_dataVol[i];

  __syncthreads();

  //ConvVol[tx] = Vol[tx]*Kernel[k];

  ConvVol[threadIdx.x][threadIdx.y] = Vol[threadIdx.x][threadIdx.y]*kernelConstant[k];

  Proj[threadIdx.x][threadIdx.y] +=  ConvVol[threadIdx.x][threadIdx.y];

  

	}

	

	g_projectionNext[j] = Proj[threadIdx.x][threadIdx.y];

}

I know this is not the most optimized implementation as i am making extra global memory reads that could be avoided by using a padded shared mem, similar to your own example. I did however create the padded shared mem version that only performs 1 global read for central values with a few extra reads at the boundaries all at the cost of thread id checks. I noticed something like 10% performance increase a lot less then i expected. Considering the cost of doing a global fetch and the fact that the occupancy is 0.33. I thought i would also mention that this shader gets called 2500 times could this be my bottleneck?

Any tips or advice is appreciated

Regards Danny

Topic		Replies	Views
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5898	July 25, 2007
Question regarding warp efficiency... CUDA Programming and Performance	9	15108	March 13, 2007
Performance in different thread-block schemes CUDA Programming and Performance	5	2343	September 19, 2008
Maximising memory per thread CUDA Programming and Performance	4	3274	May 3, 2010
too large kernel solutions CUDA Programming and Performance	11	4280	September 2, 2008
CUDA kernel block size tuning with maximum theoretical occupancy CUDA Programming and Performance	3	990	June 17, 2019
Best performance with strange settings CUDA Programming and Performance	4	3473	May 20, 2009
better performance from underpopulated warps CUDA Programming and Performance	6	2438	June 28, 2008
CUDA Visual Profiler Vista CUDA Programming and Performance	2	4131	September 11, 2009
Few performance questions occupancy,active threads,cta_launch CUDA Programming and Performance	4	4753	January 30, 2009

GPU profiling 33% occupancy faster then 50-66%

Related topics