Why don't I have 100% Occupancy?

papag62 · April 2, 2008, 8:15pm

All the following code is doing moving a window of size 2N+1 x 2N+1 about every pixel in a Left and Right image and calculating the mean pixel value in that window. Where the thread size is (32,8,1) and grid size is (32,96,1).

__constant__ int N; 

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageL;

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageR;

__global__ void

cuStereoCorr(int* RES_H, int* RES_V, float* ResultDisp, float* MeanL, float* MeanR, float* STDL, float* STDR)

{

	int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

	int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

	int loc = __mul24(y, 1024) + x;

	int u = 0;

	int v = 0;

	float4 ValueL;

	float4 ValueR;

	

	float4 MeanL_Temp;

	float4 MeanR_Temp;

	#pragma unroll

	for(u=-N; u<=N; u++)

	{

  #pragma unroll

  for(v=-N; v<=N; v++)

  {

  	ValueL = tex2D(tex_recImageL,x+u,y+v);

  	MeanL_Temp.x = (MeanL_Temp.x + ValueL.x)/((float)(2.*N+1));

  	MeanL_Temp.y = (MeanL_Temp.y + ValueL.y)/((float)(2.*N+1));

  	MeanL_Temp.z = (MeanL_Temp.z + ValueL.z)/((float)(2.*N+1));

 	ValueR = tex2D(tex_recImageR,x+u,y+v);

  	MeanR_Temp.x = (MeanR_Temp.x + ValueR.x)/((float)(2.*N+1));

  	MeanR_Temp.y = (MeanR_Temp.y + ValueR.y)/((float)(2.*N+1));

  	MeanR_Temp.z = (MeanR_Temp.z + ValueR.z)/((float)(2.*N+1));

  }

	}

	MeanL[loc]=.11*MeanL_Temp.x+.59*MeanL_Temp.y+.3*MeanL_Temp.z;

	MeanR[loc]=.11*MeanR_Temp.x+.59*MeanR_Temp.y+.3*MeanR_Temp.z;

}

If I comment out the last 2 lines I get 100% occupancy but leaving them gives 33%. I don’t understand since I have 0 incoherent loads or stores. If you have any ideas or thoughts on more efficient ways to do this please let me know.

Thanks!

bashflyng · April 2, 2008, 8:23pm

If you comment the last two lines, your function essentially does nothing (as it doesn’t write anywhere), so the compiler detects that, and a 100% occupancy is normal.

For 100% occupancy, you must use 10 registers maximum.
With that block and grid sizes, with 17 or more registers you will be stuck at 33% occupancy.

To check it, compile passing -v flag to ptxas, and it will report register usage.

DenisR · April 2, 2008, 8:28pm

if you fill in the information (use --ptxas-options=-v on the nvcc command line) in the occupancy calculator, you will see the limiting factor. My guess is that you use too many registers to get 100% occupancy.

Note that having occupancy lower than 100% does not mean you will not get maximum performance. I have kernels with 33% occupancy that do close to 70 GB/s, so maximum throughput is already achieved.

papag62 · April 2, 2008, 8:47pm

Your right, I’m using 20 registers. What is driving the number of registers used and what can I do to reduce it? Is there a easy way to push data to the local memory? my .cudin files states:

MisterAnderson42 · April 2, 2008, 9:00pm

The registers are used for all the local variables in your kernel, as well as for the texture unit addressing. Given your kernel, there probably isn’t much of a reduction you can get: 20 registers is decent.

Calculate your effective memory bandwidth, counting every texture read and memory write. Given your access pattern, you should be maxing out the device at ~70GiB/s. If this is the case, the hardware doesn’t have anything more to give and increasing the occupancy will not gain you anything.

DenisR · April 2, 2008, 9:13pm

Hi, a few observations:

-before trying to increase occupancy, you should really see how many GB/s you are getting with your kernel. If it is close to the memory bandwith, than it is useless to increase it.

I am not exactly sure if the compiler optimizes away the 4th component of the float4 if it is not used (you could check the generated ptx when specifying -keep on the nvcc commandline)
#pragma unroll does not unroll anything, if N is a certain minimum size (say 8), you can specify #pragma unroll 17
I think your code has an error (normalization is within the for loops, while it should be outside I think), find below how I guess it should be (I also tried to optimize the float4 into 3 float’s)

__constant__ int N;

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageL;

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageR;

__global__ void

cuStereoCorr(int* RES_H, int* RES_V, float* ResultDisp, float* MeanL, float* MeanR, float* STDL, float* STDR)

{

int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

int loc = __mul24(y, 1024) + x;

int u = 0;

int v = 0;

float4 ValueL;

float4 ValueR;

float MeanL_Tempx = 0.0f, MeanL_Tempy = 0.0f, MeanL_Tempz = 0.0f; 

float MeanR_Tempx = 0.0f, MeanR_Tempy = 0.0f, MeanR_Tempz = 0.0f; 

#pragma unroll

for(u=-N; u<=N; u++)

{

 #pragma unroll

 for(v=-N; v<=N; v++)

 {

  ValueL = tex2D(tex_recImageL,x+u,y+v);

  MeanL_Tempx += ValueL.x;

  MeanL_Tempy += ValueL.y;

  MeanL_Tempz += ValueL.z;

 ValueR = tex2D(tex_recImageR,x+u,y+v);

  MeanR_Tempx += ValueR.x;

  MeanR_Tempy += ValueR.y;

  MeanR_Tempz += ValueR.z;

 }

}

float factor = (float) (2*N+1);

MeanL_Tempx /= factor;

MeanL_Tempy /= factor;

MeanL_Tempz /= factor;

MeanR_Tempx /= factor;

MeanR_Tempy /= factor;

MeanR_Tempz /= factor;

MeanL[loc]=0.11f*MeanL_Tempx+0.59f*MeanL_Tempy+0.3f*MeanL_Tempz;

MeanR[loc]=0.11f*MeanR_Tempx+0.59f*MeanR_Tempy+0.3f*MeanR_Tempz;

}

MisterAnderson42 · April 2, 2008, 9:20pm

The compiler has always optimized away the 4th component of a float4 for me. This can even be annoying sometimes when you are trying to perform a coalesced read or write of a float4.

papag62 · April 2, 2008, 9:54pm

I think occupancy is the problem since I have another kernal that has 260000 stores that has a GPU Time of 725.44 but this kernel only has 65000 stores and removing the final two lines reduces the GPU time from 11526 to 689. A differnece of ~10000.

The programing guide states that

and N is a constant

Functionally it should not matter. I’m just expanding the sumation function. It may be less efficient though.

bashflyng · April 2, 2008, 9:57pm

Because if you remove the final two lines, your kernel can be optimized to a “do nothing” kernel.

Occupancy may improve things, but it isn’t the main reason of that speedup.

papag62 · April 2, 2008, 9:58pm

But why does the number of registers used reduce from 20 to 5 when I comment out the last 2 lines if registers are only used for local variables and texture addressing?

papag62 · April 2, 2008, 10:12pm

But doesn’t the fact that the profiler shows a nonzero value for the number of stores indicate that it isn’t doing nothing, its just not publishing what it did to global memory.

MisterAnderson42 · April 2, 2008, 10:13pm

Because (as has already been said a number of times) the compiler is very good at removing dead code. When you comment out the global memory writes, the compiler sees that your large loops do nothing because the value calculated by them is never written to the output. Since the large loop is “dead code” the compiler sees no reason to include it in the final output.

Regarding your comparison of two kernels with different numbers of stores: You need to also count the number of memory reads in tallying the effective memory bandwidth. Due to your double loop over N with a texture read inside, the number of reads in the kernel you posted here will be quite large.

DenisR · April 3, 2008, 5:37am

If you say functionally it should be the same, then I think you still have an error ;)

I think you wanted to write:

MeanL_Temp.x = MeanL_Temp.x + ValueL.x/((float)(2.*N+1));

instead of

MeanL_Temp.x = (MeanL_Temp.x + ValueL.x)/((float)(2.*N+1));

the first ValueL.x will be divided by 2N+1 a lot more times than the last ValueL.x if written like that or am I missing something? (which is quite possible :P )

papag62 · April 3, 2008, 4:24pm

Your right. Thanks for catching my mistake.

Also, I think I understand what the compiler is doing by removing dead code. Thanks for all the help.

Topic		Replies	Views
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2630	March 5, 2013
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	312352	September 13, 2011
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5935	July 25, 2007
Multiproccesor occupancy for a flop intensive kernel .. how bad is 25% ? CUDA Programming and Performance	10	1785	June 30, 2009
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2246	December 9, 2009
Low occupancy ratio using texture memory Image correlation using texture memory CUDA Programming and Performance	2	4709	September 20, 2008
Question regarding warp efficiency... CUDA Programming and Performance	9	15141	March 13, 2007
Reducing the number of registers To improve occupancy CUDA Programming and Performance	5	4737	April 5, 2007
help me understand `odd' performance CUDA Programming and Performance	5	1696	June 18, 2010
Occupancy doesn't tally with calculator CUDA Programming and Performance	3	1646	January 17, 2009

Why don't I have 100% Occupancy?

Related topics