Why don't I have 100% Occupancy?

All the following code is doing moving a window of size 2N+1 x 2N+1 about every pixel in a Left and Right image and calculating the mean pixel value in that window. Where the thread size is (32,8,1) and grid size is (32,96,1).

__constant__ int N; 

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageL;

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageR;

__global__ void

cuStereoCorr(int* RES_H, int* RES_V, float* ResultDisp, float* MeanL, float* MeanR, float* STDL, float* STDR)

{

	int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

	int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

	int loc = __mul24(y, 1024) + x;

	int u = 0;

	int v = 0;

	float4 ValueL;

	float4 ValueR;

	

	float4 MeanL_Temp;

	float4 MeanR_Temp;

	#pragma unroll

	for(u=-N; u<=N; u++)

	{

  #pragma unroll

  for(v=-N; v<=N; v++)

  {

  	ValueL = tex2D(tex_recImageL,x+u,y+v);

  	MeanL_Temp.x = (MeanL_Temp.x + ValueL.x)/((float)(2.*N+1));

  	MeanL_Temp.y = (MeanL_Temp.y + ValueL.y)/((float)(2.*N+1));

  	MeanL_Temp.z = (MeanL_Temp.z + ValueL.z)/((float)(2.*N+1));

 	ValueR = tex2D(tex_recImageR,x+u,y+v);

  	MeanR_Temp.x = (MeanR_Temp.x + ValueR.x)/((float)(2.*N+1));

  	MeanR_Temp.y = (MeanR_Temp.y + ValueR.y)/((float)(2.*N+1));

  	MeanR_Temp.z = (MeanR_Temp.z + ValueR.z)/((float)(2.*N+1));

  }

	}

	MeanL[loc]=.11*MeanL_Temp.x+.59*MeanL_Temp.y+.3*MeanL_Temp.z;

	MeanR[loc]=.11*MeanR_Temp.x+.59*MeanR_Temp.y+.3*MeanR_Temp.z;

}

If I comment out the last 2 lines I get 100% occupancy but leaving them gives 33%. I don’t understand since I have 0 incoherent loads or stores. If you have any ideas or thoughts on more efficient ways to do this please let me know.

Thanks!

If you comment the last two lines, your function essentially does nothing (as it doesn’t write anywhere), so the compiler detects that, and a 100% occupancy is normal.

For 100% occupancy, you must use 10 registers maximum.
With that block and grid sizes, with 17 or more registers you will be stuck at 33% occupancy.

To check it, compile passing -v flag to ptxas, and it will report register usage.

if you fill in the information (use --ptxas-options=-v on the nvcc command line) in the occupancy calculator, you will see the limiting factor. My guess is that you use too many registers to get 100% occupancy.

Note that having occupancy lower than 100% does not mean you will not get maximum performance. I have kernels with 33% occupancy that do close to 70 GB/s, so maximum throughput is already achieved.

Your right, I’m using 20 registers. What is driving the number of registers used and what can I do to reduce it? Is there a easy way to push data to the local memory? my .cudin files states:

The registers are used for all the local variables in your kernel, as well as for the texture unit addressing. Given your kernel, there probably isn’t much of a reduction you can get: 20 registers is decent.

Calculate your effective memory bandwidth, counting every texture read and memory write. Given your access pattern, you should be maxing out the device at ~70GiB/s. If this is the case, the hardware doesn’t have anything more to give and increasing the occupancy will not gain you anything.

Hi, a few observations:

-before trying to increase occupancy, you should really see how many GB/s you are getting with your kernel. If it is close to the memory bandwith, than it is useless to increase it.

  • I am not exactly sure if the compiler optimizes away the 4th component of the float4 if it is not used (you could check the generated ptx when specifying -keep on the nvcc commandline)

  • #pragma unroll does not unroll anything, if N is a certain minimum size (say 8), you can specify #pragma unroll 17

  • I think your code has an error (normalization is within the for loops, while it should be outside I think), find below how I guess it should be (I also tried to optimize the float4 into 3 float’s)

__constant__ int N;

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageL;

texture <float4, 2, cudaReadModeNormalizedFloat> tex_recImageR;

__global__ void

cuStereoCorr(int* RES_H, int* RES_V, float* ResultDisp, float* MeanL, float* MeanR, float* STDL, float* STDR)

{

int x = __mul24(blockIdx.x, blockDim.x) + threadIdx.x;

int y = __mul24(blockIdx.y, blockDim.y) + threadIdx.y;

int loc = __mul24(y, 1024) + x;

int u = 0;

int v = 0;

float4 ValueL;

float4 ValueR;

float MeanL_Tempx = 0.0f, MeanL_Tempy = 0.0f, MeanL_Tempz = 0.0f; 

float MeanR_Tempx = 0.0f, MeanR_Tempy = 0.0f, MeanR_Tempz = 0.0f; 

#pragma unroll

for(u=-N; u<=N; u++)

{

 #pragma unroll

 for(v=-N; v<=N; v++)

 {

  ValueL = tex2D(tex_recImageL,x+u,y+v);

  MeanL_Tempx += ValueL.x;

  MeanL_Tempy += ValueL.y;

  MeanL_Tempz += ValueL.z;

 ValueR = tex2D(tex_recImageR,x+u,y+v);

  MeanR_Tempx += ValueR.x;

  MeanR_Tempy += ValueR.y;

  MeanR_Tempz += ValueR.z;

 }

}

float factor = (float) (2*N+1);

MeanL_Tempx /= factor;

MeanL_Tempy /= factor;

MeanL_Tempz /= factor;

MeanR_Tempx /= factor;

MeanR_Tempy /= factor;

MeanR_Tempz /= factor;

MeanL[loc]=0.11f*MeanL_Tempx+0.59f*MeanL_Tempy+0.3f*MeanL_Tempz;

MeanR[loc]=0.11f*MeanR_Tempx+0.59f*MeanR_Tempy+0.3f*MeanR_Tempz;

}

The compiler has always optimized away the 4th component of a float4 for me. This can even be annoying sometimes when you are trying to perform a coalesced read or write of a float4.

I think occupancy is the problem since I have another kernal that has 260000 stores that has a GPU Time of 725.44 but this kernel only has 65000 stores and removing the final two lines reduces the GPU time from 11526 to 689. A differnece of ~10000.

The programing guide states that

and N is a constant

Functionally it should not matter. I’m just expanding the sumation function. It may be less efficient though.

Because if you remove the final two lines, your kernel can be optimized to a “do nothing” kernel.

Occupancy may improve things, but it isn’t the main reason of that speedup.

But why does the number of registers used reduce from 20 to 5 when I comment out the last 2 lines if registers are only used for local variables and texture addressing?

But doesn’t the fact that the profiler shows a nonzero value for the number of stores indicate that it isn’t doing nothing, its just not publishing what it did to global memory.

Because (as has already been said a number of times) the compiler is very good at removing dead code. When you comment out the global memory writes, the compiler sees that your large loops do nothing because the value calculated by them is never written to the output. Since the large loop is “dead code” the compiler sees no reason to include it in the final output.

Regarding your comparison of two kernels with different numbers of stores: You need to also count the number of memory reads in tallying the effective memory bandwidth. Due to your double loop over N with a texture read inside, the number of reads in the kernel you posted here will be quite large.

If you say functionally it should be the same, then I think you still have an error ;)

I think you wanted to write:

MeanL_Temp.x = MeanL_Temp.x + ValueL.x/((float)(2.*N+1));

instead of

MeanL_Temp.x = (MeanL_Temp.x + ValueL.x)/((float)(2.*N+1));

the first ValueL.x will be divided by 2N+1 a lot more times than the last ValueL.x if written like that or am I missing something? (which is quite possible :P )

Your right. Thanks for catching my mistake.

Also, I think I understand what the compiler is doing by removing dead code. Thanks for all the help.