Performance drop with compute_20,sm_20 nvcc options

Hi again

I’m currently working on this code

__device__ float gaussian_range_value(float range ,float sigma_r){

	return __expf (-(range * range) / (2.0 * sigma_r * sigma_r));

}

__global__ void   bilateral_texture_col(float *D, int width,int height, int m, float sigma_s, float sigma_r)

{

	int j;

	float mult, val , k;

	float ci,cj; 

	int X = blockIdx.x*blockDim.x + threadIdx.x;

	int Y = blockIdx.y*blockDim.y + threadIdx.y;

	if((X<width)&&(Y<height)){   

		  ci= tex2D( loglum_tex,X,Y); 

		  k = 0;

		  val = 0;

		

		for(j=-(m-1)/2;j<=(m-1)/2;j++) 

			{

				cj=tex2D( loglum_tex,X,Y+j);

			

				mult = spatial_kernel[abs(j)] * gaussian_range_value(ci-cj,sigma_r); 

				val += mult * cj;

				k += mult;

			}	

			D[X+Y*width]=val/k;

		}

}

Everything was fine when i used nvcc options compute_10,sm_10 (options by default), bilateral_texture_col executed in 1 ms.

Then i changed nvcc options to compute_20,sm_20 because i wanted to try atomicAdd for float and i got a huge performance fall.

My kernel now execute in 4 ms. I was able to locate where the performance loss occur, it’s in gaussian_range_value device function call.

My number of registers for the kernel rose to 13 to 30, I just cant explain why ?

Sorry for my english.

Testi

Hi.

With cc 2.0 you allow double precision floating point operations, whereas with cc 1.0 double are silently converted to float.

Now look at this:

__device__ float gaussian_range_value(float range ,float sigma_r){

          return __expf (-(range * range) / (2.0 * sigma_r * sigma_r));

}

Your 2.0 is a double, forcing all the operands to be promoted to double. The operations are now in double, and only downcast to float for expf().

Just try replacing it is 2.0f.

It seems you are right again

Casting explicitly in float solve the problem, it actually made the code 20% faster…

The number of registers used remain a bit higher, 20 instead of 13 but it doesn’t look that important.

Thank you for the fast and accurate answer.

Testi

Glad to be helpful.
IRT the extra registers, it might be due to the different compiler used between cc 1.3 and 2.0 if you are using cuda 4.1. Indeed, cuda 4.1 uses open64 for generating gpu code for cc <= 1.3 and llvm for cc >= 2.0. This can translate into very different optimisation strategies and number of used registers.