Hi again
I’m currently working on this code
__device__ float gaussian_range_value(float range ,float sigma_r){
return __expf (-(range * range) / (2.0 * sigma_r * sigma_r));
}
__global__ void bilateral_texture_col(float *D, int width,int height, int m, float sigma_s, float sigma_r)
{
int j;
float mult, val , k;
float ci,cj;
int X = blockIdx.x*blockDim.x + threadIdx.x;
int Y = blockIdx.y*blockDim.y + threadIdx.y;
if((X<width)&&(Y<height)){
ci= tex2D( loglum_tex,X,Y);
k = 0;
val = 0;
for(j=-(m-1)/2;j<=(m-1)/2;j++)
{
cj=tex2D( loglum_tex,X,Y+j);
mult = spatial_kernel[abs(j)] * gaussian_range_value(ci-cj,sigma_r);
val += mult * cj;
k += mult;
}
D[X+Y*width]=val/k;
}
}
Everything was fine when i used nvcc options compute_10,sm_10 (options by default), bilateral_texture_col executed in 1 ms.
Then i changed nvcc options to compute_20,sm_20 because i wanted to try atomicAdd for float and i got a huge performance fall.
My kernel now execute in 4 ms. I was able to locate where the performance loss occur, it’s in gaussian_range_value device function call.
My number of registers for the kernel rose to 13 to 30, I just cant explain why ?
Sorry for my english.
Testi