Strange performance issue performance

I am wondering why kernel pad_kernel_fast runs 16x fast than pad_kernel_slow? The only difference between them is dst[idx] = 3.23424324 and dst[idx] = srcvalue.

global void pad_kernel_fast(float *dst,float *src,int width,int height)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

    float srcvalue = src[iy * width + ix];
dst[(iy + 2) * (width + 5) + ix + 2] = [b]3.23424324[/b];

}

global void pad_kernel_slow(float *dst,float *src,int width,int height)
{
const int ix = blockDim.x * blockIdx.x + threadIdx.x;
const int iy = blockDim.y * blockIdx.y + threadIdx.y;

    float srcvalue = src[iy * width + ix];
dst[(iy + 2) * (width + 5) + ix + 2] = [b]srcvalue[/b];

}

Regards,

zlf

Compiler optimizes and throw out unused calculations.

Trying changing 3.23424324 to 3.234243f and test. 3.23424324 is a double which is being converted to a float.

–edit–
My bad, never mind, I miss read your problem…

Dear Lev

I turned “Optimization” to “Disabled (/Od)” in “CUDA Build Rule/General”. The result is the same as before. Anyway to turn “Compiler optimizes” off?

Regards

zlf

In first case compiler eleminates loading of scrvalue, maybe code generator too.

float srcvalue = src[iy * width + ix]; never use so i think no computing
so no read src[iy * width + ix]