Too big delay in code, problem

Please, advice me:

ihave the following code:

[codebox]device CalcKernel(dword index, dword *d_mem)


DWORD L,R,result;

 DWORD idx = blockIdx.x * blockDim.x + threadIdx.x;


result = L^R;

d_mem[idx] = result;


This function time consumption, for example, 3 seconds.

If i changing the last line to

[codebox] d_mem[idx] = idx; [/codebox]

it takes 0.9 second to call the function.

Why so big delay, when i using different variable?


Code optimization, most likely. Open64 has a very aggressive dead code removal algorithm, and it is likely that the changing the value stored at the end of the kernel removes large sections of code (quite probably the whole inlined function), making the kernel a lot “faster” because most of its contents has been optimized away-

Maybe problem is that memory is slow?
d_mem was allocate by CudaMalloc()

DWORD L,R,result;
DWORD idx = offset + blockIdx.x * blockDim.x + threadIdx.x;
result = L^R;

d_mem[blockIdx.x * blockDim.x + threadIdx.x]=idx; - working fast - use 15 registers in function
d_mem[blockIdx.x * blockDim.x + threadIdx.x]=0; - working fast
d_mem[blockIdx.x * blockDim.x + threadIdx.x]=0 & result; - working fast

d_mem[blockIdx.x * blockDim.x + threadIdx.x]=1 & result; - working slow
d_mem[blockIdx.x * blockDim.x + threadIdx.x]=result; - working slow and use 20 registers in function

any chance ot improve that?

It isn’t memory, it is code (or the absence thereof). The compiler is smart enough to know that

d_mem[blockIdx.x * blockDim.x + threadIdx.x]=0 & result

is equivalent to

d_mem[blockIdx.x * blockDim.x + threadIdx.x]=0

and that any intermediate code required to calculate result is superfluous and can be optimized away. If you don’t believe me, get yourself a copy of decuda and disassemble the ptx output from the compiler. You will find huge chunks of code missing in the “fast case”. The difference in register usage is a tell-tale sign.