Local Loads and Stores in CUDA profiler

I have a cuda kernel global mykernel (); which has three main functions for every thread inside it. one of this function when added results in very high execution time. Although we shouldn’t be executing the large part of this function as it is only executed for when a condition is true. for e.g.
global mykernel();
{
function 1();
function2();
function3();
}

device function2(…,…)
{

//declaring local variables

if ( condition){
//
lot of copy from global memory into local variables
//
}
}

When I profiled it using cudaprof, only thing I see different is that number of local load and store increase several times when I add function2 in compile, although condition is always calculated to be false, so we are not entering the code inside the if condition during execution. Could it be the case of register spilling?? Perhaps, compiler doesn’t find enough registers for new automatic local variable and spill it into local memory. but as I am not entering this section of code, I am confused why would I see increase in load and store from local memory anyways. Another thing is that I am using Fermi and L1 cache is set to 48K, so I wasn’t expecting register spilling to be huge problem. When I compile the code, I see that I am maxing out register usage per thread(63) but there is no mention in ptxas info that any lmem is used.

What does local load and store count in cuda profiler means exactly? and it is true that just compiling with lot of local variables even without using them in execution could result in decrease in overall performance.

thanks

Would replacing [font=“Courier New”]if ( condition){[/font] with [font=“Courier New”]if (0) {[/font] really give the same result, just faster?

In principle you are correct that more local variables, even if never accessed under certain conditions, can lead to more register spilling, thus more local memory accesses and decrease performance.
The best way to see what actually happens would be to look at the actual code using the disassembler from the Nouveau project (together with a little script for ELF file conversion).

Would replacing [font=“Courier New”]if ( condition){[/font] with [font=“Courier New”]if (0) {[/font] really give the same result, just faster?

In principle you are correct that more local variables, even if never accessed under certain conditions, can lead to more register spilling, thus more local memory accesses and decrease performance.
The best way to see what actually happens would be to look at the actual code using the disassembler from the Nouveau project (together with a little script for ELF file conversion).