performance issue with local mem

Hello,

I have a cuda kernel global mykernel (); which has three main functions for every thread inside it. one of this function when added results in very high execution time. Although we shouldn’t be executing the large part of this function as it is only executed for when a condition is true. for e.g.
global mykernel();
{
function 1();
function2();
function3();
}

device function2(…,…)
{

//declaring local variables

if ( condition){
//
lot of copy from global memory into local variables
//
}
}

When I profiled it using cudaprof, only thing I see different is that number of local load and store increase several times when I add function2 in compile, although condition is always calculated to be false, so we are not entering the code inside the if condition during execution. Could it be the case of register spilling?? Perhaps, compiler doesn’t find enough registers for new automatic local variable and spill it into local memory. but as I am not entering this section of code, I am confused why would I see increase in load and store from local memory anyways. Another thing is that I am using Fermi and L1 cache is set to 48K, so I wasn’t expecting register spilling to be huge problem. When I compile the code, I see that I am maxing out register usage per thread(63) but there is no mention in ptxas info that any lmem is used.

I would like to know if I am looking at the right place. What does local load and store count in cuda profiler means exactly? and it is true that just compiling with lot of local variables even without using them in execution could result in decrease in overall performance.

thanks