Hello,
I am profiling a kernel function that has this lambda declared inside
int step = 1;
int sign = 1;
auto UpdateStep = [&] __device__ () {
if(sign == -1) {
step++;
}
sign = (-1) * sign;
};
Profiling this code with Nsight shows that when UpdateStep is called in the kernel it causes a memory access to Global(2) on the line
sign = (-1) * sign;
I’m just curious as to why this is happening given that both the lambda and the variable sign are declared locally within the kernel?
I’m running CUDA 12.4 with Compute Capability 8.9.
I doubt this could be answered without a complete test case. It could possibly be due to the capture spec, or else the compiler is attributing something to that line during the optimization process that might not appear to be obviously part of the lambda.
Local variables generally are defined as Cuda Local Memory (with the same latency as global memory). Keeping them in registers is the benefit of an optimization step (within ptxas I believe).
Your UpdateStep captures the sign and step by reference, thus taking a pointer. That could prevent the optimization.
I have not much experience optimizing lambdas in Cuda device code.