Nvcc outputs shows that my program use 63regs and 48 bytes lmem plus, and cuda profiler reports large amount of local mem access. That’s probably why my program performs bad.
There’s a small loop repeated hundreds of times in my program, and in the main loop there would be no local mem access expected, as to say, 61 regs should be enough for all the calculation as I estimated. But with no luck, nvcc would use 8 more regs. For some reason, I am sorry I could not post the code for now. But I will really appreciate any advice on avoiding unnecessary local mem access and optimization tips for nvcc.
Nvcc outputs shows that my program use 63regs and 48 bytes lmem plus, and cuda profiler reports large amount of local mem access. That’s probably why my program performs bad.
There’s a small loop repeated hundreds of times in my program, and in the main loop there would be no local mem access expected, as to say, 61 regs should be enough for all the calculation as I estimated. But with no luck, nvcc would use 8 more regs. For some reason, I am sorry I could not post the code for now. But I will really appreciate any advice on avoiding unnecessary local mem access and optimization tips for nvcc.
Full unrolling is likely to make the problem worse, as the compiler will use as many registers as it sees fit to optimize the unrolled code.
Try switching unrolling off by putting [font=“Courier New”]#pragma unroll 1[/font] in front of the loop. If that works, you can try to increase the number to reintroduce partial unrolling.