Nvcc outputs shows that my program use 63regs and 48 bytes lmem plus, and cuda profiler reports large amount of local mem access. That’s probably why my program performs bad.
There’s a small loop repeated hundreds of times in my program, and in the main loop there would be no local mem access expected, as to say, 61 regs should be enough for all the calculation as I estimated. But with no luck, nvcc would use 8 more regs. For some reason, I am sorry I could not post the code for now. But I will really appreciate any advice on avoiding unnecessary local mem access and optimization tips for nvcc.