Hi there.
I’m trying to optimize my cuda app. For increase occupancy i need to limit the number of register per thread at 16. My kernel currently uses 18 regs/thread so I need to compile with -maxrregcount 16, but in this way I use 4 bytes of local memory.
My question is: is there any way to see where (in what code block) my kernel need 18 regs? If I compile with -ptx parameter I cannot see what regs are reused because any instruction allocate one more register.
The number of registers used by your application (18) are machine registers, not ptx registers. If you are sill interested in looking at the ptx register usage, you could modify nvopencc to dump a liveness analysis.
My objective isn’t look what part of my app consume much time but what part of my app uses the maximum number of registers.
You’re right, but if I can reduce the maximum number of used registers without using local memory I surely increase the performance (in my opinion at least :huh: )
Thats not correct :) occupancy is not 100% related to performance. (google those forms and maybe the programming guide from nVidia for such statement) :)
Therefore reducing the number of registers probably won’t help your performance, hence my suggestion :)
Obviously use should use the profiler and make sure all your gmem accesses are coalesced.
Not true. The compiler uses registers generously to store intermediate results to prevent them having to be recalculated. By using the -maxrregcount flag you increase the workload done within the kernel, however can reduce occupancy. Whether this is an advantage or not is application dependant. I have come accross cases in both extremes - one case where using lmem to increase occupancy increased performance, and one where reducing the register count by one to increase occupancy reduced performance.