Maximum number of registers used How to get the block of code that uses the maximum number of regist

Hi there.
I’m trying to optimize my cuda app. For increase occupancy i need to limit the number of register per thread at 16. My kernel currently uses 18 regs/thread so I need to compile with -maxrregcount 16, but in this way I use 4 bytes of local memory.
My question is: is there any way to see where (in what code block) my kernel need 18 regs? If I compile with -ptx parameter I cannot see what regs are reused because any instruction allocate one more register.

Please don’t tell me that the only way is look in ptx code and make a liveness analysis of each register External Image External Image External Image

thanks for your answers

Hi,

The number of registers used by your application (18) are machine registers, not ptx registers. If you are sill interested in looking at the ptx register usage, you could modify nvopencc to dump a liveness analysis.

What application are you using?

I find that the fastest way to check what costs time in your kernel is commenting code and seeing how much time each section

of your kernel takes. Make sure the compiler doesn’t optimize out your kernel in the process :)

Also bear in mind that increasing occupancy doesnt always mean more performance.

eyal

I am developing one application, not using :D

Actually I am using cuda toolkit 2.3 on a openSUSE 11.1 box. Could you tell me how to dump that liveness analysis?

thanks

EDIT: typing error

My objective isn’t look what part of my app consume much time but what part of my app uses the maximum number of registers.

You’re right, but if I can reduce the maximum number of used registers without using local memory I surely increase the performance (in my opinion at least :huh: )

Thats not correct :) occupancy is not 100% related to performance. (google those forms and maybe the programming guide from nVidia for such statement) :)

Therefore reducing the number of registers probably won’t help your performance, hence my suggestion :)

Obviously use should use the profiler and make sure all your gmem accesses are coalesced.

eyal

Not true. The compiler uses registers generously to store intermediate results to prevent them having to be recalculated. By using the -maxrregcount flag you increase the workload done within the kernel, however can reduce occupancy. Whether this is an advantage or not is application dependant. I have come accross cases in both extremes - one case where using lmem to increase occupancy increased performance, and one where reducing the register count by one to increase occupancy reduced performance.