I’m trying to optimize my cuda app. For increase occupancy i need to limit the number of register per thread at 16. My kernel currently uses 18 regs/thread so I need to compile with -maxrregcount 16, but in this way I use 4 bytes of local memory.
My question is: is there any way to see where (in what code block) my kernel need 18 regs? If I compile with -ptx parameter I cannot see what regs are reused because any instruction allocate one more register.
Please don’t tell me that the only way is look in ptx code and make a liveness analysis of each register
thanks for your answers