I have a kernel which uses 12 registers, causing sub-optimal warp occupancy according to the calculator. I need to reduce it to 10 registers to improve performance. Any advice?
its kinda hard without some more info, don’t use threadDim.x as u probably have that as a const somewhere. use shared mem instead of registers even if you don’t need to “share” it with the other threads. reuse whatever variables you can.
If you have some constants, that may be initialized in host side - you can using constant memory (constant)
If your kernel uses only 12 registers, even the current occupancy is probably achieving (or has the potential of achieving) very good latency hiding. Higher occupancy doesn’t always mean higher performance, which depends heavily on the application.
If you’re seeing performance issues with 12 regsiters, I’d recommend ensuring that your global memory accesses are coalesced, minimize shared memory bank conflicts (in that order), before spending an effort to achieve 100% occupancy.