Hi, I have a kernel that is using 40 registers each thread, which is a bit too much and so I want to reduce it.
In the kernel, there are two big parts that are almost the same, 20+ lines, uses lots of registers. The only difference is the input parameter to this two chunks of code:
If any part is removed, the register usage drop to around 27, and if both are removed, it drops to ZERO. So a way I can think of is to, instead of having two chunk of similar code, I’ll just have one chunk.
I’ve tried a few ways,
(1) make the chunk a device function, didn’t work because from the ptx it seems that the compiler suck the device function into the caller body, so still 40 registers
(2) put the chunk in a for loop and loop twice, didn’t work either
(3) use (evil) goto to go back to visit the chunk twice, still doesn’t work
If anyone has any similar experience or any suggestion, please let me know.