save registers

Hi, I have a kernel that is using 40 registers each thread, which is a bit too much and so I want to reduce it.

In the kernel, there are two big parts that are almost the same, 20+ lines, uses lots of registers. The only difference is the input parameter to this two chunks of code:

If any part is removed, the register usage drop to around 27, and if both are removed, it drops to ZERO. So a way I can think of is to, instead of having two chunk of similar code, I’ll just have one chunk.

I’ve tried a few ways,
(1) make the chunk a device function, didn’t work because from the ptx it seems that the compiler suck the device function into the caller body, so still 40 registers
(2) put the chunk in a for loop and loop twice, didn’t work either
(3) use (evil) goto to go back to visit the chunk twice, still doesn’t work

If anyone has any similar experience or any suggestion, please let me know.

Drop in a few volatile statements when declaring local variables within the kernel. I am not in the mood to explain why it works, but it is generally known as the “volatile trick”.

This can bring down execution speed slightly, but save a few registers. But I doubt you can get to 32 registers from 40.