I am certain this is somewhere, but I’m having trouble tracking it down. What factors affect how many registers a kernel uses?
you can find the number of registers have used from *.cubin file.
Nvidia cuda programming guide, apendix A:
“The number of registers per multiprocessor is 8192”
if one thread uses 14 registers.
and you have 64 threads per block.
so total registers is 14 * 64 = 896 register.
Actually, I am trying to figure out how to predict how many registers will be needed for a given kernel, so I can try to intelligently control the number of registers used. I know how to find the actual number used, but the numbers are not intuitive. For example, removing a local variable from the kernel sometimes seems to make the number of registers increase.
The question came up for me when I was trying to tune a kernel. It was using 16 registers at the time. I could consider caching certain variables, passing in additional data to the kernel, or using more shared memory to improve performance. If the number of registers increased to 17, the number of blocks that could simultaneously execute and the maximum occupancy drop substantially.
As a result, I am looking to get some idea of how my potential optimizations may affect the number of registers used at compile time.
You mean that you want to control the number of register needed for your program?
I think that is impossible.
By the way, as far as i know, you can specified the maximum of register can be used for 1 thread.
Maybe control is too strong of a word. I am just looking for general guidelines to help me predict whether a particular change is likely to affect the number of registers used by the kernel. If a change makes a difference, I would like to have a ballpark idea of how (i.e. more registers or less).
Here’s an example. I have a kernel that uses 35 registers. If I can reduce it to 32 registers, I can significantly increase the occupancy. If I want to reduce register usage, should I try to:
- Minimize the number of kernel pass parameters?
- Minimize the size of pass parameters?
- Reduce the number/size of local variables?
- Something entirely different?
Intuitively, I would expect that eliminating pass parameters or local variables would reduce register usage; however, I am fairly certain that I have seen it go up after performing such actions.
I may reduce the total register for 1 threads.
I have a suggestion, why don’t you use shared memory replace for register. Pay attention at bank conflict.
I have never try to do it, but I hope that you can do. Tell me if you success.