I used --cubin option and printed the register number for each thread, I then count the defined register variables in my kernel, they roughly match.
however, in my code, I have many places where the RHS requires a bunch of floating-point operations, such as
I imagine these operations also require temporary registers to hold the intermediate results when evaluating the RHS. My question is: is the register count reported in the cubin file include these temp registers? or, in other words, will these temp registers consume the 8192 register limit?
the reason I ask this is because I am getting “the launch timed out and was terminated” error when the thread number is set to a bigger value. From what I searched online, I think this is related to register limit. Anyone want to share your experience on this?
thank you in advance