I have a kernel that takes 10 registers, and if I compile it with --maxrregcount=16, it takes 9 registers.
Is it normal ? I’d rather keep it using 10 registers (in my mind, the maxrregcount optimizations often produce a little overhead (I’m not talking about register spilling))
PS : I use cuda 3.1
Does the [font=“Courier New”]launch_bounds()[/font] directive (see appendix B.16 of the Programming Guide) produce a better result?
Among other strange results:
regcount = 25 , maxrregcount = 33 => kerneltime = 80 ms
regcount = 25 , maxrregcount = 54 => kerneltime = 100 ms
with cuda 2.1. Never had time to investigate further, maybe someone knows why ?