I have to compute a lengthy taylor series expansion of the form:
s = st+constant1
s = st+constant2
s = s*t+constant40
I have to compute this for multiple inputs inside my kernel. The problem is (looking at the output of cuobjdump).
The problem is that logic in my kernel doesn’t have too heavy register usage, since the kernel computes expansions on a few (4to8) variables, thus there is no need to require many registers to hold 8 temporaries (+ some extra for pointers to write out the result). However, my kernel wants to use maximum amount of registers available plus a kilobyte of local memory to store intermediate results.
In order investigate the problem closer I rewrote my kernel in ptx (using fma.rn) and got almost the same result in register usage. Then I looked at cuobjdump to see how my code looks like in the final cubin file and discovered the following:
for all constants used in calculations ptxas tries to allocate a separate register into which it loads a result, further down that register is used in fma instruction to accumulate the expansion. Since I have many constants, ptxas acquires almost all registers for holding constants, and starts moving accumulation variables to the local memory.
The final code generated by ptx looks like this:
mov r1, constant1
mov r2, constant2
mov r50, constant2
As you can see, not only it eats all the registerst (and starts to spill to the stack) but it also uses extra instruction for each fma instruction.
What I’d expect the code to do is to fetch the constants from constant memory (which actually works for some predefined constants like 1.0, 0.5,etc)
But no matter how I try to write the code, using immediate constants, using explicit constant tables, it always ends up using registers for loading constants instead of using constants directly.
I’m using latest toolkit 5.0 .
Anyone can help me beat ptxas stupidity ?