An algorithm I’m porting is massively huge in C++ by the time it gets to PTX, using 54 registers or something (naturally it changes every time I compile, including a few times when I didn’t change any source files, so it’s hard to say for sure). I know I can get it under 24 registers if I do it by hand, and the compiler’s been driving me up the wall, so I’m just going to write it in PTX. However, there are several other algorithms that will work just fine in a .cu, and I don’t want to write these in PTX if I don’t have to.
What’s the best way to combine .cu and handwritten .ptx? Particularly if I need to use the constant cache for code in .cu to attain good performance (note that there doesn’t appear to be a way to copy to constant space with the driver API, so I’d have to use the runtime API with all its C++ goodness. I’d prefer to use the driver API, but this feature is critical. If anyone knows how to do it, or sees something in the reference manual that I’m missing, let me know.)
Thanks for your help,