Mixing PTX and C++ OR: Copy to __constant__ w/ device API?

Hey, all:

An algorithm I’m porting is massively huge in C++ by the time it gets to PTX, using 54 registers or something (naturally it changes every time I compile, including a few times when I didn’t change any source files, so it’s hard to say for sure). I know I can get it under 24 registers if I do it by hand, and the compiler’s been driving me up the wall, so I’m just going to write it in PTX. However, there are several other algorithms that will work just fine in a .cu, and I don’t want to write these in PTX if I don’t have to.

What’s the best way to combine .cu and handwritten .ptx? Particularly if I need to use the constant cache for code in .cu to attain good performance (note that there doesn’t appear to be a way to copy to constant space with the driver API, so I’d have to use the runtime API with all its C++ goodness. I’d prefer to use the driver API, but this feature is critical. If anyone knows how to do it, or sees something in the reference manual that I’m missing, let me know.)

Thanks for your help,

put a global const in your module and then access it through cuModuleGetGlobal, this way you should have gained access to the pointer, and fill it like any other array.

If it works, please write here :).

If you are using driverapi then modules are already compiled to cubin so, it doesn’t matter whether they are written in c or ptx asm. So combining them should be no problem. But that’s again a guess :).

If registers are a problem, try the compiler option --maxrregcount=…

Indeed it does! I’m not sure how memory is distinguished between being const and just global - maybe it’s just by address - but my tests show that it works. Thanks for the tip.