setting register count per (global) function?

When compiling the kernel you can have some control over how many registers the global functions inside are allowed to use through the “-maxregcount” parameter. This however applies to ALL the global functions in the kernel file. As we cannot[*] compile multiple kernel files to have them linked together in a single cubin / object we are stuck with a single maximum count for all the kernels. For example I want to limit the register usage as follows:

global A compiles to 17 registers, limit to 16
global B compiles to 9 registers, limit to 8

The only “solution” I have found so far is to compile two different .cu files, each with a different maxregcount, and use the driver API to load these modules separately. However if these functions share global data, or something similar I have to get handles to both in both cubin files, etc. adding a lot of additional overhead.

Would it be possible, or is there a better way to set maximum register counts per function?

[*]please correct me if I am wrong and it is already possible but I have not found something like this so far.

I think you don’t need to use the driver API to load the compiled stuff, rather you can use the linker to do this.

nvcc A.cu --maxregisters=16 -o A.o
nvcc B.cu --maxregisters=16 -o B.o
nvcc master.cu A.o B.o -o cudaProgram

… you’ll have to play with header files and includes and externs, but it shouldn’t be too bad.