how to deal with __launch_bounds__() when there is a device function called by the global function?


I’m encountering a problem with _launch_bounds(). I want to set up the number of registers per thread manually, so I put _launch_bounds() just in the definition of my global function. However, I’m failed to build the code, I was always told I can’t let a function with maximum registers per thread of 64
call a function with registers of 70. In my opinion, it meant that I was not able to set up my expected number of registers. Then, I realized that there is a device function called by my global function, and I’m pretty sure it causes the trouble. However, I don’t know how to deal with this circumstance, 'cause launch_bounds only works on global function.

Any ideas?


It is no clear what your code actually looks like, and how it is being compiled. Here are two generic ideas:

(1) If you are building the code from a single source file (allowing the compiler to use whole program optimizations), try adding a forceinline attribute for the device function.

(2) If you are using multiple source files and separate compilation, try adding the compiler flag -maxrregcount=63 for the file that contains the device function in question.


Actually, your first solution works good to my problem. Thanks a lot. I have a question about your second suggestion. Do you know hot to set up the compiler flag -maxregcount in VS2010?

I am glad to hear you got things to work. I never use the Visual Studio IDE, for my Windows builds I use makefiles and commandline invocations of nvcc just as I do under Linux,. This makes it trivial to add the -maxrregcount flag where needed. So I am afraid I do not have a suggestion on how to add that flag to a Visual Studio build.

For Visual Studio 2010 you can set maxrregcount in the property page under /ConfigurationProperties/CUDA C/C++ / Device

Thank you so much

Fortunately, the guy above gave me a hint. However, do you know why the compiler cannot build my code without forceinline ? My code just like what I list below.

device YYY(){ <—here is the problem, the registers per thread is set to 70 by force.

global void launch_bounds(512, 2)
XXX() <------I set MinBlocksPerMP to 2, so that I have 64 registers per thread



I have not personally run into the situation you are seeing, but my reasoning is as follows: GPU resources are alloctaed at kernel launch time based on the propertier of the global function. This includes the allocation of registers. So a called subroutine (device function) that needs more registers cannot be called from a global function that provides fewer registers. This is different from CPUs where the number of registers is fixed for all functions.

By forcing the compiler to inline the device function you presumably wind up with a global function requiring more registers than the original global function, but due to the inlining there is no longer a conflict in register usage. Alternatively, if you compile all source files with the same -maxrregcount setting, you are creating the equivalent of the situation on CPUs where all code always assumes the same number of registers (except that you as a programmer get to choose that number). The GPU approach (that is, variable number of registers) gives greater flexibility for performance trade-offs but occassionaly leads to the kind of conflict you encountered here.