We have a kernel of this type
global streamclassifier( inputs ) {
#include “header1.hâ€
#include “header2.hâ€
.
.
.
#include “headerN.hâ€
if (threadIdx.x < 32) {
“Call the device function in header1.hâ€
else if (threadIdx.x < 64) {
“Call the device function in header2.hâ€
…
…
…
…
else if(threadId.x < (N-1)*32)
“ call the device function in headerN.hâ€
}
Here N could be as high as 140. Each device function that the warp calls has a lot of bit manipulations ( and’ing and or’ing). Because we included all the header files in to the same kernel, we are getting errors wrt to the virtual register allocation, precisely ran out of virtual registers. We tried the –opencc-options -OPT:Olimit = 0, but the compiler ran out of memory and had to reboot the machine. We have Telsa cards. We can at max compile this for two header files (N = 2).
We are doing this so that when we launch this kernel, each warp of threads computes a different function. Is there any other better way to do what we desire ? Like a compiler switch that will reuse the virtual registers ?