Very large kernels How to compile a large cuda kernel?

We have a (machine-generated) kernel file that is quite large. When trying to compile, the nvcc bails out with “out of heap error”.

As a work-around, we tried splitting the one large files into multiple kernel files. Whether multiple template/kernel files are permitted is unclear from the docs – we had no luck getting that to go using Visual studio environment. I don’t know if that is something we did wrong with visual studio, or just a fundamental limitation of how nvcc works. Basically, visual studio only wanted to compile one of the template files (even though we did the same kind of “custom build setup” on all the template files).

Even weirder, when we try putting multiple #include statements in a single template file, only the first kernel file got included in. The rest seemed to be ignored.

The manual says there is an upper limit of 2 million ptx instructions in the kernel. Does that limitation manifest itself in the “out of heap error” encountered with nvcc compilation? And just out of curiousity, why such a low limit on instruction size, and are there workarounds?

Could you split your kernel into some smaller ones, compile them to cubin or PTX files, and them call them in order using the driver API? You wouldn’t need to copy back the resulting data until the last sub-kernel has completed.

As for the 2M instruction limit…I don’t think there are any workarounds besides splitting your kernel into pieces. Something to do with a hardware limitation if I remember correctly.

Out of heap? I’ve had nvcc run out of stack, which I corrected by using ‘editbin’ to modify the executable. I’m not sure you can run out of “heap” unless you just run out of memory on your system.

I don’t know why visual studio is not letting you compile multiple files, try doing “rebuild all.”

Once you get it compiling, separate object files probably won’t work right away. You may need to put a c wrapper around each kernel call to let you call into it. (I don’t think the <<< >>> syntax will work)

P.S. to test you idea on a basic level, copy your kernel to its own file and run nvcc -cubin on it. If that doesn’t work, then neither will juggling .cu files.