On-the-fly compilation

Hi !

A couple of times I’ve asked questions here and I got some pretty nice answers, so let me ask just one more :)

I have one huge-huge CUDA kernel. Depending on the input, it may need or it may not need some of its code.

Since the fight with the registry spills (believe me, I’ve tried every single advice I got to reduce them) is ongoing, I just thought there may be a different way.

What I want is, to detect which parts of the code are needed and create kernel “on the fly” with just them (or replace the ones that are not used with empty ones). Of course, none of them will inlined, but this is fine.

If all of my users have nvcc installed this is not a problem (I will #ifdef the functions that are not needed, etc). However, CUDA SDK is quite big and the last thing I want is to bring it with my app.

If you have seen some kind of reference/manual/topic about it or if you know a way this can be done (with relatively small memory-space footprint for the users of the app), please share. Like, is it possible to combine multiple ptx files in fatbin on-the-fly ?

Thanks,
Bye.

p.s. I dont have any cpp code in the kernel, if this matters …

How many different variants of this kernel do you anticipate you will need? If you need less than, say, twenty, you might want to look into using a templated function, instantiate the template for all the variants you need and simply invoke the desired instantiated variant at run time. I prefer function pointers for the latter, others prefer switch statements.

I am aware that there are CUDA applications that use their own PTX code generator and then use CUDA support for JITing PTX at runtime, but I have no personal experience with that approach. So far, the template approach has worked well enough for my own projects and is much less complex.

The slowdown comes from the fact, that the code is there (and there are a lot of registry spills). It is causing slowdowns even when it is not being invoked (it is not because the code is slow).
What I need is to have the kernel compiled on the fly, in order to produce version of it that is not having at all this piece of code (therefore reduce my spills).

I think that OptiX is working that way (but then again, I wasn’t able to run the examples and check out what they are doing, since the links in the nVidia web page are broken …).

CUDA documentation mentions JITing once - but it is in order to compile against multiple architectures (and the most appropriate one to be selected at runtime).

p.s. OptiX webpage mentions as a new feature
“More Efficient Compilation
The need to compile when the scene changes is far less frequent than in previous releases.”
So, they are compiling it one way or another. The question is, do I need the whole SDK to be distributed with my app ?

I understand the situation. An approach worthy of consideration is to make the various pieces of code you want to have removed dependent on template parameters so that a piece of code is excluded when the template is instantiated with the corresponding parameter set to 0. I have used this technique in practice and it works quite well to reduce code size and register usage. The idea is similar to using #ifdef, except that one uses actual if-statements in the code, each controlled by a template parameter.

If you come to the conclusion that online compilation is the only workable solution, there are functions in the CUDA API to load fat binaries. You may need to use the driver API, as I recall. Have a look at the documentation for cuModuleLoadFatBinary().

I am using the driver API already.

I expected CUDA templates to be instantiated compile time (just like C++ ones). Are they instantiated runtime?
Or you suggest to compile all the needed versions once, and then load the appropriate one runtime ? If so, if I want to turn on/off, let say 8 of them, I would have 256 versions of the code (in a single fatbin file, but exponential increase of the code version is not really good) …

Chapter 6 in this paper describes what OptiX means with compilation:
http://graphics.cs.williams.edu/papers/OptiXSIGGRAPH10/

The CUDA compiler uses a C++ frontend so templates are handled as by every other C++ compiler meaning templates are instantiated at compile time. If your code requires that all combinations of binary template parameters are instantiated, this will cause exponential growth in the number of instances, as noted, and this will quickly become cumbersome or unworkable.

I have used this technique for up to five parameters (= 32 instances) in my own projects. In other instances I have found that only a smaller subset of possible combinations made sense, because the parameters were not all independent, so even though there were six parameters only twenty combination were needed, making the template approach workable.

Certainly the use of templates is not a silver bullet, but should be at least considered. Other approaches may work better, such as decomposing a large kernel into smaller ones (each of which performs one stage of the processing done by the original kernel) or using online compilation.

@savage309:

Another strategy is to restructure your mega-kernel into independent kernels connected by queues.

The Laine/Karras/Aila “Megakernels Considered Harmful” paper is worth reading.

This approach makes even more sense if your megakernel is composed of functions with very different GPU resource requirements.

Oh, and no one has mentioned the Dynamic Parallelism facility.

Dynamic Parallelism plus some of the other techniques mentioned above (function pointers, switch statements, etc.) might simplify your problem.

Thanks for the replies.
I appreciate them a lot ! (paper is great, online compilation sounds interesting :))
Best,
savage309