-rdc=true causes massive register spills. Is it possible device runtime without it?

I have a fairly complex kernel comprised of dozen of device functions, all written in PTX. When compiled straight to cubin it uses 177 registers and all is well. It is indented to be run as one long-lived block on a single SMX. It is not a typical GPU task, and every microsecond is important.

Now I want this kernel to launch another kernel using dynamic parallelism. It requires a library (libcudadevrt.a), hence relocatable code. Problem is as soon as I add -rdc=true I get a lot of register spilling: around 840 bytes for the main function, and something under 100 for every device function. From reading this forum I understand that increased register usage can be normal with relocatable code. So I have these questions:

  1. Is there a way to use dynamic parallelism without relocatable code? I was able to extract PTX from libcudadevrt.a but it’s not clear if it can be combined with my actual program.

  2. If relocatable code is mandatory, any special tips for lowering register usage? Many device functions I wrote are mainly for modularity and can be replaced with in-place code - would that help?

  3. A lot of data I keep in registers are heavily used thread-specific constants that are known before kernel launch. If I have to use memory for these, what is the best approach? Constant memory, cache prefetch, something else?

I’m developing this on Cuda 8, GeForce 940MX (Maxwell), but eventually targeting Pascal/Volta Tesla cards. Thanks in advance.

CUDA Dynamic Parallelism requires separate device code compilation and linking. It requires relocatable code.

Possibly. It is worth a try. You can use good old compiler #include directives to still attempt to maintain a “semblance” of modularity, if you wish. I don’t normally recommend writing in straight PTX, and don’t spend much time thinking about code assembly in PTX, so it’s possible some of my comments don’t make sense in that context.

__constant__ memory is usually a good choice for uniform constants (load from same constant across warp) that are known at runtime only. However this doesn’t completely negate register usage - constants still need to be loaded into registers prior to use. Register spilling puts additional load on the memory subsystems, starting with the L1 cache. So keeping the L1 cache as pristine as possible, by using the read-only cache where possible (cc3.5 and up) and using the constant cache (i.e. __constant__ memory) or even texture cache, may help. Some later architectures have a unified L1/Tex.