Any way to mitigate whole program optimization?

I have a switch statement of ~10-20 device functions that operate on a large structure (100+ registers).

The device functions are typically unrolled loops that work on different subsets of the larger structure.

The register footprint of each device function should be able to fit in the remaining free registers as determined by launch bound constraints and minus the large in-register structure.

What’s interesting is that you would think the register footprint of the switch statement would be determined by the device function with the largest register footprint.

Unfortunately, that doesn’t seem to be the case.

Adding a simple and minimal footprint device function to the switch statement impacts the entire kernel in both register usage and performance.

I’m not surprised that some of this is happening but am wondering if there are any strategies for mitigating whole program optimization.

I don’t want to fight PTXAS for something so simple.

I’d still like the individual device functions to be optimized since experimentation shows that -O1 does drop register usage but with an appreciable drop in performance. The “-no-bb-merge” option wasn’t very helpful either.

If you have any ideas I’ll try them out but otherwise I think I’ll wind up having to reduce overall register pressure by enough that there is plenty for PTXAS to use.

Did noinline help?
How about calling through function pointers?

Thanks, Tera.

I will try noinline!

Calling through function pointers was disastrous because the function args were being copied instead of left in place.

Performant function pointers would’ve been my preference though.

Did you pass the args by reference?

  1. wait for cuda8 or ask nvidia guys to test it for you. may be nvidia still have early access program?
  2. try ‘if’ instead of switch
  3. use union to force register reuse:
void f (struct_with_f_locals &l);
void g (struct_with_g_locals &l);

union {struct_with_f_locals a; struct_with_g_locals b} x;
f(x.a);
g(x.b);

@tera: noinline appears to be tickling a compiler bug. The kernel drops to only using 127 registers and spills the rest.

@txbob: Yes, all args are passed by reference and were also passed as const correct pointers in an earlier version with identical results.

@BZ: I’m using CUDA 8.0 EA and make use of unions. I’ve tried pure switch vs. if/elseif as well.

Thanks for all your suggestions – keep 'em coming if you have more!

It’s just confounding that adding one small footprint function to the switch/if/elseif can impact both performance and register allocation.

I suspect there is little I can do about this other than reduce overall register pressure.

you can fill a bug report. for me it seems serious compiler drawback

about unions - i mean that you should union local variables of both functions. this way, they cannot be allocated in different registers. i modified my example above to emphasize the idea

@BZ, got it – that’s an interesting idea. I will try it tomorrow morning.