I have a switch statement of ~10-20 device functions that operate on a large structure (100+ registers).
The device functions are typically unrolled loops that work on different subsets of the larger structure.
The register footprint of each device function should be able to fit in the remaining free registers as determined by launch bound constraints and minus the large in-register structure.
What’s interesting is that you would think the register footprint of the switch statement would be determined by the device function with the largest register footprint.
Unfortunately, that doesn’t seem to be the case.
Adding a simple and minimal footprint device function to the switch statement impacts the entire kernel in both register usage and performance.
I’m not surprised that some of this is happening but am wondering if there are any strategies for mitigating whole program optimization.
I don’t want to fight PTXAS for something so simple.
I’d still like the individual device functions to be optimized since experimentation shows that -O1 does drop register usage but with an appreciable drop in performance. The “-no-bb-merge” option wasn’t very helpful either.
If you have any ideas I’ll try them out but otherwise I think I’ll wind up having to reduce overall register pressure by enough that there is plenty for PTXAS to use.