The cost of Relocatable Device Code (-rdc=true)

I would like to know more about the consequences of compiling with -rdc=true.

From what I understand the only cost comes from the compiler not being able to optimize function calls ACROSS compilation units. That is not being able to inline a small device function from another compilation unit.

I would like to compile with rdc in order to communicate constant memory across files (using extern) and pass device functions across compilation units in my application.

My application is conformed by many .cu files which are currently independent of each other (no external constant memory and no communication of device functions between files), so no rdc is needed and I can compile each one separately using only “$nvcc -c”.

When I compile said application by using “$nvcc -dc” or -rdc=true the performance drops a noticeable amount in kernels that, as I said before, only use device functions that are on the same C.U.
In this case I would expect the compiler to still inline the same device functions as before (and it appears to be doing so). So I do not understand what else is rdc doing to harm performance.

LONG STORY SHORT

Besides function call optimization across compilation units, is there any other performance consequence of using relocatable device code?
Thanks!

Beyond the loss of cross-unit optimizations involving the function contents itself (especially inlining, dead code elimination, loop unrolling, etc), the cross-unit function calls themselves have more overhead since they have to follow the full function calling ABI, needing extra overhead to set up and recover function arguments and results.

Just hypothesizing about further effects, there might be some lost instruction cache efficiency, but as a guess that’d be neglegible unless you designed some especially non-looping code with many diverse nonlocal functions.

I do not think this is it, since all device functions used in kernels are in the same compilation unit. And forcing the compiler to not inline key device functions has a much worse impact than rdc.

Interesting! This might be the source of my application’s slowdown. Although I believe there must be other sources as well.

SPWorley gave an excellent summary of details generally affecting the performance of code compiled with -rdc=true. Among the “extra overhead” of following the ABI may also be increased register usage and as a consequence lower occupancy (use the profiler to check). Without providing source code that others can compile, I am afraid more detailed advice isn’t possible.

A closer look at the generated machine code (not at the intermediate PTX code) should tell you precisely how -rdc=true affects your code in particular. Use cuobjdump --dump-sass to get the disassembled code.

Forgive my ignorance but isn’t RDC required for building well-structured probjects that are split across multiple source files?

Given the choice between slower code that’s maintainable and well-organized vs a 50,000 line file that’s faster, I’d take the slower code any days. Besides, GPUs get faster all the time, right? :P

-rdc=true enables the creation of complex programs from multiple, separately compiled, compilation units. This requires adherence to an ABI and a linker, and is generally how code is built for CPU platforms. It does have a negative impact on performance; in the CPU world this impact was mitigated over time (two decades!) by the introduction of optimizing linkers, link-time code generation, and whole-program optimizations based on compiler-internal intermediate representations such as the Intel compiler’s .ipo databases. So once one enables -Qipo, one will often see a significant performance boost.

For historical reasons, the CUDA toolchain started out differently, as initial GPU hardware lacked sufficient hardware support for a proper ABI, and the compiler grew out of simple shader compilers used for graphics APIs. This also meant that the initial toolchain did whole-program optimizations since there was only a single compilation unit, with all the performance benefits that come along with it.

At this point, CUDA has a device code linker, supports an ABI and multiple compilation units that can be compiled separately. However, when these are used, performance will often drop compared to the old ways of building CUDA code. By some CUDA programmers this is then seen as a “performance regression” rather than an inevitable trade-off between the flexibility of structuring source code and code performance.

The ultimate solution would be for the CUDA toolchain to adopt much of the same link-time optimization approaches that are used by toolchains for CPUs. Looking at the history of CPU toolchains, that could be a slow process, and whether NVIDIA pursues such improvements is likely a function of how many CUDA customers express a need for that.

Some CUDA users do need performance at any cost, and for those, using a single compilation unit (which may be constructed from a multitude of #included files) with full-program optimization may well be the way to go. Projects that value modularity and are not as performance sensitive may well opt for use of -rdc=true. As the saying goes “Different strokes for different folks.”

MJ, absolutely.
95% of programming is not performance sensitive, but if you’re using GPU compute, you certainly ARE doing it for performance reasons. So you can’t ignore performance and hence our job as a software engineers is to balance performance and code elegance.

My own advice for cross-unit code structure is to avoid using cross-unit function calls for the inner-most loops. Instead, have the inner loop (with looping boilerplate) as part of the function itself, even if that loop code is repeated many times. That moves you from having say 500000 full-ABI cross-unit function calls inside a loop into having one cross-unit call that loops 500000 times internally, which would amortize to effectively no lost performance. What exactly is in that boilerplate loop depends on your application.

This is annoying, but if you had many such internal functions all sharing the same kind of boilerplate loop, you could just template the boilerplate. An example might be organizing different types of user selectable molecular force functions… don’t make the one line force computation the external callable function, instead make the templated accumulator loop part of the callable function even if repeated for each choice of force.

But only if the kernel has a cross compilation unit device function call, right?

You are right, but the source is very big and I do not think It will help the discussion in this case. On the other hand I have not being able to reproduce the negative effects in a minimum example, part of the reason why I started the thread… Anyway your insight is already helpful as always. Thank you!

Sure, but maybe that beats the purpose in some case, given that you will have to maintain many copies of the same ‘boilerplate’ code. Besides, not inlining a device function seems to be catastrophic most of the time!

My case exactly. And as you suggested, my strategy has been to template a struct containing the critical code. The annoying thing is the need in this case for the structs to be in the same CU as the kernel. The result is good because nvcc inlines it all and there is no overhead. Not really a nvcc specific restriction, but the real restriction is the great impact of inlining in CUDA.

As njuffa said, I accomplish this by #including the file with the critical device functions everywhere they are needed, recompiling them as a part of each C.U in each case. Inlining occurs, so there is no naming issues.

Yeah, sure. But when I tried separate compiling, I didnt expect I was going to notice a performance decrease at all. I guess that it is still a young technology. So hopefully it will get better with each CUDA release.

Thank you all!

I hope it has become clear now why that was an unrealistic assumption. Where cross-file function inlining is supported by CPU toolchains, it can easily give 20%-30% performance increase versus calling functions in other compilation units. Function inlining is just a starting point which allows a whole bunch of other optimizations to kick in. How big of a performance difference are you seeing with your use case when building with -rdc=true vs -rdc=false?

The problem is not so much that the CUDA toolchain is young, as it uses several components also used by CPU toolchains, it’s that most CPU toolchains are very old and mature. I used the original Intel optimizing compiler (Proton) back in 1996, some twenty years ago. Back then it would frequently “blow up” if you tried to compile anything besides the SPEC benchmarks. The gcc toolchain is almost 30 years old now. This shows that it takes a lot of man-years (and calendar years!) to create a toolchain with all the bells and whistles.

It is clear now, thanks!

About 5%