Why are function pointers so slow ?

Hi all.
I have CUDA app here, which do run in a switch(…) statement most of its time. Values there are from 0 to 25 (no “wholes” between).

When I try to use array of function ptrs in stead of the switch (with CUDA 5.5, since nvcc from CUDA 5.0 used to crash during compilation), this results in slower app (about 5-8% slower).

Is this normal ? I believe the compiler doesn’t makes jump table from a switch (like a regular cpp compiler might do), so my manually written jump table should be faster …

Also, I can’t run the profiler (as you can see here https://devtalk.nvidia.com/default/topic/545731/visual-profiler/visual-profiler-5-5-no-timeline-no-resuts-no-errors/) …

Any clues ?

The CUDA compiler aggressively inlines all device function calls when it can, which eliminates the overhead of having to use a stack to call a function. Calling a function by pointer cannot be inlined, so the compiler will necessarily have to generate slower code.

This is the most likely cause. Note that the aggressive inlining currently requires both the caller and the callee to reside in the same compilation unit. No inlining takes place across boundaries of separately compiled object modules.

So the multiple checks in the switch() statement are faster than a function call?

It’s not just the cost of a function call per se. In order for functions to be callable, they need to follow the calling conventions specified by the ABI. This includes restrictions on register allocation for example, as functions arguments and return values are restricted to specific registers. This is no different on GPUs than it is on commonly used CPUs. In addition, inlining of functions often allows additional optimizations to occur once the code has been inlined (a simple example would be constant propagation). Conversely, not inlining the function precludes these optimizations. In this regard as well code generation for CPUs and GPUs behaves in much the same way.

These are classical tradeoffs between performance and flexibility (e.g. function pointers, virtual functions, separate compilation etc).

Yeah, inlining was what I thought too, yet it is strange for me that function call overhead can be so big (again, nvcc didn’t make jump-table to inlined code itself - it is calculating many if-else).
Thanks.

One caveat here is that we are doing nothing more than inteloigent speculation here, since we have not examined the source code nor the generated machine code.

Unless I am confusing myself, branching through a jump table is not the same as calling functions through function pointers. One could inline all functions, and still use a branch table to direct conrol flow as appropriate after that.

As for the benefits of aggressive inlining, it’s not just about eliminating call overhead, and to see a speedup of 5-8% from aggressive inlining vs no inlining does not strike me as unusual (independent of platform) if code lends itself well to that optimization. If you have equivalent CPU code you may want to play with the inlining options (no inlining -> most aggressive inlining) to see what the difference is.

There was a semi-related CUDA Forums discussion here.

According to @eelsen, NVCC can generate jump tables (using the BRX instruction).

As you point out, the necessary support for jump tables is available in PTX. The compiler may have a heuristic for determining whether a jump table or a series of if-then-else is more beneficial for a particular switch() statement. If there is solid evidence that the switch() implementation strategy selected by the compiler is detrimental to performance in a particular app, I would suggest filing a bug report.