"Device Function Call Overhead" and similar latency issues using '__device__' and '__global__' functons inside libraries

I am currently working on a HPC project related with computer vision using the ‘nvidia cuda’ library in ‘c’, and compiling everything with ‘NVCC’.

At this moment I’m making a system that requires to make a high amount of computation in a quite reduced gap of time (around 8 milliseconds), were every algorithms within that lapse is measured in terms of hundreds of microseconds;

Until now I was organizing all the ‘device’ and ‘global’ functions I was making in some kind of ‘pseudo-libaries’ that didn’t need to be compiled to work with the main code, this was due I saw that functions of type ‘device’ and ‘global’ had some kind of latency or resource mismanagement issues that could affect in execution performance.
This workaround was useful on experimental stages were I just wanted to test my code functionalities, but now I’m in an early ‘production’ stage were I would like to move all those functions into real libraries.

I would like to know whether there is actually an issue with those kind of functions inside compiled libraries that may cause performance issues in the previously mentioned time windows or those problems are only caused in some specific situations that could be prevent, also I would appreciate if you could bring me some documentation about it (in case it exists).

Regards and thanks in advance.

The main thing to be aware of is that the compiler can generally do a better job of optimization when the __device__ functions called from a __global__ kernel are together in the same compilation unit/module/file.

When that is not the case (e.g. the __device__ function is in a different file than the kernel that calls it), then it is necessary to compile with relocatable device code with device linking enabled, using e.g. the -rdc=true nvcc compile switch, or similar.

This type of compilation may experience performance degradation due to the calling process and optimization opportunities available to the compiler, when compared to the first case (everything in the same file).

A possible way to recover some of the performance loss, if any, is to compile with link-time optimization enabled.

There is nvcc documentation, numerous forum questions on these topics, as well as blog articles.

Thanks for the clarification and also thanks for the documentation provided

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.