"Device Function Call Overhead" and similar latency issues using 'device' and 'global' functons inside libraries

nicolas.sande.carabel · June 13, 2024, 3:38pm

I am currently working on a HPC project related with computer vision using the ‘nvidia cuda’ library in ‘c’, and compiling everything with ‘NVCC’.

At this moment I’m making a system that requires to make a high amount of computation in a quite reduced gap of time (around 8 milliseconds), were every algorithms within that lapse is measured in terms of hundreds of microseconds;

Until now I was organizing all the ‘device’ and ‘global’ functions I was making in some kind of ‘pseudo-libaries’ that didn’t need to be compiled to work with the main code, this was due I saw that functions of type ‘device’ and ‘global’ had some kind of latency or resource mismanagement issues that could affect in execution performance.
This workaround was useful on experimental stages were I just wanted to test my code functionalities, but now I’m in an early ‘production’ stage were I would like to move all those functions into real libraries.

I would like to know whether there is actually an issue with those kind of functions inside compiled libraries that may cause performance issues in the previously mentioned time windows or those problems are only caused in some specific situations that could be prevent, also I would appreciate if you could bring me some documentation about it (in case it exists).

Regards and thanks in advance.

Robert_Crovella · June 13, 2024, 6:01pm

The main thing to be aware of is that the compiler can generally do a better job of optimization when the __device__ functions called from a __global__ kernel are together in the same compilation unit/module/file.

When that is not the case (e.g. the __device__ function is in a different file than the kernel that calls it), then it is necessary to compile with relocatable device code with device linking enabled, using e.g. the -rdc=true nvcc compile switch, or similar.

This type of compilation may experience performance degradation due to the calling process and optimization opportunities available to the compiler, when compared to the first case (everything in the same file).

A possible way to recover some of the performance loss, if any, is to compile with link-time optimization enabled.

There is nvcc documentation, numerous forum questions on these topics, as well as blog articles.

nicolas.sande.carabel · June 13, 2024, 6:59pm

Thanks for the clarification and also thanks for the documentation provided

system · June 27, 2024, 6:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NVCC Compile Shared Library CUDA Programming and Performance	10	1720	October 12, 2021
speed of device and global functions CUDA Programming and Performance	1	799	May 8, 2015
kernel calling a library function? newbie question CUDA Programming and Performance	3	3440	July 3, 2008
Strange crashes in __device__ function CUDA Programming and Performance	4	1051	August 1, 2016
Good programming practice in inlining a device function CUDA Programming and Performance	6	2715	October 29, 2021
Functions inside __global__ functions question CUDA Programming and Performance	5	7058	March 14, 2012
__device__ functions CUDA Programming and Performance	9	3107	November 10, 2010
kernel function cuda, kernel CUDA Programming and Performance	3	7159	September 21, 2009
speed of device and global functions CUDA Programming and Performance	1	614	May 6, 2015
NVCC can't inline device code across compilation units - workarounds? feature request? CUDA Programming and Performance	1	811	September 15, 2017

"Device Function Call Overhead" and similar latency issues using '__device__' and '__global__' functons inside libraries

Related topics

"Device Function Call Overhead" and similar latency issues using 'device' and 'global' functons inside libraries