I am currently working on a HPC project related with computer vision using the ‘nvidia cuda’ library in ‘c’, and compiling everything with ‘NVCC’.
At this moment I’m making a system that requires to make a high amount of computation in a quite reduced gap of time (around 8 milliseconds), were every algorithms within that lapse is measured in terms of hundreds of microseconds;
Until now I was organizing all the ‘device’ and ‘global’ functions I was making in some kind of ‘pseudo-libaries’ that didn’t need to be compiled to work with the main code, this was due I saw that functions of type ‘device’ and ‘global’ had some kind of latency or resource mismanagement issues that could affect in execution performance.
This workaround was useful on experimental stages were I just wanted to test my code functionalities, but now I’m in an early ‘production’ stage were I would like to move all those functions into real libraries.
I would like to know whether there is actually an issue with those kind of functions inside compiled libraries that may cause performance issues in the previously mentioned time windows or those problems are only caused in some specific situations that could be prevent, also I would appreciate if you could bring me some documentation about it (in case it exists).
Regards and thanks in advance.