With a complex Cuda Dynamic Parallelism thrust function being launched by cudaLaunchKernel, the NV Profiler is showing an extended delay (1+ sec) for the first invocation, but cudaLaunchKernel does not even show for subsequent launches. Subsequent invocations show a total of < 0.4 sec for execution of these functions.
I can only speculate that this is due to loading thrust libraries, as I’ve not seen it happen with custom kernels.
Does anyone here know why this happens? Is there some way to pre-load thrust libraries to cut this delay during operation?