How to get the cuda "first-call overhead" to happen only once for cuda called from dll?

I really wish I could see that “CUDA API” line, but I can’t. Any hints on how to get it? I followed these instructions: " When the Collect GPU Memory Usage option is selected from the Collect CUDA trace option set, Nsight Systems will track CUDA GPU memory allocations and deallocations and present a graph of this information in the timeline"