With a complex Cuda Dynamic Parallelism thrust function being launched by cudaLaunchKernel, the NV Profiler is showing an extended delay (1+ sec) for the first invocation, but cudaLaunchKernel does not even show for subsequent launches. Subsequent invocations show a total of < 0.4 sec for execution of these functions.
I can only speculate that this is due to loading thrust libraries, as I’ve not seen it happen with custom kernels.
Does anyone here know why this happens? Is there some way to pre-load thrust libraries to cut this delay during operation?
I’m mainly referring to CUDA here, not thrust. CUDA very often has a start up delay that can be a large fraction of a second or larger. It isn’t unique or specific to thrust. Since thrust uses cuda, if the first thing GPU-wise that your code does is some thrust function, then that may “absorb” the CUDA start up delay.
This is just conjecture. There are many questions on the forums that pertain to CUDA start up delay.
In this case, there are numerous calls to small custom cuda kernels among many cudaMalloc & cudaMemcpyAsync calls prior to the cudaLaunchKernel CDP thrust function that is exhibiting the >2x delay, which is why I suspected loading thrust libraries.
I guess I don’t understand what constitutes cuda startup which normally exhibits substantial delay.
The CUDA startup overhead is basically the cost of creating and initializing a CUDA context. Due to the use of a unified address space between host and device, this may be fairly significant if a system has lots of memory (system memory plus GPU memory) that needs to be mapped into the unified address space.
CUDA context initialization occurs on-demand, i.e. typically with the first CUDA API call. In practice, this is often this is a cudaMalloc() call. Sometimes it is beneficial to force context creation at a know point in an application by a call to cudaFree(0). As you note, your code has CUDA API calls ahead of the Thrust usage, so the observed delay should not be due to CUDA context creation overhead.
Generally in benchmarking and performance analysis, it is a best practice to ignore the first (or even the first few) invocations of any particular piece of code to avoid distorted measurements due to cold-start effects arising both from hardware (e.g. caches, TLBs) as well as software (e.g. JIT compilation). Typically the interest is in measuring steady-state performance.
So are you saying the initial extended execution time shown in the profiler is not real?
Experimenting, I have shown that I can move that extended time initial invocation into the initialization code and get profiled normal execution times where it matters in the data stream.
However, that is somewhat of a kluge and I would prefer if there were some official way to get the code initialized/cached.
I suspect that there is some code caching, but others (my supervisor) have suggested that there should not be any such effects.
Cold-start affects slowing down the performance of a piece of code are very much real, but they should not be relevant to performance analysis under most circumstances.
Performance analysis and/or benchmarking usually looks at steady-state performance, i.e. after warmup, and may in addition chose to look at the average of multiple instances or the fastest of multiple instances.
I don’t know your application, so I obviously haven’t profiled and analyzed it, so I cannot tell you what specific cold-start effects may be affecting this particular code. I gave examples of potential software and hardware effects above, so such effects definitely exist.
Is there any JIT compilation going on in this app? Either by explicit invocation or implicitly because of a mismatch between the architecture target selected at compilation time and the actual architecture of the GPU you are running on?
There’s no such animal if you are using the CUDA runtime API and are compiling for the correct architecture. CDP is not a special JIT case.
The way to “pre-compile” CUDA kernels, CDP or otherwise, is to compile for the correct architecture, with an appropriate set of compile switches. For example,
-arch=sm_60
would correctly compile for a P100 to avoid JIT (when using the CUDA runtime API, which thrust pretty much assumes).
Since I don’t see this delay when running on a Quadro 1200M, I’m wondering if there might be a bug in the JIT determination algorithm that unnecessarily invokes JIT for the P100.