First invocation of complex thrust function has extended delay. Why?

With a complex Cuda Dynamic Parallelism thrust function being launched by cudaLaunchKernel, the NV Profiler is showing an extended delay (1+ sec) for the first invocation, but cudaLaunchKernel does not even show for subsequent launches. Subsequent invocations show a total of < 0.4 sec for execution of these functions.

I can only speculate that this is due to loading thrust libraries, as I’ve not seen it happen with custom kernels.

Does anyone here know why this happens? Is there some way to pre-load thrust libraries to cut this delay during operation?

thrust is a template/header library. There are no compiled libraries (at least, none associated purely with thrust) to load.

Thanks Robert. So any idea why the first invocation takes over twice as long to execute?

When I add a dummy call during initialization, the same first call in the stream takes as long as subsequent calls. AKA < 0.4 sec.

The first CUDA call usually incurs overhead. That sounds like what you are running into.

I’m not able to explain much without a complete sample code to look at.

Ok, that makes sense. I’m just not used to seeing anything like 2x execution time with custom kernels.

Does thrust normally have that much more overhead for the initial invocation?

Related to complexity?

I’m mainly referring to CUDA here, not thrust. CUDA very often has a start up delay that can be a large fraction of a second or larger. It isn’t unique or specific to thrust. Since thrust uses cuda, if the first thing GPU-wise that your code does is some thrust function, then that may “absorb” the CUDA start up delay.

This is just conjecture. There are many questions on the forums that pertain to CUDA start up delay.

In this case, there are numerous calls to small custom cuda kernels among many cudaMalloc & cudaMemcpyAsync calls prior to the cudaLaunchKernel CDP thrust function that is exhibiting the >2x delay, which is why I suspected loading thrust libraries.

I guess I don’t understand what constitutes cuda startup which normally exhibits substantial delay.

The CUDA startup overhead is basically the cost of creating and initializing a CUDA context. Due to the use of a unified address space between host and device, this may be fairly significant if a system has lots of memory (system memory plus GPU memory) that needs to be mapped into the unified address space.

CUDA context initialization occurs on-demand, i.e. typically with the first CUDA API call. In practice, this is often this is a cudaMalloc() call. Sometimes it is beneficial to force context creation at a know point in an application by a call to cudaFree(0). As you note, your code has CUDA API calls ahead of the Thrust usage, so the observed delay should not be due to CUDA context creation overhead.

Generally in benchmarking and performance analysis, it is a best practice to ignore the first (or even the first few) invocations of any particular piece of code to avoid distorted measurements due to cold-start effects arising both from hardware (e.g. caches, TLBs) as well as software (e.g. JIT compilation). Typically the interest is in measuring steady-state performance.

Then it probably isn’t CUDA startup delay. It may be something else.

Another possibility is caching effects. Really just guessing.

So are you saying the initial extended execution time shown in the profiler is not real?

Experimenting, I have shown that I can move that extended time initial invocation into the initialization code and get profiled normal execution times where it matters in the data stream.

However, that is somewhat of a kluge and I would prefer if there were some official way to get the code initialized/cached.

I suspect that there is some code caching, but others (my supervisor) have suggested that there should not be any such effects.

Cold-start affects slowing down the performance of a piece of code are very much real, but they should not be relevant to performance analysis under most circumstances.

Performance analysis and/or benchmarking usually looks at steady-state performance, i.e. after warmup, and may in addition chose to look at the average of multiple instances or the fastest of multiple instances.

I don’t know your application, so I obviously haven’t profiled and analyzed it, so I cannot tell you what specific cold-start effects may be affecting this particular code. I gave examples of potential software and hardware effects above, so such effects definitely exist.

Is there any JIT compilation going on in this app? Either by explicit invocation or implicitly because of a mismatch between the architecture target selected at compilation time and the actual architecture of the GPU you are running on?

Running on a P100 it compile for capabilities that include 6.0

I am not familiar with the JIT compilation effects of thrust CUDA Dynamic Parallelism kernels, but I would not be surprised if that was happening.

Is there some way to pre-compile thrust CDP kernels that is not normally done at compile time?

There’s no such animal if you are using the CUDA runtime API and are compiling for the correct architecture. CDP is not a special JIT case.

The way to “pre-compile” CUDA kernels, CDP or otherwise, is to compile for the correct architecture, with an appropriate set of compile switches. For example,

-arch=sm_60

would correctly compile for a P100 to avoid JIT (when using the CUDA runtime API, which thrust pretty much assumes).

Command line includes -gencode=arch=compute_60,code=“sm_60,compute_60” so there should be no JIT compile, right?

Correct (for Tesla P100).

Since I don’t see this delay when running on a Quadro 1200M, I’m wondering if there might be a bug in the JIT determination algorithm that unnecessarily invokes JIT for the P100.

Exceedingly unlikely.

Is there a way to determine whether JIT is being activated? Disable it?

JIT depends on the presence of PTX in your fatbinary.

By building your fatbinary/executable with no PTX, it is guaranteed that there will be no JIT.

There are many questions on the web that discuss this as well as the nvcc compiler command options that would be needed.

for example, for a compute capability 6.0 device, the following compile switch:

nvcc -gencode arch=compute_60,code=sm_60 …

https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html

will generate SASS for compute capability 6.0 device with no PTX in the binary.

The presence or absence of SASS or PTX in a binary can be verified using the cuda binary utilities, which are documented.

For example:

cuobjdump -sass …
cuobjdump -ptx …

https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html