First invocation of complex thrust function has extended delay. Why?

robosmith · August 13, 2019, 6:52pm

With a complex Cuda Dynamic Parallelism thrust function being launched by cudaLaunchKernel, the NV Profiler is showing an extended delay (1+ sec) for the first invocation, but cudaLaunchKernel does not even show for subsequent launches. Subsequent invocations show a total of < 0.4 sec for execution of these functions.

I can only speculate that this is due to loading thrust libraries, as I’ve not seen it happen with custom kernels.

Does anyone here know why this happens? Is there some way to pre-load thrust libraries to cut this delay during operation?

Robert_Crovella · August 13, 2019, 7:05pm

thrust is a template/header library. There are no compiled libraries (at least, none associated purely with thrust) to load.

robosmith · August 13, 2019, 8:58pm

Thanks Robert. So any idea why the first invocation takes over twice as long to execute?

robosmith · August 13, 2019, 9:03pm

When I add a dummy call during initialization, the same first call in the stream takes as long as subsequent calls. AKA < 0.4 sec.

Robert_Crovella · August 13, 2019, 9:36pm

The first CUDA call usually incurs overhead. That sounds like what you are running into.

I’m not able to explain much without a complete sample code to look at.

robosmith · August 13, 2019, 10:01pm

Ok, that makes sense. I’m just not used to seeing anything like 2x execution time with custom kernels.

Does thrust normally have that much more overhead for the initial invocation?

Related to complexity?

Robert_Crovella · August 13, 2019, 10:47pm

I’m mainly referring to CUDA here, not thrust. CUDA very often has a start up delay that can be a large fraction of a second or larger. It isn’t unique or specific to thrust. Since thrust uses cuda, if the first thing GPU-wise that your code does is some thrust function, then that may “absorb” the CUDA start up delay.

This is just conjecture. There are many questions on the forums that pertain to CUDA start up delay.

robosmith · August 14, 2019, 6:02am

In this case, there are numerous calls to small custom cuda kernels among many cudaMalloc & cudaMemcpyAsync calls prior to the cudaLaunchKernel CDP thrust function that is exhibiting the >2x delay, which is why I suspected loading thrust libraries.

I guess I don’t understand what constitutes cuda startup which normally exhibits substantial delay.

njuffa · August 14, 2019, 6:51am

The CUDA startup overhead is basically the cost of creating and initializing a CUDA context. Due to the use of a unified address space between host and device, this may be fairly significant if a system has lots of memory (system memory plus GPU memory) that needs to be mapped into the unified address space.

CUDA context initialization occurs on-demand, i.e. typically with the first CUDA API call. In practice, this is often this is a cudaMalloc() call. Sometimes it is beneficial to force context creation at a know point in an application by a call to cudaFree(0). As you note, your code has CUDA API calls ahead of the Thrust usage, so the observed delay should not be due to CUDA context creation overhead.

Generally in benchmarking and performance analysis, it is a best practice to ignore the first (or even the first few) invocations of any particular piece of code to avoid distorted measurements due to cold-start effects arising both from hardware (e.g. caches, TLBs) as well as software (e.g. JIT compilation). Typically the interest is in measuring steady-state performance.

Robert_Crovella · August 14, 2019, 12:30pm

Then it probably isn’t CUDA startup delay. It may be something else.

Another possibility is caching effects. Really just guessing.

robosmith · August 14, 2019, 2:28pm

So are you saying the initial extended execution time shown in the profiler is not real?

Experimenting, I have shown that I can move that extended time initial invocation into the initialization code and get profiled normal execution times where it matters in the data stream.

However, that is somewhat of a kluge and I would prefer if there were some official way to get the code initialized/cached.

I suspect that there is some code caching, but others (my supervisor) have suggested that there should not be any such effects.

njuffa · August 14, 2019, 4:04pm

Cold-start affects slowing down the performance of a piece of code are very much real, but they should not be relevant to performance analysis under most circumstances.

Performance analysis and/or benchmarking usually looks at steady-state performance, i.e. after warmup, and may in addition chose to look at the average of multiple instances or the fastest of multiple instances.

I don’t know your application, so I obviously haven’t profiled and analyzed it, so I cannot tell you what specific cold-start effects may be affecting this particular code. I gave examples of potential software and hardware effects above, so such effects definitely exist.

Is there any JIT compilation going on in this app? Either by explicit invocation or implicitly because of a mismatch between the architecture target selected at compilation time and the actual architecture of the GPU you are running on?

robosmith · August 14, 2019, 5:48pm

Running on a P100 it compile for capabilities that include 6.0

I am not familiar with the JIT compilation effects of thrust CUDA Dynamic Parallelism kernels, but I would not be surprised if that was happening.

Is there some way to pre-compile thrust CDP kernels that is not normally done at compile time?

Robert_Crovella · August 14, 2019, 5:52pm

There’s no such animal if you are using the CUDA runtime API and are compiling for the correct architecture. CDP is not a special JIT case.

The way to “pre-compile” CUDA kernels, CDP or otherwise, is to compile for the correct architecture, with an appropriate set of compile switches. For example,

-arch=sm_60

would correctly compile for a P100 to avoid JIT (when using the CUDA runtime API, which thrust pretty much assumes).

robosmith · August 14, 2019, 6:54pm

Command line includes -gencode=arch=compute_60,code="sm_60,compute_60" so there should be no JIT compile, right?

Robert_Crovella · August 14, 2019, 6:57pm

Correct (for Tesla P100).

robosmith · August 16, 2019, 11:30pm

Since I don’t see this delay when running on a Quadro 1200M, I’m wondering if there might be a bug in the JIT determination algorithm that unnecessarily invokes JIT for the P100.

njuffa · August 17, 2019, 3:49am

Exceedingly unlikely.

robosmith · August 19, 2019, 4:21pm

Is there a way to determine whether JIT is being activated? Disable it?

Robert_Crovella · August 19, 2019, 4:30pm

JIT depends on the presence of PTX in your fatbinary.

By building your fatbinary/executable with no PTX, it is guaranteed that there will be no JIT.

There are many questions on the web that discuss this as well as the nvcc compiler command options that would be needed.

for example, for a compute capability 6.0 device, the following compile switch:

nvcc -gencode arch=compute_60,code=sm_60 …

[url]https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html[/url]

will generate SASS for compute capability 6.0 device with no PTX in the binary.

The presence or absence of SASS or PTX in a binary can be verified using the cuda binary utilities, which are documented.

For example:

cuobjdump -sass …
cuobjdump -ptx …

[url]CUDA Binary Utilities :: CUDA Toolkit Documentation

Topic		Replies	Views
64 bit Windows 10, gtx 1060, CUDA kernel startup time? CUDA Programming and Performance	12	2843	October 10, 2017
Unusual timing results CUDA Programming and Performance	14	2831	July 7, 2009
Thrust v1.1 release A high-level C++ template library for CUDA CUDA Programming and Performance	6	13802	September 18, 2009
Thrust prefix sum slower than CPU? CUDA Programming and Performance	3	1867	February 11, 2020
First cudaMalloc() takes long time? CUDA Programming and Performance	13	17136	April 23, 2021
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23168	July 8, 2011
X64,VC2013,WIN8.1 , cudaMallocPitch block forever. CUDA Programming and Performance	12	2498	March 9, 2016
TK1 very slow GPU initialization Jetson TK1	12	1409	October 18, 2021
Does saturating a stream hide kernel launch latency? CUDA Programming and Performance	23	2514	October 28, 2014
CUDA thrust compilation error CUDA Programming and Performance	85	361	April 4, 2025

First invocation of complex thrust function has extended delay. Why?

Related topics