Cuda library initialization

I have read about Cuda libraries needing time to initialization during the first time they are called.

I have experienced this first hand using NPP and using my own CUDA kernels. And what I have found is that what initializes one method or library doesn’t initialize another. For instance, recently I have been playing with Canny filter code. I created my own Canny filter with my own kernels and managing the data myself using cudaMalloc, cudaMemCopy, etc. I found that if before I run my main code, I create a small array (10 ints), cudaMalloc it and then copy it up to and back from the GPU, then all of my subsequent filter code runs fast each iteration. If I don’t, the first iteration through runs on the order of 50x slower than succeeding iterations (~120ms vs ~3ms).

Then I tried using nppiFilterCannyBorder_8u_C1R() instead. When I first tried, it took so long that I didn’t pursue it but after I realized that the “pump might need priming” I went back and tried again. Unfortunately, what I did for my kernel code didn’t work for the NPP code. I had to run nppiFilterCannyBorder_8u_C1R() a second time to get the performance I expected from an Nvidia library and the increase was dramatic. The first time through the code took ~130ms, the second time on the same image it took ~30us (~4000x!).

So, what does one have to do to “initialize” an Nvidia library and does each one need to be initialized separately or is there a way to initialize the “system”? And what is the best way to initialzie things? I found calling nppiFilterCannyBorderGetBufferSize() didn’t do it for me using the Canny function, I had to call that function to get things going.

Also, why is this behavior not more visible? I could be looking in the wrong places or searching for the wrong terms but I haven’t seen anything “official” about this, just entries in the forums. It seems to me this should be more exposed by Nvidia and that there should be an “official” way to prime things that could be done during an initialization step when speed doesn’t yet matter if there isn’t a way already.

If you want to get rid of all aspects of first-call overhead, it is potentially necessary to call each function that you intend to call in your code twice. This is true for both library calls and kernel calls. The first call should remove all overheads before the second call. There isn’t any provided method to initialize the “system” or some other way to do this.

If you would like to see a change to the CUDA documentation, you can file a bug. The bug reporting process is linked to a sticky post at the top of the CUDA programming forum.

Thank you for the answer.

What kind of “overheads” are involved when these functions are called and how long does removing these overheads persist? For instance, if I make several calls to a particular npp function call in an application and then don’t call it again for 2, 5, 10, etc. minutes does the overhead remain cleared or does it have to be removed again?

What I am looking for is predictable behavior and right now I don’t have a feel for how to predict things with regards to how fast things will run at any particular time.

As an FYI, our application has an approximate 10-15 ms window to process a set of ~500x500 scans before the next set has to be processed to maintain performance, so a processing time of 100+ ms, which is what I first experienced with the npp Canny filter function, is a non-starter but 30 us, which is the subsequent processing time I am seeing, is great.

My expectation is that the repeatable behavior should begin after the first call of that function, and should persist (for that function) to the end of the application duration.

None of this is spelled out or specified anywhere, so you’ll need to properly characterize and test your application to ensure the desired behavior.

The only anomalous examples I see in your descriptions line up exactly with my expectations: the first call to a function may be an outlier. Thereafter calls to that function are generally “predictable” in behavior.

Thanks again.

You should be aware that part of first-call overhead is likely memory allocator activity, and that is actually host-side work. Often, multiple layers of memory allocators are involved, and calls to the lowest, OS-level allocators can be particularly slow.

For applications that de-allocate and allocate memory repeatedly, allocation overhead can therefore also vary considerably while the application is running. No layer of the CUDA software stack (and none of the operating systems CUDA runs on) has real-time properties, and lazy initialization occurs in various places. At best you can get some soft real time behavior if you take precautions (e.g. all memory needed allocated up front) and your real time requirements are lax enough.

In general the performance of code involved in initialization delay and general overhead inside the CUDA software stack is primarily dependent on the single-thread performance of your CPU. So you would want to chose a CPU with high single-thread performance, which implies high basic clock rate. I would recommend > 3.5 GHz base clock.

Make sure your own kernels and libraries are built for the GPU architecture(s) you want to deploy on, otherwise JIT compilation overhead will add to your troubles. That, too, will be limited mostly by single-thread performance of the host system.