How to get the cuda "first-call overhead" to happen only once for cuda called from dll?

You’re talking about something as simple as allocating max-sized buffers on the gpu at app startup and keeping track of the pointers at the highest level? I suppose that might be a good speed optimization anyway for my most common use cases. But I still would like to know if my issue should be happening at all, especially if it’s because of some dumb thing I’m doing out of ignorance. I was just running a batch of tests using teeny-tiny memory (like, block size = 64 grid size = 400) and getting the same thing (i.e. cuda wants to churn for that 150 - 250 ms sometime cumulatively over the first 10 calls, even though the entire time for the normal function call is about 1 ms).

You have to find out, which Cuda function call it is. And I would try Nsight Systems, even if it goes over several DLLs.

I’ve actually got Nsight working for this project, at least to get the kernel metrics. Newbie question: How do I get Nsight to report the cudaMalloc and cudaMemcpy durations?

Assuming you are using Nsight Systems, memory allocation is illustrated here.

If you are using Nsight Compute, that tool focuses solely on kernel metrics.

I really wish I could see that “CUDA API” line, but I can’t. Any hints on how to get it? I followed these instructions: " When the Collect GPU Memory Usage option is selected from the Collect CUDA trace option set, Nsight Systems will track CUDA GPU memory allocations and deallocations and present a graph of this information in the timeline"

In figure 3 here refer to the time line row labelled “CUDA API”

Thanks. That’s where I’ve been looking for about an hour, but I’m not seeing any of the good stuff under my process (or any other process). I’m seeing everything in Nsight Compute so Nsight is able to hook my process properly … but alas Nsight Compute is showing blanks in the duration column. The Nsight Compute API statistics are useful for making guesses (and they prove that the data SHOULD be available for viewing), but I would like to see more granularity for certainty.

You might get some help over on the Nsight Systems forum.

Thanks, trying that right now.

I’m working on getting the CUDA API line to show up in Nsight Systems, but in the meantime I’m gathering intelligence from what I can see. I’m not sure that profiling the API calls is really telling me anything, because if I eliminate one thing the same 150-250 ms churn just shows up somewhere else. Yesterday all that time was in cudaMemcpy(), so I got rid of the largest cudaMemcpy calls and then the same time just started showing up in cudaDeviceSynchronize(). I’m wondering if a picture like the one below tells anyone anything. This is showing my usual 10 identical repeats of the same method call, with the first one taking huge time as usual. My question is, does the “Render” and “Dma Packet” stuff tell me anything useful?

You could disable (for testing) asynchronous kernel launches with CUDA_LAUNCH_BLOCKING=1

Which traces have you selected in Nsight Systems?

Pure speculation mode ON:

I wonder if you’re at the mercy of the Windows driver running in WDDM mode. Came across this and also Section 2.4 of this states:

“TCC reduces the latency of CUDA kernel launches”

Are you in a position to try TCC mode? It requires the card both not being used for graphic output and to support TCC mode.

I will try CUDA_LAUNCH_BLOCKING=1. My Nsight looks like this:

I’m doing this dev on Geforce RTX cards - those are impossible to get to TCC, right? I’ve got an old p100 I was going to fire up for this project, is something of that vintage worth testing on?

The docs say Geforce aren’t TCC capable, but easy enough to test with nvidia-smi.

If all else fails it may be worth trying, depending on how much trouble it is to set up.

Is testing on Linux an option?

This should be fine for measuring kernel launch overhead. Note that NVIDIA has always used counter-measures to deal with the high launch overhead caused by the WDDM driver architecture and has refined these over time. The last time I checked, the average cost of kernel launches was not much different between TCC and WDDM when using the same system, i.e. only driver changed. The disadvantage of using WDDM is significant variability of this overhead.

Your mileage may differ, so it will be best to run your own experiments if you want to assess the performance impact of TCC vs WDDM. To clarify the picture even more, try Linux, too. For meaningful results you would want to run all this as a controlled experiment, where only the driver or the OS changes.

I have no insights into CUDA / C# interop, as I have never used C#.

Tried CUDA_LAUNCH_BLOCKING=1, no difference.

Geforce RTX doesn’t support TCC even via hacking, I’m pretty sure.

Trying TCC and/or Linux would be fun and educational, but would not really address my issues since I definitely need to support Windows and consumer-grade gpus.

Today I tried the persistent memory idea. It made my normal runtime just a bit faster, but had no impact on my problem with the first call taking forever. FWIW, I have never seen any of the memory allocations taking a long time, just the memcpys.

I’m kind of sad - you guys on the nvidia forums have given me a lot of good help and information on several topics, so now I have blazing fast cuda code that is a lot higher quality than I started with … except the whole project is 100% worthless unless I can get past this issue of slow 1st call time.

I guess now I need to work on a minimal reproducible example to pinpoint where the thing breaks. Get rid of the C# and the dlls, just a C++ exe.

1 Like

One situation where I have seen CUDA calls suddenly take longer in the “midst” of processing is if there is anything associated with CPU threads being put to sleep, and/or a GPU that is idle for an exceptionally long period of time, so that the power management algorithm drops the P-state.

Definitely no threads doing anything unusual, I’m running the test code from the UI thread. GPU is not idle since it’s running my monitors, plus this issue happens no matter how quickly you go back and run the test method again. The test method runs about 80 calls to the same cuda code. Except in infrequent cases where both the 1st and 2nd calls split the slowness, it’s always just the 1st call that is slow and then the other 79 calls are very consistent and fast.

What is “an exceptionally long period of time”, roughly? I did an experiment where I do my 10 calls and sleep between each call. If I sleep for 100 ms then I only get the overhead on the first call, the other 9 are as fast as they should be. If I sleep for 1000 ms then I get the same, full overhead for each of the 10 calls. If I sleep for various times between 100 and 1000 ms then I get various unpredictable blends of call-wise overhead times.