I have a C# exe that loads a no-cuda C++ dll on demand (actually while working on this project I’ve changed it to always loading at startup). That dll demand-loads another dll that contains all the cuda code. The cuda code is working great with no open issues, except that the 150-200 ms “first-call…

My “solution” remains stable and performing, so I’m going to close this out. I don’t know why this topic ever got as lengthy as it is - it seems to me that this is a basic issue (for any cuda code with any significant latency requirements) and there should be a small batch of topics dating back to 2…

You can allocate a single allocation once and divide it into your data structures (sufficiently aligned) each time any way you want.

You’re talking about something as simple as allocating max-sized buffers on the gpu at app startup and keeping track of the pointers at the highest level? I suppose that might be a good speed optimization anyway for my most common use cases. But I still would like to know if my issue should be happe…

You have to find out, which Cuda function call it is. And I would try Nsight Systems, even if it goes over several DLLs.

I’ve actually got Nsight working for this project, at least to get the kernel metrics. Newbie question: How do I get Nsight to report the cudaMalloc and cudaMemcpy durations?

[image] speenmail: How do I get Nsight to report the cudaMalloc and cudaMemcpy durations? Assuming you are using Nsight Systems, memory allocation is illustrated here . If you are using Nsight Compute, that tool focuses solely on kernel metrics.

In figure 3 here refer to the time line row labelled “CUDA API”

Thanks. That’s where I’ve been looking for about an hour, but I’m not seeing any of the good stuff under my process (or any other process). I’m seeing everything in Nsight Compute so Nsight is able to hook my process properly … but alas Nsight Compute is showing blanks in the duration column. The Ns…

You might get some help over on the Nsight Systems forum.

How to get the cuda "first-call overhead" to happen only once for cuda called from dll?

Accelerated Computing CUDA CUDA Programming and Performance

speenmail November 6, 2024, 6:51pm 25

I really wish I could see that “CUDA API” line, but I can’t. Any hints on how to get it? I followed these instructions: " When the Collect GPU Memory Usage option is selected from the Collect CUDA trace option set, Nsight Systems will track CUDA GPU memory allocations and deallocations and present a graph of this information in the timeline"

Topic		Replies	Views
Cuda runtime call after driver api call, excessive overhead CUDA Programming and Performance cuda , driver , api	17	2292	December 24, 2021
My GPU Became Slower... after 1 month of not testing cuda CUDA Programming and Performance	18	12420	August 23, 2010
kernel call overhead: timing results overhead is large for small # of calls CUDA Programming and Performance	16	8031	March 8, 2013
Low or normal performance? CUDA Programming and Performance cuda	20	1454	November 13, 2020
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4894	February 15, 2018
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23318	July 8, 2011
Why would code run 1.7x faster when run with nvprof than without? CUDA Programming and Performance	35	3627	December 28, 2017
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2589	January 9, 2017
Long delays on CUDA app startup causing Nsight System to fail on startup CUDA Programming and Performance	37	2412	May 19, 2023
First kernel execution takes longer CUDA Programming and Performance	8	3012	December 8, 2014

How to get the cuda "first-call overhead" to happen only once for cuda called from dll?

Related topics