What's the reason for extended cuda driver api duration

shuxiguo · December 16, 2023, 3:28pm

I ran a tensorflow gpu subgraph in a concurrent manner, which means multiple operators run and call cuda driver apis concurrently, which can be shown in figure below. I found from nsys that the duration of cuda dirver api call got extended.
Basically, there are mainly three type of operators:
- send operator, which called cuMemcpyHtoDAsync to transfer data between host and device.;
- self defined kernels, which are launched by calling cuda driver api cuLaunchKernel
- batched blas api, which would conduct cumemalloc_v2 three times, cuMemcpyHtoDAync three times, and then call kernel kernels like gemvNSP_kernel.
I want to know the reason that driver api’s time is extended. Is there any way to fix it?
Thanks for your advice.

Robert_Crovella · December 18, 2023, 6:43pm

simultaneous access (e.g. from multiple threads) to the driver or runtime API may incur additional overhead. This is mentioned here.

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources.

There isn’t anything you can do about it from a CUDA perspective. Obviously if you want to rearchitect your application, you may observe some change due to reduced contention, but this may result in worse performance due to other factors (e.g. serialization of work). There is no recipe to address this in the general case.

With a bit of searching you can find other, similar forum reports.

shuxiguo · December 19, 2023, 6:45am

Thank you for your reply.
Besides, I occasionly found that the time of cumemalloc is extremely long, reaching about serveral hundred ms, as shown in below.
I found in forum that the initialization of cuContext may takes much long time. But, from nsys, the initialization has already occured before and the application always uses the same ctx.

njuffa · December 19, 2023, 10:14am

What is the CPU in this system? Memory allocations are basically pure host-side work, limited primarily by single-threaded CPU performance and to a much lesser degree by performance of the system memory. Your observations suggest potential use of a slow CPU. Allocation performance can also depend on the operating system. For example, on Windows with the default WDDM driver GPU memory allocations must be routed through an OS mechanism so the OS can always be in full control of GPU memory.

I do not claim any detailed knowledge of the CUDA memory allocators, but generally speaking, modern memory allocation mechanisms are constructed in layers, with the cost of allocations potentially increasing significantly when traversing layers. This can be triggered by a large allocation, for example, that the last-level allocator cannot handle.

It is best to keep memory allocation / de-allocation out of performance-critical sections of code. One strategy to do that is to perform all or most of the allocations up front and keep re-using buffers in performance-critical sections.

Robert_Crovella · December 19, 2023, 3:04pm

a cuMemAlloc operation is often synchronizing. That means that it will not begin until the device is idle (no kernels running). Therefore, when viewed from an API perspective (i.e. the duration of the call as viewed from the host thread, one of the depictions given in nsys) it may appear “long”, because it is waiting for an idle device. I don’t know if that applies to your case or not.

The usual advice I know of for this was already given by njuffa:

When/if that is unavoidable, use of memory pools may be of interest.

shuxiguo · December 27, 2023, 5:24am

Thanks for your advice

shuxiguo · December 27, 2023, 5:24am

Thanks you, I would consider the memory pool solution.

Topic		Replies	Views
Cuda runtime call after driver api call, excessive overhead CUDA Programming and Performance cuda , driver , api	17	2207	December 24, 2021
Why is the execution time of cudaMalloc so variable (when using hotspot benchmark from Rodinia Benchmark suite)? CUDA Programming and Performance	5	350	October 26, 2025
cudaMalloc takes several seconds CUDA Programming and Performance	6	2612	August 13, 2013
Memory Allocation Time Takes too much time!! CUDA Programming and Performance	3	4661	August 28, 2009
Multiple threads calling CUDA API in parallel CUDA Programming and Performance cuda , driver , parallel-computing	4	662	August 9, 2024
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3386	June 23, 2009
cudaMalloc's taking different times CUDA Programming and Performance	3	1973	December 22, 2010
[Multiple GPUs / Processes] CUDA Memory De/Allocation Slow CUDA Programming and Performance	25	10003	December 4, 2017
cudaHostRegister block very long time CUDA Programming and Performance	1	806	June 2, 2020
cudaMalloc hangs for several minutes on Titans on CentOS5_x64 CUDA Setup and Installation	6	3732	June 12, 2013

What's the reason for extended cuda driver api duration

Related topics