I ran a tensorflow gpu subgraph in a concurrent manner, which means multiple operators run and call cuda driver apis concurrently, which can be shown in figure below. I found from nsys that the duration of cuda dirver api call got extended.
Basically, there are mainly three type of operators:
- send operator, which called cuMemcpyHtoDAsync
to transfer data between host and device.;
- self defined kernels, which are launched by calling cuda driver api cuLaunchKernel
- batched blas api, which would conduct cumemalloc_v2
three times, cuMemcpyHtoDAync
three times, and then call kernel kernels like gemvNSP_kernel
.
I want to know the reason that driver api’s time is extended. Is there any way to fix it?
Thanks for your advice.
simultaneous access (e.g. from multiple threads) to the driver or runtime API may incur additional overhead. This is mentioned here.
Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources.
There isn’t anything you can do about it from a CUDA perspective. Obviously if you want to rearchitect your application, you may observe some change due to reduced contention, but this may result in worse performance due to other factors (e.g. serialization of work). There is no recipe to address this in the general case.
With a bit of searching you can find other, similar forum reports.
Thank you for your reply.
Besides, I occasionly found that the time of cumemalloc is extremely long, reaching about serveral hundred ms, as shown in below.
I found in forum that the initialization of cuContext may takes much long time. But, from nsys, the initialization has already occured before and the application always uses the same ctx.
What is the CPU in this system? Memory allocations are basically pure host-side work, limited primarily by single-threaded CPU performance and to a much lesser degree by performance of the system memory. Your observations suggest potential use of a slow CPU. Allocation performance can also depend on the operating system. For example, on Windows with the default WDDM driver GPU memory allocations must be routed through an OS mechanism so the OS can always be in full control of GPU memory.
I do not claim any detailed knowledge of the CUDA memory allocators, but generally speaking, modern memory allocation mechanisms are constructed in layers, with the cost of allocations potentially increasing significantly when traversing layers. This can be triggered by a large allocation, for example, that the last-level allocator cannot handle.
It is best to keep memory allocation / de-allocation out of performance-critical sections of code. One strategy to do that is to perform all or most of the allocations up front and keep re-using buffers in performance-critical sections.
a cuMemAlloc operation is often synchronizing. That means that it will not begin until the device is idle (no kernels running). Therefore, when viewed from an API perspective (i.e. the duration of the call as viewed from the host thread, one of the depictions given in nsys) it may appear “long”, because it is waiting for an idle device. I don’t know if that applies to your case or not.
The usual advice I know of for this was already given by njuffa:
When/if that is unavoidable, use of memory pools may be of interest.
Thanks for your advice
Thanks you, I would consider the memory pool solution.