Currently, I am implementing a single-host-threaded, multi-GPU image processing CUDA program. Approximately 512 images are processed in a loop, where each iteration involves host-device transfer using cudaMemcpy2DToArrayAsync and two kernel launch calls. Finally, after the loop, there is a device-host transfer using cudaMemcpyAsync. When I used a single stream for each GPU (a total of 2 streams), the profiles showed the expected results, as depicted in the figure below.
Later, I changed to using multiple streams for upload (host-device transfer), compute (kernel calls), and download (device-host transfer) (a total of 6 streams for all GPUs). I found that the launch latency of a CUDA host API call cudaMemcpy2DToArrayAsync is unusually high, as shown below.
To synchronize execution, cudaStreamWaitEvent is used, where the CUDA events are created using the flag cudaEventDisableTiming. I have not found any performance degradation due to GPU stall. Is it safe to ignore this issue especially considering future scalability?
For clarity, could you state the actual number(s)?
I am confused to see “launch latency” associated with a call to cudaMemcpy2DToArrayAsync, since this API call does not result in a kernel launch, best I know.
I am curious to understand why this occurs when I use multiple streams and events. Memory transfer overhead can block the host thread from launching additional CUDA calls. If this blocking occurs for many CUDA host API calls, the GPU may stall, resulting in degraded performance. In this aspect, is it better to use single stream without CUDA events and associated synchronization if it is possible to do so?
I am afraid I don’t know what “this” refers to. You are looking at [data-item] which looks odd to you because you are observing [value] while you are expecting [other-value] because of [reason]. Knowing the mapping for the formal arguments would facilitate discussion.
Should the concern be about any “maximum value”, I would generally ignore that. There are many “cold start” effects which can lead to high “maximum value” readings. Important is the steady-state performance, of which “minimum” and “median” give a good first-order indication. In the case you pulled out they are fairly close together but there is some variability, though nothing to be alarmed about.
Some general notes: [1] Contiguous 1D copies should be preferred over strided 2D copies. Strided copies achieve only a fraction of the effective bandwidth of a contiguous copy, depending on the stride. In the worst case the performance difference can be very pronounced, like 10x. [2] PCIe is a packetized transport that is not very efficient for small data transfers, i.e. effective bandwidth is low for small transfers. The corollary to this is that effective bandwidth is improved when several small data items are batched together for host<->device transfers. Maximum PCIe bandwidth is typically achieved when each transfer comprises several MB of data.
OK. Thank you very much for the reply. Yes, I was referring to the maximum value.
Regarding the term launch latency, is it primarily used for kernel launch calls or any asyhchronous task including memory transfer.? As per the following blog post, “Launch latency sometimes called induction time, is the time between requesting an asynchronous task and beginning to execute it. This definition includes the time of the launch API call.“ Understanding the Visualization of Overhead and Latency in NVIDIA Nsight Systems | NVIDIA Technical Blog
I only know of “launch overhead” with respect to the launching of kernels on the GPU. That was the original terminology of the CUDA team of which I was a founding member. The launch overhead is what limits the issue rate of kernel launches, about 200K launches per second at most on early CUDA GPUs, corresponding to 5 microseconds of overhead per kernel launch. Best I know, the maximum kernel launch rate with modern hardware today is about 500K kernel launches per second.
It is possible that subsequent generations of NVIDIA engineers have modified and extended the original terminology. I retired from NVIDIA in 2014, and while I still use CUDA regularly, I am not using it with an intensity that would prompt me to follow all of the latest developments in detail. For what it is worth, I have never come across the term “induction time”.
If I create the profile with the option --cuda-trace-all-apis=true, I observe that the problem is present even for single stream scenario as shown in the following figure. However, it shows cudaSetDevice is taking so long:
It appears that the aforementioned issue can be replicated in both single-GPU and single-stream implementations. I suspect that the problem arises from the high frequency of cudaEventRecord and/or cudaStreamWaitEvent calls. Additionally, I have discovered that switching the GPU mode from WDDM to TCC can resolve the issue.
I’m benchmarking mixed-precision execution paths on a Tesla T4 and observing a consistent FP16 vs FP32 latency delta that appears to align with Tensor Core engagement. I’d appreciate guidance on definitive confirmation methods on Turing GPUs.
Environment
GPU: Tesla T4 (16 GB)
Driver: 550.54.15
CUDA Toolkit: 12.5 (nvcc 12.5.82)
Python: 3.12.12
Benchmark Methodology
CUDA event–based timing (cudaEventRecord)
Explicit cudaDeviceSynchronize() before and after timing
Warmup iterations performed prior to measurement
Reported values are average per-iteration latency
Results
FP32: 354.92 ms
FP16: 40.68 ms
Speedup: ~8.7Ă—
The workload is shape-stable and the speedup is reproducible across runs under the same configuration.
Question
On Turing (Tesla T4), aside from using Nsight Compute, is there a recommended or lightweight way to conclusively confirm:
Tensor Core utilization, or
Whether the selected kernels are Tensor Core–accelerated (e.g., TensorOp paths in cuBLAS/cuDNN)?
I’m particularly interested in best practices for verifying this in minimal or headless benchmarking setups.