Overhead of cudaEventRecord/cudaLaunchKernelExC in multithreading

cos120 · July 23, 2024, 8:17am

Hi, folks.

I am developing a library to measure duration of kernel for LLM training.
I am recording start event in main thread, launch kernel via cudaLaunchKernelExC, and record end event, then push (start, end) to background thread. Background thread will handle events, and return to memory pool of events.

I have found that the latency of cudaEventRecord/cudaLaunchKernelExC has huge variance. cudaLaunchKernelExC is from 5us~500us, the slow cudaEventRecord/cudaLaunchKernelExC are from nccl background thread.

This image describe the latency of start/end event record in microseconds and latency of pushing event to working queue.

Does cpu context switch or other influence cudaEventRecord/cudaLaunchKernelExC on host side?
How can i reduce the variance of cuda host function?
Thanks.

Curefab · July 23, 2024, 9:41am

Could you also state your testing environment, please: GPU, operating system, driver mode? Screen connected to same GPU?

Also a comparison to the results shown by Nsight Systems for running those kernels?

cos120 · July 23, 2024, 10:56am

Thanks for your reply.
The env is

cuda: 12.1(cuda-compat)
gpu: A100-SXM4 * 8
os: centos7
driver: 470.82.01
driver mode: Persistence-Mode
Screen connected: no
cpu: 2x 8369B, 128 logic core
system load: < 20

Other env set.

I am using Megatron-LM, the slow cudaEventRecord/cudaLaunchKernelExC are NCCL operations, and there are in nccl background thread. Once i remove it, all cudaEventRecord/cudaLaunchKernelExC running good.

Could Nsight Systems can capture the duration of cudaEventRecord/cudaLaunchKernelExC? I just use std::chrono::high_resolution_clock::now() to get the duration for cudaEventRecord/cudaLaunchKernelExC

And does performance of cudaEventRecord/cudaLaunchKernelExC are affected by gpu load? Or other somethings?

Robert_Crovella · July 23, 2024, 1:47pm

In a multithreading/multi-process situation, it is typical behavior that the time cost of various CUDA API calls is variable. The general indication of this possibility is given here:

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon.

Curefab · July 23, 2024, 4:05pm

You could measure the kernel running time on device side to get rid of the blocking/synchronization issue:

If you have access to the kernel source code, the kernels themselves could write three values per block: the SM they were running one, the start and end time from the clock() function. It returns a per-SM time counter.

cos120 · July 24, 2024, 2:07am

Thanks.

So since the behavior is undocumented, is there any way to reduce the variance of cuda api call? Maybe separate cuda event objects to each thread locally?

cos120 · July 24, 2024, 2:08am

Thanks for your advise.

I am developing a LD_PRELOAD library, so i can not modify the source.

Robert_Crovella · July 24, 2024, 12:52pm

I’m not aware of any methods to do so. The usual advice I offer if this is really important is to launch all CUDA activity from a single thread. Not applicable here. Alternatively, launch the CUDA activity in such a way temporally that only one thread is launching work at a given time.

cos120 · July 29, 2024, 7:42am

Thanks.

And another problem, in Megatrom-LM, it sets CUDA_DEVICE_MAX_CONNECTIONS=1, this limit the working queue which bridging cpu and gpu side. Does CUDA_DEVICE_MAX_CONNECTIONS=1 may enlarge the overhead of CUDA api on host in multithreading?

Robert_Crovella · July 29, 2024, 2:39pm

CUDA_DEVICE_MAX_CONNECTIONS is described in the programming guide. It refers to the number of hardware channels that will be used to communicate to the CPU, which also affects stream behavior. I’m not aware of it having any impact on the CUDA API overhead in any situation, but it may definitely impact the behavior of multi-stream CUDA applications in terms of work scheduling.

system · August 12, 2024, 2:40pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CUDA OpenCL comparison CUDA Programming and Performance	9	3406	August 23, 2011
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10573	June 21, 2009
How can I dissect different latencies with nsight systems? Profiling Linux Targets	3	1771	February 15, 2020
Why is the performance of cudaLaunchHostFunc low, and what are the optimization strategies or alternative solutions? CUDA Programming and Performance cuda	8	67	September 14, 2024
Can kernel function parallel with CPU code? CUDA Programming and Performance	12	7737	December 5, 2008
Concurrent kernel timing with cudaEvents CUDA Programming and Performance	1	1929	April 27, 2017
cudaLaunchHostFunc + cudaEventElapsedTime? CUDA Programming and Performance	4	885	August 3, 2022
CUDA thread in background? CUDA Programming and Performance	10	16008	February 19, 2010
How to detect async event without polling CUDA Programming and Performance	28	5770	August 20, 2010
Help in speeding up cuLaunchKernel execution time CUDA Programming and Performance	11	914	October 28, 2022

Overhead of cudaEventRecord/cudaLaunchKernelExC in multithreading

Related topics