Overhead of cudaEventRecord/cudaLaunchKernelExC in multithreading

Hi, folks.

I am developing a library to measure duration of kernel for LLM training.
I am recording start event in main thread, launch kernel via cudaLaunchKernelExC, and record end event, then push (start, end) to background thread. Background thread will handle events, and return to memory pool of events.

I have found that the latency of cudaEventRecord/cudaLaunchKernelExC has huge variance. cudaLaunchKernelExC is from 5us~500us, the slow cudaEventRecord/cudaLaunchKernelExC are from nccl background thread.

This image describe the latency of start/end event record in microseconds and latency of pushing event to working queue.

Does cpu context switch or other influence cudaEventRecord/cudaLaunchKernelExC on host side?
How can i reduce the variance of cuda host function?
Thanks.

Could you also state your testing environment, please: GPU, operating system, driver mode? Screen connected to same GPU?

Also a comparison to the results shown by Nsight Systems for running those kernels?

Thanks for your reply.
The env is

  1. cuda: 12.1(cuda-compat)
  2. gpu: A100-SXM4 * 8
  3. os: centos7
  4. driver: 470.82.01
  5. driver mode: Persistence-Mode
  6. Screen connected: no
  7. cpu: 2x 8369B, 128 logic core
  8. system load: < 20

Other env set.

I am using Megatron-LM, the slow cudaEventRecord/cudaLaunchKernelExC are NCCL operations, and there are in nccl background thread. Once i remove it, all cudaEventRecord/cudaLaunchKernelExC running good.

Could Nsight Systems can capture the duration of cudaEventRecord/cudaLaunchKernelExC? I just use std::chrono::high_resolution_clock::now() to get the duration for cudaEventRecord/cudaLaunchKernelExC

And does performance of cudaEventRecord/cudaLaunchKernelExC are affected by gpu load? Or other somethings?

In a multithreading/multi-process situation, it is typical behavior that the time cost of various CUDA API calls is variable. The general indication of this possibility is given here:

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon.

You could measure the kernel running time on device side to get rid of the blocking/synchronization issue:

If you have access to the kernel source code, the kernels themselves could write three values per block: the SM they were running one, the start and end time from the clock() function. It returns a per-SM time counter.

Thanks.

So since the behavior is undocumented, is there any way to reduce the variance of cuda api call? Maybe separate cuda event objects to each thread locally?

Thanks for your advise.

I am developing a LD_PRELOAD library, so i can not modify the source.

I’m not aware of any methods to do so. The usual advice I offer if this is really important is to launch all CUDA activity from a single thread. Not applicable here. Alternatively, launch the CUDA activity in such a way temporally that only one thread is launching work at a given time.

Thanks.

And another problem, in Megatrom-LM, it sets CUDA_DEVICE_MAX_CONNECTIONS=1, this limit the working queue which bridging cpu and gpu side. Does CUDA_DEVICE_MAX_CONNECTIONS=1 may enlarge the overhead of CUDA api on host in multithreading?

CUDA_DEVICE_MAX_CONNECTIONS is described in the programming guide. It refers to the number of hardware channels that will be used to communicate to the CPU, which also affects stream behavior. I’m not aware of it having any impact on the CUDA API overhead in any situation, but it may definitely impact the behavior of multi-stream CUDA applications in terms of work scheduling.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.