Cuda processing time spikes

I am working on a CUDA project that uses streams, this project is part of a wide, complex solution. I noticed processing time spikes in the CUDA parts execution time, as a sum of Asynchronous copy to the memory, kernels launched, and results copied back.

  • Initially I was creating and destroying streams (cudaStreamDestroy) at each cycle, but I had to remove that and reuse streams, as that was creating processing time spikes that that doubled in duration at each iteration of continuous running time (e.g. 15ms at 10mins, 30ms at 20mins), of the magnitude of up to a few seconds.
  • I noticed using Nsight System they tend to happen as I call cudaMemcpyAsync, so with more streams the possibility to get a spike increases. I wasn’t able to reproduce spikes while using Nsight Compute due to it slowing the processor down and the reduced number of cycles.
  • The spikes are somehow linked to the number of streams I launch, and the other processes on the CPU and integrated graphic card. Their magnitude is linked to the average processing time.
  • Increasing the number of streams has a higher impact on speed than increasing the amount of data processed in each stream.
  • I mitigated most of the spikes by reducing the load on the CPU and GPU, but I am left now with a spike of 25ms (usual process time is 2 ms up to 10 ms as I increase the number of streams), when I also have an InfluxDB query happening at the same time, that moves data between buckets. Writing to the database is not an issue, only moving the data between buckets is.

My questions are,

  1. What is the reason and how can I reduce the impact of InfluxDB queries, and other processes on my CUDA project?
  2. Is there something better I can do with my streams, still maintaining them fully asynchronous?

This is my first experience with CUDA so I am interested to understand if there is a better approach for using streams, and what sort of CPU processes are known to have a negative effect on my CUDA process.

I am using Cuda Toolkit 12.8, Studio Driver 576.52. Only my solution and Unity running on Nvidia card.

Many thanks.

Moved from CUDA GDB to NSYS.

Can you please provide more details on your OS.

Given you are using Unity it is highly likely that the GPU GR engine (2D/3D/Compute) and Asynchronous Copy Engines are being time sliced to work on other contexts which can look like a duration spike.

NSYS GPU context switch can show if the GR engine is being time sliced but there is no support for tracing Asynchronous Copy Engine context switches.

Unity influences it’s something I have considered, I tested with Unity on the integrated graphic card, but that being maximized was causing even more delays. We tried without unity, with data streamed via EtherCAT, and we got similar results in terms of spikes. I obviously set the Nvidia setting and various process performance to reduce the delays. I noticed that processes on the CPU have a substantial effect. For example, during development, if I launch my application containing the CUDA project from VS I have spikes, if I launch the executable I don’t, even with VS open.

When profiling I observed my ACE2 being maximized, while ACE0 was not doing much, is it something that can be balanced through settings?

Another concern is, as I increase the size of data, I am launching more streams. The processing time increases and with it the probability/magnitude of spikes. I do understand preparing the streams has an overhead time, but reducing the number of streams would mean increasing the number of inputs per stream, the asynchronous copies would not change in number. Is there another approach you would suggest?

In terms of OS, I have seen it on a laptop and 2 pcs, with different Nvidia cards all with windows 11. The most urgent project run on a pc with windows 11 version 24H2, NVIDIA GeForce RTX 4080 SUPER, and Intel® UHD Graphics 770.

@dofek FYI