Cuda processing time spikes

MotherOfDragons4 · June 19, 2025, 11:01am

I am working on a CUDA project that uses streams, this project is part of a wide, complex solution. I noticed processing time spikes in the CUDA parts execution time, as a sum of Asynchronous copy to the memory, kernels launched, and results copied back.

Initially I was creating and destroying streams (cudaStreamDestroy) at each cycle, but I had to remove that and reuse streams, as that was creating processing time spikes that that doubled in duration at each iteration of continuous running time (e.g. 15ms at 10mins, 30ms at 20mins), of the magnitude of up to a few seconds.
I noticed using Nsight System they tend to happen as I call cudaMemcpyAsync, so with more streams the possibility to get a spike increases. I wasn’t able to reproduce spikes while using Nsight Compute due to it slowing the processor down and the reduced number of cycles.
The spikes are somehow linked to the number of streams I launch, and the other processes on the CPU and integrated graphic card. Their magnitude is linked to the average processing time.
Increasing the number of streams has a higher impact on speed than increasing the amount of data processed in each stream.
I mitigated most of the spikes by reducing the load on the CPU and GPU, but I am left now with a spike of 25ms (usual process time is 2 ms up to 10 ms as I increase the number of streams), when I also have an InfluxDB query happening at the same time, that moves data between buckets. Writing to the database is not an issue, only moving the data between buckets is.

My questions are,

What is the reason and how can I reduce the impact of InfluxDB queries, and other processes on my CUDA project?
Is there something better I can do with my streams, still maintaining them fully asynchronous?

This is my first experience with CUDA so I am interested to understand if there is a better approach for using streams, and what sort of CPU processes are known to have a negative effect on my CUDA process.

I am using Cuda Toolkit 12.8, Studio Driver 576.52. Only my solution and Unity running on Nvidia card.

Many thanks.

Greg · June 19, 2025, 1:45pm

Moved from CUDA GDB to NSYS.

Can you please provide more details on your OS.

Given you are using Unity it is highly likely that the GPU GR engine (2D/3D/Compute) and Asynchronous Copy Engines are being time sliced to work on other contexts which can look like a duration spike.

NSYS GPU context switch can show if the GR engine is being time sliced but there is no support for tracing Asynchronous Copy Engine context switches.

MotherOfDragons4 · June 20, 2025, 8:34am

Unity influences it’s something I have considered, I tested with Unity on the integrated graphic card, but that being maximized was causing even more delays. We tried without unity, with data streamed via EtherCAT, and we got similar results in terms of spikes. I obviously set the Nvidia setting and various process performance to reduce the delays. I noticed that processes on the CPU have a substantial effect. For example, during development, if I launch my application containing the CUDA project from VS I have spikes, if I launch the executable I don’t, even with VS open.

When profiling I observed my ACE2 being maximized, while ACE0 was not doing much, is it something that can be balanced through settings?

Another concern is, as I increase the size of data, I am launching more streams. The processing time increases and with it the probability/magnitude of spikes. I do understand preparing the streams has an overhead time, but reducing the number of streams would mean increasing the number of inputs per stream, the asynchronous copies would not change in number. Is there another approach you would suggest?

In terms of OS, I have seen it on a laptop and 2 pcs, with different Nvidia cards all with windows 11. The most urgent project run on a pc with windows 11 version 24H2, NVIDIA GeForce RTX 4080 SUPER, and Intel® UHD Graphics 770.

hwilper · June 25, 2025, 1:00pm

@dofek FYI

Topic		Replies	Views
cudaMemcpy2DAsync not always fully synchronous CUDA Programming and Performance	11	1183	February 4, 2021
cudaStream performance CUDA Programming and Performance	7	1644	June 21, 2016
performance problem CUDA Programming and Performance	2	616	July 16, 2018
streams: are they worth the time? your opinions/experience appreciated CUDA Programming and Performance	6	4915	April 21, 2009
CUDA stream performance CUDA Programming and Performance	5	2392	July 23, 2013
streams strange behaviour with profiler CUDA Programming and Performance	0	529	November 25, 2014
Multiple CPU threads with multiple cudaStreams CUDA Programming and Performance	5	6192	July 23, 2015
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	595	April 6, 2023
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1152	September 26, 2018
How to analyse the various events which are taking place at GPU using CUDA CUDA Programming and Performance	3	733	July 15, 2015

Cuda processing time spikes

Related topics