Long wait in cudaStreamSynchronize

valentin.massebeuf · May 12, 2025, 9:47am

Hi everyone,

I’m working on a project using Holoscan (in C++), and to integrate MatX with it, I’m using the CUDA stream passed between operators (since v2.9) like this:
auto cuda_stream = op_input.receive_cuda_stream("input", true, false);
In one of my operators — the one right after an InferenceOp — I notice that cudaStreamSynchronize(cuda_stream); takes around 30ms to complete.

My current hypothesis is that cudaStreamSynchronize is waiting for the previous stream’s operations to finish, but the InferenceOp itself only takes about 20ms to compute, so I don’t fully understand where the extra time would be coming from.

Does anyone know what could cause this extra delay in my cudaStreamSynchronize?

Thanks!

grelee · May 12, 2025, 2:57pm

Hi Valentin,

For the CUDA stream handling feature, each operator by default has its own internal CUDA stream. Use of receive_cuda_stream finds any streams on a given input port and synchronizes them to the operator’s dedicated internal stream before returning that internal stream to the user. It also will automatically emit the operator’s internal CUDA stream on all output ports of the operator. This ensures any tensors that had pending upstream work will be ready for use within the operator after receive_cuda_stream has returned and notifies downstream operators they may need to synchronize on that provided stream. This would explain why you are seeing synchronization occur, but I do not know what would be the source of any extra delay (30 ms vs. expected 20 ms). How were you measuring the 20 ms time? If you profile the app with NSight Systems, do you see unexpected dead time after any InferenceOp kernels have completed?

The actual synchronization over the streams found on the input when calling receive_cuda_stream is using cudaEventRecord and cudaStreamWaitEvent as in the code here.

There is a separate receive_cuda_streams (with an s) that returns a vector<cudaStream_t> of the streams found on that input port. This version, by contrast, does NOT do any synchronization and the user must handle it manually as needed. This latter version also does not automatically publish any stream ID on output ports of the operators. The user will need to call set_cuda_stream as needed to send any stream ID to downstream operators to indicate that they may need to synchronize on it.

Topic		Replies	Views
stream synchronize problem CUDA Programming and Performance	2	728	August 28, 2017
About the behavior of cudaStreamSynchronize() CUDA Programming and Performance cuda	3	3158	April 25, 2023
cudaThreadSynchronize() vs. cudaStreamSynchronize CUDA Programming and Performance	0	5546	January 19, 2010
synchronization between the host and the stream CUDA Programming and Performance	3	995	June 29, 2009
Why does cudaEventSynchronize block other streams? CUDA Programming and Performance cuda	1	492	February 2, 2023
CudaStreamSynchronize not working properly CUDA Programming and Performance	1	587	November 19, 2022
Cuda synchronisation is very long CUDA Programming and Performance cuda	1	230	March 15, 2024
cudaStreamSynchronize blocking for long time after kernels finished Jetson AGX Xavier cuda	3	502	October 18, 2021
I am curious about why some program become faster when calling cudaStreamSynchronize during program running. CUDA Programming and Performance	3	792	January 21, 2019
cudaStreamSynchronize is much slower than polling on a flag for kernel completion CUDA Programming and Performance cuda , synchronization	8	2199	February 16, 2023

Long wait in cudaStreamSynchronize

Related topics