Unexpected Synchronization Behavior in Windows vs. Linux for CUDA Async Operations with Multiple Streams

yee-loon.kok · March 12, 2025, 7:25am

When working with asynchronous operations (memory copies and kernel execution) across multiple streams, I observed a significant difference in behavior between Linux and Windows.

Linux Behavior (Expected):

All async operations start immediately after their respective API calls without waiting for cudaStreamSynchronize() or cudaDeviceSynchronize().
Operations on different streams run concurrently, and we can wait for their completion using cudaStreamSynchronize() or cudaDeviceSynchronize().

Windows Behavior (Unexpected):

Async operations do not start immediately after their API calls. Instead, they only begin execution after calling cudaStreamSynchronize() or cudaDeviceSynchronize().
All operations run concurrently only when using cudaDeviceSynchronize(). However, if cudaStreamSynchronize() is used for all streams, operations execute sequentially instead of concurrently.

Sample Code:
cudaStreamCreateWithFlags(&stream[0], cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[1], cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[2], cudaStreamNonBlocking);

for (size_t i = 0; i < 16; i++) {
// Start some async operations
CUDA_RT_CALL(cudaMemcpyAsync(d_data, h_data, TEST_SIZE, cudaMemcpyHostToDevice, stream[0]));
someHeavyKernel<<<1, 1024, 0, stream[1]>>>(d_out, d_out, TEST_SIZE);
CUDA_RT_CALL(cudaMemcpyAsync(h_out, d_out, TEST_SIZE, cudaMemcpyDeviceToHost, stream[2]));

// Optionally, sleep for a while
std::this_thread::sleep_for(std::chrono::milliseconds(1));

// Use Device Synchronize
CUDA_RT_CALL(cudaDeviceSynchronize());

// OR Use Stream Synchronize (uncomment to test)
// CUDA_RT_CALL(cudaStreamSynchronize(stream[1]));
// CUDA_RT_CALL(cudaStreamSynchronize(stream[0]));
// CUDA_RT_CALL(cudaStreamSynchronize(stream[2]));

}

Experimental Observations:
Windows (with cudaStreamSynchronize())
Figure 1: All operations do not run concurrently.
Figure 2: Adding a sleep after async API calls, operations start only after the sleep, not immediately after the API call.

Windows (with cudaDeviceSynchronize())
Figure 3: All operations run concurrently.
Figure 4: With sleep, operations still wait until after the sleep to start, but they run concurrently.

Linux (with cudaStreamSynchronize())
Figure 5: All operations run concurrently.
Figure 6: Adding sleep, operations start concurrently after their API calls without waiting for cudaStreamSynchronize().

Linux (with cudaDeviceSynchronize())
Figure 7: All operations run concurrently.
Figure 8: Adding sleep, operations start concurrently after their API calls without waiting for cudaDeviceSynchronize().

(Im new user, the system does not allow me upload all the screen shot of NSight)

Would appreciate any insights on why Windows behaves differently and how to make async execution consistent across both platforms.

yee-loon.kok · March 12, 2025, 7:27am

Here is Figure 2, Windows cudaStreamSynchronize() with sleep() in between.

yee-loon.kok · March 12, 2025, 7:29am

Here is figure 6, Linux system call cudaStreamSynchronize() with sleep() in between.

yee-loon.kok · March 12, 2025, 7:43am

Figure 4: Windows with cudaDeviceSynchronize() , all operations run concurrently, but they start after sleep().

njuffa · March 12, 2025, 9:07am

Can you clarify whether you are using the WDDM driver or the TCC driver on Windows? If the WDDM driver, have you tried to repeat your experiments on a GPU supported by the TCC driver?

To overcome the high overhead inherent in the WDDM driver, the CUDA driver may use batching of commands sent to the GPU, instead of sending each command individually right away. The release of each batch is based on internal heuristics not publicly documented in detail. This may explain your observations (at least in part).

The TCC driver on Windows should work similar to the Linux driver, but is not supported with all GPUs. Microsoft introduced the WDDM driver model to give the operating system maximum control over GPUs (potentially) involved in driving the GUI, and the primary design goal is system stability of consumer systems, not performance. The CUDA driver working on top of WDDM has little direct control over the GPU and therefore has to use various workarounds to achieve good performance, and this can include the batching of commands.

yee-loon.kok · March 13, 2025, 5:57am

Thank you very much. I’m using an RTX 6000 Ada Generation, which defaults to WDDM driver mode. After switching to TCC mode using nvidia-smi.exe, the issue was resolved. As you mentioned, with the TCC driver mode on Windows, the behavior is now similar to Linux. From NSight, both timeline reports appear almost identical.

system · March 27, 2025, 5:58am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	799	September 11, 2016
Comparison Linux vs windows of "cudaDeviceSynchronize" CUDA Programming and Performance	7	2390	August 13, 2013
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2056	May 30, 2015
I am curious about why some program become faster when calling cudaStreamSynchronize during program running. CUDA Programming and Performance	3	783	January 21, 2019
Kernels launched by multiple host threads get serialized by cudaStreamSynchronize(0) when --default- CUDA Programming and Performance	7	2868	October 12, 2021
Cuda stream problem in windows CUDA Programming and Performance	1	709	November 9, 2021
Unpredictable nature of GPU action timing in Nsight CUDA Programming and Performance	10	1203	October 18, 2015
cudaStreamSynchronize CUDA Programming and Performance cuda	4	64	July 17, 2024
cudaMemcpyAsync CUDA Programming and Performance	10	20735	October 16, 2015
Overlapping CPU and GPU operations using streams. Total failure. Any help? CUDA Programming and Performance	6	6021	April 2, 2013

Unexpected Synchronization Behavior in Windows vs. Linux for CUDA Async Operations with Multiple Streams

Related topics