Unexpected Synchronization Behavior in Windows vs. Linux for CUDA Async Operations with Multiple Streams

When working with asynchronous operations (memory copies and kernel execution) across multiple streams, I observed a significant difference in behavior between Linux and Windows.

Linux Behavior (Expected):

  1. All async operations start immediately after their respective API calls without waiting for cudaStreamSynchronize() or cudaDeviceSynchronize().
  2. Operations on different streams run concurrently, and we can wait for their completion using cudaStreamSynchronize() or cudaDeviceSynchronize().

Windows Behavior (Unexpected):

  1. Async operations do not start immediately after their API calls. Instead, they only begin execution after calling cudaStreamSynchronize() or cudaDeviceSynchronize().
  2. All operations run concurrently only when using cudaDeviceSynchronize(). However, if cudaStreamSynchronize() is used for all streams, operations execute sequentially instead of concurrently.

Sample Code:
cudaStreamCreateWithFlags(&stream[0], cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[1], cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[2], cudaStreamNonBlocking);

for (size_t i = 0; i < 16; i++) {
// Start some async operations
CUDA_RT_CALL(cudaMemcpyAsync(d_data, h_data, TEST_SIZE, cudaMemcpyHostToDevice, stream[0]));
someHeavyKernel<<<1, 1024, 0, stream[1]>>>(d_out, d_out, TEST_SIZE);
CUDA_RT_CALL(cudaMemcpyAsync(h_out, d_out, TEST_SIZE, cudaMemcpyDeviceToHost, stream[2]));

// Optionally, sleep for a while
std::this_thread::sleep_for(std::chrono::milliseconds(1));

// Use Device Synchronize
CUDA_RT_CALL(cudaDeviceSynchronize());

// OR Use Stream Synchronize (uncomment to test)
// CUDA_RT_CALL(cudaStreamSynchronize(stream[1]));
// CUDA_RT_CALL(cudaStreamSynchronize(stream[0]));
// CUDA_RT_CALL(cudaStreamSynchronize(stream[2]));

}

Experimental Observations:
Windows (with cudaStreamSynchronize())
Figure 1: All operations do not run concurrently.
Figure 2: Adding a sleep after async API calls, operations start only after the sleep, not immediately after the API call.

Windows (with cudaDeviceSynchronize())
Figure 3: All operations run concurrently.
Figure 4: With sleep, operations still wait until after the sleep to start, but they run concurrently.

Linux (with cudaStreamSynchronize())
Figure 5: All operations run concurrently.
Figure 6: Adding sleep, operations start concurrently after their API calls without waiting for cudaStreamSynchronize().

Linux (with cudaDeviceSynchronize())
Figure 7: All operations run concurrently.
Figure 8: Adding sleep, operations start concurrently after their API calls without waiting for cudaDeviceSynchronize().

(Im new user, the system does not allow me upload all the screen shot of NSight)

Would appreciate any insights on why Windows behaves differently and how to make async execution consistent across both platforms.

Here is Figure 2, Windows cudaStreamSynchronize() with sleep() in between.

Here is figure 6, Linux system call cudaStreamSynchronize() with sleep() in between.

Figure 4: Windows with cudaDeviceSynchronize() , all operations run concurrently, but they start after sleep().

Can you clarify whether you are using the WDDM driver or the TCC driver on Windows? If the WDDM driver, have you tried to repeat your experiments on a GPU supported by the TCC driver?

To overcome the high overhead inherent in the WDDM driver, the CUDA driver may use batching of commands sent to the GPU, instead of sending each command individually right away. The release of each batch is based on internal heuristics not publicly documented in detail. This may explain your observations (at least in part).

The TCC driver on Windows should work similar to the Linux driver, but is not supported with all GPUs. Microsoft introduced the WDDM driver model to give the operating system maximum control over GPUs (potentially) involved in driving the GUI, and the primary design goal is system stability of consumer systems, not performance. The CUDA driver working on top of WDDM has little direct control over the GPU and therefore has to use various workarounds to achieve good performance, and this can include the batching of commands.

1 Like

Thank you very much. I’m using an RTX 6000 Ada Generation, which defaults to WDDM driver mode. After switching to TCC mode using nvidia-smi.exe, the issue was resolved. As you mentioned, with the TCC driver mode on Windows, the behavior is now similar to Linux. From NSight, both timeline reports appear almost identical.

1 Like