When working with asynchronous operations (memory copies and kernel execution) across multiple streams, I observed a significant difference in behavior between Linux and Windows.
Linux Behavior (Expected):
- All async operations start immediately after their respective API calls without waiting for
cudaStreamSynchronize()
orcudaDeviceSynchronize()
. - Operations on different streams run concurrently, and we can wait for their completion using
cudaStreamSynchronize()
orcudaDeviceSynchronize()
.
Windows Behavior (Unexpected):
- Async operations do not start immediately after their API calls. Instead, they only begin execution after calling
cudaStreamSynchronize()
orcudaDeviceSynchronize()
. - All operations run concurrently only when using
cudaDeviceSynchronize()
. However, ifcudaStreamSynchronize()
is used for all streams, operations execute sequentially instead of concurrently.
Sample Code:
cudaStreamCreateWithFlags(&stream[0], cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[1], cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream[2], cudaStreamNonBlocking);
for (size_t i = 0; i < 16; i++) {
// Start some async operations
CUDA_RT_CALL(cudaMemcpyAsync(d_data, h_data, TEST_SIZE, cudaMemcpyHostToDevice, stream[0]));
someHeavyKernel<<<1, 1024, 0, stream[1]>>>(d_out, d_out, TEST_SIZE);
CUDA_RT_CALL(cudaMemcpyAsync(h_out, d_out, TEST_SIZE, cudaMemcpyDeviceToHost, stream[2]));
// Optionally, sleep for a while
std::this_thread::sleep_for(std::chrono::milliseconds(1));
// Use Device Synchronize
CUDA_RT_CALL(cudaDeviceSynchronize());
// OR Use Stream Synchronize (uncomment to test)
// CUDA_RT_CALL(cudaStreamSynchronize(stream[1]));
// CUDA_RT_CALL(cudaStreamSynchronize(stream[0]));
// CUDA_RT_CALL(cudaStreamSynchronize(stream[2]));
}
Experimental Observations:
Windows (with cudaStreamSynchronize()
)
Figure 1: All operations do not run concurrently.
Figure 2: Adding a sleep after async API calls, operations start only after the sleep, not immediately after the API call.
Windows (with cudaDeviceSynchronize()
)
Figure 3: All operations run concurrently.
Figure 4: With sleep, operations still wait until after the sleep to start, but they run concurrently.
Linux (with cudaStreamSynchronize()
)
Figure 5: All operations run concurrently.
Figure 6: Adding sleep, operations start concurrently after their API calls without waiting for cudaStreamSynchronize()
.
Linux (with cudaDeviceSynchronize()
)
Figure 7: All operations run concurrently.
Figure 8: Adding sleep, operations start concurrently after their API calls without waiting for cudaDeviceSynchronize()
.
(Im new user, the system does not allow me upload all the screen shot of NSight)
Would appreciate any insights on why Windows behaves differently and how to make async execution consistent across both platforms.