Default CUDA stream copy ops surprisingly block other copy ops from other non-blocking CUDA stream

liuzijing2014 · August 1, 2025, 1:24am

Hi experts,

We are working on LLM inference, and notice something surprising from our model inference stack. From our observation, the copy operations (e.g. pytorch.tensor.to()/cpu()/tolist(), for both non_blocking=True/False) would block other copy operations running on other CUDA streams whose cudaStreamNonBlocking is set.

Per CUDA doc, we are not expecting this because it seems that default CUDA stream only blocks other CUDA streams who are not non-blocking.

Here is a quick reproduce script:

import ctypes

import threading
from ctypes import c_int, c_uint, c_void_p

import torch

# Load CUDA runtime library
try:
    cuda_runtime = ctypes.CDLL("libcudart.so")  # Linux
except OSError:
    try:
        cuda_runtime = ctypes.CDLL("cudart64_*.dll")  # Windows
    except OSError:
        cuda_runtime = ctypes.CDLL("libcudart.dylib")  # macOS


# Define CUDA constants
cudaStreamNonBlocking = 0x01


def is_stream_non_blocking(pytorch_stream: torch.cuda.Stream) -> bool:
    """Check if a PyTorch CUDA stream is non-blocking."""
    try:
        # Get the raw CUDA stream handle
        stream_ptr = pytorch_stream.cuda_stream

        # Prepare arguments for cudaStreamGetFlags
        flags = c_uint()

        # Call cudaStreamGetFlags
        result = cuda_runtime.cudaStreamGetFlags(
            c_void_p(stream_ptr), ctypes.byref(flags)
        )

        if result != 0:  # cudaSuccess = 0
            print(f"Error getting stream flags: {result}")
            return False

        return bool(flags.value & cudaStreamNonBlocking)

    except Exception as e:
        print(f"Error checking stream flags: {e}")
        return False


def h2d_func():
    stream_h2d = torch.cuda.Stream()
    t = torch.ones([10000], device=torch.device("cpu"), pin_memory=True)
    with torch.cuda.stream(stream_h2d):
        print(f"stream_h2d is non-blocking: {is_stream_non_blocking(stream_h2d)}")
        for i in range(100000000):
            cuda_t = t.to("cuda", non_blocking=True)


def main() -> None:
    h2d_thread = threading.Thread(target=h2d_func, args=())
    h2d_thread.start()
    list_t = torch.empty([1])
    cuda_t = torch.arange(1000000, device=torch.device("cuda"))
    stream_main = torch.cuda.Stream()
    print(
        f"Current stream is non-blocking: {is_stream_non_blocking(torch.cuda.current_stream())}"
    )
    # with torch.cuda.stream(stream_main):
    for i in range(10000000):
        for j in range(100):
            t = cuda_t.repeat_interleave(10)
            t = cuda_t.repeat_interleave(10)
        list_t = t.to('cpu', non_blocking=True)

    h2d_thread.join()
    print("h2d thread is done")


if __name__ == "__main__":
    main()

Result:

stream_h2d is non-blocking: True
Current stream is non-blocking: False

From the GPU trace, we can confirm that there is some blocking issue:

If we change to let all copy operations run in non-default CUDA streams, we can observe that all copy operations could be run in parallel.

    with torch.cuda.stream(stream_main):
        for i in range(10000000):
            for j in range(100):
                t = cuda_t.repeat_interleave(10)
                t = cuda_t.repeat_interleave(10)
            list_t = t.to('cpu', non_blocking=True)

My question is that what is causing the blocking here?

njuffa · August 1, 2025, 1:35am

Word of advice: Never use the CUDA default stream in production code.

liuzijing2014 · August 1, 2025, 1:48am

Understand this is the recommended practice, but just more interesting in learning why this is the case, especially when CUDA doc mentioned it should not block other non-blocking CUDA streams.

njuffa · August 1, 2025, 1:55am

The NVIDIA folks here in the forum can probably help you with the language lawyering.

Robert_Crovella · August 1, 2025, 1:54pm

I frequently suggest that folks asking pytorch related questions ask them on a pytorch forum such as dicuss.pytorch.org

liuzijing2014 · August 12, 2025, 4:58am

Thanks @Robert_Crovella

quick question on legacy default CUDA stream behavior: does it block other copy operations from other (non blocking) CUDA streams? I played around with it locally, it seems for normal kernel launch/execution, the legacy default CUDA stream wouldn’t block other non blocking CUDA streams but copy operations seem to still block each other.

Robert_Crovella · August 12, 2025, 2:23pm

generally, CUDA created streams don’t block activity in other created streams. This is true whether we are talking about kernels or copy ops.

My general advice when I am teaching CUDA programming is, once we’ve learned sufficient concurrency topics, to use only created streams to structure your work.

I actually have difficulty just parsing your question, but with respect to the legacy default stream, and leaving the cudaStreamNonBlocking flag out of the picture, any CUDA fundamental teaching material I am familiar with, teaches that a work item (kernel, copy op) issued into the legacy default stream is synchronizing with respect to other device activity. It does not matter what category or characteristics that other device activity has. Synchronizing means:

all operations of any type, issued to that device prior to the work item issued to the legacy default stream, must complete prior to the start of execution of the work item issued to the legacy default stream
all operations of any type, issued to that device after the issuance of the work item in question that was issued into the legacy default stream, must wait until the legacy default stream item completes, before they can begin.

Another, perhaps simpler way of saying this is that a work item issued into the legacy default stream cannot overlap with any other work item issued to that device.

Once we add the cudaStreamNonBlockingFlag into the picture, the documentation seems pretty clear - the aforementioned synchronization is gone. I haven’t studied any test cases carefully to see if I found discrepancies, but I’ve not noticed discrepancies in the past that I can remember, but there may be specialized corner cases when the copy op in question is very small, say 64KB or less. I think we are getting in the weeds here.

I personally would not start out on any such exploration in a pytorch environment, because I don’t have it memorized how every pytorch API converts into a sequence of CUDA API calls, so that I can infer expected behavior. I could perhaps use a profiler to determine the sequence in such a case, but for me personally, that’s a relatively inconvenient way to proceed. But if you wanted to offer up a CUDA C++ runtime or driver API call sequence (what I consider this forum to be focused on, less so every other implementation of CUDA wrapped in another language) that behaved strangely, I’m sure folks here would be interested.

Topic		Replies	Views
Why there is no overleaping when execute CUDA kernel and copy operation in parallael CUDA Programming and Performance	4	190	May 28, 2024
CUDA non-default stream synchronization CUDA Programming and Performance jetson-orin	4	470	October 30, 2024
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	1084	February 1, 2022
Time intervals and non-concurrent in multi streaming CUDA Programming and Performance cuda	6	704	April 6, 2023
Execution order between Cuda Stream 0 and other streams CUDA Programming and Performance	0	383	July 9, 2020
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2490	January 18, 2023
cudaMemcpyAsync, unexpected behaviour while using cudaStreamNonBlocking? CUDA Programming and Performance	6	2253	May 29, 2018
nppiCFAToRGB_8u_C1C3R_Ctx Blocking HtoD Copy GPU-Accelerated Libraries npp	0	951	November 4, 2022
D2H cudaMemcpyAsync Blocks Irrelevant Kernels CUDA Programming and Performance cuda , kernel	4	1177	July 29, 2022
cuda stream CUDA Programming and Performance	3	5887	April 6, 2011

Default CUDA stream copy ops surprisingly block other copy ops from other non-blocking CUDA stream

Related topics