Default CUDA stream copy ops surprisingly block other copy ops from other non-blocking CUDA stream

Hi experts,

We are working on LLM inference, and notice something surprising from our model inference stack. From our observation, the copy operations (e.g. pytorch.tensor.to()/cpu()/tolist(), for both non_blocking=True/False) would block other copy operations running on other CUDA streams whose cudaStreamNonBlocking is set.

Per CUDA doc, we are not expecting this because it seems that default CUDA stream only blocks other CUDA streams who are not non-blocking.

Here is a quick reproduce script:

import ctypes

import threading
from ctypes import c_int, c_uint, c_void_p

import torch

# Load CUDA runtime library
try:
    cuda_runtime = ctypes.CDLL("libcudart.so")  # Linux
except OSError:
    try:
        cuda_runtime = ctypes.CDLL("cudart64_*.dll")  # Windows
    except OSError:
        cuda_runtime = ctypes.CDLL("libcudart.dylib")  # macOS


# Define CUDA constants
cudaStreamNonBlocking = 0x01


def is_stream_non_blocking(pytorch_stream: torch.cuda.Stream) -> bool:
    """Check if a PyTorch CUDA stream is non-blocking."""
    try:
        # Get the raw CUDA stream handle
        stream_ptr = pytorch_stream.cuda_stream

        # Prepare arguments for cudaStreamGetFlags
        flags = c_uint()

        # Call cudaStreamGetFlags
        result = cuda_runtime.cudaStreamGetFlags(
            c_void_p(stream_ptr), ctypes.byref(flags)
        )

        if result != 0:  # cudaSuccess = 0
            print(f"Error getting stream flags: {result}")
            return False

        return bool(flags.value & cudaStreamNonBlocking)

    except Exception as e:
        print(f"Error checking stream flags: {e}")
        return False


def h2d_func():
    stream_h2d = torch.cuda.Stream()
    t = torch.ones([10000], device=torch.device("cpu"), pin_memory=True)
    with torch.cuda.stream(stream_h2d):
        print(f"stream_h2d is non-blocking: {is_stream_non_blocking(stream_h2d)}")
        for i in range(100000000):
            cuda_t = t.to("cuda", non_blocking=True)


def main() -> None:
    h2d_thread = threading.Thread(target=h2d_func, args=())
    h2d_thread.start()
    list_t = torch.empty([1])
    cuda_t = torch.arange(1000000, device=torch.device("cuda"))
    stream_main = torch.cuda.Stream()
    print(
        f"Current stream is non-blocking: {is_stream_non_blocking(torch.cuda.current_stream())}"
    )
    # with torch.cuda.stream(stream_main):
    for i in range(10000000):
        for j in range(100):
            t = cuda_t.repeat_interleave(10)
            t = cuda_t.repeat_interleave(10)
        list_t = t.to('cpu', non_blocking=True)

    h2d_thread.join()
    print("h2d thread is done")


if __name__ == "__main__":
    main()

Result:

stream_h2d is non-blocking: True
Current stream is non-blocking: False

From the GPU trace, we can confirm that there is some blocking issue:

If we change to let all copy operations run in non-default CUDA streams, we can observe that all copy operations could be run in parallel.

    with torch.cuda.stream(stream_main):
        for i in range(10000000):
            for j in range(100):
                t = cuda_t.repeat_interleave(10)
                t = cuda_t.repeat_interleave(10)
            list_t = t.to('cpu', non_blocking=True)

My question is that what is causing the blocking here?

Word of advice: Never use the CUDA default stream in production code.

Understand this is the recommended practice, but just more interesting in learning why this is the case, especially when CUDA doc mentioned it should not block other non-blocking CUDA streams.

The NVIDIA folks here in the forum can probably help you with the language lawyering.

1 Like

I frequently suggest that folks asking pytorch related questions ask them on a pytorch forum such as dicuss.pytorch.org

Thanks @Robert_Crovella

quick question on legacy default CUDA stream behavior: does it block other copy operations from other (non blocking) CUDA streams? I played around with it locally, it seems for normal kernel launch/execution, the legacy default CUDA stream wouldn’t block other non blocking CUDA streams but copy operations seem to still block each other.

generally, CUDA created streams don’t block activity in other created streams. This is true whether we are talking about kernels or copy ops.

My general advice when I am teaching CUDA programming is, once we’ve learned sufficient concurrency topics, to use only created streams to structure your work.

I actually have difficulty just parsing your question, but with respect to the legacy default stream, and leaving the cudaStreamNonBlocking flag out of the picture, any CUDA fundamental teaching material I am familiar with, teaches that a work item (kernel, copy op) issued into the legacy default stream is synchronizing with respect to other device activity. It does not matter what category or characteristics that other device activity has. Synchronizing means:

  • all operations of any type, issued to that device prior to the work item issued to the legacy default stream, must complete prior to the start of execution of the work item issued to the legacy default stream
  • all operations of any type, issued to that device after the issuance of the work item in question that was issued into the legacy default stream, must wait until the legacy default stream item completes, before they can begin.

Another, perhaps simpler way of saying this is that a work item issued into the legacy default stream cannot overlap with any other work item issued to that device.

Once we add the cudaStreamNonBlockingFlag into the picture, the documentation seems pretty clear - the aforementioned synchronization is gone. I haven’t studied any test cases carefully to see if I found discrepancies, but I’ve not noticed discrepancies in the past that I can remember, but there may be specialized corner cases when the copy op in question is very small, say 64KB or less. I think we are getting in the weeds here.

I personally would not start out on any such exploration in a pytorch environment, because I don’t have it memorized how every pytorch API converts into a sequence of CUDA API calls, so that I can infer expected behavior. I could perhaps use a profiler to determine the sequence in such a case, but for me personally, that’s a relatively inconvenient way to proceed. But if you wanted to offer up a CUDA C++ runtime or driver API call sequence (what I consider this forum to be focused on, less so every other implementation of CUDA wrapped in another language) that behaved strangely, I’m sure folks here would be interested.