Multistream in cudagraph capturing

Hi, there!

I figured out that capturing a CudaGraph including multiple streams is not easy.

import torch

device = "cuda"

stream1 = torch.cuda.Stream()

graph = torch.cuda.CUDAGraph()

a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
c = torch.empty(1024, 1024, device=device)

_ = torch.matmul(a, b)

with torch.cuda.graph(graph):
    for _ in range(32):
        a.copy_(torch.matmul(a, torch.randn(1024, 1024, device=device)))
        with torch.cuda.stream(stream1):
            b.copy_(torch.matmul(b, torch.randn(1024, 1024, device=device)))
        c.copy_(a + b)

graph.replay()

torch.cuda.synchronize()

The point is that there are 2 different kernels totally independent, so running each of them on different streams can overlap the arithmetrics of the two, eventually shrinking the latency. After running the different kernels, the results of them (a and b) must be summed up, making Tensor c.

However when I run the code, it pops out an error like this :

b.copy_(torch.matmul(b, torch.randn(1024, 1024, device=device)))
RuntimeError: CUDA error: operation not permitted when stream is capturing

This shows that running stream 1 during capturing is not available, so I tested this code after erasing the line “with torch.cuda.stream(stream1):”, and it worked successfully.

Is there no possibility to capture/run a cudagraph with multiple streams?

Here is a recent discussion that may be of interest. I wouldn’t be able to comment on pytorch. I usually suggest that folks asking pytorch questions may get better help on a pytorch forum such as discuss.pytorch.org. There are NVIDIA experts that patrol that forum.