Hi, there!
I figured out that capturing a CudaGraph including multiple streams is not easy.
import torch
device = "cuda"
stream1 = torch.cuda.Stream()
graph = torch.cuda.CUDAGraph()
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
c = torch.empty(1024, 1024, device=device)
_ = torch.matmul(a, b)
with torch.cuda.graph(graph):
for _ in range(32):
a.copy_(torch.matmul(a, torch.randn(1024, 1024, device=device)))
with torch.cuda.stream(stream1):
b.copy_(torch.matmul(b, torch.randn(1024, 1024, device=device)))
c.copy_(a + b)
graph.replay()
torch.cuda.synchronize()
The point is that there are 2 different kernels totally independent, so running each of them on different streams can overlap the arithmetrics of the two, eventually shrinking the latency. After running the different kernels, the results of them (a and b) must be summed up, making Tensor c.
However when I run the code, it pops out an error like this :
b.copy_(torch.matmul(b, torch.randn(1024, 1024, device=device)))
RuntimeError: CUDA error: operation not permitted when stream is capturing
This shows that running stream 1 during capturing is not available, so I tested this code after erasing the line “with torch.cuda.stream(stream1):”, and it worked successfully.
Is there no possibility to capture/run a cudagraph with multiple streams?