I’m trying to use cuda graph to imprve some project built on pytorch.
When I run the examples support by Accelerating PyTorch with CUDA Graphs | PyTorch, it works fine.
However, when I did profiling through
nsys profile -t cuda -s none --cpuctxsw=none python <filename.py>
it failed and return code is 139.
Only the first example is failed, and when I removed the backward part, it worked fine. Because I’m not familiar with neither pytorch nor nsys, I’m not sure if it is something wrong in nsys or pytorch give a wrong result but pretend everything is fine.
I guess that maybe something is wrong around autograd in pytorch with cudagraph. (The second example only use cudagraph in forward, not in loss or backward)
python code:
import torch
N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
torch.nn.Dropout(p=0.2),
torch.nn.Linear(H, D_out),
torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')
# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
for i in range(3):
optimizer.zero_grad(set_to_none=True)
y_pred = model(static_input)
loss = loss_fn(y_pred, static_target)
loss.backward()
optimizer.step()
torch.cuda.current_stream().wait_stream(s)
# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
static_y_pred = model(static_input)
static_loss = loss_fn(static_y_pred, static_target)
static_loss.backward()
optimizer.step()
real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]
for data, target in zip(real_inputs, real_targets):
# Fills the graph's input memory with new data to compute on
static_input.copy_(data)
static_target.copy_(target)
# replay() includes forward, backward, and step.
# You don't even need to call optimizer.zero_grad() between iterations
# because the captured backward refills static .grad tensors in place.
g.replay()
# Params have been updated. static_y_pred, static_loss, and .grad
# attributes hold values from computing on this iteration's data.
other information:
nsys version: NVIDIA Nsight Systems version 2023.1.2.43-32377213v0
information showed in nvidia-smi: NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2
python version: Python 3.11.8
torch version: torch 2.2.1+cu121 pypi_0 pypi