Nsys profile failed when using pytorch cudagraph

I’m trying to use cuda graph to imprve some project built on pytorch.

When I run the examples support by Accelerating PyTorch with CUDA Graphs | PyTorch, it works fine.

However, when I did profiling through

nsys profile -t cuda -s none --cpuctxsw=none  python <filename.py>

it failed and return code is 139.

Only the first example is failed, and when I removed the backward part, it worked fine. Because I’m not familiar with neither pytorch nor nsys, I’m not sure if it is something wrong in nsys or pytorch give a wrong result but pretend everything is fine.
I guess that maybe something is wrong around autograd in pytorch with cudagraph. (The second example only use cudagraph in forward, not in loss or backward)

python code:

import torch

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Linear(H, D_out),
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
with torch.cuda.stream(s):
    for i in range(3):
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

other information:
nsys version: NVIDIA Nsight Systems version 2023.1.2.43-32377213v0
information showed in nvidia-smi: NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2
python version: Python 3.11.8
torch version: torch 2.2.1+cu121 pypi_0 pypi

@jyi can you please respond to this issue?

Hello, CUDA 12.2 is not supported by Nsight Systems 2023.1.2. Could you retry with the latest version of Nsight Systems?

1 Like

Thanks a lot!
After upgrading to NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0, everything works fine.

How to upgrade nsys: follow this, just download the .deb package and uncompress.
It seems that cuda toolkit didn’t contain right version of nsys? I had installed toolkit12.1 which contained nvcc12.1, nvprof12.1.105, ncu2023.1.1 and nsys2023.1.1, the driver version is 12.2(shown in nvidia-smi: NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2)