Nsys profile failed when using pytorch cudagraph

yunfan.xiong · June 7, 2024, 5:56am

I’m trying to use cuda graph to imprve some project built on pytorch.

When I run the examples support by Accelerating PyTorch with CUDA Graphs | PyTorch, it works fine.

However, when I did profiling through

nsys profile -t cuda -s none --cpuctxsw=none  python <filename.py>

it failed and return code is 139.

Only the first example is failed, and when I removed the backward part, it worked fine. Because I’m not familiar with neither pytorch nor nsys, I’m not sure if it is something wrong in nsys or pytorch give a wrong result but pretend everything is fine.
I guess that maybe something is wrong around autograd in pytorch with cudagraph. (The second example only use cudagraph in forward, not in loss or backward)

python code:

import torch

N, D_in, H, D_out = 640, 4096, 2048, 1024
model = torch.nn.Sequential(torch.nn.Linear(D_in, H),
                            torch.nn.Dropout(p=0.2),
                            torch.nn.Linear(H, D_out),
                            torch.nn.Dropout(p=0.1)).cuda()
loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Placeholders used for capture
static_input = torch.randn(N, D_in, device='cuda')
static_target = torch.randn(N, D_out, device='cuda')

# warmup
# Uses static_input and static_target here for convenience,
# but in a real setting, because the warmup includes optimizer.step()
# you must use a few batches of real data.
s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        optimizer.zero_grad(set_to_none=True)
        y_pred = model(static_input)
        loss = loss_fn(y_pred, static_target)
        loss.backward()
        optimizer.step()
torch.cuda.current_stream().wait_stream(s)

# capture
g = torch.cuda.CUDAGraph()
# Sets grads to None before capture, so backward() will create
# .grad attributes with allocations from the graph's private pool
optimizer.zero_grad(set_to_none=True)
with torch.cuda.graph(g):
    static_y_pred = model(static_input)
    static_loss = loss_fn(static_y_pred, static_target)
    static_loss.backward()
    optimizer.step()

real_inputs = [torch.rand_like(static_input) for _ in range(10)]
real_targets = [torch.rand_like(static_target) for _ in range(10)]

for data, target in zip(real_inputs, real_targets):
    # Fills the graph's input memory with new data to compute on
    static_input.copy_(data)
    static_target.copy_(target)
    # replay() includes forward, backward, and step.
    # You don't even need to call optimizer.zero_grad() between iterations
    # because the captured backward refills static .grad tensors in place.
    g.replay()
    # Params have been updated. static_y_pred, static_loss, and .grad
    # attributes hold values from computing on this iteration's data.

other information:
nsys version: NVIDIA Nsight Systems version 2023.1.2.43-32377213v0
information showed in nvidia-smi: NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2
python version: Python 3.11.8
torch version: torch 2.2.1+cu121 pypi_0 pypi

hwilper · June 7, 2024, 2:38pm

@jyi can you please respond to this issue?

jyi · June 11, 2024, 9:11pm

Hello, CUDA 12.2 is not supported by Nsight Systems 2023.1.2. Could you retry with the latest version of Nsight Systems?

yunfan.xiong · June 12, 2024, 6:35am

Thanks a lot!
After upgrading to NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0, everything works fine.

How to upgrade nsys: follow this, just download the .deb package and uncompress.
It seems that cuda toolkit didn’t contain right version of nsys? I had installed toolkit12.1 which contained nvcc12.1, nvprof12.1.105, ncu2023.1.1 and nsys2023.1.1, the driver version is 12.2(shown in nvidia-smi: NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2)

system · June 26, 2024, 6:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can not get CUDA python backtrace Profiling Linux Targets	12	1887	May 7, 2023
Nsys profile error: invalidArgumentException, unknown API driver activity Profiling Linux Targets nsight	17	3442	July 28, 2023
Nsys does not show CUDA kernels Profiling Linux Targets	6	1246	December 12, 2022
Unable to capture "Can't find UUID for CUDA device" Profiling Linux Targets	10	2349	November 9, 2023
[QuadDCommon::tag_message*] = No GPU associated to the given UUID Profiling Linux Targets	24	909	November 5, 2024
Nsys dying with "Agent launcher failed." Profiling Linux Targets	14	1354	March 13, 2023
Profling a simple deep learning code : no python backtrace + cannot use cudnn trace Profiling x86 Windows Targets cudnn	19	1141	December 13, 2023
Nsys not collecting python backtrace with --python-backtrace=cuda Profiling Linux Targets cuda , python , cudnn	4	62	October 9, 2024
Error in nsys profiling of python code Profiling Linux Targets	4	426	April 25, 2024
Nsys profile error : InvalidArgumentException Profiling Linux Targets nsight	1	724	September 8, 2023

Nsys profile failed when using pytorch cudagraph

Related topics