Can not get CUDA python backtrace

tangweikai420 · April 18, 2023, 8:25am

Hello everyone! Documents says nsys can get CUDA python backtrace like:

However, when using nsys myself, I can only see python frames with no other detail like:

So what should I do?
I use remote profile on nsys 2023.2.1Windows-x64, target is on linux, with python3.9 and CUDA11.1.

hwilper · April 18, 2023, 7:02pm

How did you launch Nsys? You have to use the -cudabacktrace option:

--cudabacktrace

Options: all, none, kernel, memory, sync, other
Default:none

When tracing CUDA APIs, enable the collection of a backtrace when a CUDA API is invoked. Significant runtime overhead may occur. Values may be combined using ‘,’. Each value except ‘none’ may be appended with a threshold after ‘:’. Threshold is duration, in nanoseconds, that CUDA APIs must execute before backtraces are collected, e.g. ‘kernel:500’. Default value for each threshold is 1000ns (1us). Note: CPU sampling must be enabled. Note: Not available on IBM Power targets.

tangweikai420 · April 18, 2023, 7:18pm

Thanks for your reply! In fact I use nsys gui:

Any problem? Should I enable “collect python backtrace samples”? However if I enable it, everytime I try to open the nsys report, the nsys gui crash immediately.

tangweikai420 · April 19, 2023, 10:24am

And I found that if I use nsys cli on target system instead of remote Windows gui like:
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu -o nsight_report -f true --cudabacktrace=true -x true python3.9 main.py
I can get some backtrace details like:

So is there something wrong about remote profile?

hwilper · April 19, 2023, 3:25pm

Initially since you said you were calling from the GUI I thought maybe you needed to ask for process-tree wide analysis, but you have done that.

You don’t need python backtraces, that is python runtime sampling, which isn’t what you are looking for here.

@rknight do you have any suggestions?

tangweikai420 · April 19, 2023, 3:56pm

okay, something is even more strange…
Just now, I try to profile a deep learning application with python multi-thread, still using nsys on my Windows PC and target on Linux(i.e. remote mode). And with “Collect CUDA Backtrace”:

the report shows CUDA python backtrace now! But an warning occured:

And I cannot get any CUDA kernel info:

However, without “Collect CUDA Backtrace”, everything seemed to be fine:

I used to enable a lot of profile options, such as osrt, cudnn, backtrace… because I think sometimes they may be useful and they donot bring too much overhead. Now I think if I find the report unexplainable in the future, I will only enable “collect CUDA trace”, “collect CPU IP/backtrace samples”, which may solve a lot of strange problems, and then enable other options one-by-one to see what causes the problem.

rknight · April 19, 2023, 5:28pm

This sounds like a bug in nsys to me. @dofek Can you take a look?

rknight · April 19, 2023, 5:30pm

Sounds like another bug :-(.

tangweikai420, would it be possible to get access to your workload so we can reproduce this issue and hopefully fix it?

tangweikai420 · April 19, 2023, 6:15pm

Thanks for your reply! I would like to share my workload.
My workload depends on a DL framework called DGL, and the python multi-thread usage is here. In short, everytime the DataLoader init, it will create a separate thread for data processing and transfer.
Here is the code:

import dgl
import torch
from ogb.nodeproppred import DglNodePropPredDataset

dataset = DglNodePropPredDataset('ogbn-arxiv')
device = 'cuda:0'

graph, node_labels = dataset[0]
graph = dgl.add_reverse_edges(graph)
graph.ndata['label'] = node_labels[:, 0]

node_features = graph.ndata['feat']
num_features = node_features.shape[1]
num_classes = (node_labels.max() + 1).item()

idx_split = dataset.get_idx_split()
train_nids = idx_split['train']
valid_nids = idx_split['valid']
test_nids = idx_split['test']

sampler = dgl.dataloading.NeighborSampler([4, 4], prefetch_node_feats=['feat'], prefetch_labels=['label'])

train_dataloader = dgl.dataloading.DataLoader(
    # The following arguments are specific to DGL's DataLoader.
    graph,              # The graph
    train_nids,         # The node IDs to iterate over in minibatches
    sampler,            # The neighbor sampler
    device=device,      # Put the sampled MFGs on CPU or GPU
    # The following arguments are inherited from PyTorch DataLoader.
    batch_size=1024,    # Batch size
    shuffle=True,       # Whether to shuffle the nodes for every epoch
    drop_last=False,    # Whether to drop the last incomplete batch
    num_workers=0,       # Number of sampler processes
)

import torch.nn as nn
import torch.nn.functional as F
from dgl.nn.pytorch import SAGEConv

class Model(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(Model, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, aggregator_type='mean')
        self.conv2 = SAGEConv(h_feats, num_classes, aggregator_type='mean')
        self.h_feats = h_feats

    def forward(self, mfgs, x):
        h_dst = x[:mfgs[0].num_dst_nodes()]
        h = self.conv1(mfgs[0], (x, h_dst))
        h = F.relu(h)
        h_dst = h[:mfgs[1].num_dst_nodes()]
        h = self.conv2(mfgs[1], (h, h_dst))
        return h

model = Model(num_features, 128, num_classes).to(device)

opt = torch.optim.Adam(model.parameters())

def train():
    for epoch in range(4):
        model.train()
        # nvtx
        torch.cuda.nvtx.range_push("epoch")
        for step, (input_nodes, output_nodes, mfgs) in enumerate(train_dataloader):
            inputs = mfgs[0].srcdata['feat']
            # batch nodes label
            labels = mfgs[-1].dstdata['label']
            predictions = model(mfgs, inputs)
            loss = F.cross_entropy(predictions, labels)
            opt.zero_grad()
            loss.backward()
            opt.step()
        #     break
        # break
        torch.cuda.nvtx.range_pop()

train()

And I think it is hard to tell whether nsys causes the issue. Maybe the issue have to do with DGL, or most likely I did something wrong myself… At least, now I can get the explainable nsys report. As I said above,

I will only enable “collect CUDA trace”, “collect CPU IP/backtrace samples”, which may solve a lot of strange problems, and then enable other options one-by-one to see what causes the problem.

So if the issue really can be reproduced, I’m glad that I do a little contribution.
Thanks in advance!

TzahMazuz · April 23, 2023, 8:14am

Hi there, first I would like to thank you for reaching out with these issues.

The CLI command you mentioned: nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu -o nsight_report -f true --cudabacktrace=true -x true python3.9 main.py
This is not complete for what you wanted, you need to set --python-backtrace=cuda for the cuda backtraces to include python backtrace.
Not being able to see CUDA kernel info, we’ll look into it.
Not being able to see python backtrace, we are aware of that bug and it’s being worked on, for now you can use the following workaround:
Create a config.ini file, put it in the same folder where your nsys executable is located.
Add CudaBacktraceDepth=180 to the config file, this option will probably increase the overhead but it should let you see the python backtraces within the CUDA API backtrace’s tooltip.
Nsys GUI crash when “Collect python backtrace samples” is enabled, we’ll look into it asap.

tangweikai420 · April 23, 2023, 8:26am

In fact, if I use --python-backtrace=cuda in target cli, and then open the report use nsys gui on my PC, the nsys gui will crash after a while(or immediately).

TzahMazuz · April 23, 2023, 9:13am

That is probably because of the tooltip bug, as I said, if you will increase the CudaBacktraceDepth to a high enough value (using the config.ini file), it shouldn’t crash.

system · May 7, 2023, 9:13am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nsys not collecting python backtrace with --python-backtrace=cuda Profiling Linux Targets cuda , python , cudnn	4	235	October 9, 2024
Nsys cannot capture cuda information Profiling DRIVE Targets	9	401	April 21, 2025
Nsys cli cannot trace cuda Profiling Embedded Targets	5	2537	May 13, 2022
Call stack is visible/captured only for some CUDA kernels (broken backtraces) Profiling Linux Targets	5	1731	December 29, 2022
Profling a simple deep learning code : no python backtrace + cannot use cudnn trace Profiling x86 Windows Targets cudnn	19	1360	December 13, 2023
Nsys Does not Track CUDA Api events Profiling Linux Targets	5	1191	December 22, 2022
Broken Backtraces Profiling Linux Targets cudnn	2	317	April 11, 2025
Nsys can't capture anything (cuda programs only) Profiling Linux Targets	14	328	July 10, 2025
Nsys profile exception Profiling x86 Windows Targets cuda	5	88	August 26, 2025
Missing CUDA runtime events from nsys report Profiling Linux Targets llama-31-70b-instruct , llama	7	284	April 17, 2025

Can not get CUDA python backtrace

Related topics