Can not get CUDA python backtrace

Hello everyone! Documents says nsys can get CUDA python backtrace like:

However, when using nsys myself, I can only see python frames with no other detail like:

So what should I do?
I use remote profile on nsys 2023.2.1Windows-x64, target is on linux, with python3.9 and CUDA11.1.

How did you launch Nsys? You have to use the -cudabacktrace option:


Options: all, none, kernel, memory, sync, other

When tracing CUDA APIs, enable the collection of a backtrace when a CUDA API is invoked. Significant runtime overhead may occur. Values may be combined using ‘,’. Each value except ‘none’ may be appended with a threshold after ‘:’. Threshold is duration, in nanoseconds, that CUDA APIs must execute before backtraces are collected, e.g. ‘kernel:500’. Default value for each threshold is 1000ns (1us). Note: CPU sampling must be enabled. Note: Not available on IBM Power targets.

Thanks for your reply! In fact I use nsys gui:

Any problem? Should I enable “collect python backtrace samples”? However if I enable it, everytime I try to open the nsys report, the nsys gui crash immediately.

And I found that if I use nsys cli on target system instead of remote Windows gui like:
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu -o nsight_report -f true --cudabacktrace=true -x true python3.9
I can get some backtrace details like:

So is there something wrong about remote profile?

Initially since you said you were calling from the GUI I thought maybe you needed to ask for process-tree wide analysis, but you have done that.

You don’t need python backtraces, that is python runtime sampling, which isn’t what you are looking for here.

@rknight do you have any suggestions?

okay, something is even more strange…
Just now, I try to profile a deep learning application with python multi-thread, still using nsys on my Windows PC and target on Linux(i.e. remote mode). And with “Collect CUDA Backtrace”:

the report shows CUDA python backtrace now! But an warning occured:

And I cannot get any CUDA kernel info:

However, without “Collect CUDA Backtrace”, everything seemed to be fine:

I used to enable a lot of profile options, such as osrt, cudnn, backtrace… because I think sometimes they may be useful and they donot bring too much overhead. Now I think if I find the report unexplainable in the future, I will only enable “collect CUDA trace”, “collect CPU IP/backtrace samples”, which may solve a lot of strange problems, and then enable other options one-by-one to see what causes the problem.

This sounds like a bug in nsys to me. @dofek Can you take a look?

Sounds like another bug :-(.

tangweikai420, would it be possible to get access to your workload so we can reproduce this issue and hopefully fix it?

Thanks for your reply! I would like to share my workload.
My workload depends on a DL framework called DGL, and the python multi-thread usage is here. In short, everytime the DataLoader init, it will create a separate thread for data processing and transfer.
Here is the code:

import dgl
import torch
from ogb.nodeproppred import DglNodePropPredDataset

dataset = DglNodePropPredDataset('ogbn-arxiv')
device = 'cuda:0'

graph, node_labels = dataset[0]
graph = dgl.add_reverse_edges(graph)
graph.ndata['label'] = node_labels[:, 0]

node_features = graph.ndata['feat']
num_features = node_features.shape[1]
num_classes = (node_labels.max() + 1).item()

idx_split = dataset.get_idx_split()
train_nids = idx_split['train']
valid_nids = idx_split['valid']
test_nids = idx_split['test']

sampler = dgl.dataloading.NeighborSampler([4, 4], prefetch_node_feats=['feat'], prefetch_labels=['label'])

train_dataloader = dgl.dataloading.DataLoader(
    # The following arguments are specific to DGL's DataLoader.
    graph,              # The graph
    train_nids,         # The node IDs to iterate over in minibatches
    sampler,            # The neighbor sampler
    device=device,      # Put the sampled MFGs on CPU or GPU
    # The following arguments are inherited from PyTorch DataLoader.
    batch_size=1024,    # Batch size
    shuffle=True,       # Whether to shuffle the nodes for every epoch
    drop_last=False,    # Whether to drop the last incomplete batch
    num_workers=0,       # Number of sampler processes

import torch.nn as nn
import torch.nn.functional as F
from dgl.nn.pytorch import SAGEConv

class Model(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(Model, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, aggregator_type='mean')
        self.conv2 = SAGEConv(h_feats, num_classes, aggregator_type='mean')
        self.h_feats = h_feats

    def forward(self, mfgs, x):
        h_dst = x[:mfgs[0].num_dst_nodes()]
        h = self.conv1(mfgs[0], (x, h_dst))
        h = F.relu(h)
        h_dst = h[:mfgs[1].num_dst_nodes()]
        h = self.conv2(mfgs[1], (h, h_dst))
        return h

model = Model(num_features, 128, num_classes).to(device)

opt = torch.optim.Adam(model.parameters())

def train():
    for epoch in range(4):
        # nvtx
        for step, (input_nodes, output_nodes, mfgs) in enumerate(train_dataloader):
            inputs = mfgs[0].srcdata['feat']
            # batch nodes label
            labels = mfgs[-1].dstdata['label']
            predictions = model(mfgs, inputs)
            loss = F.cross_entropy(predictions, labels)
        #     break
        # break


And I think it is hard to tell whether nsys causes the issue. Maybe the issue have to do with DGL, or most likely I did something wrong myself… At least, now I can get the explainable nsys report. As I said above,

I will only enable “collect CUDA trace”, “collect CPU IP/backtrace samples”, which may solve a lot of strange problems, and then enable other options one-by-one to see what causes the problem.

So if the issue really can be reproduced, I’m glad that I do a little contribution.
Thanks in advance!

Hi there, first I would like to thank you for reaching out with these issues.

  • The CLI command you mentioned: nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu -o nsight_report -f true --cudabacktrace=true -x true python3.9
    This is not complete for what you wanted, you need to set --python-backtrace=cuda for the cuda backtraces to include python backtrace.

  • Not being able to see CUDA kernel info, we’ll look into it.

  • Not being able to see python backtrace, we are aware of that bug and it’s being worked on, for now you can use the following workaround:
    Create a config.ini file, put it in the same folder where your nsys executable is located.
    Add CudaBacktraceDepth=180 to the config file, this option will probably increase the overhead but it should let you see the python backtraces within the CUDA API backtrace’s tooltip.

  • Nsys GUI crash when “Collect python backtrace samples” is enabled, we’ll look into it asap.

In fact, if I use --python-backtrace=cuda in target cli, and then open the report use nsys gui on my PC, the nsys gui will crash after a while(or immediately).

That is probably because of the tooltip bug, as I said, if you will increase the CudaBacktraceDepth to a high enough value (using the config.ini file), it shouldn’t crash.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.