Profling a simple deep learning code : no python backtrace + cannot use cudnn trace

Hi,

I am discovering Nsys for profiling deep learning models, Thus I tried simple with this code:

import torch
import torch
import torch.nn as nn

class MyModule(torch.nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
        self.qkv = nn.Linear(128, 384)

    def forward(self, xQuery):
        torch.cuda.nvtx.range_push("linear")
        qkv=self.qkv(xQuery) # LINEAR(123,3*128)
        torch.cuda.nvtx.range_pop()
        torch.cuda.nvtx.range_push("chunk")
        qkv=qkv.chunk(3,dim=-1)
        torch.cuda.nvtx.range_pop()
        torch.cuda.nvtx.range_push("clone")
        q=qkv[0].clone()
        torch.cuda.nvtx.range_pop()
        torch.cuda.nvtx.range_push("permute")
        q = q.permute(0, 3, 1, 2)  # B, C, H, W
        torch.cuda.nvtx.range_pop()
        return q

torch.cuda.cudart().cudaProfilerStart()

torch.cuda.nvtx.range_push("initModel")
model = MyModule()
torch.cuda.nvtx.range_pop()
device=torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.cuda.nvtx.range_push("ModelAndDataToGPU")
model.cuda()
xQuery = torch.randn(1,259, 259,128).to(device)
xFocalMaps=[torch.randn(1, 259, 259, 128).to(device), torch.randn(1, 130, 130, 128).to(device), torch.randn(1,65, 65, 128).to(device)]
torch.cuda.nvtx.range_pop()

for i in range(10):
    torch.cuda.nvtx.range_push(f"iteration{i}")
    result = model.forward(xQuery)
    torch.cuda.nvtx.range_pop()

torch.cuda.cudart().cudaProfilerStop()

To do this I found this CLI that seems out-dated :

nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu  --capture-range=cudaProfilerApi --stop-on-range-end=true --cudabacktrace=true -x true -o my_profile python main.py

So I modified it to fit my code and by removing not working parameters :

nsys profile -w true -t cuda,nvtx,cublas -s cpu  --capture-range=cudaProfilerApi -x true -o icarusComparisonProfiler .\venv\scripts\python.exe icarusTest.py --python-sampling-frequency=1000 --python-sampling=true --python-backtrace=cuda --cudabacktrace=true

According to User Guide :: Nsight Systems Documentation I should be able to see a python backtrace in the report, yet there is not any python backtrace.

Also : I see that cudnn is not a valid argument for --trace option ? Is this normal ? As it seems that it was working in the past ?

I am using pytorch, do I need to install nvtx package on my python distribution for annotation ?

I am using NSYS 2023.3.1
Python 3.11.6

Thank you !

Your screen shot seems to show that you were running this on a Windows target.

You have the options to trace Python backtraces after your application name. The CLI assumes that any option after the application name is an argument to the application, rather than an argument to Nsys.

I am also confused as to why you are not seeing cudnn as a valid trace option, it certainly should be. That first command line should work.

Oh damn I feel so dumb that was obvious, of course the python backtrace is fully working now, yet for the cudnn trace it is not working :

nsys profile -w true -t cuda,nvtx,cublas,cudnn -s cpu  --capture-range=cudaProfilerApi -x true -o icarusComparisonProfiler --python-nvtx-annotations .\annotations.json --python-sampling-frequency=1000 --python-sampling=true .\venv\scripts\python.exe icarusTest.py
Illegal --trace argument 'cudnn'
Possible --trace values are one or more of 'cuda', 'nvtx', 'cublas', 'cublas-verbose', 'cusparse', 'cusparse-verbose', 'nvvideo', 'opengl', 'opengl-annotations', 'vulkan', 'vulkan-annotations', 'dx11', 'dx11-annotations', 'dx12', 'dx12-annotations', 'wddm' or 'none'

usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.

Can you help me with the cudnn thing ?

Thank you !

@skottapalli can you run a quick test on this and make sure that cudnn is working as expected? I would ask one of the Windows team, but they have today off.

Hi @hwilper, did you have the time to make the test ?
Thank you !

Let me ping Sneha again.

1 Like

@Monkey.py - we don’t support cudnn tracing on windows targets. It is available on all the Linux targets.

1 Like

Thank you for your answer, yeah it is, I will have to use a Linux distribution I guess. I tried on WSL but it seems that cuda tracing is not available on WSL, is it ?

CUDA tracing on WSL is not supported yet due to inherent limitations. We are trying to overcome those in a future release of Nsight Systems.

Unfortunately, this means that you will need to use a Linux distribution. I apologize for the inconvenience.

@skottapalli

Hi, I created a linux ubuntu dual boot to access cudnn trace. I tested with the following code :

import torch
import torch
import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.linear = nn.Linear(128, 64)
    def forward(self,x):
        return self.linear(x)

torch.cuda.cudart().cudaProfilerStart()

torch.cuda.nvtx.range_push("initModel")
model= MyModel()
torch.cuda.nvtx.range_pop()

device=torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

torch.cuda.nvtx.range_push("ModelAndDataToGPU")
model.cuda()
xQuery = torch.randn(1,259, 259,128).to(device)
torch.cuda.nvtx.range_pop()

for i in range(10):
    torch.cuda.nvtx.range_push(f"iteration{i}")
    result = model.forward(xQuery)
    torch.cuda.nvtx.range_pop()
torch.cuda.cudart().cudaProfilerStop()

And i use the following line to profile :

sudo nsys profile -w true -t cuda,nvtx,cublas,cudnn -s cpu  --capture-range=cudaProfilerApi -x true -o icarusComparisonProfiler --python-sampling-frequency=1000 --python-sampling=true ./venv/bin/python ./Icarus/icarus/profiling/profCudnn.py

As expected no error shows up when I use the line, yet when I look at the profiling report there is still no cudnn trace.

Am I missing something ? The trace should be there right ?

Could you share the nsys-rep file if you can to help me debug?

Do you get cudnn traces if you remove the --capture-range=cudaProfilerApi? Do you see any warning or errors in the diagnostics page of the report in the GUI? It would be helpful to get the nsys-rep file from you so that I can check a few more things.

@skottapalli

You will find attached a .zip containing the python script, the report generated with and without --capture-range=cudaProfilerApi (see the readMe.txt which provides the command used to generate the files).

I tried to remove the --capture-range option but still no cudnn trace.
Any idea of what could cause the issue ?

In the report I have the error “Analysis 4589 00:02.643 No cuDNN events collected. Does the process use cuDNN?”. I suppose that a Linear layer should be using the cudnn library right ?

I am not very familiar with the torch model and its use of cudnn. Could you try profiling a cudnn sample where we know it uses cudnn? This will help us isolate if there is a problem with cuddn tracing in nsys or if your original app is not using cudnn.

I will try to do that, I am not very familiar with directly using cuda and cudnn, I just installed the toolkit to code a sample. If you have a sample to provide I will be glad to use it.

Okay after many experiments (building torch from source, using nsys on cuda & cudnn library in C++, …) I figured out that torch is not using cudnn library for a linear & sigmoid layer, but it is using for conv2d for example, and with a conv2d layer the trace appears. SO it turns out that I have been searching for a cudnn trace for function that does not call cudnn …
Anyway thank you for your help and your time ! You’ve helped me a lot