How to get full profiling with Nsight system for a particular process

Hi,

I am using deepstream-6.4, trying to profile my custom pipeline with all element, and figure it out where is the bottleneck and fix it. I create a topic with deepstream as well I attach the topic also here for better understanding .
I have a server.py —> this will read config of multiple cameras and it will create a process for building the pipeline.
My problem is I’m not able see my pipeline element how much time it’s taken.
THIS IS THE COMMAND I USED -------> /opt/nvidia/nsight-systems/2024.3.1/bin/nsys profile --trace=cuda,cudnn,cublas,osrt,nvtx --python-backtrace=cuda --python-sampling=true -d 120 --delay=60 python3 server.py "

I will share the profile report as well.
Nvidia_forums_nsight_system.zip (1.2 MB)

What I want is to check pipeline how much time does it taking for each element. I need this help quickly.

deepstream topic link :- Profiling Nsight system with deepstream-6.4

I would like to say go through above topic once to understand properly.

Hi debjit.adak,
Nsight Systems captures traces of the target app and its child processes.
You wrote:

THIS IS THE COMMAND I USED -------> /opt/nvidia/nsight-systems/2024.3.1/bin/nsys profile --trace=cuda,cudnn,cublas,osrt,nvtx --python-backtrace=cuda --python-sampling=true -d 120 --delay=60 python3 server.py "

The command line above is good, with one caveat. To collect Python backtrace for CUDA API you must enable the CUDA backtrace collection feature by adding --cudabacktrace=all. Note that this feature may incur significant overhead. You can opt to collect backtraces only for specific types of CUDA API calls. See cli profile command switch options for more details.

Looking into the report file that you attached it seems:

  1. The profile command line is different from the one in your post. See attached screenshot.
  2. CUDA API calls and NVTX annotations were not collected.

I suggest you try to capture a report with the command line you posted above, possibly with the modification I suggested.

Doron

@dofek I added what you said, Now my command looks like "/opt/nvidia/nsight-systems/2024.3.1/bin/nsys profile --trace=cuda,cudnn,cublas,osrt,nvtx --cudabacktrace=all --python-backtrace=cuda --python-sampling=true -d 120 --delay=60 python3 server.py "
and attaching the report as well. My question is in this report I don’t see cuda and NVTX any information.
new_report.zip (10.1 MB)

I don’t have idea why CUDA API calls and NVTX annotations are not collecting. Check now, new report with correct command and suggest…
Can you help me with it.

Are the child processes created using fork()?
Can you try to add the line
multiprocessing.set_start_method('spawn')
somewhere at the beginning of your code and try to profile it?

No ! in server.py creating main process under this pipeline creation is a child process. for process I’m using python multiprocessing.

multiprocessing.set_start_method(‘spawn’)
This if I try to add in my code as you said beginning But “RuntimeError: context has already been set” This error is coming !

Hi @debjit.adak
Can you please make sure that you put the line “multiprocessing.set_start_method(‘spawn’)” at the very top of your Python code? It should be before any imports (besides “multiprocessing” of course).
Also, you can try to change the line to “multiprocessing.set_start_method(‘spawn’, force=True)”

@Guy_Sz @dofek

I have done the same thing ‘’‘’ multiprocessing.set_start_method(‘spawn’, force=True) “”" what you said to add. But the same error is kind of same “” RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module: ""

Right, my bad. The set_start_method() must be called inside the if __name__ == '__main__' clause. Can you try that?

If you see the previous error: “RuntimeError: context has already been set”, it means that the start method had already been set. Note that this can happen in a kind of implicit way, for example multiprocessing.get_start_method() will set the start method as a side effect. Also, datasets.load_dataset() sets the start method.
So, the call to set_start_method() must be before all that.

@Guy_Sz
Hey, I am also facing the same issue where nsys is not profiling the cudakernela and apis in the subprocesses being launched.
But given my setup and requirements, we have to use fork() and cannot use spawn().
I am using the below command, but not helping me:

nsys profile --trace cuda -o arxiv_gpu_shm --force-overwrite true  --trace-fork-before-exec true python node_clas.py --dataset ogbn-arxiv --epoch 1

I have tried with two nsys versions as below:

NVIDIA Nsight Systems version 2024.2.1.106-242134037904v0
NVIDIA Nsight Systems version 2022.1.3.3-1c7b5f7

Attached file for ref:
arxiv_gpu_shm.nsys-rep.zip (489.4 KB)