Issue running Nsys profiler in docker and analysing the results

Hello,

We have an application based on the docker holoscan:v3.4.0-igpu running on Orin Jetson with Ubuntu 22.04 and our goal is to profile our application to check cuda usage and verify our models run in parallel.

I was trying to follow NSight Systems Profiling - NVIDIA Docs
The first issue I have is that nsys is not found in the docker. Since I have it on host (nsys: /usr/local/bin/nsys), I tried to mount it in our docker run command with -v /usr/local/bin/nsys:/usr/local/bin/nsys

Once mounted, the library libbpf.so.1 is not found
nsys: error while loading shared libraries: libbpf.so.1: cannot open shared object file: No such file or directory
I installed it with apt install -y libbpf1 inside the container.

I now have the issue :
Error: The CLI executable is in the ‘target-linux-tegra-armv8’ directory in your installation.
Modify the executable in your command to be a symbolic link pointing to ‘target-linux-tegra-armv8/nsys’.
easily fixed by mounting the volume /opt/nvidia/nsight-systems in the container.

I can now run the command /opt/nvidia/nsight-systems/2024.5.4/target-linux-tegra-armv8/nsys profile -t cuda,nvtx,osrt -o profiler -f true -d 3 python script.py
it records fine but generate a .qdstrm file instead of a nsys-rep file. I can’t open this file with nsys-ui (it keep on loading)

I was able to obtain the nsys-rep file by using QdstrmImporter utility as detailed here User Guide — Nsight Systems

After opening it with nsys-ui, my question is: how can I verify that our models execution runs in parallel ? I can see the GXF responsible of the inference running across multiple stream, but each block is only in one worker thread at a time. How can I verify we use our resources in an optimal way ?

Thank you for reaching out.

Mounting /usr/local/bin/nsys mounts only main symlink to our application instead of all needed files. There are couple problems caused by this that you have already noticed: e.g. missing libbpf.so.1 which is being delivered as part of nsight-systems installation or report file not being generated due to missing host-linux-tegra-armv8 directory or wrong location for our files ("Error: The CLI executable is in…”).

Could you try to mount whole host /opt/nvidia directory into your docker (-v /opt/nvidia:/opt/nvidia) and repeat the test?

On the other problem: What is your python script.py doing? Could you provide screenshot of Nsight Systems timeline rows that contain workload that you are concerned not being parallel, so that I better understand the issue?

Additionally, you can check SM occupancy. Run with “–gpu-metrics-device=all” option and look at the SM Active row under the GPU section. This shows how much of the GPU’s Streaming Multiprocessor capacity is actively being used.