Can't get GPU Metrics with Nsight System

On a DGX box with 8 A100 GPUs, I’m trying to profile a program within a Docker container. Specifically, I want to make sure that the inter-GPU communications flow through NVlinks, and thus I profile with the GPU-metric flag. My command is: sudo docker run --name “hy_yeast_multi_0” --rm --cap-add=SYS_ADMIN -v $(pwd):/workspace nvlink nsys profile --gpu-metrics-device=all --force-overwrite true --output=/workspace/profile_report_multi_gpu.nsys-rep /workspace/galactose_rdmeode1.9_test_MultiGPU.py -id 0 -t 10 -g 11.1 -gpus “2,3”.

However, I can’t see GPU metrics in the profiling result file, as shown below.

Some help will be highly appreciated. Thanks!!

Okay, all I am seeing there is the top level information, did you open the CPU and Processes overview lanes to see the date inside?

Yes, I did. It looks like this:

Can you look in the diagnosics/warnings pane and see if you see anything.

@rknight

Sure! This is what I saw:


Thanks!

Hi hongyili000, can you share your nsys-rep file?

Sure! Here is the link: profile_report_multi_gpu.nsys-rep - Google Drive. Thanks!

I don’t see anything in your nsys-rep file that would suggest an issue. But, your version of nsys is relatively old. Can you upgrade to the latest version of Nsight Systems and try the collection again? You can find the latest version at Nsight Systems - Get Started | NVIDIA Developer

Sure! I downloaded and installed Nsight Systems 2024.5.1 inside the Docker container. I then run the command sudo docker run --name “hy_yeast_multi_0” --rm --cap-add=SYS_ADMIN -v $(pwd):/workspace nvlink nsys profile --gpu-metrics-device=all --force-overwrite true --output=/workspace/profile_report_multi_gpu.nsys-rep /workspace/galactose_rdmeode1.9_test_MultiGPU.py -id 0 -t 10 -g 11.1 and get:


Thanks!

I’m not sure but you may have DCGM running on the host which is preventing nsys GPU metrics from getting access to the GPU’s hardware counters.

See NVIDIA DCGM | NVIDIA Developer

Ok, thanks! When I ran the command sudo service nvidia-dcgm status, I got:

After DCGM is stopped, try the

nsys profile --gpu-metrics-devices=help

command again to see if any GPUs are available to profile using GPU metrics.

Sure! This is what I get:

Thanks!

What else could potentially prevent profiling?