use nsight compute in docker

i download docker container from this command
docker pull nvcr.io/nvidia/pytorch:19.03-py3
and run it by
sudo nvidia-docker run -it nvcr.io/nvidia/pytorch:19.03-py3
and want to use nsight-compute-cli in the container,
/usr/local/cuda/NsightCompute-2019.1/nv-nsight-cu-cli -o output /opt/conda/bin/python ****
then, i got erro like this,
==ERROR== The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. See the following link for instructions to enable permissions and get more information: https://developer.nvidia.com/NVSOLN1000

any ideas?

AFAIK, there are no restrictions on running from a Linux container. (Windows containers do restrict profiling).

You should follow the link which is aliased and should have resolved to

https://developer.nvidia.com/nvidia-development-tools-solutions-ERR_NVGPUCTRPERM-permission-issue-performance-counters

There you will find the reasons you are hitting this issue and instructions for gaining access to the GPU performance counters (restricted by recently released NVIDIA drivers).

See

  • “Solutions for this issue”
  • “Administration” > “Command Line Control - Linux Only”

i had tried these 3 command in host,

systemctl isolate multi-user
modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia-vgpu-vfio nvidia
modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0

and then, enter docker container, the error is still there, and the nvidia driver version is 418.56, which i think match “418.43 or later”.
but if i try the 3 command in container, it says “modprob not found”, and i believe i am the “root” in container.
so, what should i do ?

@suntao2012

try

apt update && apt install kmod

I update and install kmod. But cannot enable permission with modprobe nvidia NVreg_RestrictProfilingToAdminUsers=0

It show that modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.3.0-61-generic

Please help me!

I am not yet certain why the above steps, run outside the container, would fail to enable you to profile within the container. I would suggest two things to try:

  • can you profile on the same system outside of the container (without being root)?
  • can you try setting the permissions permanently using the /etc/modprobe.d file, reboot, and check if that solves it within the container?

Alternatively, A file containing ‘options nvidia “NVreg_RestrictProfilingToAdminUsers=0”’ may be saved to /etc/modprobe.d

[5] On some systems (or when using a deb to install), it may be necessary to rebuild the initrd after writing a configuration file to /etc/modprobe.d[6]
[6] When rebuilding the initrd, running “update-initramfs -u” is also required.

Finally, be advised that Nsight Compute 2019.1 has known issues profiling pytorch newer than 19.07. Given that you use 19.03, it might be fine, but you might also want to consider updating to Nsight Compute 2020.1, for which this issue has been resolved.
https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#updates-2020-1

Linking the new Nsight Compute blog post for running in containers here for future reference: https://developer.nvidia.com/blog/using-nsight-compute-in-containers/