Dear All,
We would like to make GPU Profiling available to all of the researchers using our high performance computing cluster. When attempting to run:
ncu --target-processes all -o profile python3 code.py
We see:
==ERROR== ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters on the target device 0. For instructions on enabling permissions and to get more information see https://developer.nvidia.com/ERR_NVGPUCTRPERM
Following those instructions, we create a conf file in /etc/modprobe.d:
[root@X ~]# cat /etc/modprobe.d/nvidia.conf
options nvidia NVreg_RestrictProfilingToAdminUsers=0
However, upon reboot:
[root@X ~]# nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
/var/log/messages says:
Dec 4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: The NVIDIA probe routine was not called for 4 device(s).
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: This can occur when a driver such as: #012NVRM: nouveau, rivafb, nvidiafb or rivatv #012NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: Try unloading the conflicting kernel module (and/or#012NVRM: reconfigure your kernel without the conflicting#012NVRM: driver(s)), then try loading the NVIDIA kernel module#012NVRM: again.
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: No NVIDIA devices probed.
Dec 4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4775]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: The NVIDIA probe routine was not called for 4 device(s).
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: This can occur when a driver such as:
NVRM: nouveau, rivafb, nvidiafb or rivatv
NVRM: was loaded and obtained ownership of the NVIDIA device(s).
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
Dec 4 11:35:09 holygpu7c1309 kernel: NVRM: No NVIDIA devices probed.
Dec 4 11:35:09 holygpu7c1309 kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 234
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4775]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4770]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4771]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
Dec 4 11:35:09 holygpu7c1309 systemd-udevd[4771]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \ -f 1) 255'' failed with exit code 1.
This is Rocky Linux release 8.7 (Green Obsidian), and our drivers are:
[root@X modprobe.d]# rpm -qa | grep nvidia
nvidia-xconfig-535.104.12-1.el8.x86_64
nvidia-libXNVCtrl-devel-535.104.12-1.el8.x86_64
nvidia-driver-cuda-535.104.12-1.el8.x86_64
nvidia-libXNVCtrl-535.104.12-1.el8.x86_64
nvidia-kmod-common-535.104.12-1.el8.noarch
nvidia-driver-devel-535.104.12-1.el8.x86_64
nvidia-persistenced-535.104.12-1.el8.x86_64
nvidia-driver-cuda-libs-535.104.12-1.el8.x86_64
nvidia-settings-535.104.12-1.el8.x86_64
dnf-plugin-nvidia-2.0-1.el8.noarch
nvidia-driver-535.104.12-1.el8.x86_64
nvidia-driver-NvFBCOpenGL-535.104.12-1.el8.x86_64
nvidia-driver-NVML-535.104.12-1.el8.x86_64
kmod-nvidia-latest-dkms-535.104.12-1.el8.x86_64
nvidia-driver-libs-535.104.12-1.el8.x86_64
nvidia-modprobe-535.104.12-1.el8.x86_64
How can I see if there is a kernel module conflict? Is there another solution for this problem?
Thank you!