Installed nvidia kmod driver on centos 7.5 machine with A100 GPUs and getting the nvidia-container-cli: initialization error: cuda error: unknown error
I see NVRM: installed in this system is not supported by the NVIDIA 396.82 driver release. in dmesg logs.
Can you advise which version should we go with or if there is a way to fix this error.
nvidia-container information: nvidia-container-cli -k -d /dev/tty info
– WARNING, the following logs are for debugging purposes only –
I0520 21:14:52.146938 87864 nvc.c:281] initializing library context (version=1.0.1, build=038fb92d00c94f97d61492d4ed1f82e981129b74)
I0520 21:14:52.147026 87864 nvc.c:255] using root /
I0520 21:14:52.147030 87864 nvc.c:256] using ldcache /etc/ld.so.cache
I0520 21:14:52.147034 87864 nvc.c:257] using unprivileged user 10041:10042
W0520 21:14:52.149093 87865 nvc.c:186] failed to set inheritable capabilities
W0520 21:14:52.149171 87865 nvc.c:187] skipping kernel modules load due to failure
I0520 21:14:52.149785 87866 driver.c:133] starting driver service
I0520 21:14:52.161217 87864 driver.c:233] driver service terminated with signal 15
nvidia-container-cli: initialization error: cuda error: unknown error
Kernel version from uname -a
Linux datasciairflowa100gpuworkerc1-dsm1-1 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Any relevant kernel output lines from dmesg
NVRM: installed in this system is not supported by the NVIDIA 396.82 driver release.
Driver information from nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Thu May 20 11:58:52 2021
Driver Version : 460.73.01
CUDA Version : 11.2
Attached GPUs : 8
GPU 00000000:00:04.0
Product Name : A100-SXM4-40GB
Product Brand : NVIDIA
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1323820022403
GPU UUID : GPU-5c2da4bc-dc76-ea02-156e-00422cb46cab
Minor Number : 0
VBIOS Version : 92.00.36.00.04
MultiGPU Board : No
Board ID : 0x4
GPU Part Number : 692-2G506-0200-002
Inforom Version
Image Version : G506.0200.00.04
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization Mode : Pass-Through
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x00
Device : 0x04
Domain : 0x0000
Device Id : 0x20B010DE
Bus Id : 00000000:00:04.0
Sub System Id : 0x134F10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : N/A
Performance State : P0
Clocks Throttle Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 40536 MiB
Used : 0 MiB
Free : 40536 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 1 MiB
Free : 65535 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 640 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 37 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
GPU Max Operating Temp : 85 C
GPU Target Temperature : N/A
Memory Current Temp : 37 C
Memory Max Operating Temp : 95 C
Power Readings
Power Management : Supported
Power Draw : 61.62 W
Power Limit : 400.00 W
Default Power Limit : 400.00 W
Enforced Power Limit : 400.00 W
Min Power Limit : 100.00 W
Max Power Limit : 400.00 W
Clocks
Graphics : 1095 MHz
SM : 1095 MHz
Memory : 1215 MHz
Video : 990 MHz
Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Default Applications Clocks
Graphics : 1095 MHz
Memory : 1215 MHz
Max Clocks
Graphics : 1410 MHz
SM : 1410 MHz
Memory : 1215 MHz
Video : 1290 MHz
Max Customer Boost Clocks
Graphics : 1410 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes : None
Docker version from docker version
Client:
Version: 18.09.1
API version: 1.39
Go version: go1.10.6
Git commit: 4c52b90
Built: Wed Jan 9 19:35:01 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.1
API version: 1.39 (minimum version 1.12)
Go version: go1.10.6
Git commit: 4c52b90
Built: Wed Jan 9 19:06:30 2019
OS/Arch: linux/amd64
Experimental: false
NVIDIA packages version from dpkg -l ‘nvidia’ or rpm -qa ‘nvidia’
xorg-x11-drv-nvidia-libs-396.82-1.el7.x86_64
libnvidia-container1-1.0.1-1.x86_64
nvidia-docker2-2.0.3-1.docker18.09.1.ce.noarch
nvidia-container-runtime-2.0.0-1.docker18.09.1.x86_64
xorg-x11-drv-nvidia-devel-396.82-1.el7.x86_64
nvidia-container-toolkit-1.0.5-2.x86_64
nvidia-kmod-396.82-2.el7.x86_64
xorg-x11-drv-nvidia-396.82-1.el7.x86_64
xorg-x11-drv-nvidia-gl-396.82-1.el7.x86_64
libnvidia-container-tools-1.0.1-1.x86_64
NVIDIA container library version from nvidia-container-cli -V
version: 1.0.1
build date: 2019-01-15T23:26+0000
build revision: 038fb92d00c94f97d61492d4ed1f82e981129b74
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-36)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,–gc-sections