Hello,
I’m trying to upgrade the nvidia driver of my DGX Station (V100) from nvidia-driver-470-server
to nvidia-driver-525-server
following this guide, without success.
I cannot find a clear reference of driver branches compatibility for DGX Station (V100). Is the R525 branch incompatible with my machine? What else could be the problem?
I’m using DGX OS 5.4.2.
Thanks in advance
I have the 525 series installed via your linked guide.
What is the error that you’re seeing?
When I run
apt install -y --reinstall nvidia-peer-memory-dkms
I get
nv_peer_mem.ko:
Running module version sanity check.
- Original module
- No original module exists within this kernel
- Installation
- Installing to /lib/modules/5.4.0-146-generic/updates/dkms/
depmod...
DKMS: install completed.
modprobe: ERROR: could not insert 'nv_peer_mem': Invalid argument
Note that the module is there:
❯ l /lib/modules/5.4.0-146-generic/updates/dkms/
total 4,4M
-rw-r--r-- 1 root root 15K apr 3 10:24 auxiliary.ko
-rw-r--r-- 1 root root 87K apr 3 10:24 ib_cm.ko
-rw-r--r-- 1 root root 522K apr 3 10:24 ib_core.ko
-rw-r--r-- 1 root root 229K apr 3 10:24 ib_ipoib.ko
-rw-r--r-- 1 root root 37K apr 3 10:24 ib_umad.ko
-rw-r--r-- 1 root root 195K apr 3 10:24 ib_uverbs.ko
-rw-r--r-- 1 root root 76K apr 3 10:24 iw_cm.ko
-rw-r--r-- 1 root root 2,2M apr 3 10:24 mlx5_core.ko
-rw-r--r-- 1 root root 586K apr 3 10:24 mlx5_ib.ko
-rw-r--r-- 1 root root 21K apr 3 10:24 mlx_compat.ko
-rw-r--r-- 1 root root 176K apr 3 10:24 mlxdevm.ko
-rw-r--r-- 1 root root 40K apr 3 10:24 mlxfw.ko
-rw-r--r-- 1 root root 20K apr 5 10:21 nv_peer_mem.ko
-rw-r--r-- 1 root root 170K apr 3 10:24 rdma_cm.ko
-rw-r--r-- 1 root root 45K apr 3 10:24 rdma_ucm.ko
dmesg:
[mer apr 5 10:51:59 2023] NVRM: API mismatch: the client has the version 525.85.12, but
NVRM: this kernel module has the version 470.161.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[mer apr 5 10:51:59 2023] NVRM: API mismatch: the client has the version 525.85.12, but
NVRM: this kernel module has the version 470.161.03. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[mer apr 5 10:52:21 2023] nv_peer_mem: Unknown symbol nvidia_p2p_cap_persistent_pages (err -2)
[mer apr 5 10:52:21 2023] nv_peer_mem: disagrees about version of symbol nvidia_p2p_dma_unmap_pages
[mer apr 5 10:52:21 2023] nv_peer_mem: Unknown symbol nvidia_p2p_dma_unmap_pages (err -22)
[mer apr 5 10:52:21 2023] nv_peer_mem: disagrees about version of symbol nvidia_p2p_dma_map_pages
[mer apr 5 10:52:21 2023] nv_peer_mem: Unknown symbol nvidia_p2p_dma_map_pages (err -22)
[mer apr 5 10:52:21 2023] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_dma_mapping
[mer apr 5 10:52:21 2023] nv_peer_mem: Unknown symbol nvidia_p2p_free_dma_mapping (err -22)
Actually, solved: to avoid this mismatch we just had to reboot (the older modules are embedded in the linux image and prevent loading the new ones).