Unable to install nvidia-driver on Ubuntu 20.04 with V100 GPUs - "parse error in symbol dump file"

I have a set of V100 GPUs in a private data center. Previously, this system ran Ubuntu 18.04 and the nvidia drivers. After updating to Ubuntu 20.04 I am unable to update the nvidia drivers.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Attempts to install different versions of the nvidia-driver package all fail the same way. The install fails with Bad return status for module build on kernel error, indicating a failed invocation of make(1). I’ve tried to apt remove the nvidia packages in an attempt to get the system back to something like a clean state, I’ve tried various builds of the nvidia-drivers, including nvidia-driver-550 as well as the driver version shown here – I always get the same failure.

nvidia-bug-report.log.gz (233.8 KB)

Reading package lists... Done
Building dependency tree
Reading state information... Done
nvidia-driver-535 is already the newest version (535.161.08-0ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 52 not upgraded.
2 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] y
Setting up nvidia-dkms-535 (535.161.08-0ubuntu1) ...
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau
from loading. This can be reverted by deleting the following file:
/etc/modprobe.d/nvidia-graphics-drivers.conf

A new initrd image has also been created. To revert, please regenerate your
initrd by running the following command after deleting the modprobe.d file:
`/usr/sbin/initramfs -u`

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

INFO:Enable nvidia
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/dell_latitude
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/lenovo_thinkpad
DEBUG:Parsing /usr/share/ubuntu-drivers-common/quirks/put_your_quirks_here
Removing old nvidia-535.161.08 DKMS files...

------------------------------
Deleting module version: 535.161.08
completely from the DKMS tree.
------------------------------
Done.
Loading new nvidia-535.161.08 DKMS files...
Building for 5.15.0-101-generic
Building for architecture x86_64
Building initial module for 5.15.0-101-generic
Error! Bad return status for module build on kernel: 5.15.0-101-generic (x86_64)
Consult /var/lib/dkms/nvidia/535.161.08/build/make.log for more information.
dpkg: error processing package nvidia-dkms-535 (--configure):
 installed nvidia-dkms-535 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-driver-535:
 nvidia-driver-535 depends on nvidia-dkms-535 (= 535.161.08-0ubuntu1); however:
  Package nvidia-dkms-535 is not configured yet.

dpkg: error processing package nvidia-driver-535 (--configure):
 dependency problems - leaving unconfigured
No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                          Processing triggers for initramfs-tools (0.136ubuntu6.7) ...
update-initramfs: Generating /boot/initrd.img-5.15.0-101-generic
Errors were encountered while processing:
 nvidia-dkms-535
 nvidia-driver-535
E: Sub-process /usr/bin/dpkg returned an error code (1)

Looking into the make.log file, I always find the error at the end of the log:

make -f ./scripts/Makefile.modpost
  sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/535.161.08/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/535.161.08/build/Module.symvers -e -i Module.symvers -i /usr/src/ofa_kernel/default/Module.symvers   -T -
FATAL: modpost: parse error in symbol dump file
make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/535.161.08/build/Module.symvers] Error 1
make[1]: *** [Makefile:1830: modules] Error 2
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-101-generic'
make: *** [Makefile:82: modules] Error 2

Here are the GPUs in this system:

$ lspci -nnk | grep -i nvid
1f:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau
65:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau
b6:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau
df:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1db5] (rev a1)
        Subsystem: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] [10de:1249]
        Kernel modules: nvidiafb, nouveau

Did you manage to fix it? I encountered the same problem on ubuntu 20.04

Haven’t found a fix to this yet.

I’ve had a similar failure with driver version 470 failing the modpost step under kernels 5.15 and 6.5. Similar to yours the nodes I inherited in our datacentre had ofed drivers installed from ancient times (note the ofa_kernel include in the command. Moving the /usr/src/ofa_kernel (and afterwards finding the offending mellanox ofed driver package and properly removing it) worked for me. In the end the Module.symvers provided by the ofed drivers was incompatible with any of the kernels I was running. I know this is very late, but I hope it is helpful.