Hello all,
I was trying to install GPU driver for A100 40G on VMWare, and had confirmed the GPU passthrough, since lshw -c display displays follows:
[root@localhost ~]# lshw -c display
*-display
description: VGA compatible controller
product: SVGA II Adapter
vendor: VMware
physical id: f
bus info: pci@0000:00:0f.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: vga_controller bus_master cap_list rom
configuration: driver=vmwgfx latency=64
resources: irq:16 ioport:1070(size=16) memory:e8000000-efffffff memory:fe000000-fe7fffff memory:c0400000-c0407fff
*-display
description: 3D controller
product: GA100 [GRID A100 PCIe 40GB]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:13:00.0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm bus_master cap_list
configuration: driver=nvidia latency=248
resources: irq:16 memory:fc000000-fcffffff memory:e4000000-e5ffffff
After entering lspci -v | grep -i nvidia, it displays:
[root@localhost ~]# lspci -v | grep -i nvidia
13:00.0 3D controller: NVIDIA Corporation GA100 [GRID A100 PCIe 40GB] (rev a1)
Subsystem: NVIDIA Corporation Device 145f
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
Following are some records from messages:
Aug 26 15:03:56 localhost kernel: resource sanity check: requesting [mem 0xfc700000-0xfd6fffff], which spans more than PCI Bus 0000:13 [mem 0xfc000000-0xfcffffff]
Aug 26 15:03:56 localhost kernel: caller os_map_kernel_space.part.6+0xbe/0xc0 [nvidia] mapping multiple BARs
Aug 26 15:03:56 localhost kernel: NVRM: GPU 0000:13:00.0: RmInitAdapter failed! (0x24:0xffff:1209)
Aug 26 15:03:56 localhost kernel: NVRM: GPU 0000:13:00.0: rm_init_adapter failed, device minor number 0
I first tried installing CUDA toolkit 12.6 on RHEL 8.4, the installation was rather smooth, and nvcc --version did have information display, but nvidia-smi displays “no devices were found”.
I tried blacklist nouveau, but displays the same.
Then I thought it might be the kernel version mismatch (4.18-553), so I used another VMWare instead, the RHEL 7.9, with a kernel version 3.10-1160, and installed driver only, but displays the same.
I tried installing DKMS, updating GCC, chmod 777 driver, but none worked.
I tried drivers of 12.6 and 11.4.
Followings are some logs that might be useful. What should I do now?
nvidia-bug-report.log.gz (585.1 KB)
nvidia-installer.log (279.8 KB)