RTX 4080 Fans Will Not Spin

I’ve done all I could to debug this issue, so seeking support.
Running an RTX 4080 in ESXI in PCIe passthrough to an Ubuntu Server VM.

  • CPU - EPYC 7302P
  • Motherboard - Supermicro MBDH12SSLNTO
    It was a major pain to get the drivers to work in the first place, but eventually got them “working.”
Thu Apr 13 20:05:21 2023
| NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:0B:00.0 Off |                  N/A |
|ERR!   28C    P0    32W / 320W |      1MiB / 16376MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

This was installed with the open source kernel. Proprietary kernel, the GPUs would never be recognized.

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX Open Kernel Module for x86_64  525.89.02  Release Build  (dvs-builder@U16-F11-34-6)  Wed Feb  1 23:19:51 UTC 2023
GCC version:  gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

It was installed with the following options

cat /etc/modprobe.d/blacklist-nvidia-nouveau.conf

blacklist nouveau
options nouveau modeset=0
cat /etc/modprobe.d/nvidia.conf
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1

ESXI VM has the following advanced options set:

pciPassthru.use64bitMMIO = TRUE
pciPassthru.64bitMMIOSizeGB = 64
pciPassthru.msiEnabled = FALSE
hypervisor.cpuid.v0 = FALSE

I’m able to run programs that use the GPU, but the GPU fans refuse to spin, as you can see in the ERR! in the nvidia-smi screenshot. I’ve tried
Spoofing xorgs using various versions of coolgpus, and making necessary modifications to get it to work. It always errors out on setting the fan speed with an Unknown Error
I’ve tried controlling the fans through IPMI with superfans-gpu-controller and manually. This also does not work. FANA, which I’m guessing is the peripheral fan (GPU), remains at 0 RPM even with manual raw commands and the IPMI GUI set to full speed.

The GPU fans spin when the VM is OFF. I’m out of ideas on how to fix the fan issue. I’ve read pretty much every topic related to driver errors and ESXI, so looking for some additional support. I will be retrying the installation on Ubuntu desktop in ESXI
nvidia-bug-report.log.gz (495.7 KB)