NVIDIA L40S keeps dropping from nvidia-smi

I built a server for work so we can run a triton inference sever, but after weeks of troubleshooting im stumped… Here are the specs of the server:

CPU: AMD Ryzen Threadripper 7970X
MOBO: ASUS Pro WS TRX50 SAGE WIFI BIOS Version 1106
GPU: NVIDIA L40S, NVIDIA P1000 (used for display only)
PSU: Corsair AX 1600i
RAM: Crucial 192GB DDR5 6000
OS: Ubuntu 24.04.1 LTS

I have the most recent drivers installed (I’ve tried different driver versions as well) 575.57.08 and the CUDA toolkit installed 12.9. I have nouveau blacklisted and secure boot disabled. In the BIOS I have Resize BAR enabled, IOMMU set to enabled. When the machine first boots nvidia-smi shows the L40S but very quickly is drops and disappears. When I run lspci the L40S does still show up as a 3D Controller so its still detected by the system. I have the power management in OS set to performance and Perstance M set to on for the L40S. I’ve tried different PCIE lanes, with and without the P1000, and changing different BIOS settings. I don’t believe this is a hardware issue, but I’m open to suggestions. I can wipe and redo anything I need on this machine so let me know. This is my output for dmesg | grep -i nvidia:

itadmin@BETA:~$ sudo dmesg | grep -i nvidia
[sudo] password for itadmin:
[ 3.599791] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:01.3/0000:82:00.1/sound/card0/input9
[ 3.599862] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:01.3/0000:82:00.1/sound/card0/input10
[ 3.599921] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:01.3/0000:82:00.1/sound/card0/input11
[ 3.599996] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:01.3/0000:82:00.1/sound/card0/input12
[ 3.835444] nvidia: loading out-of-tree module taints kernel.
[ 3.835452] nvidia: module license ‘NVIDIA’ taints kernel.
[ 3.835456] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 3.835457] nvidia: module license taints kernel.
[ 3.995438] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[ 4.013438] nvidia 0000:82:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
[ 4.232520] nvidia 0000:41:00.0: enabling device (0000 → 0002)
[ 4.284554] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 575.57.08 Sat May 24 07:21:16 UTC 2025
[ 4.303175] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 575.57.08 Sat May 24 06:52:56 UTC 2025
[ 4.306044] [drm] [nvidia-drm] [GPU ID 0x00008200] Loading driver
[ 4.934610] [drm] Initialized nvidia-drm 0.0.0 for 0000:82:00.0 on minor 1
[ 4.964757] nvidia 0000:82:00.0: vgaarb: deactivate vga console
[ 6.300675] [drm] [nvidia-drm] [GPU ID 0x00004100] Loading driver
[ 6.309250] [drm] Initialized nvidia-drm 0.0.0 for 0000:41:00.0 on minor 2
[ 12.306111] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 244.016491] WARNING: CPU: 20 PID: 4085 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.016669] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.016784] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.016983] nvidia_close+0x1ab/0x280 [nvidia]
[ 244.017238] WARNING: CPU: 16 PID: 3855 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.017402] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.017497] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.017676] nvidia_close+0x1ab/0x280 [nvidia]
[ 244.018152] WARNING: CPU: 7 PID: 4085 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.018359] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.018475] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.018689] nvidia_close+0x1ab/0x280 [nvidia]
[ 244.018943] WARNING: CPU: 6 PID: 3855 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.019120] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.019233] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.019413] nvidia_close+0x1ab/0x280 [nvidia]
[ 244.022000] WARNING: CPU: 2 PID: 4416 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.022165] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.022256] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.022436] nvidia_close+0x1ab/0x280 [nvidia]
[ 244.022825] WARNING: CPU: 2 PID: 4416 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.022999] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.023102] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.023284] nvidia_close+0x1ab/0x280 [nvidia]
[ 244.165087] WARNING: CPU: 36 PID: 3636 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.165296] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.165428] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.165631] nvidia_close+0x1ab/0x280 [nvidia]
[ 244.202993] WARNING: CPU: 36 PID: 3803 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5027 nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.203165] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 244.203277] RIP: 0010:nvidia_dev_put+0xb1/0xc0 [nvidia]
[ 244.203458] nvidia_close+0x1ab/0x280 [nvidia]
[ 268.768680] WARNING: CPU: 0 PID: 7081 at /var/lib/dkms/nvidia/575.57.08/build/nvidia/nv.c:5103 nvidia_dev_put_uuid+0x55/0x60 [nvidia]
[ 268.768848] Modules linked in: rfcomm cmac algif_hash algif_skcipher af_alg nf_conntrack_netlink xt_nat veth nvidia_uvm(POE) xt_MASQUERADE bridge stp llc xt_set ip_set nft_chain_nat nf_nat xfrm_user xfrm_algo ccm snd_seq_dummy snd_hrtimer overlay qrtr bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nvidi_drm(POE) nvidia_modeset(POE) nf_tables libcrc32c binfmt_misc nls_iso8859_1 nvidia(POE) amd_atl intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd mt7925e mt7925_common mt792x_lib snd_hda_codec_realtek kvm_amd mt76_connac_lib snd_hda_codec_generic mt76 snd_hda_codec_hdmi snd_hda_scodec_component kvm crct10dif_pclmul snd_hda_intel polyval_clmulni mac80211 snd_intel_dspcfg snd_seq_midi polyval_generic btusb snd_intel_sdw_acpi ghash_clmulni_intel btrtl snd_seq_midi_event sha256_ssse3 btintel snd_hda_codec snd_rawmidi sha1_ssse3 btbcm aesni_intel snd_hda_core btmtk
[ 268.768953] RIP: 0010:nvidia_dev_put_uuid+0x55/0x60 [nvidia]
[ 268.769126] nvUvmInterfaceUnregisterGpu+0x2d/0x90 [nvidia]
[ 268.769276] uvm_gpu_release_locked+0x6d/0x70 [nvidia_uvm]
[ 268.769295] uvm_va_space_destroy+0x5f0/0x7a0 [nvidia_uvm]
[ 268.769312] uvm_release.isra.0+0x83/0x170 [nvidia_uvm]
[ 268.769327] uvm_release_entry.part.0.isra.0+0x54/0xa0 [nvidia_uvm]
[ 268.769344] uvm_release_entry+0x2d/0x40 [nvidia_uvm]

I figured out the issue was overheating. I didn’t suspect this at first bc the logs didn’t necessarily specify an overheating issue and the card would overheat sitting idle which is wild.