Problems with A100 and Ubuntu 22.04

Hello,

Ive tried to get a A100 (driver) running on Ubunutu 22.04 Desktop with no luck.

My setup:
Mainboard: ASROCK ROMED8-2T with AMD EPYC 7H12 CPU. 120GB Reg ECC Memory.

What I did:
Installed and upgraded Ubuntu 22.04 Desktop.

A100 is recognized as PCIe device:

test@test-desktop:~$ lspci | grep NVIDIA
45:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)

Recognized by the kernel:

test@test-desktop:~$ sudo dmesg | grep nvidia
[sudo] Passwort für test:
[ 5.051881] audit: type=1400 audit(1701790560.353:5): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=1138 comm=“apparmor_parser”
[ 5.051885] audit: type=1400 audit(1701790560.353:6): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=1138 comm=“apparmor_parser”

No kernel module recognized so far:

test@test-desktop:~$ lsmod | grep nvidia
test@test-desktop:~$

Just Nouveau driver is loaded:

test@test-desktop:~$ lsmod | grep nouveau
nouveau 2830336 0
mxm_wmi 16384 1 nouveau
drm_ttm_helper 16384 1 nouveau
ttm 110592 2 drm_ttm_helper,nouveau
drm_display_helper 212992 1 nouveau
drm_kms_helper 249856 5 ast,drm_display_helper,nouveau
i2c_algo_bit 16384 2 ast,nouveau
video 73728 1 nouveau
wmi 40960 3 video,mxm_wmi,nouveau
drm 700416 9 drm_kms_helper,ast,drm_shmem_helper,drm_display_helper,drm_ttm_helper,ttm,nouveau
test@test-desktop:~$

Then I installed nvidia-driver-535 via the additional drivers section in Ubuntu.

Since then dmesg reports as follows:

[ 5.252953] nvidia: loading out-of-tree module taints kernel.
[ 5.252967] nvidia: module license ‘NVIDIA’ taints kernel.
[ 5.252968] Disabling lock debugging due to kernel taint
[ 5.258482] RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters, 163840 ms ovfl timer
[ 5.258486] RAPL PMU: hw unit of domain package 2^-16 Joules
[ 5.265896] cryptd: max_cpu_qlen set to 1000
[ 5.273192] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x00c1d6, prod_id: 0x1000, dev_id: 0x20)
[ 5.276241] AVX2 version of gcm_enc/dec engaged.
[ 5.276511] AES CTR mode by8 optimization enabled
[ 5.311639] loop8: detected capacity change from 0 to 8
[ 5.411840] nvidia-nvlink: Nvlink Core is being initialized, major device number 234

[ 5.413378] nvidia 0000:45:00.0: enabling device (0000 → 0002)
[ 5.460593] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023
[ 5.511993] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.129.03 Thu Oct 19 18:42:12 UTC 2023
[ 5.528831] [drm] [nvidia-drm] [GPU ID 0x00004500] Loading driver
[ 5.534989] SVM: TSC scaling supported
[ 5.534993] kvm: Nested Virtualization enabled
[ 5.534995] SVM: kvm: Nested Paging enabled
[ 5.534996] SEV supported: 509 ASIDs
[ 5.535043] SVM: Virtual VMLOAD VMSAVE supported
[ 5.535044] SVM: Virtual GIF supported
[ 5.535044] SVM: LBR virtualization supported
[ 5.554977] MCE: In-kernel MCE decoding enabled.
[ 5.557821] EDAC amd64: MCT channel count: 8
[ 5.558002] EDAC MC0: Giving out device to module amd64_edac controller F17h_M30h: DEV 0000:00:18.3 (INTERRUPT)
[ 5.558004] EDAC amd64: F17h_M30h detected (node 0).
[ 5.558008] EDAC MC: UMC0 chip selects:
[ 5.558009] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558010] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558014] EDAC MC: UMC1 chip selects:
[ 5.558015] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558016] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558020] EDAC MC: UMC2 chip selects:
[ 5.558020] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558021] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558025] EDAC MC: UMC3 chip selects:
[ 5.558026] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558027] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558030] EDAC MC: UMC4 chip selects:
[ 5.558031] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558032] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558035] EDAC MC: UMC5 chip selects:
[ 5.558036] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558037] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558040] EDAC MC: UMC6 chip selects:
[ 5.558041] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558042] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558045] EDAC MC: UMC7 chip selects:
[ 5.558046] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558047] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558048] EDAC amd64: using x16 syndromes.
[ 5.558061] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 5.558063] AMD64 EDAC driver v3.5.0
[ 5.564019] intel_rapl_common: Found RAPL domain package
[ 5.564021] intel_rapl_common: Found RAPL domain core
[ 5.576863] bnxt_en 0000:42:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
[ 5.576869] bnxt_en 0000:42:00.0 eno1np0: EEE is not active
[ 5.576872] bnxt_en 0000:42:00.0 eno1np0: FEC autoneg off encoding: None
[ 5.634776] ipmi_si IPI0001:00: IPMI kcs interface initialized
[ 5.637594] ipmi_ssif: IPMI SSIF Interface driver
[ 6.774955] mlx5_core 0000:c1:00.0 enp193s0f0np0: Link down
[ 6.837380] NVRM: Xid (PCI:0000:45:00): 120, pid=‘’, name=, GSP task exception: illegal instruction (cause:0x2) @ pc:0x55ad2dc, task:1
[ 6.837391] NVRM: Reported by libos task:0 v2.0 [0] @ ts:1701791560
[ 6.837393] NVRM: RISC-V CSR State:
[ 6.837394] NVRM: mstatus:0x000000001e000000 mscratch:0x0000000000000000 mie:0x0000000000000880 mip:0x0000000000000000
[ 6.837396] NVRM: mepc:0x00000000055ad2dc mbadaddr:0x00000000056a2608 mcause:0x0000000000000002
[ 6.837397] NVRM: RISC-V GPR State:
[ 6.837399] NVRM: ra:0x00000000050a325c sp:0x00000000056a1d30 gp:0x0000000000000000 tp:0x0000000000000000
[ 6.837400] NVRM: a0:0x8000000000178190 a1:0x8000000000194c78 a2:0x0000000000000000 a3:0x0000000000c4cd90
[ 6.837401] NVRM: a4:0x179dfa6817bc9ce0 a5:0x00000000055ad2dc a6:0x000000000000004e a7:0x00000000041a54e8
[ 6.837403] NVRM: s0:0x00000000056a1dd0 s1:0x0000000000000000 s2:0x8000000000178190 s3:0x800000000017c894
[ 6.837404] NVRM: s4:0x8000000000194c78 s5:0x00000000056a1d38 s6:0x0000000000000000 s7:0x000000000417f000
[ 6.837406] NVRM: s8:0x800000000017c990 s9:0x0000000000000056 s10:0x0000000000000056 s11:0x80000000002a8cd0
[ 6.837407] NVRM: t0:0x0000000000000012 t1:0x000000000558f244 t2:0x000000000015c47b t3:0x8000000000099630
[ 6.837408] NVRM: t4:0x0000000000000033 t5:0x0000000000d8992d t6:0x0000000000000000
[ 6.837410] NVRM: Stack Trace:
[ 6.837411] NVRM: 0x00000000055ad2dc
[ 6.837412] NVRM: 0x00000000050c9e50
[ 6.837412] NVRM: 0x0000000004ba2ca8
[ 6.837413] NVRM: 0x0000000004b7621c
[ 6.837414] NVRM: 0x0000000005685d04
[ 6.837415] NVRM: PC Trace:
[ 6.837416] NVRM: 0x000000000568b064 0x000000000568d490 0x000000000568be08
[ 6.837418] NVRM: External I/O Register State:
[ 6.837419] NVRM: 0x00111360:0x00000000 0x00111364:0x00000000 0x00111368:0x00000000 0x0011136c:0x00000000
[ 6.837421] NVRM: 0x001112b4:0x00040000 0x001112b8:0x00000000 0x001112bc:0x00000000 0x00111344:0x11100000
[ 6.837422] NVRM: 0x00110008:0x00000010 0x0011010c:0x00000000 0x00110118:0x00011122 0x00110110:0x13f5d1e2
[ 6.837424] NVRM: 0x00110128:0x00000000 0x00110114:0x0000d320 0x0011011c:0x000003a0
[ 6.837425] NVRM: ------------[ end crash report ]------------
[ 6.837430] NVRM: GPU0 GSP RPC buffer contains function 4098 (GSP_RUN_CPU_SEQUENCER) and data 0x00000000000001ea 0x0000000000003fe2.
[ 6.837433] NVRM: GPU0 RPC history (CPU → GSP):
[ 6.837434] NVRM: entry function data0 data1 ts_start ts_end duration actively_polling
[ 6.837435] NVRM: 0 73 SET_REGISTRY 0x0000000000000000 0x0000000000000000 0x00060bc5395ce801 0x0000000000000000 y
[ 6.837439] NVRM: -1 72 GSP_SET_SYSTEM_INFO 0x0000000000000000 0x0000000000000000 0x00060bc5395ce7fc 0x0000000000000000
[ 6.837442] NVRM: GPU0 RPC event history (CPU ← GSP):
[ 6.837443] NVRM: entry function data0 data1 ts_start ts_end duration during_incomplete_rpc
[ 6.837444] NVRM: 0 4098 GSP_RUN_CPU_SEQUENCER 0x00000000000001ea 0x0000000000003fe2 0x00060bc5396ba02e 0x00060bc5396bb483 5205us y
[ 6.837619] NVRM: Xid (PCI:0000:45:00): 140, pid=‘’, name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0
[ 6.838662] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x62:2393)
[ 6.839897] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 6.840045] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004500] Failed to allocate NvKmsKapiDevice
[ 6.840236] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004500] Failed to register device
[ 6.969608] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 7.012611] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 7.257844] kauditd_printk_skb: 35 callbacks suppressed
[ 7.257847] audit: type=1400 audit(1701791560.565:44): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=1563 comm=“snap-confine” capability=12 capname=“net_admin”
[ 7.257858] audit: type=1400 audit(1701791560.565:45): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=1563 comm=“snap-confine” capability=38 capname=“perfmon”
[ 7.489475] mlx5_core 0000:c1:00.1 enp193s0f1np1: Link down
[ 7.501088] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.501694] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.579078] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.579673] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.658277] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.658842] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.735722] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.736347] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 9.084872] audit: type=1400 audit(1701791562.393:46): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=2223 comm=“snap-confine” capability=12 capname=“net_admin”
[ 9.084885] audit: type=1400 audit(1701791562.393:47): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=2223 comm=“snap-confine” capability=38 capname=“perfmon”
[ 9.964979] rfkill: input handler disabled

Any thoughts on this?

Any hint is highly appreciated!

Thanks forward!

Thomas

Since the A100 doesn’t have fans on its own, external cooling is required. Is enough airflow through the A100 provided?
Is this a VM?

Hi generix!

Thanks for th quick reply!

Cooling is provided more than twice the requirements in the spec.
OS/Ubuntu runs on bare metal, no VM involved.

Not looking good, the gsp firmware is failing. So either the firmware or the gpu is broken. Please try switching to the -open driver in Software&Updates.

I installed the open driver as suggested, this driver doesn’t work either.

Kernel messages are:

[ 5.010737] nvidia: loading out-of-tree module taints kernel.
[ 5.015493] audit: type=1400 audit(1701869242.314:2): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libreoffice-xpdfimport” pid=1140 comm=“apparmor_parser”
[ 5.015522] audit: type=1400 audit(1701869242.314:3): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libreoffice-senddoc” pid=1138 comm=“apparmor_parser”
[ 5.015632] audit: type=1400 audit(1701869242.314:4): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“lsb_release” pid=1131 comm=“apparmor_parser”
[ 5.015796] audit: type=1400 audit(1701869242.314:5): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=1132 comm=“apparmor_parser”
[ 5.015802] audit: type=1400 audit(1701869242.314:6): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=1132 comm=“apparmor_parser”
[ 5.015840] audit: type=1400 audit(1701869242.314:7): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libreoffice-oosplash” pid=1137 comm=“apparmor_parser”
[ 5.016187] audit: type=1400 audit(1701869242.314:8): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“/usr/bin/man” pid=1135 comm=“apparmor_parser”
[ 5.065793] fbcon: astdrmfb (fb0) is primary device
[ 5.080192] nvidia-nvlink: Nvlink Core is being initialized, major device number 234

[ 5.084550] nvidia 0000:45:00.0: enabling device (0000 → 0002)
[ 5.093824] Console: switching to colour frame buffer device 240x67
[ 5.120668] ast 0000:44:00.0: [drm] fb0: astdrmfb frame buffer device
[ 5.136583] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 535.129.03 Release Build (dvs-builder@U16-I3-B15-1-1) Thu Oct 19 18:54:01 UTC 2023
[ 5.148635] RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters, 163840 ms ovfl timer
[ 5.148640] RAPL PMU: hw unit of domain package 2^-16 Joules
[ 5.152993] cryptd: max_cpu_qlen set to 1000
[ 5.161711] AVX2 version of gcm_enc/dec engaged.
[ 5.161839] AES CTR mode by8 optimization enabled
[ 5.221719] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 535.129.03 Release Build (dvs-builder@U16-I3-B15-1-1) Thu Oct 19 18:46:10 UTC 2023
[ 5.311653] [drm] [nvidia-drm] [GPU ID 0x00004500] Loading driver
[ 5.330364] SVM: TSC scaling supported
[ 5.330366] kvm: Nested Virtualization enabled
[ 5.330366] SVM: kvm: Nested Paging enabled
[ 5.330368] SEV supported: 509 ASIDs
[ 5.330418] SVM: Virtual VMLOAD VMSAVE supported
[ 5.330419] SVM: Virtual GIF supported
[ 5.330419] SVM: LBR virtualization supported
[ 5.334716] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x00c1d6, prod_id: 0x1000, dev_id: 0x20)
[ 5.375704] MCE: In-kernel MCE decoding enabled.
[ 5.380868] EDAC amd64: MCT channel count: 8
[ 5.381000] EDAC MC0: Giving out device to module amd64_edac controller F17h_M30h: DEV 0000:00:18.3 (INTERRUPT)
[ 5.381002] EDAC amd64: F17h_M30h detected (node 0).
[ 5.381006] EDAC MC: UMC0 chip selects:
[ 5.381006] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381008] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381011] EDAC MC: UMC1 chip selects:
[ 5.381012] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381013] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381017] EDAC MC: UMC2 chip selects:
[ 5.381017] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381018] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381021] EDAC MC: UMC3 chip selects:
[ 5.381022] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381023] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381026] EDAC MC: UMC4 chip selects:
[ 5.381026] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381027] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381030] EDAC MC: UMC5 chip selects:
[ 5.381031] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381032] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381035] EDAC MC: UMC6 chip selects:
[ 5.381035] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381036] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381039] EDAC MC: UMC7 chip selects:
[ 5.381040] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.381041] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.381041] EDAC amd64: using x16 syndromes.
[ 5.381050] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 5.381051] AMD64 EDAC driver v3.5.0
[ 5.390973] intel_rapl_common: Found RAPL domain package
[ 5.390975] intel_rapl_common: Found RAPL domain core
[ 5.505400] ipmi_si IPI0001:00: IPMI kcs interface initialized
[ 5.507675] ipmi_ssif: IPMI SSIF Interface driver
[ 5.781232] bnxt_en 0000:42:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
[ 5.781238] bnxt_en 0000:42:00.0 eno1np0: EEE is not active
[ 5.781240] bnxt_en 0000:42:00.0 eno1np0: FEC autoneg off encoding: None
[ 6.549208] NVRM gpuHandleSanityCheckRegReadError_GM107: Possible bad register read: addr: 0x1101c4, regvalue: 0xbadf510c, error code: Unknown SYS_PRI_ERROR_CODE
[ 6.549404] NVRM kgspHealthCheck_TU102: ****************************** GSP-CrashCat Report *******************************
[ 6.549417] NVRM gpuGenGidData_FWCLIENT: GSP Static Info has not been initialized yet for UUID
[ 6.549419] NVRM: Xid (PCI:0000:45:00): 120, pid=‘’, name=, GSP task exception: store access fault (cause:0x7) @ pc:0x557f984, task:1
[ 6.549424] NVRM: Reported by libos task:0 v2.0 [0] @ ts:1701869243
[ 6.549426] NVRM: RISC-V CSR State:
[ 6.549427] NVRM: mstatus:0x000000001e000000 mscratch:0x0000000000000000 mie:0x0000000000000880 mip:0x0000000000000000
[ 6.549428] NVRM: mepc:0x000000000557f984 mbadaddr:0x0000000005720000 mcause:0x0000000000000007
[ 6.549429] NVRM: RISC-V GPR State:
[ 6.549430] NVRM: ra:0x0000000004b6ac24 sp:0x00000000056a17e0 gp:0x0000000000000000 tp:0x0000000000000000
[ 6.549431] NVRM: a0:0x0000000005714000 a1:0x0000000000000000 a2:0x0000000000010000 a3:0x0000000000010000
[ 6.549432] NVRM: a4:0x0000000005720000 a5:0x0000000005724000 a6:0x0000000000000002 a7:0x0000000004ac9a78
[ 6.549434] NVRM: s0:0x00000000056a1840 s1:0x0000000000000000 s2:0x0000000000010000 s3:0x0000000005714000
[ 6.549435] NVRM: s4:0x8000000000178190 s5:0x000000000000ffd0 s6:0x0000000000000000 s7:0x0000000000001002
[ 6.549436] NVRM: s8:0x0000000004a950e0 s9:0x0000000004a931b8 s10:0x0000000005714030 s11:0x0000000000000000
[ 6.549437] NVRM: t0:0x0000000000000012 t1:0x00000000000000bc t2:0x0000000000000200 t3:0x0000000000000001
[ 6.549438] NVRM: t4:0x0000000004a27000 t5:0x000000000417f000 t6:0x0000000000000000
[ 6.549439] NVRM: Stack Trace:
[ 6.549440] NVRM: 0x000000000557f984
[ 6.549441] NVRM: 0x0000000004b6ac24
[ 6.549441] NVRM: 0x0000000004b6e02c
[ 6.549442] NVRM: 0x00000000051bf60c
[ 6.549443] NVRM: 0x00000000054bbe68
[ 6.549444] NVRM: 0x00000000054bce8c
[ 6.549444] NVRM: 0x00000000054b89a4
[ 6.549445] NVRM: 0x00000000050a2fe0
[ 6.549445] NVRM: 0x00000000050c9e50
[ 6.549446] NVRM: 0x0000000004ba2ca8
[ 6.549447] NVRM: 0x0000000004b7621c
[ 6.549448] NVRM: 0x0000000005685d04
[ 6.549448] NVRM: PC Trace:
[ 6.549450] NVRM: 0x000000000568b064 0x000000000568d490 0x000000000568be08 0x0000000004021464 0x000000000568bc2c
[ 6.549451] NVRM: 0x000000000568b340 0x0000000004021464 0x000000000568b284 0x000000000568bc1c 0x000000000568b10c
[ 6.549452] NVRM: 0x000000000568bc00 0x000000000568d7ac 0x000000000568ba88 0x000000000568b1c0 0x000000000568ba50
[ 6.549453] NVRM: 0x000000000568b340 0x0000000004021464 0x000000000568b284 0x000000000568baa8 0x000000000568b1c0
[ 6.549454] NVRM: 0x000000000568ba50 0x000000000568b340 0x0000000004021464 0x000000000568b284 0x000000000568baa8
[ 6.549456] NVRM: 0x000000000568b1c0 0x000000000568ba50 0x000000000568b340 0x0000000004021464 0x000000000568b284
[ 6.549457] NVRM: 0x000000000568baa8 0x000000000568b1c0 0x000000000568ba50 0x000000000568b340 0x0000000004021464
[ 6.549458] NVRM: 0x000000000568b284 0x000000000568baa8 0x000000000568b1c0 0x000000000568ba50 0x000000000568b340
[ 6.549459] NVRM: 0x0000000004021464 0x000000000568b284 0x000000000568baa8 0x000000000568b1c0 0x000000000568ba50
[ 6.549460] NVRM: 0x000000000568b340
[ 6.549461] NVRM: External I/O Register State:
[ 6.549462] NVRM: 0x00111360:0x00000000 0x00111364:0xbadf510c 0x00111368:0x009a0584 0x0011136c:0x00000000
[ 6.549464] NVRM: 0x001112b4:0x00040000 0x001112b8:0x00000000 0x001112bc:0x00000000 0x00111344:0x11100000
[ 6.549465] NVRM: 0x00110008:0x00000010 0x0011010c:0x00000000 0x00110118:0x00011602 0x00110110:0x13fd47f0
[ 6.549466] NVRM: 0x00110128:0x00000000 0x00110114:0x0000bf00 0x0011011c:0x00000f00
[ 6.549467] NVRM: ------------[ end crash report ]------------
[ 6.549472] NVRM: GPU0 GSP RPC buffer contains function 73 (SET_REGISTRY) and data 0x0000000000000000 0x0000000000000000.
[ 6.549475] NVRM: GPU0 RPC history (CPU → GSP):
[ 6.549476] NVRM: entry function data0 data1 ts_start ts_end duration actively_polling
[ 6.549477] NVRM: 0 73 SET_REGISTRY 0x0000000000000000 0x0000000000000000 0x00060bd74fad8609 0x0000000000000000 y
[ 6.549480] NVRM: -1 72 GSP_SET_SYSTEM_INFO 0x0000000000000000 0x0000000000000000 0x00060bd74fad8603 0x0000000000000000
[ 6.549482] NVRM: GPU0 RPC event history (CPU ← GSP):
[ 6.549483] NVRM: entry function data0 data1 ts_start ts_end duration during_incomplete_rpc
[ 6.549655] NVRM: Xid (PCI:0000:45:00): 140, pid=‘’, name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0
[ 6.549658] NVRM kgspHealthCheck_TU102: **********************************************************************************
[ 6.549661] NVRM nvCheckOkFailedNoLog: Check failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from rpcRecvPoll(pGpu, pRpc, NV_VGPU_MSG_EVENT_GSP_INIT_DONE) @ kernel_gsp.c:3616
[ 6.549663] NVRM nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from kgspWaitForRmInitDone(pGpu, pKernelGsp) @ kernel_gsp_tu102.c:445
[ 6.549668] NVRM kgspInitRm_IMPL: cannot bootstrap riscv/gsp: 0x62
[ 6.549673] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 6.551008] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x62:1660)
[ 6.552240] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 6.552472] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004500] Failed to allocate NvKmsKapiDevice
[ 6.552746] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004500] Failed to register device
[ 6.586149] loop15: detected capacity change from 0 to 8
[ 6.621442] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 6.928434] mlx5_core 0000:c1:00.0 enp193s0f0np0: Link down
[ 7.223771] NVRM gpuClearFbhubPoisonIntrForBug2924523_GA100_KERNEL: FBHUB Interrupt detected = 0x80. Clearing it.
[ 7.225422] NVRM tmrSetCurrentTime_GV100: ERROR: Write to PTIMER attempted even though Level 0 PLM is disabled.
[ 7.225426] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ timer_gv100.c:80
[ 7.295388] NVRM kgspInitRm_IMPL: unexpected WPR2 already up, cannot proceed with booting gsp
[ 7.295394] NVRM kgspInitRm_IMPL: (the GPU is likely in a bad state and may need to be reset)
[ 7.295399] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 7.296397] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:1660)
[ 7.297008] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.304210] NVRM tmrSetCurrentTime_GV100: ERROR: Write to PTIMER attempted even though Level 0 PLM is disabled.
[ 7.304217] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ timer_gv100.c:80
[ 7.333083] kauditd_printk_skb: 35 callbacks suppressed
[ 7.333086] audit: type=1400 audit(1701869244.634:44): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/snap/snapd/20290/usr/lib/snapd/snap-confine” pid=1604 comm=“snap-confine” capability=12 capname=“net_admin”
[ 7.333093] audit: type=1400 audit(1701869244.634:45): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/snap/snapd/20290/usr/lib/snapd/snap-confine” pid=1604 comm=“snap-confine” capability=38 capname=“perfmon”
[ 7.372503] NVRM kgspInitRm_IMPL: unexpected WPR2 already up, cannot proceed with booting gsp
[ 7.372509] NVRM kgspInitRm_IMPL: (the GPU is likely in a bad state and may need to be reset)
[ 7.372513] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 7.373589] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:1660)
[ 7.374148] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.381352] NVRM tmrSetCurrentTime_GV100: ERROR: Write to PTIMER attempted even though Level 0 PLM is disabled.
[ 7.381359] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ timer_gv100.c:80
[ 7.449714] NVRM kgspInitRm_IMPL: unexpected WPR2 already up, cannot proceed with booting gsp
[ 7.449720] NVRM kgspInitRm_IMPL: (the GPU is likely in a bad state and may need to be reset)
[ 7.449724] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 7.450663] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:1660)
[ 7.451237] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.458164] NVRM tmrSetCurrentTime_GV100: ERROR: Write to PTIMER attempted even though Level 0 PLM is disabled.
[ 7.458171] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ timer_gv100.c:80
[ 7.526312] NVRM kgspInitRm_IMPL: unexpected WPR2 already up, cannot proceed with booting gsp
[ 7.526319] NVRM kgspInitRm_IMPL: (the GPU is likely in a bad state and may need to be reset)
[ 7.526324] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 7.529486] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:1660)
[ 7.530127] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.585325] mlx5_core 0000:c1:00.1 enp193s0f1np1: Link down
[ 9.735592] rfkill: input handler disabled
[ 9.771804] NVRM tmrSetCurrentTime_GV100: ERROR: Write to PTIMER attempted even though Level 0 PLM is disabled.
[ 9.771810] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ timer_gv100.c:80
[ 9.839639] NVRM kgspInitRm_IMPL: unexpected WPR2 already up, cannot proceed with booting gsp
[ 9.839646] NVRM kgspInitRm_IMPL: (the GPU is likely in a bad state and may need to be reset)
[ 9.839651] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 9.840722] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:1660)
[ 9.841252] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 9.848589] NVRM tmrSetCurrentTime_GV100: ERROR: Write to PTIMER attempted even though Level 0 PLM is disabled.
[ 9.848595] NVRM nvAssertFailedNoLog: Assertion failed: 0 @ timer_gv100.c:80
[ 9.918224] NVRM kgspInitRm_IMPL: unexpected WPR2 already up, cannot proceed with booting gsp
[ 9.918230] NVRM kgspInitRm_IMPL: (the GPU is likely in a bad state and may need to be reset)
[ 9.918234] NVRM RmInitAdapter: Cannot initialize GSP firmware RM
[ 9.919227] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:1660)
[ 9.919799] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0

I strongly suspect the gpu is broken. You could check if this is only affecting the gsp by disabling the firmware (switch back to the normal driver, set kernel parameter nvidia.NVreg_EnableGpuFirmware=0) but I guess it’s better to contact your vendor.

Thanks generix for your help. I’ll contact my vendor.

Regards
Thomas

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.