Hello,
Ive tried to get a A100 (driver) running on Ubunutu 22.04 Desktop with no luck.
My setup:
Mainboard: ASROCK ROMED8-2T with AMD EPYC 7H12 CPU. 120GB Reg ECC Memory.
What I did:
Installed and upgraded Ubuntu 22.04 Desktop.
A100 is recognized as PCIe device:
test@test-desktop:~$ lspci | grep NVIDIA
45:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
Recognized by the kernel:
test@test-desktop:~$ sudo dmesg | grep nvidia
[sudo] Passwort für test:
[ 5.051881] audit: type=1400 audit(1701790560.353:5): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=1138 comm=“apparmor_parser”
[ 5.051885] audit: type=1400 audit(1701790560.353:6): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=1138 comm=“apparmor_parser”
No kernel module recognized so far:
test@test-desktop:~$ lsmod | grep nvidia
test@test-desktop:~$
Just Nouveau driver is loaded:
test@test-desktop:~$ lsmod | grep nouveau
nouveau 2830336 0
mxm_wmi 16384 1 nouveau
drm_ttm_helper 16384 1 nouveau
ttm 110592 2 drm_ttm_helper,nouveau
drm_display_helper 212992 1 nouveau
drm_kms_helper 249856 5 ast,drm_display_helper,nouveau
i2c_algo_bit 16384 2 ast,nouveau
video 73728 1 nouveau
wmi 40960 3 video,mxm_wmi,nouveau
drm 700416 9 drm_kms_helper,ast,drm_shmem_helper,drm_display_helper,drm_ttm_helper,ttm,nouveau
test@test-desktop:~$
Then I installed nvidia-driver-535 via the additional drivers section in Ubuntu.
Since then dmesg reports as follows:
[ 5.252953] nvidia: loading out-of-tree module taints kernel.
[ 5.252967] nvidia: module license ‘NVIDIA’ taints kernel.
[ 5.252968] Disabling lock debugging due to kernel taint
[ 5.258482] RAPL PMU: API unit is 2^-32 Joules, 1 fixed counters, 163840 ms ovfl timer
[ 5.258486] RAPL PMU: hw unit of domain package 2^-16 Joules
[ 5.265896] cryptd: max_cpu_qlen set to 1000
[ 5.273192] ipmi_si IPI0001:00: IPMI message handler: Found new BMC (man_id: 0x00c1d6, prod_id: 0x1000, dev_id: 0x20)
[ 5.276241] AVX2 version of gcm_enc/dec engaged.
[ 5.276511] AES CTR mode by8 optimization enabled
[ 5.311639] loop8: detected capacity change from 0 to 8
[ 5.411840] nvidia-nvlink: Nvlink Core is being initialized, major device number 234[ 5.413378] nvidia 0000:45:00.0: enabling device (0000 → 0002)
[ 5.460593] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 535.129.03 Thu Oct 19 18:56:32 UTC 2023
[ 5.511993] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.129.03 Thu Oct 19 18:42:12 UTC 2023
[ 5.528831] [drm] [nvidia-drm] [GPU ID 0x00004500] Loading driver
[ 5.534989] SVM: TSC scaling supported
[ 5.534993] kvm: Nested Virtualization enabled
[ 5.534995] SVM: kvm: Nested Paging enabled
[ 5.534996] SEV supported: 509 ASIDs
[ 5.535043] SVM: Virtual VMLOAD VMSAVE supported
[ 5.535044] SVM: Virtual GIF supported
[ 5.535044] SVM: LBR virtualization supported
[ 5.554977] MCE: In-kernel MCE decoding enabled.
[ 5.557821] EDAC amd64: MCT channel count: 8
[ 5.558002] EDAC MC0: Giving out device to module amd64_edac controller F17h_M30h: DEV 0000:00:18.3 (INTERRUPT)
[ 5.558004] EDAC amd64: F17h_M30h detected (node 0).
[ 5.558008] EDAC MC: UMC0 chip selects:
[ 5.558009] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558010] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558014] EDAC MC: UMC1 chip selects:
[ 5.558015] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558016] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558020] EDAC MC: UMC2 chip selects:
[ 5.558020] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558021] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558025] EDAC MC: UMC3 chip selects:
[ 5.558026] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558027] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558030] EDAC MC: UMC4 chip selects:
[ 5.558031] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558032] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558035] EDAC MC: UMC5 chip selects:
[ 5.558036] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558037] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558040] EDAC MC: UMC6 chip selects:
[ 5.558041] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558042] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558045] EDAC MC: UMC7 chip selects:
[ 5.558046] EDAC amd64: MC: 0: 8192MB 1: 8192MB
[ 5.558047] EDAC amd64: MC: 2: 0MB 3: 0MB
[ 5.558048] EDAC amd64: using x16 syndromes.
[ 5.558061] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 5.558063] AMD64 EDAC driver v3.5.0
[ 5.564019] intel_rapl_common: Found RAPL domain package
[ 5.564021] intel_rapl_common: Found RAPL domain core
[ 5.576863] bnxt_en 0000:42:00.0 eno1np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: none
[ 5.576869] bnxt_en 0000:42:00.0 eno1np0: EEE is not active
[ 5.576872] bnxt_en 0000:42:00.0 eno1np0: FEC autoneg off encoding: None
[ 5.634776] ipmi_si IPI0001:00: IPMI kcs interface initialized
[ 5.637594] ipmi_ssif: IPMI SSIF Interface driver
[ 6.774955] mlx5_core 0000:c1:00.0 enp193s0f0np0: Link down
[ 6.837380] NVRM: Xid (PCI:0000:45:00): 120, pid=‘’, name=, GSP task exception: illegal instruction (cause:0x2) @ pc:0x55ad2dc, task:1
[ 6.837391] NVRM: Reported by libos task:0 v2.0 [0] @ ts:1701791560
[ 6.837393] NVRM: RISC-V CSR State:
[ 6.837394] NVRM: mstatus:0x000000001e000000 mscratch:0x0000000000000000 mie:0x0000000000000880 mip:0x0000000000000000
[ 6.837396] NVRM: mepc:0x00000000055ad2dc mbadaddr:0x00000000056a2608 mcause:0x0000000000000002
[ 6.837397] NVRM: RISC-V GPR State:
[ 6.837399] NVRM: ra:0x00000000050a325c sp:0x00000000056a1d30 gp:0x0000000000000000 tp:0x0000000000000000
[ 6.837400] NVRM: a0:0x8000000000178190 a1:0x8000000000194c78 a2:0x0000000000000000 a3:0x0000000000c4cd90
[ 6.837401] NVRM: a4:0x179dfa6817bc9ce0 a5:0x00000000055ad2dc a6:0x000000000000004e a7:0x00000000041a54e8
[ 6.837403] NVRM: s0:0x00000000056a1dd0 s1:0x0000000000000000 s2:0x8000000000178190 s3:0x800000000017c894
[ 6.837404] NVRM: s4:0x8000000000194c78 s5:0x00000000056a1d38 s6:0x0000000000000000 s7:0x000000000417f000
[ 6.837406] NVRM: s8:0x800000000017c990 s9:0x0000000000000056 s10:0x0000000000000056 s11:0x80000000002a8cd0
[ 6.837407] NVRM: t0:0x0000000000000012 t1:0x000000000558f244 t2:0x000000000015c47b t3:0x8000000000099630
[ 6.837408] NVRM: t4:0x0000000000000033 t5:0x0000000000d8992d t6:0x0000000000000000
[ 6.837410] NVRM: Stack Trace:
[ 6.837411] NVRM: 0x00000000055ad2dc
[ 6.837412] NVRM: 0x00000000050c9e50
[ 6.837412] NVRM: 0x0000000004ba2ca8
[ 6.837413] NVRM: 0x0000000004b7621c
[ 6.837414] NVRM: 0x0000000005685d04
[ 6.837415] NVRM: PC Trace:
[ 6.837416] NVRM: 0x000000000568b064 0x000000000568d490 0x000000000568be08
[ 6.837418] NVRM: External I/O Register State:
[ 6.837419] NVRM: 0x00111360:0x00000000 0x00111364:0x00000000 0x00111368:0x00000000 0x0011136c:0x00000000
[ 6.837421] NVRM: 0x001112b4:0x00040000 0x001112b8:0x00000000 0x001112bc:0x00000000 0x00111344:0x11100000
[ 6.837422] NVRM: 0x00110008:0x00000010 0x0011010c:0x00000000 0x00110118:0x00011122 0x00110110:0x13f5d1e2
[ 6.837424] NVRM: 0x00110128:0x00000000 0x00110114:0x0000d320 0x0011011c:0x000003a0
[ 6.837425] NVRM: ------------[ end crash report ]------------
[ 6.837430] NVRM: GPU0 GSP RPC buffer contains function 4098 (GSP_RUN_CPU_SEQUENCER) and data 0x00000000000001ea 0x0000000000003fe2.
[ 6.837433] NVRM: GPU0 RPC history (CPU → GSP):
[ 6.837434] NVRM: entry function data0 data1 ts_start ts_end duration actively_polling
[ 6.837435] NVRM: 0 73 SET_REGISTRY 0x0000000000000000 0x0000000000000000 0x00060bc5395ce801 0x0000000000000000 y
[ 6.837439] NVRM: -1 72 GSP_SET_SYSTEM_INFO 0x0000000000000000 0x0000000000000000 0x00060bc5395ce7fc 0x0000000000000000
[ 6.837442] NVRM: GPU0 RPC event history (CPU ← GSP):
[ 6.837443] NVRM: entry function data0 data1 ts_start ts_end duration during_incomplete_rpc
[ 6.837444] NVRM: 0 4098 GSP_RUN_CPU_SEQUENCER 0x00000000000001ea 0x0000000000003fe2 0x00060bc5396ba02e 0x00060bc5396bb483 5205us y
[ 6.837619] NVRM: Xid (PCI:0000:45:00): 140, pid=‘’, name=, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0
[ 6.838662] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x62:2393)
[ 6.839897] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 6.840045] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004500] Failed to allocate NvKmsKapiDevice
[ 6.840236] [drm:nv_drm_probe_devices [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x00004500] Failed to register device
[ 6.969608] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 7.012611] nvidia-uvm: Loaded the UVM driver, major device number 510.
[ 7.257844] kauditd_printk_skb: 35 callbacks suppressed
[ 7.257847] audit: type=1400 audit(1701791560.565:44): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=1563 comm=“snap-confine” capability=12 capname=“net_admin”
[ 7.257858] audit: type=1400 audit(1701791560.565:45): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=1563 comm=“snap-confine” capability=38 capname=“perfmon”
[ 7.489475] mlx5_core 0000:c1:00.1 enp193s0f1np1: Link down
[ 7.501088] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.501694] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.579078] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.579673] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.658277] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.658842] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 7.735722] NVRM: GPU 0000:45:00.0: RmInitAdapter failed! (0x62:0x40:2393)
[ 7.736347] NVRM: GPU 0000:45:00.0: rm_init_adapter failed, device minor number 0
[ 9.084872] audit: type=1400 audit(1701791562.393:46): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=2223 comm=“snap-confine” capability=12 capname=“net_admin”
[ 9.084885] audit: type=1400 audit(1701791562.393:47): apparmor=“DENIED” operation=“capable” class=“cap” profile=“/usr/lib/snapd/snap-confine” pid=2223 comm=“snap-confine” capability=38 capname=“perfmon”
[ 9.964979] rfkill: input handler disabled
Any thoughts on this?
Any hint is highly appreciated!
Thanks forward!
Thomas