I have attempted to install CUDA on Ubuntu 22.04.1 LTS using the network installer (also the non-network installer) as given on the NVIDIA CUDA download site. It seems to install fine, but when I rebooted, there is only a black screen. I can log into the computer remotely through ssh and other functions that do not involve the display or GPU seem to be operating normally, except that the computer pauses periodically, the Xorg process is hung and cannot be killed, and the EDID of the monitor can not be read (even though it correctly identifies the make and model of the monitor). I have tried two different monitors with the same results.
I have executed the NVIDIA bug report script but I do not know what to do with it. Here is the output of “nvidia-smi”:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce … On | 00000000:2B:00.0 Off | N/A |
| 34% 54C P0 117W / 350W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+
I have looked at other similar forum topics but have not been able to identify an applicable solution from them.
Thank you for your assistance.
It appears to have something to do with “nvidia-modeset.” The following kernel messages including a kernel Oops are shown below in the following kernel log:
[ 126.296138] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1120
[ 134.295611] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1129
[ 142.296506] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1129
[ 174.296894] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1120
[ 182.296507] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1129
[ 190.296053] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1129
[ 222.293297] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1120
[ 230.292507] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1129
[ 238.291752] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1129
[ 242.643248] INFO: task nvidia-modeset/:895 blocked for more than 120 seconds.
[ 242.643251] Tainted: P OE 5.15.0-53-generic #59-Ubuntu
[ 242.643252] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 242.643253] task:nvidia-modeset/ state:D stack: 0 pid: 895 ppid: 2 flags:0x00004000
[ 242.643256] Call Trace:
[ 242.643257]
[ 242.643258] __schedule+0x24e/0x590
[ 242.643262] ? __schedule+0x256/0x590
[ 242.643265] schedule+0x69/0x110
[ 242.643266] schedule_timeout+0x103/0x140
[ 242.643268] ? schedule_timeout+0x103/0x140
[ 242.643269] __down_common+0xf1/0x150
[ 242.643271] __down+0x1d/0x30
[ 242.643273] down+0x51/0x70
[ 242.643275] nvkms_kthread_q_callback+0x78/0x110 [nvidia_modeset]
[ 242.643293] _main_loop+0x8c/0x140 [nvidia_modeset]
[ 242.643311] ? nvkms_sema_up+0x20/0x20 [nvidia_modeset]
[ 242.643328] kthread+0x12a/0x150
[ 242.643330] ? set_kthread_struct+0x50/0x50
[ 242.643332] ret_from_fork+0x22/0x30
[ 242.643334]
[ 270.291250] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1120
[ 278.292347] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1129
[ 286.293433] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1129
[ 318.295444] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:1120
[ 326.295600] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:0:0:1129
[ 334.295474] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67e:1:0:1129
[ 363.479162] INFO: task nvidia-modeset/:895 blocked for more than 241 seconds.
[ 363.479165] Tainted: P OE 5.15.0-53-generic #59-Ubuntu
[ 363.479166] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 363.479167] task:nvidia-modeset/ state:D stack: 0 pid: 895 ppid: 2 flags:0x00004000
[ 363.479169] Call Trace:
[ 363.479171]
[ 363.479172] __schedule+0x24e/0x590
[ 363.479176] ? __schedule+0x256/0x590
[ 363.479179] schedule+0x69/0x110
[ 363.479181] schedule_timeout+0x103/0x140
[ 363.479183] ? schedule_timeout+0x103/0x140
[ 363.479184] __down_common+0xf1/0x150
[ 363.479186] __down+0x1d/0x30
[ 363.479187] down+0x51/0x70
[ 363.479190] nvkms_kthread_q_callback+0x78/0x110 [nvidia_modeset]
[ 363.479208] _main_loop+0x8c/0x140 [nvidia_modeset]
[ 363.479226] ? nvkms_sema_up+0x20/0x20 [nvidia_modeset]
[ 363.479243] kthread+0x12a/0x150
[ 363.479245] ? set_kthread_struct+0x50/0x50
[ 363.479246] ret_from_fork+0x22/0x30
[ 363.479249]
[ 422.293629] nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device Ancor Communications Inc ASUS VH222H (HDMI-0)
[ 478.288992] nvidia-modeset: ERROR: GPU:0: Idling display engine timed out: 0x0000c67d:0:0:341
[ 484.304576] INFO: task nvidia-modeset/:895 blocked for more than 362 seconds.
[ 484.304579] Tainted: P OE 5.15.0-53-generic #59-Ubuntu
[ 484.304580] “echo 0 > /proc/sys/kernel/hung_task_timeout_secs” disables this message.
[ 484.304581] task:nvidia-modeset/ state:D stack: 0 pid: 895 ppid: 2 flags:0x00004000
[ 484.304584] Call Trace:
[ 484.304585]
[ 484.304586] __schedule+0x24e/0x590
[ 484.304589] ? __schedule+0x256/0x590
[ 484.304591] schedule+0x69/0x110
[ 484.304592] schedule_timeout+0x103/0x140
[ 484.304594] ? schedule_timeout+0x103/0x140
[ 484.304596] __down_common+0xf1/0x150
[ 484.304598] __down+0x1d/0x30
[ 484.304599] down+0x51/0x70
[ 484.304601] nvkms_kthread_q_callback+0x78/0x110 [nvidia_modeset]
[ 484.304613] _main_loop+0x8c/0x140 [nvidia_modeset]
[ 484.304624] ? nvkms_sema_up+0x20/0x20 [nvidia_modeset]
[ 484.304635] kthread+0x12a/0x150
[ 484.304638] ? set_kthread_struct+0x50/0x50
[ 484.304639] ret_from_fork+0x22/0x30
[ 484.304642]
[ 558.289900] nvidia-modeset: ERROR: GPU:0: Display engine push buffer channel allocation failed: 0x65 (Call timed out [NV_ERR_TIMEOUT])
[ 558.290182] nvidia-modeset: ERROR: GPU:0: Failed to allocate display engine core DMA push buffer
[ 569.084185] audit: type=1400 audit(1669393016.266:64): apparmor=“DENIED” operation=“capable” profile=“/usr/lib/snapd/snap-confine” pid=2037 comm=“snap-confine” capability=12 capname=“net_admin”
[ 569.084193] audit: type=1400 audit(1669393016.266:65): apparmor=“DENIED” operation=“capable” profile=“/usr/lib/snapd/snap-confine” pid=2037 comm=“snap-confine” capability=38 capname=“perfmon”
[ 569.132860] BUG: kernel NULL pointer dereference, address: 0000000000000070
[ 569.132862] #PF: supervisor read access in kernel mode
[ 569.132863] #PF: error_code(0x0000) - not-present page
[ 569.132864] PGD 0 P4D 0
[ 569.132866] Oops: 0000 [#1] SMP NOPTI
[ 569.132867] CPU: 5 PID: 1959 Comm: Xorg Tainted: P OE 5.15.0-53-generic #59-Ubuntu
[ 569.132869] Hardware name: Micro-Star International Co., Ltd. MS-7C91/MAG B550 TOMAHAWK MAX WIFI (MS-7C91), BIOS 2.10 05/25/2022
[ 569.132870] RIP: 0010:_nv002557kms+0x18/0x70 [nvidia_modeset]
[ 569.132885] Code: ff c6 44 24 2f 01 eb af 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d 1f fa 0b 00 48 8d 4c 24 0c 89 ee 89 44 24 0c e8 af
[ 569.132886] RSP: 0018:ffffb53a421c3cd0 EFLAGS: 00010282
[ 569.132887] RAX: 0000000000000000 RBX: 0000000020020000 RCX: ffff92e3480c8780
[ 569.132888] RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff92e36bcb7008
[ 569.132889] RBP: 0000000000010012 R08: 0000000000000004 R09: 0000000000000000
[ 569.132889] R10: ffff92e3513f8000 R11: 0000000000000000 R12: ffff92e36bcb7008
[ 569.132890] R13: ffff92e36bcb70a0 R14: 0000000000000fff R15: 0000000000010011
[ 569.132891] FS: 00007f26a316da80(0000) GS:ffff92f22eb40000(0000) knlGS:0000000000000000
[ 569.132891] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 569.132892] CR2: 0000000000000070 CR3: 000000012a456000 CR4: 0000000000750ee0
[ 569.132893] PKRU: 55555554
[ 569.132893] Call Trace:
[ 569.132894]
[ 569.132896] ? _nv002556kms+0xb3/0x150 [nvidia_modeset]
[ 569.132907] ? _nv002329kms+0x4da/0x730 [nvidia_modeset]
[ 569.132918] ? _nv000451kms+0xa0/0xa0 [nvidia_modeset]
[ 569.132927] ? _copy_from_user+0x2e/0x70
[ 569.132929] ? _nv000451kms+0xa0/0xa0 [nvidia_modeset]
[ 569.132938] ? _nv000638kms+0x34/0x50 [nvidia_modeset]
[ 569.132947] ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[ 569.132956] ? nvkms_ioctl+0x108/0x180 [nvidia_modeset]
[ 569.132965] ? nvidia_frontend_unlocked_ioctl+0x58/0x90 [nvidia]
[ 569.133112] ? __x64_sys_ioctl+0x95/0xd0
[ 569.133114] ? do_syscall_64+0x5c/0xc0
[ 569.133117] ? entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 569.133120]
[ 569.133120] Modules linked in: rfcomm ccm nvidia_uvm(POE) cmac algif_hash algif_skcipher af_alg nvidia_drm(POE) intel_rapl_msr bnep intel_rapl_common nvidia_modeset(POE) snd_hda_codec_realtek snd_hda_codec_generic edac_mce_amd nvidia(POE) ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_pcm nls_iso8859_1 mt7921e drm_kms_helper snd_seq_midi mt76_connac_lib snd_seq_midi_event snd_rawmidi mt76 kvm crct10dif_pclmul mac80211 snd_seq cec ghash_clmulni_intel btusb aesni_intel btrtl snd_seq_device btbcm btintel crypto_simd cryptd rc_core snd_timer bluetooth cfg80211 rapl fb_sys_fops syscopyarea snd sysfillrect input_leds ecdh_generic wmi_bmof sysimgblt ccp k10temp joydev ecc soundcore libarc4 mac_hid sch_fq_codel ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp parport drm ramoops reed_solomon pstore_blk mtd pstore_zone efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid r8169 nvme ahci xhci_pci gpio_amdpt
[ 569.133150] crc32_pclmul i2c_piix4 nvme_core realtek libahci xhci_pci_renesas wmi gpio_generic
[ 569.133155] CR2: 0000000000000070
[ 569.133156] —[ end trace 4f29f0e439ed8fcc ]—
[ 569.240991] RIP: 0010:_nv002557kms+0x18/0x70 [nvidia_modeset]
[ 569.241007] Code: ff c6 44 24 2f 01 eb af 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 49 89 fc 53 89 d5 41 b8 04 00 00 00 ba 02 01 02 00 48 83 ec 10 <8b> 46 70 8b 3d 1f fa 0b 00 48 8d 4c 24 0c 89 ee 89 44 24 0c e8 af
[ 569.241009] RSP: 0018:ffffb53a421c3cd0 EFLAGS: 00010282
[ 569.241010] RAX: 0000000000000000 RBX: 0000000020020000 RCX: ffff92e3480c8780
[ 569.241011] RDX: 0000000000020102 RSI: 0000000000000000 RDI: ffff92e36bcb7008
[ 569.241012] RBP: 0000000000010012 R08: 0000000000000004 R09: 0000000000000000
[ 569.241012] R10: ffff92e3513f8000 R11: 0000000000000000 R12: ffff92e36bcb7008
[ 569.241013] R13: ffff92e36bcb70a0 R14: 0000000000000fff R15: 0000000000010011
[ 569.241014] FS: 00007f26a316da80(0000) GS:ffff92f22eb40000(0000) knlGS:0000000000000000
[ 569.241015] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 569.241015] CR2: 0000000000000070 CR3: 000000012a456000 CR4: 0000000000750ee0
[ 569.241016] PKRU: 55555554
[ 569.712129] audit: type=1400 audit(1669393016.894:66): apparmor=“DENIED” operation=“capable” profile=“/usr/lib/snapd/snap-confine” pid=2116 comm=“snap-confine” capability=12 capname=“net_admin”
[ 569.712136] audit: type=1400 audit(1669393016.894:67): apparmor=“DENIED” operation=“capable” profile=“/usr/lib/snapd/snap-confine” pid=2116 comm=“snap-confine” capability=38 capname=“perfmon”
Are you using HDMI? I had the black screen problem with HDMI and it was due to the latest 520 drivers that were needed for CUDA 11.8 (though I don’t know what it said in the logs). Maybe try an older CUDA version, 11.6.2 (with driver 510.84) worked for me. The newest production branch driver 515.86.01 is also supposed to solve that problem, didn’t try though due to CUDA dependencies.
Yes, I am using HDMI. In fact, I found a post on another forum which mentioned that if one leaves the monitor unplugged when the X server is starting up on boot, and plugs it in only when after the X server is started, the monitor and X server will work. I have done this and in fact it works. This points to some kind of race condition or other condition where things are not starting up in the proper order or are sensitive to being started in a certain order.
I hope this will be fixed, but because I plan on using the computer remotely most of the time and only need a monitor for local debugging, an HDMI monitor of low resolution (1920x1080) is sufficient and I don’t plan to use a DisplayPort monitor. I hope that this is fixed in the 520 drivers soon, but for now I can use the workaround.
Thanks.