CUDA illegal instruction on 3090, Ubuntu 20.04, Pytorch 1.11 Cuda 11.3, cudnn 8.2.0

Started suddenly today, had been working correctly until now.

No overclocking, power limit at 250W, temp around 60 C.

NVIDIA driver 510.73.05, also tested with 470
Linux kernel 5.13.0-48-generic, also tested with 5.13.0-44-generic

Software env

pytorch 1.11.0 py3.9_cuda11.3_cudnn8.2.0_0 pytorch
cudatoolkit 11.3.1 h2bc3f7f_2

When the crash happens, pytorch outputs “RuntimeError: CUDA Error: an illegal instruction was encountered)”

Syslog records the following (Xid 13 errors until finally Xid 62

Jun 10 12:36:18 U109 kernel: [  602.770376] NVRM: GPU at PCI:0000:01:00: GPU-8a35fa6d-3e3c-ab21-659b-78fef4318cf5
Jun 10 12:36:18 U109 kernel: [  602.770379] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770386] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 0, TPC 0, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.770391] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5047b0=0xd0009 0x5047b4=0x4 0x5047a8=0xc81eb60 0x5047ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770435] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x504f30=0x0 0x504f34=0x20 0x504f28=0xc81eb60 0x504f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770464] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770469] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x504fb0=0x90009 0x504fb4=0x20 0x504fa8=0xc81eb60 0x504fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770513] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x505730=0x0 0x505734=0x20 0x505728=0xc81eb60 0x50572c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770541] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770546] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5057b0=0xe0009 0x5057b4=0x20 0x5057a8=0xc81eb60 0x5057ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770589] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770593] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x505f30=0x10009 0x505f34=0x20 0x505f28=0xc81eb60 0x505f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770622] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 0, TPC 3, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770627] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x505fb0=0xe0009 0x505fb4=0x20 0x505fa8=0xc81eb60 0x505fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770669] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770674] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x506730=0xe0009 0x506734=0x20 0x506728=0xc81eb60 0x50672c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770701] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 0, TPC 4, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770706] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5067b0=0x70009 0x5067b4=0x20 0x5067a8=0xc81eb60 0x5067ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770746] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770751] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50c730=0xf0009 0x50c734=0x20 0x50c728=0xc81eb60 0x50c72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770777] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50c7b0=0x0 0x50c7b4=0x20 0x50c7a8=0xc81eb60 0x50c7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770816] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50cf30=0x0 0x50cf34=0x20 0x50cf28=0xc81eb60 0x50cf2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770841] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 1, TPC 1, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770846] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 1, TPC 1, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.770850] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50cfb0=0x50009 0x50cfb4=0x24 0x50cfa8=0xc81eb60 0x50cfac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770890] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50d730=0x0 0x50d734=0x20 0x50d728=0xc81eb60 0x50d72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770915] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 1, TPC 2, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770920] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 1, TPC 2, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.770924] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50d7b0=0xd0009 0x50d7b4=0x24 0x50d7a8=0xc81eb60 0x50d7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770963] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.770967] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 1, TPC 3, SM 0): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.770972] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50df30=0x9 0x50df34=0x24 0x50df28=0xc81eb60 0x50df2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.770996] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 1, TPC 3, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771001] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50dfb0=0xe0009 0x50dfb4=0x20 0x50dfa8=0xc81eb60 0x50dfac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771041] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50e730=0x0 0x50e734=0x20 0x50e728=0xc81eb60 0x50e72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771066] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 1, TPC 4, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771070] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50e7b0=0x80009 0x50e7b4=0x20 0x50e7a8=0xc81eb60 0x50e7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771110] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50ef30=0x0 0x50ef34=0x20 0x50ef28=0xc81eb60 0x50ef2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771135] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 1, TPC 5, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771140] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x50efb0=0x10009 0x50efb4=0x20 0x50efa8=0xc81eb60 0x50efac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771181] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x514730=0x0 0x514734=0x20 0x514728=0xc81eb60 0x51472c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771206] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5147b0=0x0 0x5147b4=0x20 0x5147a8=0xc81eb60 0x5147ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771245] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771250] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x514f30=0x70009 0x514f34=0x20 0x514f28=0xc81eb60 0x514f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771275] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 2, TPC 1, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771298] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x514fb0=0xd0009 0x514fb4=0x20 0x514fa8=0xc81eb60 0x514fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771356] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x515730=0x0 0x515734=0x20 0x515728=0xc81eb60 0x51572c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771398] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 2, TPC 2, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771421] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 2, TPC 2, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.771443] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5157b0=0x20009 0x5157b4=0x24 0x5157a8=0xc81eb60 0x5157ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771501] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x515f30=0x0 0x515f34=0x20 0x515f28=0xc81eb60 0x515f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771543] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 2, TPC 3, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771566] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x515fb0=0x10009 0x515fb4=0x20 0x515fa8=0xc81eb60 0x515fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771623] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x516730=0x0 0x516734=0x20 0x516728=0xc81eb60 0x51672c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771666] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 2, TPC 4, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771689] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5167b0=0xc0009 0x5167b4=0x20 0x5167a8=0xc81eb60 0x5167ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771790] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771816] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x516f30=0x150009 0x516f34=0x20 0x516f28=0xc81eb60 0x516f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.771879] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 2, TPC 5, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.771914] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x516fb0=0x70009 0x516fb4=0x20 0x516fa8=0xc81eb60 0x516fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772000] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 3, TPC 0, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.772028] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51c730=0xc0009 0x51c734=0x20 0x51c728=0xc81eb60 0x51c72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772114] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51c7b0=0x0 0x51c7b4=0x20 0x51c7a8=0xc81eb60 0x51c7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772222] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51cf30=0x0 0x51cf34=0x20 0x51cf28=0xc81eb60 0x51cf2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772278] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 3, TPC 1, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.772304] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 3, TPC 1, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.772330] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51cfb0=0xc0009 0x51cfb4=0x24 0x51cfa8=0xc81eb60 0x51cfac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772403] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 3, TPC 2, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.772430] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51d730=0x70009 0x51d734=0x20 0x51d728=0xc81eb60 0x51d72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772485] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 3, TPC 2, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.772512] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51d7b0=0xf0009 0x51d7b4=0x20 0x51d7a8=0xc81eb60 0x51d7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772587] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51df30=0x0 0x51df34=0x20 0x51df28=0xc81eb60 0x51df2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772641] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 3, TPC 3, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.772668] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51dfb0=0x50009 0x51dfb4=0x20 0x51dfa8=0xc81eb60 0x51dfac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772743] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51e730=0x0 0x51e734=0x20 0x51e728=0xc81eb60 0x51e72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772798] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 3, TPC 4, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.772825] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51e7b0=0xd0009 0x51e7b4=0x20 0x51e7a8=0xc81eb60 0x51e7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772898] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51ef30=0x0 0x51ef34=0x20 0x51ef28=0xc81eb60 0x51ef2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.772954] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 3, TPC 5, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.772980] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x51efb0=0xa0009 0x51efb4=0x20 0x51efa8=0xc81eb60 0x51efac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773058] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x524730=0x0 0x524734=0x20 0x524728=0xc81eb60 0x52472c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773113] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 0, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773140] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5247b0=0x90009 0x5247b4=0x20 0x5247a8=0xc81eb60 0x5247ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773214] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 1, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773241] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x524f30=0x1a0009 0x524f34=0x20 0x524f28=0xc81eb60 0x524f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773296] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 1, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773322] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x524fb0=0xe0009 0x524fb4=0x20 0x524fa8=0xc81eb60 0x524fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773395] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773422] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x525730=0x80009 0x525734=0x20 0x525728=0xc81eb60 0x52572c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773477] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 2, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773503] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 4, TPC 2, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.773529] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5257b0=0x20009 0x5257b4=0x24 0x5257a8=0xc81eb60 0x5257ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773603] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x525f30=0x0 0x525f34=0x20 0x525f28=0xc81eb60 0x525f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773659] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 3, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773685] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 4, TPC 3, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.773710] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x525fb0=0xa0009 0x525fb4=0x24 0x525fa8=0xc81eb60 0x525fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773785] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x526730=0x0 0x526734=0x20 0x526728=0xc81eb60 0x52672c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773841] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 4, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773867] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5267b0=0xe0009 0x5267b4=0x20 0x5267a8=0xc81eb60 0x5267ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.773941] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 5, SM 0): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.773968] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 4, TPC 5, SM 0): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.773993] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x526f30=0x50009 0x526f34=0x24 0x526f28=0xc81eb60 0x526f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774049] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 4, TPC 5, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.774075] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 4, TPC 5, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.774101] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x526fb0=0x80009 0x526fb4=0x24 0x526fa8=0xc81eb60 0x526fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774177] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52c730=0x0 0x52c734=0x20 0x52c728=0xc81eb60 0x52c72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774234] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52c7b0=0x0 0x52c7b4=0x20 0x52c7a8=0xc81eb60 0x52c7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774309] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52cf30=0x0 0x52cf34=0x20 0x52cf28=0xc81eb60 0x52cf2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774364] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 5, TPC 1, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.774391] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 5, TPC 1, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.774416] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52cfb0=0xd0009 0x52cfb4=0x24 0x52cfa8=0xc81eb60 0x52cfac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774490] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52d730=0x0 0x52d734=0x20 0x52d728=0xc81eb60 0x52d72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774544] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 5, TPC 2, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.774570] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 5, TPC 2, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.774594] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52d7b0=0xc0009 0x52d7b4=0x24 0x52d7a8=0xc81eb60 0x52d7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774654] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52df30=0x0 0x52df34=0x20 0x52df28=0xc81eb60 0x52df2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774696] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 5, TPC 3, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.774719] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52dfb0=0xe0009 0x52dfb4=0x20 0x52dfa8=0xc81eb60 0x52dfac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774776] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52e730=0x0 0x52e734=0x20 0x52e728=0xc81eb60 0x52e72c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774818] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 5, TPC 4, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.774841] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52e7b0=0x80009 0x52e7b4=0x20 0x52e7a8=0xc81eb60 0x52e7ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774898] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52ef30=0x0 0x52ef34=0x20 0x52ef28=0xc81eb60 0x52ef2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.774940] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 5, TPC 5, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.774963] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 5, TPC 5, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.774985] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x52efb0=0x40009 0x52efb4=0x24 0x52efa8=0xc81eb60 0x52efac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775044] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x534730=0x0 0x534734=0x20 0x534728=0xc81eb60 0x53472c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775086] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 6, TPC 0, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.775109] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5347b0=0xe0009 0x5347b4=0x20 0x5347a8=0xc81eb60 0x5347ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775166] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x534f30=0x0 0x534f34=0x20 0x534f28=0xc81eb60 0x534f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775208] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 6, TPC 1, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.775231] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 6, TPC 1, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.775253] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x534fb0=0x50009 0x534fb4=0x24 0x534fa8=0xc81eb60 0x534fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775310] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x535730=0x0 0x535734=0x20 0x535728=0xc81eb60 0x53572c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775352] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 6, TPC 2, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.775375] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5357b0=0x20009 0x5357b4=0x20 0x5357a8=0xc81eb60 0x5357ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775432] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x535f30=0x0 0x535f34=0x20 0x535f28=0xc81eb60 0x535f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775474] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 6, TPC 3, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.775496] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Global Exception on (GPC 6, TPC 3, SM 1): Multiple Warp Errors
Jun 10 12:36:18 U109 kernel: [  602.775518] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x535fb0=0xb0009 0x535fb4=0x24 0x535fa8=0xc81eb60 0x535fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775575] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x536730=0x0 0x536734=0x20 0x536728=0xc81eb60 0x53672c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775617] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 6, TPC 4, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.775639] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x5367b0=0x70009 0x5367b4=0x20 0x5367a8=0xc81eb60 0x5367ac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775695] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x536f30=0x0 0x536f34=0x20 0x536f28=0xc81eb60 0x536f2c=0x1174
Jun 10 12:36:18 U109 kernel: [  602.775753] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics SM Warp Exception on (GPC 6, TPC 5, SM 1): Illegal Instruction Encoding
Jun 10 12:36:18 U109 kernel: [  602.775788] NVRM: Xid (PCI:0000:01:00): 13, pid=538, Graphics Exception: ESR 0x536fb0=0xa0009 0x536fb4=0x20 0x536fa8=0xc81eb60 0x536fac=0x1174
Jun 10 12:36:18 U109 kernel: [  602.788301] NVRM: Xid (PCI:0000:01:00): 13, pid=3036, Graphics Exception: ChID 0018, Class 0000c7c0, Offset 00000000, Data 00000000
Jun 10 12:36:20 U109 kernel: [  604.235706] NVRM: Xid (PCI:0000:01:00): 62, pid=538, 0000(0000) 00000000 00000000

After this, nvidia-smi fails until I disconnect power for 30 min and then reboot. Then nvidia-smi looks correct, but running the program (minidiffusion/diffudiver.py at master · htoyryla/minidiffusion · GitHub) results in a crash again.

On another machine with practically the same setup, the program runs fine (as it had done on this one as well until today).

Nvidia bug report after boot

nvidia-bug-report.log.gz (375.2 KB)

Nvidia bug report after experiencing an illegal instruction

nvidia-bug-report.log 3.gz (444.9 KB)

The nvidia gpu goes into error state, I suspect due to overheating of the memory. Please monitor temperatures, check airflow.
Otherwise, the gpu or its memory might be broken or a repaste of the heatsink might be necessary.

I see the same thing on a 3090 Ti running at ~53 degrees C. I get ~10 crashes per day.

I can run the same code on a 3070 (running at 70 degrees) for days at a time and I’ve never seen a crash.

Does anyone know what this is?

In my case, the card was broken and I got it replaced.

Thanks for following up! I suspected a hardware problem and am trying to do a warranty return but I can’t repro the crash on demand.

Try using gpu-burn to reproduce a hw fault.

May be the pytorch version is not compatible i have worked on the same gpu for video analytics with anaconda environment and it works on pytorch 1.9.1 and my other specs was cudatoolkit-11.1.x and cudnn 8.2.x. So try with it may be it works.

I tried with multiple envs each having different version combinations, also different Nvidia drivers. Also, the setup had worked correctly fo some time.

I got it replaced so no need to speculate anymore.