I have a server with 8 * A6000, when I run my PyTorch script, and try to send the SIGKILL signal to the process (due to some core dump happed…), the nvidia-smi shows some error.
Here is the dmesg log
4397357.620899] show_signal_msg: 8 callbacks suppressed
[4397357.620906] pt_main_thread[3780598]: segfault at 4e4 ip 00007fb10aaee578 sp 00007fad2fff4b00 error 4
[4397357.620909] pt_main_thread[3780634]: segfault at 618 ip 00007fb10aaee578 sp 00007facfd7e7b00 error 4
[4397357.620917] Code: 0f 6f d0 66 0f 74 d1 66 0f d7 f2 85 f6 0f 84 97 00 00 00 4d 8b 94 24 b0 00 00 00 31 d2 f3 0f bc d6 48 63 d2 48 01 ca 4c 21 ca <41> 3b 3c 92 75 6a 48 83 c3 04 49 39 dd 0f 85 75 ff ff ff 48 8b 7d
[4397357.620918] Code: 0f 6f d0 66 0f 74 d1 66 0f d7 f2 85 f6 0f 84 97 00 00 00 4d 8b 94 24 b0 00 00 00 31 d2 f3 0f bc d6 48 63 d2 48 01 ca 4c 21 ca <41> 3b 3c 92 75 6a 48 83 c3 04 49 39 dd 0f 85 75 ff ff ff 48 8b 7d
[4397436.093794] pt_main_thread[3790101]: segfault at fc ip 00007ff154aee578 sp 00007fed3c7d5b00 error 4
[4397436.093794] pt_main_thread[3790035]: segfault at 138 ip 00007ff154aee578 sp 00007fedd0f57b00 error 4
[4397436.093794] pt_main_thread[3790078]: segfault at 1cc ip 00007ff154aee578 sp 00007fed47fecb00 error 4
[4397436.093820] in _XLAC.cpython-38-x86_64-linux-gnu.so[7ff1529de000+bd85000]
[4397436.093820] in _XLAC.cpython-38-x86_64-linux-gnu.so[7ff1529de000+bd85000]
[4397436.093819] in _XLAC.cpython-38-x86_64-linux-gnu.so[7ff1529de000+bd85000]
[4397436.093830] Code: 0f 6f d0 66 0f 74 d1 66 0f d7 f2 85 f6 0f 84 97 00 00 00 4d 8b 94 24 b0 00 00 00 31 d2 f3 0f bc d6 48 63 d2 48 01 ca 4c 21 ca <41> 3b 3c 92 75 6a 48 83 c3 04 49 39 dd 0f 85 75 ff ff ff 48 8b 7d
[4397436.093832] Code: 0f 6f d0 66 0f 74 d1 66 0f d7 f2 85 f6 0f 84 97 00 00 00 4d 8b 94 24 b0 00 00 00 31 d2 f3 0f bc d6 48 63 d2 48 01 ca 4c 21 ca <41> 3b 3c 92 75 6a 48 83 c3 04 49 39 dd 0f 85 75 ff ff ff 48 8b 7d
[4397436.093832] Code: 0f 6f d0 66 0f 74 d1 66 0f d7 f2 85 f6 0f 84 97 00 00 00 4d 8b 94 24 b0 00 00 00 31 d2 f3 0f bc d6 48 63 d2 48 01 ca 4c 21 ca <41> 3b 3c 92 75 6a 48 83 c3 04 49 39 dd 0f 85 75 ff ff ff 48 8b 7d
[4397657.178006] pt_main_thread[3817775]: segfault at aa8 ip 00007f68610ee62e sp 00007f6449fd8b00 error 6
[4397657.178020] Code: 66 0f d7 d1 85 d2 0f 84 1c 01 00 00 48 8b 7d 80 48 89 c6 e8 54 ca fa fd 8b 13 49 8b 8c 24 b0 00 00 00 4d 8b 84 24 a8 00 00 00 <89> 14 81 e9 48 ff ff ff 49 8b b4 24 a8 00 00 00 49 8b 94 24 b0 00
[4398300.174679] pcieport 0000:32:02.0: pciehp: Slot(102): Card not present
[4398300.826730] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[4398300.826757] NVRM: GPU at PCI:0000:35:00: GPU-e271a214-eaca-ee28-ed3b-24381651d261
[4398300.826780] {1}[Hardware Error]: event severity: recoverable
[4398300.826783] NVRM: GPU Board Serial Number: 1322823068859
[4398300.826814] {1}[Hardware Error]: Error 0, type: recoverable
[4398300.826815] NVRM: Xid (PCI:0000:35:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[4398300.826846] {1}[Hardware Error]: section_type: PCIe error
[4398300.826848] NVRM: GPU 0000:35:00.0: GPU has fallen off the bus.
[4398300.826875] {1}[Hardware Error]: port_type: 4, root port
[4398300.826876] NVRM: GPU 0000:35:00.0: GPU serial number is 1322823068859.
[4398300.826906] {1}[Hardware Error]: version: 3.0
[4398300.826932] {1}[Hardware Error]: command: 0x0547, status: 0x4010
[4398300.826965] {1}[Hardware Error]: device_id: 0000:30:02.0
[4398300.826995] {1}[Hardware Error]: slot: 1
[4398300.827017] {1}[Hardware Error]: secondary_bus: 0x31
[4398300.827044] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x347a
[4398300.827078] {1}[Hardware Error]: class_code: 060400
[4398300.827105] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0013
[4398300.827195] pcieport 0000:30:02.0: AER: aer_status: 0x00004000, aer_mask: 0x00100020
[4398300.827232] pcieport 0000:30:02.0: [14] CmpltTO (First)
[4398300.827264] pcieport 0000:30:02.0: AER: aer_layer=Transaction Layer, aer_agent=Requester ID
[4398300.827299] pcieport 0000:30:02.0: AER: aer_uncor_severity: 0x00463010
[4398300.827426] nvidia 0000:34:00.0: AER: can't recover (no error_detected callback)
[4398300.827435] snd_hda_intel 0000:34:00.1: AER: can't recover (no error_detected callback)
[4398300.827462] nvidia 0000:35:00.0: AER: can't recover (no error_detected callback)
[4398301.318436] snd_hda_codec_hdmi hdaudioC1D0: Unable to sync register 0x4f0800. -5
[4398301.318465] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318470] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318475] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318478] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318483] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318487] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318491] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318494] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318498] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318502] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318506] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318509] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318513] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318517] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318520] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.318524] snd_hda_codec_hdmi hdaudioC1D0: HDMI: invalid ELD buf size -1
[4398301.518536] pci 0000:35:00.1: AER: can't recover (no error_detected callback)
[4398301.518573] nvidia 0000:36:00.0: AER: can't recover (no error_detected callback)
[4398301.518577] snd_hda_intel 0000:36:00.1: AER: can't recover (no error_detected callback)
[4398301.518595] nvidia 0000:37:00.0: AER: can't recover (no error_detected callback)
[4398301.518598] snd_hda_intel 0000:37:00.1: AER: can't recover (no error_detected callback)
[4398301.518601] switchtec 0000:31:00.1: AER: can't recover (no error_detected callback)
[4398301.518857] pcieport 0000:30:02.0: AER: device recovery failed
[4398301.518920] NVRM: Attempting to remove device 0000:35:00.0 with non-zero usage count!
Here is the log file generated by nvidia-bug-report.sh:
nvidia-bug-report.log.gz (4.0 MB)