Nsight connection always cause NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus

When I tried to setup Nsight connection to GPU form locally or remotely via ssh, it alwasy failed with the following dump.

[ 923.840870] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 237
[ 938.117729] NVRM: GPU at PCI:0000:01:00: GPU-530e75e4-0bf0-b59b-7145-986190ed 5a7e
[ 938.117733] NVRM: GPU Board Serial Number:
[ 938.117735] NVRM: Xid (PCI:0000:01:00): 79, GPU has fallen off the bus.
[ 938.117741] NVRM: GPU at 00000000:01:00.0 has fallen off the bus.
[ 938.117741] NVRM: GPU is on Board .
[ 938.118971] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.nvidia-bug-report.log.gz (1.3 MB)

You don’t seem to have a monitor connected, so the Xserver is continuouly starting and stopping in fast succession. Please add
Option “AllowEmptyInitialConfiguration” “true”
to the device section of your xorg.conf.

I have tried to add the EmptyInitialConfiguration to True. After Reboot, I still meet the same issue as before.

The following is my xorg.conf as a reference.

cat /etc/X11/xorg.conf

nvidia-xconfig: X configuration file generated by nvidia-xconfig

nvidia-xconfig: version 418.87.00

Section “ServerLayout”
Identifier “Layout0”
Screen 0 “Screen0” 0 0
InputDevice “Keyboard0” “CoreKeyboard”
InputDevice “Mouse0” “CorePointer”
EndSection

Section “Files”
EndSection

Section “Module”
Load “dbe”
Load “extmod”
Load “type1”
Load “freetype”
Load “glx”
EndSection

Section “InputDevice”

# generated from default
Identifier     "Mouse0"
Driver         "mouse"
Option         "Protocol" "auto"
Option         "Device" "/dev/psaux"
Option         "Emulate3Buttons" "no"
Option         "ZAxisMapping" "4 5"

EndSection

Section “InputDevice”

# generated from default
Identifier     "Keyboard0"
Driver         "kbd"

EndSection

Section “Monitor”
Identifier “Monitor0”
VendorName “Unknown”
ModelName “Unknown”
Option “DPMS”
EndSection

Section “Device”
Identifier “Device0”
Driver “nvidia”
VendorName “NVIDIA Corporation”
EndSection

Section “Screen”
Identifier “Screen0”
Device “Device0”
Monitor “Monitor0”
DefaultDepth 24
Option “AllowEmptyInitialConfiguration” “True”
SubSection “Display”
Depth 24
EndSubSection
EndSection

nvidia-bug-report.log.gz (1.2 MB)

Please also create a nvidia-bug-report.log in the working state, i.e. before running nsight.

Attached is a log generated just after rebooting, without connecting nsight tools. nvidia-bug-report.log.poweron.gz (735.5 KB)

I also observed that when I tried to set my card to the persistence mode with

$ sudo nvidia-smi --persistence-mode=1
Enabled persistence mode for GPU 00000000:01:00.0.
All done.

Just after a few second, a GPU lost could be observed from the kmsg.
Here is a log for such a scenario. nvidia-bug-report.log.gz (821.6 KB)

The reasons for XID79 in a desktop system are 99% either overheating or lack of power. The temperature looks fine. The 1050 is bus-powered, which leads me to the question, does the pcie-slot you’ve plugged it in supports 75W? This should be standard, but some mainboards have slots with less wattage supported.

Thanks for your info, I will check with the power supply frist.

While you’re at it, you might check if reseating the card in its slot helps, i.e. pull it out, put it properly back in.