GPU has fallen off the bus

I am rendering scenes in blender through an automated Blender Python script that randomizes the scene inside blender and saves the render to disk. My GPU crashes with the error “Unable to determine the device handle for GPU” while the render script is running. One thing to notice is that Blender crashes exactly at the same point each time (when the blender script is rendering a scene the 18th time in its render loop) and nvidias-smi throws the “Unable to determine the device handle for GPU” error at the same time.
Here’s the output of sudo data.txt (3.4 MB)

And this is the output of nvidia-debugdump --list

Found 1 NVIDIA devices
Error: nvmlDeviceGetHandleByIndex(): Unknown Error
FAILED to get details on GPU (0x0): Unknown Error

Please suggest any fixes for this.

@amrits @generix Please help!! Much appreciated.

You’running into Xid 79, fallen off the bus. Please monitor gpu temperatures, check your PSU.

@generix I did. The GPU core and mem temperatures stayed below 42 deg C when the crash happened. I doubt it’s the PSU as I am able to run the same blender script to render in the same system in Windows (dual boot).

The linux driver has more agressive clocking than the Windows driver. Please run nvidia-smi -lgc 300,1500 as root to limit clocking, then run your blender script again. If the gpu is not falling off the bus, it’s most likely the psu not being able to maintain power on usage spikes.

1 Like

Okay! Thanks a lot!