Tesla P40 Unable to determine the device handle for GPU 0000:01:00.0: Unknown Error

hi,I am using telsa P40, I can run my pytorch cuda program last month , but when running my program again today, I am facing the issue ,my gpu lost ,I restart many times ,every time when i restart ,i can use the nvidia-smi command ,when I run my torch program ,the gpu is lost ,it tell me Unable to determine the device handle ,I reinstall the cuda ,and drivers many times ,it didnt work,i use the sudo nvidia-bug-report.sh command to get out the log ,I can find out what the problem is ,the log is complex ,I dont know where the problems is heres is my log
nvidia-bug-report.log (1.8 MB),can u help me

It’s falling off the bus, I suspect due to overheating since you’re running it in a desktop mainboard/case. The Tesla needs additional cooling.

hi ,thx so much,I was able to run my program months ago,we have used a fan to cool it ,and put it on our computer room ,where the temperature is blow 20 degree,we were abled to run large gpu programs。I am wondering it casuse by reinstall drivers ,I used to setting something for mainboard,but I cant remerber I what setted。now it hangs why pytorch loading data,gpu fall the bus,it hangs on the start when loadingdata ,I dont think the temperature is high ,It has not yet start to train just loading data? but anthoer program use not much gpu memory was able to run ,my another problem is the pytorch minst example。each time I restarted my computer I can use to nvidia-smi command 。what do u mean by a desktop mainboard/case,should I use service lightdm stop to stop desktop。And any other tips for my case except additional cooling. any setting need for my mainboard ?

You should monitor temperatures to find out if there’s something wrong with cooling, e.g.
nvidia-smi -q -l 1 -d TEMPERATURE >temperature.txt
will create a log that you can check after crash.
A second cause for this could be insufficient power, i.e. a failing psu.

hi ,thanks sir ,Dued to too many dirty dust stuck in the gpu card and mainboard ,I cleaned them,Now it seems to work normal