My computer has eight gpu cards. If they run for several days, some will fall down.
If I restart the computer , they work well again. I haven’t found any functions to reset the device.
Does anyone know how to reset the device? Thank you for your kindness.
CUDA doesn’t provide any explicit function for resetting the device…the only way to do that would be the good’ol manual re-boot
I think this type of a function could be really usefull…
I have had in my experiences (on the tesla c870, quadro 5600 fx…), my gpu cards didn’t particularly fall down (…excuse the pun :rolleyes: ), but encountered really strange behavior, after using it for prolonged durations…
especially with device memory…with those i have observed some really weird behavior (even after making sure I have freed all the allocated memory within my code !!)
But a manual reboot - fixes all this !! this is kinda worrying and its a serious hit on the reliability of using GPUs !!
has anyone else also faced similar problems ??
any comments/solutions/possibilities for this “strange behavior” from NVIDIA ??
I have never run into this. You might want to run memtestG80 on your system just to be sure you don’t have problematic memory. (You might need to specify a large number of iterations if this problem only shows up after a while.)
In the case of the 8 GPU system, do you know what your card temperature is? You might be accumulating errors from overheating cards during long jobs.
The problem you have referred also comes to me and it traps me long. If I restart computer, the problem disappears.
I think NVIDIA should improve the performance of GPUs and provide a function to reset device instead of restarting computer.
Its too soon to blame nVIDIA. I think as Seibert said temperature could be the culprit
Yeah i think it would be a good idea to run the memtestg80 and make sure nothing wrong from my other hardware !!
but, i have been trying to download the memtestG80 files (for linux 64 bit) and it just doesn’t send me the the confirmation mail…or i guess it just too slow…
anyone have the files with them ?? so i can download them right away instead of having to wait for the simtk website… thanks in advance
scratch that…i got it :)