overheating or what?


I left torch7 nn-traing for Cifar10 going on, when last night I got too tired of waiting and went to bed. In the morning the computing with cunn had stopped on error:

cublas runtime error : an internal operation failed at /home/mattik/torch/extra/cutorch/lib/THC/THCBlas.cu:246
stack traceback:
[C]: in function ‘v’
/home/mattik/torch/install/share/lua/5.1/nn/THNN.lua:110: in function ‘SpatialConvolutionMM_updateOutput’
…ik/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:79: in function <…ik/torch/install/share/lua/5.1/nn/SpatialConvolution.lua:76>

This is on Asus GL553W laptop with GTX960 running Ubuntu 16.04 and CUDA 7.5
Is this overheating or what problem? The vents are clean, of course the fan was running very loudly, the laptop was not running from battery. The display hadn’t messed up (probably display running Intel) but every program using CUDA failed. Reboot helped, and things work now normally.



I do not see how this can be diagnosed based on the information provided. If every program using CUDA failed subsequently, it seems that corruption of the NVIDIA driver stack occurred.

That happens very rarely in my experience, especially on Linux. Pretty much the only scenario I have seen is hitting the OS watchdog timer limit for CUDA kernels multiple times in quick succession, with certain driver versions.

Are you able to reproduce the failure?