Hi, I am sorry if this comes across as a pytorch question, but I suspect that the tools I need to understand this issue are cuda based.
I have a network. During training, at random times, the code stalls. There are no errors, and everything but spyder (my IDE) continues working fine (spyder becomes unresponsive). As far as I can see (from waiting a long time), the code never resumes, but task manager does not list it as unresponsive. By littering the training loop with ‘print’ commands, I have narrowed the offending lines down to the commands for moving tensors from cpu to gpu.
Additional info:
The GPU is GeForce GTX 1060
While the code was stalled, I took this screenshot from task manager:
(so, it does not seem like it is overflowing)
I have tried to starting a separate training process while another one is stalled, and while the python code executes fine, when it gets to the training part, nothing happens either. But, again, I don’t know if that is actually indicative of anything.
When I kill the process and start again, it is often able to run without any kind of restart of the machine.
I am not here to cry ‘bug’, but would appreciate some advice on how to understand this behavior, so I can adjust my code to avoid it in the future? Of course this laptop is just for early development (I’ll move the code to a gpu cluster later), but I should still be able to run small versions of my experiment here :) Also, I need to know that I won’t bring the problem with me to the large hardware.
I have narrowed the offending lines down to the commands for moving tensors from cpu to gpu.
And those lines are? Have you inquired about this problem with PyTorch support (presumably forum or mailing list)?
When you abnormally terminate an application that uses the Intel Fortran runtime with Ctrl-C, it is normal to get this kind of a stack dump. libifcoremd.dll is one of the variants of the Intel Fortran run-time library. if = Intel Fortran, md = multi-threaded (re-entrant), dynamic linking, non-debug.
I haven’t checked whether it’s one of the other in particular, just that it happens after the data loader has returned xtemp, ytemp, but before the data is actually passed to the network. these two lines are the only thing happening inbetween.
Yes, I have also reached out on the pytorch forum, but have not had anyone answer there (yet).
I don’t see a call to the CUDA API. In thirteen years of CUDA programming I have not encountered a hanging call to cudaMemcpy() or cudaMemcpyAync(). I therefore consider it very likely that the issue is in the software stack between this high-level code and actual calls to CUDA memcpy functions. I am guessing that’s something like three software layers of separation? My recommendation would be to drill down methodically, software layer by software layer. I am further guessing that the root cause is some kind of synchronization issue, i.e. a deadlock or a livelock.
I know nothing about PyTorch other than that it exists. Their bug database lists an issue which based on my superficial perusal looks similar to OP’s situation. Might be worth checking out: