How do I debug this? (pytorch stalls when moving tensor to GPU?)

mikkelsen.kaare · March 2, 2021, 9:23pm

Hi, I am sorry if this comes across as a pytorch question, but I suspect that the tools I need to understand this issue are cuda based.

I have a network. During training, at random times, the code stalls. There are no errors, and everything but spyder (my IDE) continues working fine (spyder becomes unresponsive). As far as I can see (from waiting a long time), the code never resumes, but task manager does not list it as unresponsive. By littering the training loop with ‘print’ commands, I have narrowed the offending lines down to the commands for moving tensors from cpu to gpu.

Additional info:
The GPU is GeForce GTX 1060

While the code was stalled, I took this screenshot from task manager:
Screenshot 2021-03-01 142136
(so, it does not seem like it is overflowing)

I have tried to starting a separate training process while another one is stalled, and while the python code executes fine, when it gets to the training part, nothing happens either. But, again, I don’t know if that is actually indicative of anything.

When I kill the process and start again, it is often able to run without any kind of restart of the machine.

I am not here to cry ‘bug’, but would appreciate some advice on how to understand this behavior, so I can adjust my code to avoid it in the future? Of course this laptop is just for early development (I’ll move the code to a gpu cluster later), but I should still be able to run small versions of my experiment here :) Also, I need to know that I won’t bring the problem with me to the large hardware.

mikkelsen.kaare · March 2, 2021, 9:26pm

As a new user, I am not allowed to include more than one upload, but I also have:

When I run the code directly from anaconda prompt, I can kill the process, and get the following ‘errors’:

(I have no idea if this is helpful, sorry)

njuffa · March 2, 2021, 9:49pm

I have narrowed the offending lines down to the commands for moving tensors from cpu to gpu.

And those lines are? Have you inquired about this problem with PyTorch support (presumably forum or mailing list)?

When you abnormally terminate an application that uses the Intel Fortran runtime with Ctrl-C, it is normal to get this kind of a stack dump. libifcoremd.dll is one of the variants of the Intel Fortran run-time library. if = Intel Fortran, md = multi-threaded (re-entrant), dynamic linking, non-debug.

mikkelsen.kaare · March 2, 2021, 10:06pm

Hi, the lines are:

        xbatch=xtemp.to(cuda)
        ybatch=ytemp.to(cuda)

(with “cuda=torch.device(‘cuda:0’)”)

I haven’t checked whether it’s one of the other in particular, just that it happens after the data loader has returned xtemp, ytemp, but before the data is actually passed to the network. these two lines are the only thing happening inbetween.

Yes, I have also reached out on the pytorch forum, but have not had anyone answer there (yet).

njuffa · March 2, 2021, 11:17pm

I don’t see a call to the CUDA API. In thirteen years of CUDA programming I have not encountered a hanging call to cudaMemcpy() or cudaMemcpyAync(). I therefore consider it very likely that the issue is in the software stack between this high-level code and actual calls to CUDA memcpy functions. I am guessing that’s something like three software layers of separation? My recommendation would be to drill down methodically, software layer by software layer. I am further guessing that the root cause is some kind of synchronization issue, i.e. a deadlock or a livelock.

I know nothing about PyTorch other than that it exists. Their bug database lists an issue which based on my superficial perusal looks similar to OP’s situation. Might be worth checking out:

using multi thread lead to gpu stuck with GPU-util 100% · Issue #22259 · pytorch/pytorch · GitHub

mikkelsen.kaare · March 3, 2021, 7:28am

ok, I’ll try that :)

Topic		Replies	Views
GDB CUDA Fortran hang? Legacy PGI Compilers	3	9765	May 20, 2014
CPU hangs when calling thrust::copy_if CUDA Programming and Performance	14	2564	August 10, 2015
Code hangs... CUDA Programming and Performance	24	19887	August 18, 2010
Training WSL 2 CUDA hangs over several training steps cuDNN	14	4271	October 7, 2021
Contents of loop failing to translate/compile/run? nvc, nvc++ and nvfortran cuda	25	734	February 11, 2023
Multi-GPUs stuck/freeze while one GPU works well CUDA Programming and Performance cuda	3	825	February 7, 2021
GPU errors during CUDA-based computations CUDA Programming and Performance cuda , pytorch , machine-learning	6	1937	May 8, 2023
CUDA + OpenMP oddity - looks like a compiler bug. Legacy PGI Compilers	6	12159	April 12, 2010
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20115	May 4, 2007
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13050	July 9, 2008

How do I debug this? (pytorch stalls when moving tensor to GPU?)

Related topics