Training "never finishes" or system crashes using PyTorch - GPU has memory allocated but always has 0% utilization using DataLoader

marco_uc · January 21, 2023, 11:45pm

My neural network training “never finishes” or system crashes (memory reaches limit or DataLoader worker being killed error occurs) using PyTorch (using CUDA, etc) - GPU has memory allocated but always has 0% utilization using DataLoader. I’ve tested several batch values; and in DataLoader, number of workers, shuffle true or false, pin_memory true or false. Considering some tests I’ve done, I can’t use number of workers greater than 1, even if I increase or decrease the batch value. I’m in the dark and will be very grateful with any help, thanks. I am using a NVIDIA V100 SXM2.

marco_uc · January 22, 2023, 11:22pm

I was able to solve part of the problem by basically following the instructions on that page - PyTorch - NCC @ Durham< - and after the instructions on other page (about HDF5 file loading using multiprocessing, I can’t post the link because I can post just one link).
Now there is no error regarding the DataLoader worker and no memory overflow/system crash. Each DataLoader worker is using a thread to carry out the necessary loads, but the GPU is still at 0% utilization despite having a certain amount of memory allocated to it; and even varying the batch size, the training of just 1 epoch is not completed in the time I can do it end up running in another environment (actually I still haven’t been able to complete the training of just 1 epoch in any time I’ve tested it so far (each test I wait a maximum of 40 minutes, but in the other environment the training is completed much faster). Perhaps there is some bottleneck in relation to the GPU. Can you help me, please?

Topic		Replies	Views
GPU memory cannot be released Deep Learning (Training & Inference)	0	1322	October 26, 2018
GPU 0000:0B:00.0: GPU is lost when running pytorch program for CNN training Linux cuda , nvbugs , python	0	380	July 14, 2021
GPU memory allocated but GPU usage 0% CUDA Programming and Performance	2	7715	January 5, 2021
CUDA running out of memory when training a classifier CUDA Developer Tools	0	463	May 31, 2020
Training multiple models on multiple GPUs hangs Frameworks pytorch	0	813	February 19, 2021
GPU memory allocated but GPU usage 0% CUDA Developer Tools	0	471	January 4, 2021
Best Nvidia Driver for PyTorch? I'm crashing :-( AI Foundation Models and Endpoints pytorch , drivers	1	497	March 17, 2024
torch.OutOfMemoryError: CUDA out of memory when training model Linux pytorch , ai-training , training , natural-language-processing-nlp , ai-model-training	0	49	January 6, 2025
Training with GPU memory but with 0% GPU utilization cuDNN cuda , tensorflow	2	1274	May 7, 2021
Syncbatchnorm and DDP causes crash Frameworks pytorch	1	1124	August 27, 2020

Training "never finishes" or system crashes using PyTorch - GPU has memory allocated but always has 0% utilization using DataLoader

Related topics