[Environment]
OS : Ubuntu 22.04
Device : GeForce RTX 4090
Nvidia-driver : 535.129.03
cuda : cuda_12.1.r12.1 (check with “nvcc -V” command)
python : 3.10.12
torch : 2.1.0
[Problem that is occurring]
While model training, my desktop freezes at random times. Instead of the python process freezing, ubuntu itself freezes completely. It won’t come back on unless I hold down the power button.
The freeze timing changes depending on the pytorch dataloader’s num_workers setting. For example, if I set num_workers to 0, the learning will freeze after 7 epochs, but if you set num_workers to 2, it will freeze in the middle of 1 epoch.
Similarly, if I change batch_size, the freeze timing will also change.
VRAM is 24 GB and RAM is 128 GB, so I do not consider this to be a resource shortage.
Since “torch.cuda.is_available()” returns true, the GPU is recognized. Also, the calculation “torch.zeros([2,2]).cuda() * 2” was completed without the PC freezing.
Training was executed using these codes.
It works in kaggle notebook and colab, so there shouldn’t be any bugs in the code itself.
I’ve tried various nvidia-smi and cuda versions, but the same error occurs.
I also tried kaggle docker image, but the same error occurs.
I don’t know if there is a problem with the hardware itself, a bad driver, or a bad OS setting. Please help if anyone encountered the same error.