My desktop freezes at random times while training with pytorch

[Environment]
OS : Ubuntu 22.04
Device : GeForce RTX 4090
Nvidia-driver : 535.129.03
cuda : cuda_12.1.r12.1 (check with “nvcc -V” command)

python : 3.10.12
torch : 2.1.0

[Problem that is occurring]
While model training, my desktop freezes at random times. Instead of the python process freezing, ubuntu itself freezes completely. It won’t come back on unless I hold down the power button.

The freeze timing changes depending on the pytorch dataloader’s num_workers setting. For example, if I set num_workers to 0, the learning will freeze after 7 epochs, but if you set num_workers to 2, it will freeze in the middle of 1 epoch.
Similarly, if I change batch_size, the freeze timing will also change.
VRAM is 24 GB and RAM is 128 GB, so I do not consider this to be a resource shortage.

Since “torch.cuda.is_available()” returns true, the GPU is recognized. Also, the calculation “torch.zeros([2,2]).cuda() * 2” was completed without the PC freezing.

Training was executed using these codes.
It works in kaggle notebook and colab, so there shouldn’t be any bugs in the code itself.

I’ve tried various nvidia-smi and cuda versions, but the same error occurs.
I also tried kaggle docker image, but the same error occurs.

I don’t know if there is a problem with the hardware itself, a bad driver, or a bad OS setting. Please help if anyone encountered the same error.

@Kinosuke Have you solved this problem? I am experiencing a similar issue.

My environment is:

  • OS : Ubuntu 22.04
  • Device : Two GeForce RTX 4090
  • Nvidia-driver : 545.23.08
  • cuda : V12.3.107
  • python : 3.10.12
  • torch : 2.1.2+cu121

The system randomly freezes during training and needs to be physically rebooted. This issue occurs regardless of the num_workers settings and is observed in both single GPU and two GPUs with DDP cases.
I have tried different combinations of Nvidia drivers (530, 535, and 545), CUDA versions (11.8, 12.1, and 12.3), and PyTorch versions (12.0.1 and 12.1.2), but all exhibit the same problem.

@taemin.cho

Your issue is very similar to mine.

The same issue occurred when training there with windows on the same machine. Also, when I tried to run a GPU benchmark test software, the software would stop.

I sent the Desktop to the store where I purchased it for repair and they acknowledged the hardware failure and replaced the 4090 at no charge.

I believe your GPU may have a similar Hardware failure. I recommend that you have it looked at as soon as possible, as it can be replaced free of charge within a year of purchase.

Hi,
I have also experienced the same issue and we have exactly the same setup, OS, drivers, GPU etc. One issue I recently find is that, the cpu cooler is not reading the cpu temperature correctly in my case, and it won’t work even when my device is in very heavy load. My fans do not even make any noises when I stress test my cpu, and I think that’s the reason why desktop freezes – due to cpu lockup under high temp.

My temporary solution is to set my liquid cooler to always run at maximum capacity through an API in my dual boot windows. please let me know if you have got any better solutions, it will be very apprecaited.