4090 devices lost after random epochs training

I am on ubuntu 22.02, 4090, driver 535, cuda 12.2
I am experiencing an issue that the pytorch training would go well and after certain epoches , the training crahed .
And when I do nvidia-smi, it will show no devices found.

Sounds like same as