4090 devices lost after random epochs training

I am on ubuntu 22.02, 4090, driver 535, cuda 12.2
I am experiencing an issue that the pytorch training would go well and after certain epoches , the training crahed .
And when I do nvidia-smi, it will show no devices found.

Sounds like same as
https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu000000-0-unknown-error/270060/2?u=generix