I met the same error using rtx5090 x8 .
same config ande same code! But it works in L20 x8!
os: Linux hello-SY8108G-D12R-G4 6.11.0-17-generic #17~24.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jan 20 22:48:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
cuda:(sdxl_ft) root@hello-SY8108G-D12R-G4:~/home/train/kohya_ss# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:23:50_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0
nvidia-smi: NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8
ERROR:
[rank0]:[W715 21:26:00.385246677 CUDAGuardImpl.h:119] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
[rank1]:[W715 21:26:00.385298941 CUDAGuardImpl.h:119] Warning: CUDA warning: an illegal memory access was encountered (function destroyEvent)
W0715 21:26:01.212000 47771 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 47850 closing signal SIGTERM
E0715 21:26:01.326000 47771 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 0 (pid: 47849) of binary: /root/anaconda3/envs/sdxl_ft/bin/python3.10
Traceback (most recent call last):
Thanks for the udpate. Hope that Nvidia will release a solution soon for their buggy Linux drivers…
Seems new production drivers come out on the 17th or 18th of each month. 🤞 (No idea when CUDA toolkit 13.1 will be out though)
I search this in torch github,but still get the error! you guys think my error is cuda toolkit bug not the torch ? i am not sure!
HI guys, I solve my problem with the torch nightly.
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu129