Hello, after upgrading my system from 2X Titan RTX by replacing one of them with an RTX 6000 Ada generation card, I started seeing various CUDA runtime errors during training neural networks using Pytorch (same scripts work fine on the Titan). The errors reported are not consistent, the 2 most frequent being “CUDA error: an illegal memory access was encountered” and “CUDA error: unspecified launch failure”; they can happen a few minutes to a few hours into the training. The stacks reported are not same either, setting CUDA_LAUNCH_BLOCKING=1 was not of any help in localizing the errors. They seem very sporadic. After days of investigation, I was advised by an NVidia support to run the official NVidia cleanup tool, reinstall the drivers and run some tests. The tests could only be executed in WDDM mode (I run Windows). So, after switching and running the tests for a while, they couldn’t find any problems with the card. So, I tried to run my Pytorch training scripts in WDDM mode, and after 24 hours the errors did not appear. After switching RTX 6000 Ada back into TCC mode, the errors appeared after only a few minutes, again and again.
Switching the new card into WDDM mode seems to help (at least, it works much much longer with no errors in this mode), so I was wondering whether this could indicate a problem with TCC mode of the driver. After searching for a similar problem I saw many mentions on this mysterious problem on Pytorch forums (and some of them happen on Linux systems as well), but this particular post, although for MATLAB, looks strikingly similar to what I described above:
Running in WDDM mode is less efficient (Windows does use some VRAM and computation power for its own purposes), so TCC is more preferable. I’ve been also wondering whether the fact that card works fine on WDDM mode rules out the possibility that the card is defective and perhaps points to a bug in the driver?
Any clues on what can be done to make the card work as good as RTX Titans on same machine will be appreciated. By now, I tried everything I could think of, including swapping cards (6000 and Titan) in their slots, but TCC mode still produces errors.
Below is the output of nvidia-smi tool:
C:\Windows\System32>nvidia-smi
Mon Jun 12 10:31:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.98 Driver Version: 535.98 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX 6000 Ada Gene... WDDM | 00000000:06:00.0 Off | Off |
| 40% 25C P8 12W / 300W | 591MiB / 49140MiB | 8% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA TITAN RTX TCC | 00000000:0A:00.0 Off | N/A |
| 41% 23C P8 16W / 280W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+