We are experiencing an issue with only one of two Volta GPUs on a system. The error they are getting is " CUDA error: illegal memory access was encountered. " Their code runs fine on one GPU [1], but fails on the other GPU [0].
isg@dg19c:~$ uname -a
Linux dg19c 4.15.0-118-generic #119-Ubuntu SMP Tue Sep 8 12:30:01 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
isg@dg19c:~$ nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000000:62:00.0 Off | Off |
| N/A 41C P0 43W / 300W | 11MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… On | 00000000:89:00.0 Off | Off |
| N/A 41C P0 43W / 300W | 11MiB / 32510MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+
nvidia-bug-report.log (509 Bytes)