RTX 2070 "InternalError: GPU sync failed" and nan loss

gusarov.alex3 · October 27, 2020, 6:24pm

Hi, I have a problem with my RTX 2070.
I can’t train tensorflow models. Each time when I try to train model using GPU I get “nan” loss (on CPU a don’t have such problem). Also I have such problem with nvidia-docker when execute TF-docker examples .
(https://sun9-8.userapi.com/impf/gGLZLZR8i6pw_tLMiveMO6lPEsL82SkYEoXIhA/MVjUFDyBDzo.jpg?size=0x0&quality=90&proxy=1&sign=dd594952260f338228f2a6ee73c3186b)

 Tue Oct 27 21:23:57 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.95.01    Driver Version: 440.95.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 2070    Off  | 00000000:07:00.0  On |                  N/A |
|  0%   52C    P8    21W / 185W |    361MiB /  7979MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1105      G   /usr/lib/xorg/Xorg                           133MiB |
|    0      2239      G   cinnamon                                      51MiB |
|    0      2657      G   ...AAAAAAAAAAAACAAAAAAAAAA= --shared-files   174MiB |
+-----------------------------------------------------------------------------+


uname -a
\Linux COMP 5.4.0-48-generic #52~18.04.1-Ubuntu SMP Thu Sep 10 12:50:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Topic		Replies	Views
TensorflowGPU problems RTX 2070 Super Deep Learning (Training & Inference) tensorflow	0	509	June 8, 2020
Ubuntu 20.04, RTX 3090, nvidia-Tensorflow; NaN values consistently appearing during training of rnn networks Linux cuda	2	1572	March 11, 2022
Tensorflow GPU - GPU detected but never used and computer crash on Windows 10 - RTX 2070 CUDA Setup and Installation	7	5196	November 8, 2022
SSD loss possible int overflow on one computer, but not the other TAO Toolkit	14	802	October 12, 2021
Huge loss on RTX 2080 Ti issue GPU - Hardware	2	992	April 9, 2020
all CUDA-capable devices are busy or unavailable. What is wrong? cuDNN	10	10094	October 12, 2021
GPU Lost when using Tensorflow Training Linux	0	465	March 6, 2019
Tensorflow with RTX 2070 Super Frameworks (archived) tensorflow	14	9765	December 21, 2019
Ubuntu 18.04, RTX 2080 Ti, Tensorflow; NaN values consistently appearing during training of all networks Linux cuda , tensorflow , ubuntu	11	1502	October 12, 2021
RTX 2070 CUDA problem? Cannot run pytorch anymore after a program crashed. The Mandelbrot sample shows artifacts. Linux	5	2711	April 28, 2019

RTX 2070 "InternalError: GPU sync failed" and nan loss

Related topics