Tensorflow GPU - GPU detected but never used and computer crash on Windows 10 - RTX 2070

Hello,

I have an issue on my computer GL704G W - Win10 Pro - RTX 2070

  • Win10 Pro 64 bits
  • cuda_10.0.130_411.31
  • cudnn-10.0
  • python-3.6.8-amd64
  • vc_redist.x64
  • pip install tensorflow-gpu==1.10.0

I tested the TF GPU with:

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

returns:

WARNING: Logging before flag parsing goes to stderr.
W0711 16:04:51.333560 12692 deprecation_wrapper.py:119] From C:\Users\Manuel\PycharmProjects\testCUDA\cuda.py:2: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0711 16:04:51.334558 12692 deprecation_wrapper.py:119] From C:\Users\Manuel\PycharmProjects\testCUDA\cuda.py:2: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2019-07-11 16:04:51.368338: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-07-11 16:04:51.376758: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library nvcuda.dll
2019-07-11 16:04:52.693641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
2019-07-11 16:04:52.702086: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-07-11 16:04:52.708494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1763] Adding visible gpu devices: 0
2019-07-11 16:04:53.357195: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-11 16:04:53.363858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0
2019-07-11 16:04:53.367232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N
2019-07-11 16:04:53.371312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6315 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5
2019-07-11 16:04:53.390774: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5

Another test:

import tensorflow as tf
tf.test.is_built_with_cuda()

return:

True
import tensorflow as tf
tf.test.is_gpu_available(cuda_only=False, min_cuda_compute_capability=None)

return:

True

But, when I execute and Tensorflow GPU code, only the CPU works and after long time, my compute crash and reboot.

Exemples from: https://github.com/tensorflow/models.git

What’s the problem ? What am I wrongly installed?

Best,
Manuel

Ok. I completely uninstall Python, CUDA 10 and the libs.

Now, I follow this https://www.pugetsystems.com/labs/hpc/How-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-UPDATED-1419/

on my GL704G W - Win10 Pro - RTX 2070

So, I can execute keras/examples/deep_dream.py and I have this issue:

(tf-gpu) PS C:\Users\Manuel\demo\examples-keras> python .\deep_dream.py .\chien.png drm
Using TensorFlow backend.
WARNING:tensorflow:From C:\Users\Manuel\Anaconda3\envs\tf-gpu\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-07-16 17:15:30.733448: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2019-07-16 17:15:32.158666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.44
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.59GiB
2019-07-16 17:15:32.167946: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-07-16 17:15:32.683032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-16 17:15:32.687889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2019-07-16 17:15:32.690815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2019-07-16 17:15:32.694737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6319 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Model loaded.
WARNING:tensorflow:Variable += will be deprecated. Use variable.assign_add if you want assignment to the variable value or 'x = x + y' if you want a new python Tensor object.
Processing image shape (255, 255)
..Loss value at 0 : 1.0290155
..Loss value at 1 : 0.9662107
..Loss value at 2 : 0.92942023
..Loss value at 3 : 0.92720234
..Loss value at 4 : 0.93417346
..Loss value at 5 : 0.93499184
..Loss value at 6 : 0.93060154
..Loss value at 7 : 0.93919677
..Loss value at 8 : 0.9740615
..Loss value at 9 : 0.9510974
..Loss value at 10 : 0.95132095
..Loss value at 11 : 0.924647
..Loss value at 12 : 0.9184734
..Loss value at 13 : 0.9356839
..Loss value at 14 : 0.94341934
..Loss value at 15 : 0.9621455
..Loss value at 16 : 0.93943894
..Loss value at 17 : 0.93967533
..Loss value at 18 : 0.9238926
..Loss value at 19 : 0.95155436
Processing image shape (357, 357)
2019-07-16 17:15:47.761198: E tensorflow/stream_executor/cuda/cuda_driver.cc:981] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2019-07-16 17:15:47.768278: E tensorflow/stream_executor/cuda/cuda_timer.cc:55] Internal: error destroying CUDA event in context 000001D46CD2E8C0: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2019-07-16 17:15:47.778170: E tensorflow/stream_executor/cuda/cuda_timer.cc:60] Internal: error destroying CUDA event in context 000001D46CD2E8C0: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2019-07-16 17:15:47.785700: F tensorflow/stream_executor/cuda/cuda_dnn.cc:194] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream.

I found nothing about “Failed to set cuDNN stream.”

Someone can help me?

The problem is that some code that TF ran on the GPU is doing something illegal. It’s impossible to say what that is exactly based on the posting here. The cudnn failure is just a side-effect of that: once the illegal operation happens on the GPU, any further attempts to use the GPU will fail.

On windows its possible that you are hitting a WDDM TDR timeout on some TF kernel.

You might want to ask questions about TF on a TF support forum.

Hello,

Thanks for your help. I installed Tensorflow GPU 1.14 with cuda 10

it works … sometimes it works and sometimes I have a computer crash.

Hello, I just found a troubling fact. When I do not plug in the power of the laptop it runs more slowly and it is stable. This means that when the GPU is limited in power, it works.

If I plug the charger, TF model detection generates a lot of fake then freeze the computer up to WDDM TDR timeout.

I do not understand why computing power would change the behavior of Tensorflow.

@numael Did you issue got resolved ? I am having the same issue.

Hello. I do not have any certainty yet, however, it seems this is from a hardware problem. The GPU seems defective. Maybe a cooling problem …