== Misbehavior
- running hello world tensorflow2 examples leads to the following error and :
2020-06-27 14:45:27.411523: E tensorflow/core/kernels/gpu_utils.cc:85] Detected cudnn out-of-bounds write in convolution buffer! This is likely a cudnn bug. We will skip this algorithm in the future, but your GPU state may already be corrupted, leading to incorrect results. Within Google, no action is needed on your part. Outside of Google, please ensure you’re running the latest version of cudnn. If that doesn’t fix the problem, please file a bug with this full error message and we’ll contact nvidia.
2020-06-27 14:45:27.411571: E tensorflow/core/kernels/gpu_utils.cc:93] Redzone mismatch in RHS redzone of buffer 0x7fd474cbc000 at offset 1950752; expected ffffffffffffffff but was ff00ff04ff02ff.
== Setup ==
- ubuntu 20.04 LTS
- 2x 2080TI
- nvidia-smi:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208… Off | 00000000:68:00.0 Off | N/A |
| 30% 45C P2 63W / 250W | 10916MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce RTX 208… Off | 00000000:6B:00.0 Off | N/A |
| 23% 37C P8 30W / 250W | 10916MiB / 11019MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 100689 C python 10905MiB |
| 1 100689 C python 10905MiB |
±----------------------------------------------------------------------------+
======= dmesg
[ 4515.542279] NVRM : GPU at PCI:0000:68:00: GPU-1f9c5af8-b64d-9416-f95f-80db850bf412
[ 4515.542287] NVRM : GPU Board Serial Number:
[ 4515.542295] NVRM: Xid (PCI:0000:68:00): 13, pid=95682, Graphics SM Warp Exception on (GPC 1, TPC 0, SM 0): Illegal Instruction Encoding
[ 4515.550829] NVRM: Xid (PCI:0000:68:00): 13, pid=95682, Graphics Exception: ESR 0x50c730=0xf0009 0x50c734=0x20 0x50c728=0x4c1eb72 0x50c72c=0x174
[ 4515.557725] NVRM: Xid (PCI:0000:68:00): 13, pid=95682, Graphics SM Warp Exception on (GPC 2, TPC 0, SM 0): Illegal Instruction Encoding
[ 4515.562919] NVRM: Xid (PCI:0000:68:00): 13, pid=95682, Graphics Exception: ESR 0x514730=0xf0009 0x514734=0x20 0x514728=0x4c1eb72 0x51472c=0x174
[ 4515.569213] NVRM: Xid (PCI:0000:68:00): 43, pid=95682, Ch 00000008
[ 4710.367956] NVRM : GPU at PCI:0000:68:00: GPU-1f9c5af8-b64d-9416-f95f-80db850bf412
[ 4710.367963] NVRM : GPU Board Serial Number:
[ 4710.367971] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 0): Illegal Instruction Encoding
[ 4710.376348] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Global Exception on (GPC 0, TPC 0, SM 0): Multiple Warp Errors
[ 4710.383058] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics Exception: ESR 0x504730=0xf0009 0x504734=0x24 0x504728=0x4c1eb72 0x50472c=0x174
[ 4710.388808] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Warp Exception on (GPC 0, TPC 0, SM 1): Illegal Instruction Encoding
[ 4710.393239] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Global Exception on (GPC 0, TPC 0, SM 1): Multiple Warp Errors
[ 4710.397273] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics Exception: ESR 0x5047b0=0x10009 0x5047b4=0x24 0x5047a8=0x4c1eb72 0x5047ac=0x174
[ 4710.400301] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 0): Illegal Instruction Encoding
[ 4710.403195] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Global Exception on (GPC 0, TPC 1, SM 0): Multiple Warp Errors
[ 4710.406097] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics Exception: ESR 0x504f30=0xc0009 0x504f34=0x24 0x504f28=0x4c1eb72 0x504f2c=0x174
[ 4710.408410] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Warp Exception on (GPC 0, TPC 1, SM 1): Illegal Instruction Encoding
[ 4710.410346] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Global Exception on (GPC 0, TPC 1, SM 1): Multiple Warp Errors
[ 4710.412242] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics Exception: ESR 0x504fb0=0x30009 0x504fb4=0x24 0x504fa8=0x4c1eb72 0x504fac=0x174
[ 4710.414263] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Warp Exception on (GPC 0, TPC 2, SM 0): Illegal Instruction Encoding
[ 4710.416157] NVRM: Xid (PCI:0000:68:00): 13, pid=100731, Graphics SM Global Exception on (GPC 0, TPC 2, SM 0): Multiple Warp Errors
======= further detail:
(cv) shoeman@skynet:~/Downloads/tf-keras-tutorial$ python 7-estimators-multi-gpus.py
2020-06-27 14:45:21.051668: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-06-27 14:45:21.107273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:68:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-27 14:45:21.109144: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:6b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-27 14:45:21.109320: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-27 14:45:21.110388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-27 14:45:21.111551: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-27 14:45:21.111813: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-27 14:45:21.113145: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-27 14:45:21.113839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-27 14:45:21.116685: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-27 14:45:21.125528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-06-27 14:45:21.125875: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-06-27 14:45:21.151513: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2100000000 Hz
2020-06-27 14:45:21.154983: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fd8b8000b60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-27 14:45:21.155040: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-06-27 14:45:21.598152: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55ad1c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-06-27 14:45:21.598219: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-27 14:45:21.598240: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-06-27 14:45:21.601324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:68:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-27 14:45:21.603669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:6b:00.0 name: GeForce RTX 2080 Ti computeCapability: 7.5
coreClock: 1.545GHz coreCount: 68 deviceMemorySize: 10.76GiB deviceMemoryBandwidth: 573.69GiB/s
2020-06-27 14:45:21.603753: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-27 14:45:21.603808: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-27 14:45:21.603842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-06-27 14:45:21.603875: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-06-27 14:45:21.603908: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-06-27 14:45:21.603940: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-06-27 14:45:21.603973: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-27 14:45:21.612423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1
2020-06-27 14:45:21.612511: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-06-27 14:45:21.616981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-27 14:45:21.617006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 1
2020-06-27 14:45:21.617017: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N N
2020-06-27 14:45:21.617025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1: N N
2020-06-27 14:45:21.621273: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10203 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5)
2020-06-27 14:45:21.623360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10203 MB memory) → physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:6b:00.0, compute capability: 7.5)
Epoch 1/10
2020-06-27 14:45:25.960138: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-06-27 14:45:26.308760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-06-27 14:45:27.411523: E tensorflow/core/kernels/gpu_utils.cc:85] Detected cudnn out-of-bounds write in convolution buffer! This is likely a cudnn bug. We will skip this algorithm in the future, but your GPU state may already be corrupted, leading to incorrect results. Within Google, no action is needed on your part. Outside of Google, please ensure you’re running the latest version of cudnn. If that doesn’t fix the problem, please file a bug with this full error message and we’ll contact nvidia.
2020-06-27 14:45:27.411571: E tensorflow/core/kernels/gpu_utils.cc:93] Redzone mismatch in RHS redzone of buffer 0x7fd474cbc000 at offset 1950752; expected ffffffffffffffff but was ff00ff04ff02ff.
2020-06-27 14:45:27.517115: E tensorflow/core/kernels/gpu_utils.cc:85] Detected cudnn out-of-bounds write in convolution buffer! This is likely a cudnn bug. We will skip this algorithm in the future, but your GPU state may already be corrupted, leading to incorrect results. Within Google, no action is needed on your part. Outside of Google, please ensure you’re running the latest version of cudnn. If that doesn’t fix the problem, please file a bug with this full error message and we’ll contact nvidia.
2020-06-27 14:45:27.517159: E tensorflow/core/kernels/gpu_utils.cc:93] Redzone mismatch in LHS redzone of buffer 0x7fd474eee000 at offset 2173952; expected ffffffffffffffff but was 30ff00ff3cff7aff.
2020-06-27 14:45:27.554475: E tensorflow/core/kernels/gpu_utils.cc:85] Detected cudnn out-of-bounds write in convolution buffer! This is likely a cudnn bug. We will skip this algorithm in the future, but your GPU state may already be corrupted, leading to incorrect results. Within Google, no action is needed on your part. Outside of Google, please ensure you’re running the latest version of cudnn. If that doesn’t fix the problem, please file a bug with this full error message and we’ll contact nvidia.
2020-06-27 14:45:27.554520: E tensorflow/core/kernels/gpu_utils.cc:93] Redzone mismatch in RHS redzone of buffer 0x7fd474eee000 at offset 16392; expected ffffffffffffffff but was ffffffffffffdfff.
2020-06-27 14:45:27.902493: E tensorflow/core/kernels/gpu_utils.cc:85] Detected cudnn out-of-bounds write in convolution buffer! This is likely a cudnn bug. We will skip this algorithm in the future, but your GPU state may already be corrupted, leading to incorrect results. Within Google, no action is needed on your part. Outside of Google, please ensure you’re running the latest version of cudnn. If that doesn’t fix the problem, please file a bug with this full error message and we’ll contact nvidia.
2020-06-27 14:45:27.902541: E tensorflow/core/kernels/gpu_utils.cc:93] Redzone mismatch in LHS redzone of buffer 0x7fd47703a800 at offset 2921504; expected ffffffffffffffff but was ff00ff00ff00ff00.
2020-06-27 14:45:27.931378: E tensorflow/core/kernels/gpu_utils.cc:85] Detected cudnn out-of-bounds write in convolution buffer! This is likely a cudnn bug. We will skip this algorithm in the future, but your GPU state may already be corrupted, leading to incorrect results. Within Google, no action is needed on your part. Outside of Google, please ensure you’re running the latest version of cudnn. If that doesn’t fix the problem, please file a bug with this full error message and we’ll contact nvidia.
2020-06-27 14:45:27.931428: E tensorflow/core/kernels/gpu_utils.cc:93] Redzone mismatch in LHS redzone of buffer 0x7fd476028800 at offset 8005632; expected ffffffffffffffff but was ff00ff00ff00ff00.
118/118 [==============================] - 1s 10ms/step - accuracy: 0.6611 - loss: 0.9435
Epoch 2/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.8058 - loss: 0.5285
Epoch 3/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.8310 - loss: 0.4682
Epoch 4/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.8481 - loss: 0.4247
Epoch 5/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.8616 - loss: 0.3883
Epoch 6/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.8709 - loss: 0.3588
Epoch 7/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.8774 - loss: 0.3377
Epoch 8/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.2608 - loss: 0.0661
Epoch 9/10
118/118 [==============================] - 1s 9ms/step - accuracy: 0.0999 - loss: 1.1921e-07
Epoch 10/10
109/118 [==========================>…] - ETA: 0s - accuracy: 0.1005 - loss: 1.1921e-072020-06-27 14:45:39.726098: E tensorflow/stream_executor/cuda/cuda_driver.cc:910] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2020-06-27 14:45:39.726121: E tensorflow/stream_executor/gpu/gpu_timer.cc:55] Internal: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2020-06-27 14:45:39.726127: E tensorflow/stream_executor/gpu/gpu_timer.cc:60] Internal: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2020-06-27 14:45:39.726139: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 8B (8 bytes) from device: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2020-06-27 14:45:39.726161: E tensorflow/stream_executor/stream.cc:5485] Internal: Failed to enqueue async memset operation: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2020-06-27 14:45:39.726181: W tensorflow/core/kernels/gpu_utils.cc:69] Failed to check cudnn convolutions for out-of-bounds reads and writes with an error message: ‘Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered’; skipping this check. This only means that we won’t check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2020-06-27 14:45:39.726187: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 8B (8 bytes) from device: CUDA_ERROR_ILLEGAL_INSTRUCTION: an illegal instruction was encountered
2020-06-27 14:45:39.726195: I tensorflow/stream_executor/stream.cc:4963] [stream=0x55c6d20,impl=0x55b9050] did not memzero GPU location; source: 0x7fd7437fc020
p
^C
======= gpu burn