I have recently begun working remotely on a Deep Learning machine, with a pair of Titan RTX GPUs (24GB RAM each), running Ubuntu 18.04. The machine is brand new, and everything was working fine for about 10 days, but I am currently experiencing intermittent errors when running my ML training jobs. I typically get errors of the form:
2020-06-12 00:14:01.824110: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-12 00:14:01.824142: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
2020-06-12 00:14:01.824177: E tensorflow/stream_executor/cuda/cuda_driver.cc:1045] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7f16c3b35300; host src: 0x7f1688606e00; size: 512=0x200
As you can see I am using TensorFlow, specifically TensorFlow 2.2.0 (I tried rolling back to 2.1.0, but the same errors occurred). I understand that due to CUDA’s async nature the printed error might not reflect the real, deeper error, but running my training script with CUDA_LAUNCH_BLOCKING=1
returns no consistent errors. A few of the CUDA samples I have run also return errors, for example matrixMulCUBLAS
:
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "TITAN RTX" with compute capability 7.5
GPU Device 0: "TITAN RTX" with compute capability 7.5
MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
CUDA error at matrixMulCUBLAS.cpp:258 code=13(CUBLAS_STATUS_EXECUTION_FAILED) "cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB)"
While running my training code the machine’s CPU is at 100%, or even slightly higher (how is that even possible?). This happens even when running jobs with small batch sizes. I don’t understand what could be happening there. The script in question runs without issue on a Windows machine I have available, which has 1 GPU, and also on Google Colab.
I have tried running cuda-memcheck
with my script, but it runs the script incredibly slowly (28sec per training step, as opposed to 0.06 without it), and the CPU shoots up to 100%.
When I first started using the machine TensorFlow complained about not being able to find CUDA libraries like libcublas
, which I fixed by installing CUDA according to the instructions on the TensorFlow website. In my ~/.profile
I set LD_LIBRARY_PATH
as:
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/extras/CUPTI/lib6
The TensorFlow Ubuntu 18.04 CUDA installetion specified the CUDA 340 driver. When I run nvidia-smi however I see the 440 driver listed. To find all available NVIDIA drivers I run apt-cache search nvidia | grep -P '^nvidia-(driver-)?[0-9]+\s'
and get
nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-352 - Transitional package for nvidia-361
nvidia-361 - Transitional package for nvidia-367
nvidia-367 - Transitional package for nvidia-375
nvidia-375 - Transitional package for nvidia-384
nvidia-driver-390 - NVIDIA driver metapackage
nvidia-340 - NVIDIA binary driver - version 340.108
nvidia-driver-418 - Transitional package for nvidia-driver-430
nvidia-driver-430 - Transitional package for nvidia-driver-440
nvidia-driver-435 - NVIDIA driver metapackage
nvidia-driver-440 - NVIDIA driver metapackage
nvidia-driver-450 - NVIDIA driver metapackage
nvidia-384 - Transitional package for nvidia-driver-418
nvidia-driver-410 - NVIDIA driver metapackage
So I’m wondering if this is a driver conflict. Or perhaps it’s a CUDA library issue? Or perhaps - given that everything was working fine for about 10 days - it’s a hardware issue (I sincerely hope not).
Many thanks
UPDATE
I was able to run cuda-memcheck
with a batch size of 16 (at 16sec per training step), and I immediately got
========= Invalid __global__ read of size 4
========= at 0x00000f20 in volta_scudnn_128x64_relu_interior_nn_v1
========= by thread (101,0,0) in block (73,1,0)
========= Address 0x5f3e44e09250 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x346) [0x2af0b6]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x1697329]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x16973b7]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x16cd705]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x1025adb]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x1025afe]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xa6048e]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x95901d]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xdcb3d]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xdd03f]
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2ca) [0xde27a]
A bit after that I see
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuEventSynchronize.
Apparently this means a Segmentation fault?
The error message also later includes the TensorFlow stacktrace, which prints
Internal: cuDNN launch failure : input shape([16,256,1,1039]) filter shape([1,16,256,1024])
This seems to relate to a tf.keras.layers.Conv1D
in my code. Bafflingly I am currently running the training script again with the exact same params without any errors.