Intermittent CUDA_ERROR_ILLEGAL_ADDRESS error on Ubuntu 18.04 with TensorFlow 2.2.0

I have recently begun working remotely on a Deep Learning machine, with a pair of Titan RTX GPUs (24GB RAM each), running Ubuntu 18.04. The machine is brand new, and everything was working fine for about 10 days, but I am currently experiencing intermittent errors when running my ML training jobs. I typically get errors of the form:

2020-06-12 00:14:01.824110: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-12 00:14:01.824142: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
2020-06-12 00:14:01.824177: E tensorflow/stream_executor/cuda/cuda_driver.cc:1045] failed to enqueue async memcpy from host to device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered; GPU dst: 0x7f16c3b35300; host src: 0x7f1688606e00; size: 512=0x200

As you can see I am using TensorFlow, specifically TensorFlow 2.2.0 (I tried rolling back to 2.1.0, but the same errors occurred). I understand that due to CUDA’s async nature the printed error might not reflect the real, deeper error, but running my training script with CUDA_LAUNCH_BLOCKING=1 returns no consistent errors. A few of the CUDA samples I have run also return errors, for example matrixMulCUBLAS:

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "TITAN RTX" with compute capability 7.5

GPU Device 0: "TITAN RTX" with compute capability 7.5

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
CUDA error at matrixMulCUBLAS.cpp:258 code=13(CUBLAS_STATUS_EXECUTION_FAILED) "cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB)"

While running my training code the machine’s CPU is at 100%, or even slightly higher (how is that even possible?). This happens even when running jobs with small batch sizes. I don’t understand what could be happening there. The script in question runs without issue on a Windows machine I have available, which has 1 GPU, and also on Google Colab.

I have tried running cuda-memcheck with my script, but it runs the script incredibly slowly (28sec per training step, as opposed to 0.06 without it), and the CPU shoots up to 100%.

When I first started using the machine TensorFlow complained about not being able to find CUDA libraries like libcublas, which I fixed by installing CUDA according to the instructions on the TensorFlow website. In my ~/.profile I set LD_LIBRARY_PATH as:

export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.1/extras/CUPTI/lib6

The TensorFlow Ubuntu 18.04 CUDA installetion specified the CUDA 340 driver. When I run nvidia-smi however I see the 440 driver listed. To find all available NVIDIA drivers I run apt-cache search nvidia | grep -P '^nvidia-(driver-)?[0-9]+\s' and get

nvidia-331 - Transitional package for nvidia-331
nvidia-346 - Transitional package for nvidia-346
nvidia-352 - Transitional package for nvidia-361
nvidia-361 - Transitional package for nvidia-367
nvidia-367 - Transitional package for nvidia-375
nvidia-375 - Transitional package for nvidia-384
nvidia-driver-390 - NVIDIA driver metapackage
nvidia-340 - NVIDIA binary driver - version 340.108
nvidia-driver-418 - Transitional package for nvidia-driver-430
nvidia-driver-430 - Transitional package for nvidia-driver-440
nvidia-driver-435 - NVIDIA driver metapackage
nvidia-driver-440 - NVIDIA driver metapackage
nvidia-driver-450 - NVIDIA driver metapackage
nvidia-384 - Transitional package for nvidia-driver-418
nvidia-driver-410 - NVIDIA driver metapackage

So I’m wondering if this is a driver conflict. Or perhaps it’s a CUDA library issue? Or perhaps - given that everything was working fine for about 10 days - it’s a hardware issue (I sincerely hope not).

Many thanks

UPDATE

I was able to run cuda-memcheck with a batch size of 16 (at 16sec per training step), and I immediately got

========= Invalid __global__ read of size 4
=========     at 0x00000f20 in volta_scudnn_128x64_relu_interior_nn_v1
=========     by thread (101,0,0) in block (73,1,0)
=========     Address 0x5f3e44e09250 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x346) [0x2af0b6]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x1697329]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x16973b7]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x16cd705]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x1025adb]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x1025afe]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xa6048e]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0x95901d]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xdcb3d]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 [0xdd03f]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcudnn.so.7 (cudnnConvolutionForward + 0x2ca) [0xde27a]

A bit after that I see

========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuEventSynchronize.

Apparently this means a Segmentation fault?

The error message also later includes the TensorFlow stacktrace, which prints

Internal: cuDNN launch failure : input shape([16,256,1,1039]) filter shape([1,16,256,1024])

This seems to relate to a tf.keras.layers.Conv1D in my code. Bafflingly I am currently running the training script again with the exact same params without any errors.

Hi, chris
an illegal memory access was encountered
Hint of your problem is above log line as I thought. It is usually occured when some of difference between how many declare to using memory from you code and memory addresses where system actually need to process. So, check a lines of initialize layers whether it is smaller than call layers.
Hope it goes well.

Thank you very much for your response. Unfortunately I am getting the same inermittent errors with some simple TensorFlow benchmark code:

# Adapted from https://stackoverflow.com/q/58441514/795131
import tensorflow as tf
import numpy as np
from time import time
import argparse

def timeit(func, iterations, *args):
    t0 = time()
    for _ in range(iterations):
        func(*args)
    print("Time/iter: %.4f sec" % ((time() - t0) / iterations))

def make_data(batch_shape):
    return np.random.randn(*batch_shape), np.random.randint(0, 256, (batch_shape[0], batch_shape[1]))

class TFModel(tf.keras.Model):

    def __init__(self):
        super().__init__()

        self.input_expand = tf.keras.layers.Conv1D(filters=1024, kernel_size=1)

        self.rnn1 = tf.keras.layers.GRU(
            units=1024,
            return_sequences=True,
            return_state=True,
            stateful=True)

        self.rnn2 = tf.keras.layers.GRU(
            units=1024,
            return_sequences=True,
            stateful=True)

        self.dense1 = tf.keras.layers.Dense(1024, activation='relu')
        self.dense2 = tf.keras.layers.Dense(256, activation='relu')

    def call(self, inputs):
        batch_size = tf.shape(inputs)[0]
        inputs = self.input_expand(inputs)
        (rnn_frames, rnn_state) = self.rnn1(inputs)
        rnn_frames = self.rnn2(rnn_frames, rnn_state)
        out = self.dense1(rnn_frames)
        return self.dense2(out)

def get_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--batch_size', type=int, default=32)
    parser.add_argument('--time_steps', type=int, default=16)
    parser.add_argument('--features', type=int, default=64)
    parser.add_argument('--iters', type=int, default=200)
    return parser.parse_args()

def main():
    args = get_arguments()
    batch_shape = (args.batch_size, args.time_steps, args.features)
    X, y = make_data(batch_shape)
    opt = tf.optimizers.Adam()
    compute_loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model = TFModel()
    model.compile(optimizer=opt, loss=compute_loss)
    model(X)
    model.reset_states()
    timeit(model.train_on_batch, args.iters, X, y)

if __name__ == '__main__':
    main()

Sometimes it runs fine, sometimes I get an error which starts with

2020-06-15 23:35:42.086146: E tensorflow/stream_executor/cuda/cuda_driver.cc:910] failed to synchronize the stop event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-15 23:35:42.086173: E tensorflow/stream_executor/gpu/gpu_timer.cc:55] Internal: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-15 23:35:42.086178: E tensorflow/stream_executor/gpu/gpu_timer.cc:60] Internal: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-15 23:35:42.086207: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 8B (8 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-15 23:35:42.086212: E tensorflow/stream_executor/stream.cc:5485] Internal: Failed to enqueue async memset operation: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-15 23:35:42.086219: W tensorflow/core/kernels/gpu_utils.cc:69] Failed to check cudnn convolutions for out-of-bounds reads and writes with an error message: 'Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered'; skipping this check. This only means that we won't check cudnn for out-of-bounds reads and writes. This message will only be printed once.
2020-06-15 23:35:42.086225: I tensorflow/stream_executor/cuda/cuda_driver.cc:763] failed to allocate 8B (8 bytes) from device: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-06-15 23:35:42.086258: I tensorflow/stream_executor/stream.cc:4963] [stream=0x564cc35d6380,impl=0x564cc35d53c0] did not memzero GPU location; source: 0x7f330bffd020

And ends with

tensorflow.python.framework.errors_impl.InternalError:  cuDNN launch failure : input shape([32,64,1,16]) filter shape([1,1,64,1024])
	 [[node tf_model/conv1d/conv1d (defined at tf_benchmarks.py:39) ]] [Op:__inference_train_function_3928]

Function call stack:
train_function

Larger batch sizes (say 128) seem to always cause this error, but also sometimes smaller ones. The same code runs fine every time on Colab and on my local machine. So my suspicion is that indeed it’s a CUDA installation issue. I doubt it’s a hardware problem, but it could be of course. But before investigating that I’m going to purge CUDA and re-install it.