simplecublas kernel execution error

After trying(failing) to get tensorflow working on our cluster I think there is an issue with cuBlas. Running the sample code provided in the cuda toolkit gives a kernel execution error. I am not sure if this is a driver installation error or a cuda library problem. The admins of our cluster recently preformed a partial update of the nvidia drivers on some nodes from 396.26 to 410.104. The interesting thing is the error below occurs on both nodes with drivers 396.26 to 410.104. Is the a problem with drivers or the cuda-toolkit install.

$ /software/cuda-toolkit/9.0.176/samples/7_CUDALibraries/simpleCUBLAS/simpleCUBLAS
GPU Device 0: “Tesla V100-PCIE-16GB” with compute capability 7.0

simpleCUBLAS test running…
!!! kernel execution error.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26 Driver Version: 396.26 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000000:3B:00.0 Off | 0 |
| N/A 25C P0 26W / 250W | 23MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-PCIE… Off | 00000000:D8:00.0 Off | 0 |
| N/A 21C P0 24W / 250W | 23MiB / 16160MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10885 G /usr/bin/X 22MiB |
| 1 10885 G /usr/bin/X 22MiB |
±----------------------------------------------------------------------------+

±----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE… Off | 00000000:02:00.0 Off | 0 |
| N/A 22C P0 27W / 250W | 24MiB / 12198MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla P100-PCIE… Off | 00000000:82:00.0 Off | 0 |
| N/A 25C P0 26W / 250W | 24MiB / 12198MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 13166 G /usr/bin/X 24MiB |
| 1 13166 G /usr/bin/X 24MiB |
±----------------------------------------------------------------------------+

Here is the tensorflow error it occurs with both versions 1.12.2 and 1.13.1. On cuda 9.0

$ python regression.py
1.12.2


Layer (type) Output Shape Param #

dense (Dense) (None, 64) 640


dense_1 (Dense) (None, 64) 4160


dense_2 (Dense) (None, 1) 65

Total params: 4,865
Trainable params: 4,865
Non-trainable params: 0


None
2019-05-17 13:55:30.451749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:02:00.0
totalMemory: 11.91GiB freeMemory: 11.60GiB
2019-05-17 13:55:30.585480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:82:00.0
totalMemory: 11.91GiB freeMemory: 11.60GiB
2019-05-17 13:55:30.585555: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1
2019-05-17 14:02:32.396076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-17 14:02:32.396104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1
2019-05-17 14:02:32.396110: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N
2019-05-17 14:02:32.396113: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N
2019-05-17 14:02:32.397192: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11227 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:02:00.0, compute capability: 6.0)
2019-05-17 14:02:32.398159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11227 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:82:00.0, compute capability: 6.0)
2019-05-17 14:03:39.684369: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File “regression.py”, line 78, in
example_result = model.predict(example_batch)
File “/home/roverst/software/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py”, line 1878, in predict
self, x, batch_size=batch_size, verbose=verbose, steps=steps)
File “/home/roverst/software/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py”, line 326, in predict_loop
batch_outs = f(ins_batch)
File “/home/roverst/software/lib/python3.6/site-packages/tensorflow/python/keras/backend.py”, line 2988, in call
run_metadata=self.run_metadata)
File “/home/roverst/software/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1439, in call
run_metadata_ptr)
File “/home/roverst/software/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py”, line 528, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(10, 9), b.shape=(9, 64), m=10, n=64, k=9
[[{{node dense/MatMul}} = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_dense_input_0_0/_21, dense/MatMul/ReadVariableOp)]]