Error during training using RTX3090 with TLT docker, it is ok with RTX2070 : failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Hello,

I have a new PC with an RTX3090, I use same docker, same driver etc … except I installed ubuntu 20.04 instead of 18.04

I use on both PC : nvcr.io/nvidia/tlt-streamanalytics v2.0_py3 eefcf17a7830 5 months ago 7.15GB

Here not OK with RTX3090 :

2021-01-02 08:30:58.533444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-02 08:30:58.533471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-01-02 08:30:58.533476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-01-02 08:30:58.535300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22321 MB memory) → physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:17:00.0, compute capability: 8.6)
2021-01-02 08:31:18.684258: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-02 08:32:24.439148: E tensorflow/stream_executor/cuda/cuda_blas.cc:429] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2021-01-02 08:32:24.439190: E tensorflow/stream_executor/cuda/cuda_blas.cc:2437] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMMBatched launch failed : a.shape=[14,3,3], b.shape=[14,3,3], m=3, n=3, k=3, batch_size=14
[[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
[[resnet18_nopool_bn_detectnet_v2/block_4b_bn_2/AssignMovingAvg/_4229]]
(1) Internal: Blas xGEMMBatched launch failed : a.shape=[14,3,3], b.shape=[14,3,3], m=3, n=3, k=3, batch_size=14
[[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

root@1d7804e9546b:/workspace/hd/download/kitti# nvidia-smi
Sat Jan 2 10:16:12 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:17:00.0 Off | N/A |
| 0% 30C P8 7W / 350W | 20MiB / 24265MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+



Here all ok with RTX2070 :

2021-01-02 09:30:17.770648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-02 09:30:17.770679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-01-02 09:30:17.770706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-01-02 09:30:17.771106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-02 09:30:17.771679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-02 09:30:17.772102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6587 MB memory) → physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-01-02 09:30:42.527198: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-02 09:30:42.797878: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6999c30
2021-01-02 09:30:42.798018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-02 09:30:43.165349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-02 09:30:43.645751: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-02 09:30:48,061 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 0/120: loss: 0.09842 Time taken: 0:00:00 ETA: 0:00:00
2021-01-02 09:30:48,061 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.905
2021-01-02 09:31:03,557 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 15.321
2021-01-02 09:31:15,719 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 28.778
2021-01-02 09:31:28,145 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 28.169
2021-01-02 09:31:42,020 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 25.226
2021-01-02 09:31:53,953 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 29.330
2021-01-02 09:32:05,958 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 29.155

root@19c89778e12a:/workspace/hd/download/kitti# nvidia-smi
Sat Jan 2 09:29:39 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 On | 00000000:01:00.0 On | N/A |
| 13% 54C P0 50W / 175W | 287MiB / 7981MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Thank you

laurent

Please refer to ERROR: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED