Error during training using RTX3090 with TLT docker, it is ok with RTX2070 : failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

laurentj6jzo · January 2, 2021, 10:27am

Hello,

I have a new PC with an RTX3090, I use same docker, same driver etc … except I installed ubuntu 20.04 instead of 18.04

I use on both PC : nvcr.io/nvidia/tlt-streamanalytics v2.0_py3 eefcf17a7830 5 months ago 7.15GB

Here not OK with RTX3090 :

2021-01-02 08:30:58.533444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-02 08:30:58.533471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-01-02 08:30:58.533476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-01-02 08:30:58.535300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22321 MB memory) → physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:17:00.0, compute capability: 8.6)
2021-01-02 08:31:18.684258: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-02 08:32:24.439148: E tensorflow/stream_executor/cuda/cuda_blas.cc:429] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2021-01-02 08:32:24.439190: E tensorflow/stream_executor/cuda/cuda_blas.cc:2437] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1365, in _do_call
return fn(*args)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1350, in _run_fn
target_list, run_metadata)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMMBatched launch failed : a.shape=[14,3,3], b.shape=[14,3,3], m=3, n=3, k=3, batch_size=14
[[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
[[resnet18_nopool_bn_detectnet_v2/block_4b_bn_2/AssignMovingAvg/_4229]]
(1) Internal: Blas xGEMMBatched launch failed : a.shape=[14,3,3], b.shape=[14,3,3], m=3, n=3, k=3, batch_size=14
[[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Here all ok with RTX2070 :

2021-01-02 09:30:17.770648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-02 09:30:17.770679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-01-02 09:30:17.770706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-01-02 09:30:17.771106: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-02 09:30:17.771679: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-02 09:30:17.772102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6587 MB memory) → physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-01-02 09:30:42.527198: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-02 09:30:42.797878: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6999c30
2021-01-02 09:30:42.798018: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-02 09:30:43.165349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-02 09:30:43.645751: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-02 09:30:48,061 [INFO] /usr/local/lib/python3.6/dist-packages/modulus/hooks/task_progress_monitor_hook.pyc: Epoch 0/120: loss: 0.09842 Time taken: 0:00:00 ETA: 0:00:00
2021-01-02 09:30:48,061 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 1.905
2021-01-02 09:31:03,557 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 15.321
2021-01-02 09:31:15,719 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 28.778
2021-01-02 09:31:28,145 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 28.169
2021-01-02 09:31:42,020 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 25.226
2021-01-02 09:31:53,953 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 29.330
2021-01-02 09:32:05,958 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 29.155

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Thank you

laurent

Morganh · January 2, 2021, 10:41am

Please refer to ERROR: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Topic		Replies	Views
TLT DetectnetV2, Problem (Solved! -> RTX 3070 not supported by tlt 2.0_py3) TAO Toolkit	12	870	October 12, 2021
ERROR: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED TAO Toolkit	2	12822	October 12, 2021
Detection training (resnet18) working on Tesla V100 GPU, but not on RTX 2080 Ti TAO Toolkit	2	805	October 12, 2021
Can't run tao 3.0 w/ RTX A6000 TAO Toolkit	15	1347	January 4, 2022
which version of cuda can work with RTX 2080 CUDA Setup and Installation	17	36231	May 13, 2021
Running TLT 3.0 in DGX A100, driver-version error TAO Toolkit	8	1508	September 19, 2021
Tlt train error: Value 'sm_86' is not defined for option 'gpu-name' TAO Toolkit	2	3682	October 12, 2021
PyTorch CUDA Errors on Ubuntu 22 with RTX 3090 Ti x2 CUDA Setup and Installation cuda , ubuntu , pytorch , python	5	4856	April 29, 2023
WSL2 / RTX3070 : running cuda samples and containers errors CUDA on Windows Subsystem for Linux	0	2710	January 27, 2021
Using RTX 3090 for Deep Learning Models Training GPU - Hardware	3	4004	February 21, 2021

Error during training using RTX3090 with TLT docker, it is ok with RTX2070 : failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

Related topics