ERROR: failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

I tried to train yolov3 model using TLT and get the following error:

failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED

System set up:

  • 2 x GeForce RTX 3090
  • Driver Version: 455.38
  • CUDA Version: 11.1
  • tlt-streamanalytics:v2.0_py3
  • cuda:11.0-base
  • Output of nvidia-smi (within container)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38       Driver Version: 455.38       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 3090    Off  | 00000000:24:00.0 Off |                  N/A |
|  0%   32C    P8    22W / 350W |     10MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 3090    Off  | 00000000:2D:00.0  On |                  N/A |
|  0%   31C    P8    30W / 350W |    289MiB / 24265MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

  • Ouput of nvcc --version (within container)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

The command that cause error:

!tlt-train yolo -e $SPECS_DIR/yolo_train_resnet18_kitti.txt \
                -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                -k $KEY \
                -m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet10/resnet_10.hdf5 \
                --gpus 2
  • Error log:
2020-12-21 22:25:56.433979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:56.433972: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:58.263005: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-12-21 22:25:58.263816: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-12-21 22:25:58.278119: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.279960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:24:00.0
2020-12-21 22:25:58.279982: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:58.280777: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.280969: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-21 22:25:58.281865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:2d:00.0
2020-12-21 22:25:58.281888: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:58.281897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-21 22:25:58.282155: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-21 22:25:58.282889: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-21 22:25:58.283329: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-21 22:25:58.283847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-21 22:25:58.284106: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-21 22:25:58.284255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-21 22:25:58.285234: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-21 22:25:58.286123: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-21 22:25:58.286353: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-21 22:25:58.286483: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.287331: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.288087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-12-21 22:25:58.288111: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:58.288794: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-21 22:25:58.288914: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.290581: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.291544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 1
2020-12-21 22:25:58.291566: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:58.931030: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-21 22:25:58.931065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-12-21 22:25:58.931070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-12-21 22:25:58.931327: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.932122: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.932886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.933609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22128 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3090, pci bus id: 0000:24:00.0, compute capability: 8.6)
Using TensorFlow backend.
2020-12-21 22:25:58,934 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/code/yolo/specs/yolo_train_resnet18_kitti.txt.
2020-12-21 22:25:58,935 [INFO] /usr/local/lib/python3.6/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/code/yolo/specs/yolo_train_resnet18_kitti.txt
2020-12-21 22:25:58.936614: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-21 22:25:58.936645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      1 
2020-12-21 22:25:58.936651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 1:   N 
2020-12-21 22:25:58.936860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.937640: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.938393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:58.939310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21845 MB memory) -> physical GPU (device: 1, name: GeForce RTX 3090, pci bus id: 0000:2d:00.0, compute capability: 8.6)
Using TensorFlow backend.
2020-12-21 22:25:58,940 [INFO] iva.yolo.scripts.train: Loading experiment spec at /workspace/tlt-experiments/code/yolo/specs/yolo_train_resnet18_kitti.txt.
2020-12-21 22:25:58,941 [INFO] /usr/local/lib/python3.6/dist-packages/iva/yolo/utils/spec_loader.pyc: Merging specification from /workspace/tlt-experiments/code/yolo/specs/yolo_train_resnet18_kitti.txt
2020-12-21 22:25:58,945 [INFO] iva.yolo.scripts.train: Loading pretrained weights. This may take a while...
2020-12-21 22:25:58,958 [INFO] iva.yolo.scripts.train: Loading pretrained weights. This may take a while...
2020-12-21 22:25:59,212 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-12-21 22:25:59,212 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-12-21 22:25:59,212 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-12-21 22:25:59,212 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 24, io threads: 48, compute threads: 24, buffered batches: 4
2020-12-21 22:25:59,212 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 3868, number of sources: 1, batch size per gpu: 5, steps: 774
2020-12-21 22:25:59,216 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-12-21 22:25:59,216 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-12-21 22:25:59,217 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-12-21 22:25:59,217 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 24, io threads: 48, compute threads: 24, buffered batches: 4
2020-12-21 22:25:59,217 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 3868, number of sources: 1, batch size per gpu: 5, steps: 774
2020-12-21 22:25:59,288 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-12-21 22:25:59,292 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-12-21 22:25:59.311274: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.312086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:24:00.0
2020-12-21 22:25:59.312155: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.312865: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:2d:00.0
2020-12-21 22:25:59.312884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:59.312922: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-21 22:25:59.312935: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-21 22:25:59.312946: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-21 22:25:59.312957: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-21 22:25:59.312967: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-21 22:25:59.312978: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-21 22:25:59.313036: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.313831: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.314545: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.314611: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.316322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:24:00.0
2020-12-21 22:25:59.316392: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.316397: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.317877: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2020-12-21 22:25:59.317900: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:2d:00.0
2020-12-21 22:25:59.317919: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-12-21 22:25:59.317948: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-21 22:25:59.317962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-12-21 22:25:59.317972: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-12-21 22:25:59.317983: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-12-21 22:25:59.317993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-12-21 22:25:59.318004: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-12-21 22:25:59.318062: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.318829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.319586: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.320344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-21 22:25:59.321052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1
2020-12-21 22:25:59,468 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-12-21 22:25:59,470 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: True - shard 0 of 1
2020-12-21 22:25:59,473 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-12-21 22:25:59,473 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-12-21 22:25:59,474 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-12-21 22:25:59,474 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
Weights for those layers can not be loaded: ['expand_conv1', 'expand_conv1_bn', 'expand_conv1_lrelu']
STOP trainig now and check the pre-train model if this is not expected!
2020-12-21 22:26:16,348 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-12-21 22:26:16,348 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-12-21 22:26:16,348 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-12-21 22:26:16,349 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 24, io threads: 48, compute threads: 24, buffered batches: 4
2020-12-21 22:26:16,349 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 682, number of sources: 1, batch size per gpu: 16, steps: 43
2020-12-21 22:26:16,371 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
Weights for those layers can not be loaded: ['expand_conv1', 'expand_conv1_bn', 'expand_conv1_lrelu']
STOP trainig now and check the pre-train model if this is not expected!
2020-12-21 22:26:16,535 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2020-12-21 22:26:16,538 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-12-21 22:26:16,538 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
2020-12-21 22:26:16,681 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2020-12-21 22:26:16,681 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2020-12-21 22:26:16,681 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2020-12-21 22:26:16,681 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 24, io threads: 48, compute threads: 24, buffered batches: 4
2020-12-21 22:26:16,681 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 682, number of sources: 1, batch size per gpu: 16, steps: 43
2020-12-21 22:26:16,703 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2020-12-21 22:26:16,869 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2020-12-21 22:26:16,873 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2020-12-21 22:26:16,873 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
Input (InputLayer)              (5, 3, 1152, 1440)   0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (5, 64, 576, 720)    9408        Input[0][0]                      
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (5, 64, 576, 720)    256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (5, 64, 576, 720)    0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (5, 64, 288, 360)    36864       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (5, 64, 288, 360)    256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (5, 64, 288, 360)    0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (5, 64, 288, 360)    36864       block_1a_relu_1[0][0]            
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (5, 64, 288, 360)    4096        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (5, 64, 288, 360)    256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (5, 64, 288, 360)    256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (5, 64, 288, 360)    0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1a_relu (Activation)      (5, 64, 288, 360)    0           add_1[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (5, 128, 144, 180)   73728       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (5, 128, 144, 180)   512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (5, 128, 144, 180)   0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (5, 128, 144, 180)   147456      block_2a_relu_1[0][0]            
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (5, 128, 144, 180)   8192        block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (5, 128, 144, 180)   512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (5, 128, 144, 180)   512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_2 (Add)                     (5, 128, 144, 180)   0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2a_relu (Activation)      (5, 128, 144, 180)   0           add_2[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (5, 256, 72, 90)     294912      block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (5, 256, 72, 90)     1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (5, 256, 72, 90)     0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (5, 256, 72, 90)     589824      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (5, 256, 72, 90)     32768       block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (5, 256, 72, 90)     1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (5, 256, 72, 90)     1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (5, 256, 72, 90)     0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3a_relu (Activation)      (5, 256, 72, 90)     0           add_3[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (5, 512, 72, 90)     1179648     block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (5, 512, 72, 90)     2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (5, 512, 72, 90)     0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (5, 512, 72, 90)     2359296     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (5, 512, 72, 90)     131072      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (5, 512, 72, 90)     2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (5, 512, 72, 90)     2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_4 (Add)                     (5, 512, 72, 90)     0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4a_relu (Activation)      (5, 512, 72, 90)     0           add_4[0][0]                      
__________________________________________________________________________________________________
expand_conv1 (Conv2D)           (5, 512, 36, 45)     2359296     block_4a_relu[0][0]              
__________________________________________________________________________________________________
expand_conv1_bn (BatchNormaliza (5, 512, 36, 45)     2048        expand_conv1[0][0]               
__________________________________________________________________________________________________
expand_conv1_lrelu (LeakyReLU)  (5, 512, 36, 45)     0           expand_conv1_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv1_1 (Conv2D)           (5, 256, 36, 45)     131072      expand_conv1_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv1_1_bn (BatchNormaliza (5, 256, 36, 45)     1024        yolo_conv1_1[0][0]               
__________________________________________________________________________________________________
yolo_conv1_1_lrelu (LeakyReLU)  (5, 256, 36, 45)     0           yolo_conv1_1_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv1_2 (Conv2D)           (5, 512, 36, 45)     1179648     yolo_conv1_1_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv1_2_bn (BatchNormaliza (5, 512, 36, 45)     2048        yolo_conv1_2[0][0]               
__________________________________________________________________________________________________
yolo_conv1_2_lrelu (LeakyReLU)  (5, 512, 36, 45)     0           yolo_conv1_2_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv1_3 (Conv2D)           (5, 256, 36, 45)     131072      yolo_conv1_2_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv1_3_bn (BatchNormaliza (5, 256, 36, 45)     1024        yolo_conv1_3[0][0]               
__________________________________________________________________________________________________
yolo_conv1_3_lrelu (LeakyReLU)  (5, 256, 36, 45)     0           yolo_conv1_3_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv1_4 (Conv2D)           (5, 512, 36, 45)     1179648     yolo_conv1_3_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv1_4_bn (BatchNormaliza (5, 512, 36, 45)     2048        yolo_conv1_4[0][0]               
__________________________________________________________________________________________________
yolo_conv1_4_lrelu (LeakyReLU)  (5, 512, 36, 45)     0           yolo_conv1_4_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv1_5 (Conv2D)           (5, 256, 36, 45)     131072      yolo_conv1_4_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv1_5_bn (BatchNormaliza (5, 256, 36, 45)     1024        yolo_conv1_5[0][0]               
__________________________________________________________________________________________________
yolo_conv1_5_lrelu (LeakyReLU)  (5, 256, 36, 45)     0           yolo_conv1_5_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv2 (Conv2D)             (5, 128, 36, 45)     32768       yolo_conv1_5_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv2_bn (BatchNormalizati (5, 128, 36, 45)     512         yolo_conv2[0][0]                 
__________________________________________________________________________________________________
yolo_conv2_lrelu (LeakyReLU)    (5, 128, 36, 45)     0           yolo_conv2_bn[0][0]              
__________________________________________________________________________________________________
upsample0 (UpSampling2D)        (5, 128, 72, 90)     0           yolo_conv2_lrelu[0][0]           
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (5, 384, 72, 90)     0           upsample0[0][0]                  
                                                                 block_3a_relu[0][0]              
__________________________________________________________________________________________________
yolo_conv3_1 (Conv2D)           (5, 128, 72, 90)     49152       concatenate_1[0][0]              
__________________________________________________________________________________________________
yolo_conv3_1_bn (BatchNormaliza (5, 128, 72, 90)     512         yolo_conv3_1[0][0]               
__________________________________________________________________________________________________
yolo_conv3_1_lrelu (LeakyReLU)  (5, 128, 72, 90)     0           yolo_conv3_1_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv3_2 (Conv2D)           (5, 256, 72, 90)     294912      yolo_conv3_1_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv3_2_bn (BatchNormaliza (5, 256, 72, 90)     1024        yolo_conv3_2[0][0]               
__________________________________________________________________________________________________
yolo_conv3_2_lrelu (LeakyReLU)  (5, 256, 72, 90)     0           yolo_conv3_2_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv3_3 (Conv2D)           (5, 128, 72, 90)     32768       yolo_conv3_2_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv3_3_bn (BatchNormaliza (5, 128, 72, 90)     512         yolo_conv3_3[0][0]               
__________________________________________________________________________________________________
yolo_conv3_3_lrelu (LeakyReLU)  (5, 128, 72, 90)     0           yolo_conv3_3_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv3_4 (Conv2D)           (5, 256, 72, 90)     294912      yolo_conv3_3_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv3_4_bn (BatchNormaliza (5, 256, 72, 90)     1024        yolo_conv3_4[0][0]               
__________________________________________________________________________________________________
yolo_conv3_4_lrelu (LeakyReLU)  (5, 256, 72, 90)     0           yolo_conv3_4_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv3_5 (Conv2D)           (5, 128, 72, 90)     32768       yolo_conv3_4_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv3_5_bn (BatchNormaliza (5, 128, 72, 90)     512         yolo_conv3_5[0][0]               
__________________________________________________________________________________________________
yolo_conv3_5_lrelu (LeakyReLU)  (5, 128, 72, 90)     0           yolo_conv3_5_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv4 (Conv2D)             (5, 64, 72, 90)      8192        yolo_conv3_5_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv4_bn (BatchNormalizati (5, 64, 72, 90)      256         yolo_conv4[0][0]                 
__________________________________________________________________________________________________
yolo_conv4_lrelu (LeakyReLU)    (5, 64, 72, 90)      0           yolo_conv4_bn[0][0]              
__________________________________________________________________________________________________
upsample1 (UpSampling2D)        (5, 64, 144, 180)    0           yolo_conv4_lrelu[0][0]           
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (5, 192, 144, 180)   0           upsample1[0][0]                  
                                                                 block_2a_relu[0][0]              
__________________________________________________________________________________________________
yolo_conv5_1 (Conv2D)           (5, 64, 144, 180)    12288       concatenate_2[0][0]              
__________________________________________________________________________________________________
yolo_conv5_1_bn (BatchNormaliza (5, 64, 144, 180)    256         yolo_conv5_1[0][0]               
__________________________________________________________________________________________________
yolo_conv5_1_lrelu (LeakyReLU)  (5, 64, 144, 180)    0           yolo_conv5_1_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv5_2 (Conv2D)           (5, 128, 144, 180)   73728       yolo_conv5_1_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv5_2_bn (BatchNormaliza (5, 128, 144, 180)   512         yolo_conv5_2[0][0]               
__________________________________________________________________________________________________
yolo_conv5_2_lrelu (LeakyReLU)  (5, 128, 144, 180)   0           yolo_conv5_2_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv5_3 (Conv2D)           (5, 64, 144, 180)    8192        yolo_conv5_2_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv5_3_bn (BatchNormaliza (5, 64, 144, 180)    256         yolo_conv5_3[0][0]               
__________________________________________________________________________________________________
yolo_conv5_3_lrelu (LeakyReLU)  (5, 64, 144, 180)    0           yolo_conv5_3_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv5_4 (Conv2D)           (5, 128, 144, 180)   73728       yolo_conv5_3_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv5_4_bn (BatchNormaliza (5, 128, 144, 180)   512         yolo_conv5_4[0][0]               
__________________________________________________________________________________________________
yolo_conv5_4_lrelu (LeakyReLU)  (5, 128, 144, 180)   0           yolo_conv5_4_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv5_5 (Conv2D)           (5, 64, 144, 180)    8192        yolo_conv5_4_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv5_5_bn (BatchNormaliza (5, 64, 144, 180)    256         yolo_conv5_5[0][0]               
__________________________________________________________________________________________________
yolo_conv5_5_lrelu (LeakyReLU)  (5, 64, 144, 180)    0           yolo_conv5_5_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv1_6 (Conv2D)           (5, 512, 36, 45)     1179648     yolo_conv1_5_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv3_6 (Conv2D)           (5, 256, 72, 90)     294912      yolo_conv3_5_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv5_6 (Conv2D)           (5, 128, 144, 180)   73728       yolo_conv5_5_lrelu[0][0]         
__________________________________________________________________________________________________
yolo_conv1_6_bn (BatchNormaliza (5, 512, 36, 45)     2048        yolo_conv1_6[0][0]               
__________________________________________________________________________________________________
yolo_conv3_6_bn (BatchNormaliza (5, 256, 72, 90)     1024        yolo_conv3_6[0][0]               
__________________________________________________________________________________________________
yolo_conv5_6_bn (BatchNormaliza (5, 128, 144, 180)   512         yolo_conv5_6[0][0]               
__________________________________________________________________________________________________
yolo_conv1_6_lrelu (LeakyReLU)  (5, 512, 36, 45)     0           yolo_conv1_6_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv3_6_lrelu (LeakyReLU)  (5, 256, 72, 90)     0           yolo_conv3_6_bn[0][0]            
__________________________________________________________________________________________________
yolo_conv5_6_lrelu (LeakyReLU)  (5, 128, 144, 180)   0           yolo_conv5_6_bn[0][0]            
__________________________________________________________________________________________________
conv_big_object (Conv2D)        (5, 21, 36, 45)      10773       yolo_conv1_6_lrelu[0][0]         
__________________________________________________________________________________________________
conv_mid_object (Conv2D)        (5, 21, 72, 90)      5397        yolo_conv3_6_lrelu[0][0]         
__________________________________________________________________________________________________
conv_sm_object (Conv2D)         (5, 21, 144, 180)    2709        yolo_conv5_6_lrelu[0][0]         
__________________________________________________________________________________________________
bg_permute (Permute)            (5, 36, 45, 21)      0           conv_big_object[0][0]            
__________________________________________________________________________________________________
md_permute (Permute)            (5, 72, 90, 21)      0           conv_mid_object[0][0]            
__________________________________________________________________________________________________
sm_permute (Permute)            (5, 144, 180, 21)    0           conv_sm_object[0][0]             
__________________________________________________________________________________________________
bg_anchor (YOLOAnchorBox)       (5, 1, 4860, 6)      0           conv_big_object[0][0]            
__________________________________________________________________________________________________
bg_reshape (Reshape)            (5, 1, 4860, 7)      0           bg_permute[0][0]                 
__________________________________________________________________________________________________
md_anchor (YOLOAnchorBox)       (5, 1, 19440, 6)     0           conv_mid_object[0][0]            
__________________________________________________________________________________________________
md_reshape (Reshape)            (5, 1, 19440, 7)     0           md_permute[0][0]                 
__________________________________________________________________________________________________
sm_anchor (YOLOAnchorBox)       (5, 1, 77760, 6)     0           conv_sm_object[0][0]             
__________________________________________________________________________________________________
sm_reshape (Reshape)            (5, 1, 77760, 7)     0           sm_permute[0][0]                 
__________________________________________________________________________________________________
encoded_bg (Concatenate)        (5, 1, 4860, 13)     0           bg_anchor[0][0]                  
                                                                 bg_reshape[0][0]                 
__________________________________________________________________________________________________
encoded_md (Concatenate)        (5, 1, 19440, 13)    0           md_anchor[0][0]                  
                                                                 md_reshape[0][0]                 
__________________________________________________________________________________________________
encoded_sm (Concatenate)        (5, 1, 77760, 13)    0           sm_anchor[0][0]                  
                                                                 sm_reshape[0][0]                 
__________________________________________________________________________________________________
encoded_detections (Concatenate (5, 1, 102060, 13)   0           encoded_bg[0][0]                 
                                                                 encoded_md[0][0]                 
                                                                 encoded_sm[0][0]                 
==================================================================================================
Total params: 12,535,423
Trainable params: 12,510,655
Non-trainable params: 24,768
__________________________________________________________________________________________________
2020-12-21 22:26:19,539 [INFO] iva.yolo.scripts.train: Number of images in the training dataset:	  3868
Epoch 1/120
2020-12-21 22:26:26.687360: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-21 22:26:27.256256: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-12-21 22:27:08.895203: E tensorflow/stream_executor/cuda/cuda_blas.cc:429] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2020-12-21 22:27:08.895235: E tensorflow/stream_executor/cuda/cuda_blas.cc:2437] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 51, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo/scripts/train.py", line 239, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo/scripts/train.py", line 183, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 154, in fit_loop
    outs = f(ins)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[5,3,3], b.shape=[5,3,3], m=3, n=3, k=3, batch_size=5
	 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
	 [[cond_6/cond/SliceReplace/ListDiff/Switch/_3151]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[5,3,3], b.shape=[5,3,3], m=3, n=3, k=3, batch_size=5
	 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
0 successful operations.
0 derived errors ignored.
2020-12-21 22:27:09.439647: E tensorflow/stream_executor/cuda/cuda_blas.cc:429] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
2020-12-21 22:27:09.439680: E tensorflow/stream_executor/cuda/cuda_blas.cc:2437] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py", line 51, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo/scripts/train.py", line 239, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo/scripts/train.py", line 183, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 154, in fit_loop
    outs = f(ins)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas xGEMMBatched launch failed : a.shape=[5,3,3], b.shape=[5,3,3], m=3, n=3, k=3, batch_size=5
	 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
	 [[cond_6/cond/SliceReplace/ListDiff/Switch/_3151]]
  (1) Internal: Blas xGEMMBatched launch failed : a.shape=[5,3,3], b.shape=[5,3,3], m=3, n=3, k=3, batch_size=5
	 [[{{node CompositeTransform_6/CompositeTransform_5/CompositeTransform_4/CompositeTransform_3/CompositeTransform_2/CompositeTransform_1/CompositeTransform/RandomFlip/MatMul}}]]
0 successful operations.
0 derived errors ignored.
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[44497,1],0]
  Exit code:    1
--------------------------------------------------------------------------

What I have tried:

Run the first code block content as follow:

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

Refer to Tlt-augment execution error occurs
and Tlt train error: Value 'sm_86' is not defined for option 'gpu-name'