Detectnet_v2 training core dumped error

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : ** x86_64 GPU machine**
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here) :
detectnet_v2_train_config.txt (11.1 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

i’m getting core dump error in training part.
Attached full logs
training_log.txt (42.0 KB)
and config file for reference.
.tao_mounts.json file:
.tao_mounts.json (802 Bytes)

COMMAND: tao detectnet_v2 train -k tao_encode -n detectnet_v2_resnet18 -r /home/soundarrajan/detectnet_v2/result/training -e /home/soundarrajan/detectnet_v2/config/detectnet_v2_train_config.txt --log_file /home/soundarrajan/detectnet_v2/logs/training_log.txt

ERROR:

INFO:tensorflow:Graph was finalized.
2022-06-06 12:34:52,779 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-06-06 12:34:54,226 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-06-06 12:34:54,743 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-06-06 12:35:00,359 [INFO] tensorflow: Saving checkpoints for step-0.
2022-06-06 12:35:15.946288: F tensorflow/core/kernels/cuda_solvers.cc:94] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance.
[2e033a5e779a:00072] *** Process received signal ***
[2e033a5e779a:00072] Signal: Aborted (6)
[2e033a5e779a:00072] Signal code:  (-6)
[2e033a5e779a:00072] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f2a55138f10]
[2e033a5e779a:00072] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f2a55138e87]
[2e033a5e779a:00072] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f2a5513a7f1]
[2e033a5e779a:00072] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x82f75b4)[0x7f29cebea5b4]
[2e033a5e779a:00072] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10CudaSolverC1EPNS_15OpKernelContextE+0x102)[0x7f29cab3d042]
[2e033a5e779a:00072] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18MatrixInverseOpGpuIfE12ComputeAsyncEPNS_15OpKernelContextESt8functionIFvvEE+0x147)[0x7f29ca1f9d27]
[2e033a5e779a:00072] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice12ComputeAsyncEPNS_13AsyncOpKernelEPNS_15OpKernelContextESt8functionIFvvEE+0xeb)[0x7f29c5b0f69b]
[2e033a5e779a:00072] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf9617d)[0x7f29c5b7317d]
[2e033a5e779a:00072] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf97c6f)[0x7f29c5b74c6f]
[2e033a5e779a:00072] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f29c5c24791]
[2e033a5e779a:00072] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f29c5c21df8]
[2e033a5e779a:00072] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f2a530236df]
[2e033a5e779a:00072] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f2a54ee26db]
[2e033a5e779a:00072] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f2a5521b61f]
[2e033a5e779a:00072] *** End of error message ***
Aborted (core dumped)

kindly help to fix the issue and training the model successfully.

Are you using nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 or nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ?

Hi @Morganh,

Output for COMMAND: tao info --verbose

Configuration of the TAO Toolkit Instance

dockers:
        nvidia/tao/tao-toolkit-tf:
                v3.22.05-tf1.15.5-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. augment
                                2. bpnet
                                3. classification
                                4. dssd
                                5. faster_rcnn
                                6. emotionnet
                                7. efficientdet
                                8. fpenet
                                9. gazenet
                                10. gesturenet
                                11. heartratenet
                                12. lprnet
                                13. mask_rcnn
                                14. multitask_classification
                                15. retinanet
                                16. ssd
                                17. unet
                                18. yolo_v3
                                19. yolo_v4
                                20. yolo_v4_tiny
                                21. converter
                v3.22.05-tf1.15.4-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. detectnet_v2
        nvidia/tao/tao-toolkit-pyt:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. speech_to_text_conformer
                                4. action_recognition
                                5. pointpillars
                                6. pose_classification
                                7. spectro_gen
                                8. vocoder
                v3.21.11-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. text_classification
                                2. question_answering
                                3. token_classification
                                4. intent_slot_classification
                                5. punctuation_and_capitalization
        nvidia/tao/tao-toolkit-lm:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

Can you share the result of $nvidia-smi ?

Hi @Morganh,

Please find the required output for the COMMAND: nvidia-smi

Mon Jun  6 18:54:14 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:2D:00.0 Off |                  N/A |
|  0%   36C    P8    12W / 300W |   9967MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2173      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      2348      G   /usr/bin/gnome-shell                6MiB |
|    0   N/A  N/A      5681      C   /usr/bin/python3                  825MiB |
|    0   N/A  N/A     26781      C   /usr/bin/python3                 9121MiB |
+-----------------------------------------------------------------------------+

Please try below to check if it works. Thanks.
export TF_FORCE_GPU_ALLOW_GROWTH=true

Hi @Morganh,

No luck, still getting same error even after setting the suggested environmental variable.
image

training full log:
training_log.txt (42.0 KB)

I cannot reproduce the error while training with KITTI dataset.

Suggest you to free the gpu memory and retry.
$ sudo kill -9 5681 26781

Hi @Morganh,

Try with the dataset which i’m using. Tao toolkit container not installing - #19 by soundarrajan

PASCAL VOC dataset (17216 images) → KITTI format → Tfrecord is the path i’m following.

KITTI dataset is huge it seems, some 13GB right? But PASCAL VOC is only 2GB.

Please kill previous process and retry.

Hi @Morganh ,

Now getting the below error,

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303..jpg; No such file or directory
	 [[{{node AssetLoader/ReadFile}}]]
	 [[data_loader_out]]
  (1) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303..jpg; No such file or directory
	 [[{{node AssetLoader/ReadFile}}]]
	 [[data_loader_out]]
	 [[data_loader_out/_3767]]
0 successful operations.
0 derived errors ignored.

But i could able to locate the /2009_004303…jpg; file in the directory
COMMAND: locate 2009_004303
/home/soundarrajan/dataset/kitti/data/2009_004303.jpg
/home/soundarrajan/dataset/kitti/labels/2009_004303.txt
/home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg
/home/soundarrajan/detectnet_v2/data/kitti/labels/2009_004303.txt

Attached full log:
training_log.txt (51.2 KB)

Please modify below path. The path is not right or necessary.
image_directory_path: “/home/soundarrajan/detectnet_v2/data/kitti/data”

Hi @Morganh,

The above mentioned path is correct and that’s where the dataset are present.

i can able to locate that
COMMAND: locate 2009_004303
/home/soundarrajan/dataset/kitti/data/2009_004303.jpg
/home/soundarrajan/dataset/kitti/labels/2009_004303.txt
/home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg
/home/soundarrajan/detectnet_v2/data/kitti/labels/2009_004303.txt

problem is that the image path is parsed twice i think,

tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]

Attached tao_mounts file
.tao_mounts.json (802 Bytes)

kindly check anything to change in tao_mounts?

Yes, because I am afraid you set extra path when you generate tfrecord files. So please try to modify to avoid the path is parsed twice.

Hi @Morganh,

Can you please help to identify where i’m doing that mistake?

COMMAND for Tfrecord convertion:
tao detectnet_v2 dataset_convert -v -d /home/soundarrajan/detectnet_v2/config/data_convert_config_spec.txt -o /home/soundarrajan/detectnet_v2/result/tfrecord --log_file /home/soundarrajan/detectnet_v2/logs/dataset_convert_log.txt -v

Tfrecords are generated successfully…

Log file:
dataset_convert_log.txt (5.8 KB)

Config file:
data_convert_config_spec.txt (410 Bytes)

Modify

image_dir_name: "/home/soundarrajan/detectnet_v2/data/kitti/data"
label_dir_name: "/home/soundarrajan/detectnet_v2/data/kitti/labels"

to

image_dir_name: "data"
label_dir_name: "labels"

Hi @Morganh,

Thanks it worked,
I have updated the path like below

kitti_config{
root_directory_path: “/home/soundarrajan/detectnet_v2/dataset”
image_dir_name: “kitti/data”
label_dir_name: “kitti/labels”
partition_mode: “random”
num_partitions: 2
image_extension: “.jpg”
val_split: 20
num_shards: 10
}
image_directory_path: “/home/soundarrajan/detectnet_v2/dataset”

One more doubt: Is it possible to do transfer learning for SSD Mobilenet v2 tflite model with pretrained coco 2017 dataset with the custom dataset?
If possible please share reference link to explore.

No, currently it is not supported.

Hi @Morganh,

Tflite framework not supported or SSD Mobilenet V2 architecture not supported?

Any model/framework matrix to check?

TAO supports SSD Mobilenet V2 architecture. See SSD — TAO Toolkit 3.22.05 documentation .
But TAO does not support 3rd party tflite pre-trained model.