Detectnet_v2 training core dumped error

soundarrajan · June 6, 2022, 1:04pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : ** x86_64 GPU machine**
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here) :
detectnet_v2_train_config.txt (11.1 KB)

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

i’m getting core dump error in training part.
Attached full logs
training_log.txt (42.0 KB)
and config file for reference.
.tao_mounts.json file:
.tao_mounts.json (802 Bytes)

COMMAND: tao detectnet_v2 train -k tao_encode -n detectnet_v2_resnet18 -r /home/soundarrajan/detectnet_v2/result/training -e /home/soundarrajan/detectnet_v2/config/detectnet_v2_train_config.txt --log_file /home/soundarrajan/detectnet_v2/logs/training_log.txt

ERROR:

INFO:tensorflow:Graph was finalized.
2022-06-06 12:34:52,779 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-06-06 12:34:54,226 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-06-06 12:34:54,743 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-06-06 12:35:00,359 [INFO] tensorflow: Saving checkpoints for step-0.
2022-06-06 12:35:15.946288: F tensorflow/core/kernels/cuda_solvers.cc:94] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance.
[2e033a5e779a:00072] *** Process received signal ***
[2e033a5e779a:00072] Signal: Aborted (6)
[2e033a5e779a:00072] Signal code:  (-6)
[2e033a5e779a:00072] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f2a55138f10]
[2e033a5e779a:00072] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f2a55138e87]
[2e033a5e779a:00072] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f2a5513a7f1]
[2e033a5e779a:00072] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x82f75b4)[0x7f29cebea5b4]
[2e033a5e779a:00072] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10CudaSolverC1EPNS_15OpKernelContextE+0x102)[0x7f29cab3d042]
[2e033a5e779a:00072] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18MatrixInverseOpGpuIfE12ComputeAsyncEPNS_15OpKernelContextESt8functionIFvvEE+0x147)[0x7f29ca1f9d27]
[2e033a5e779a:00072] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice12ComputeAsyncEPNS_13AsyncOpKernelEPNS_15OpKernelContextESt8functionIFvvEE+0xeb)[0x7f29c5b0f69b]
[2e033a5e779a:00072] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf9617d)[0x7f29c5b7317d]
[2e033a5e779a:00072] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf97c6f)[0x7f29c5b74c6f]
[2e033a5e779a:00072] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f29c5c24791]
[2e033a5e779a:00072] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f29c5c21df8]
[2e033a5e779a:00072] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f2a530236df]
[2e033a5e779a:00072] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f2a54ee26db]
[2e033a5e779a:00072] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f2a5521b61f]
[2e033a5e779a:00072] *** End of error message ***
Aborted (core dumped)

kindly help to fix the issue and training the model successfully.

Morganh · June 6, 2022, 1:10pm

Are you using nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 or nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3 ?

soundarrajan · June 6, 2022, 1:15pm

Hi @Morganh,

Output for COMMAND: tao info --verbose

Configuration of the TAO Toolkit Instance

dockers:
        nvidia/tao/tao-toolkit-tf:
                v3.22.05-tf1.15.5-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. augment
                                2. bpnet
                                3. classification
                                4. dssd
                                5. faster_rcnn
                                6. emotionnet
                                7. efficientdet
                                8. fpenet
                                9. gazenet
                                10. gesturenet
                                11. heartratenet
                                12. lprnet
                                13. mask_rcnn
                                14. multitask_classification
                                15. retinanet
                                16. ssd
                                17. unet
                                18. yolo_v3
                                19. yolo_v4
                                20. yolo_v4_tiny
                                21. converter
                v3.22.05-tf1.15.4-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. detectnet_v2
        nvidia/tao/tao-toolkit-pyt:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. speech_to_text_conformer
                                4. action_recognition
                                5. pointpillars
                                6. pose_classification
                                7. spectro_gen
                                8. vocoder
                v3.21.11-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. text_classification
                                2. question_answering
                                3. token_classification
                                4. intent_slot_classification
                                5. punctuation_and_capitalization
        nvidia/tao/tao-toolkit-lm:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

Morganh · June 6, 2022, 1:20pm

Can you share the result of $nvidia-smi ?

soundarrajan · June 6, 2022, 1:26pm

Hi @Morganh,

Please find the required output for the COMMAND: nvidia-smi

Mon Jun  6 18:54:14 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:2D:00.0 Off |                  N/A |
|  0%   36C    P8    12W / 300W |   9967MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2173      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      2348      G   /usr/bin/gnome-shell                6MiB |
|    0   N/A  N/A      5681      C   /usr/bin/python3                  825MiB |
|    0   N/A  N/A     26781      C   /usr/bin/python3                 9121MiB |
+-----------------------------------------------------------------------------+

Morganh · June 6, 2022, 1:57pm

Please try below to check if it works. Thanks.
export TF_FORCE_GPU_ALLOW_GROWTH=true

soundarrajan · June 6, 2022, 2:13pm

Hi @Morganh,

No luck, still getting same error even after setting the suggested environmental variable.

training full log:
training_log.txt (42.0 KB)

Morganh · June 7, 2022, 3:25am

I cannot reproduce the error while training with KITTI dataset.

Suggest you to free the gpu memory and retry.
$ sudo kill -9 5681 26781

soundarrajan · June 7, 2022, 3:37am

Hi @Morganh,

Try with the dataset which i’m using. Tao toolkit container not installing - #19 by soundarrajan

PASCAL VOC dataset (17216 images) → KITTI format → Tfrecord is the path i’m following.

KITTI dataset is huge it seems, some 13GB right? But PASCAL VOC is only 2GB.

Morganh · June 7, 2022, 3:42am

Please kill previous process and retry.

soundarrajan · June 7, 2022, 3:51am

Hi @Morganh ,

Now getting the below error,

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303..jpg; No such file or directory
	 [[{{node AssetLoader/ReadFile}}]]
	 [[data_loader_out]]
  (1) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303..jpg; No such file or directory
	 [[{{node AssetLoader/ReadFile}}]]
	 [[data_loader_out]]
	 [[data_loader_out/_3767]]
0 successful operations.
0 derived errors ignored.

But i could able to locate the /2009_004303…jpg; file in the directory
COMMAND: locate 2009_004303
/home/soundarrajan/dataset/kitti/data/2009_004303.jpg
/home/soundarrajan/dataset/kitti/labels/2009_004303.txt
/home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg
/home/soundarrajan/detectnet_v2/data/kitti/labels/2009_004303.txt

Attached full log:
training_log.txt (51.2 KB)

Morganh · June 7, 2022, 3:57am

soundarrajan:

  (0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303..jpg; No such file or directory
	 [[{{node AssetLoader/ReadFile}}]]
	 [[data_loader_out]]

Please modify below path. The path is not right or necessary.
image_directory_path: “/home/soundarrajan/detectnet_v2/data/kitti/data”

soundarrajan · June 7, 2022, 4:31am

Hi @Morganh,

The above mentioned path is correct and that’s where the dataset are present.

i can able to locate that
COMMAND: locate 2009_004303
/home/soundarrajan/dataset/kitti/data/2009_004303.jpg
/home/soundarrajan/dataset/kitti/labels/2009_004303.txt
/home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg
/home/soundarrajan/detectnet_v2/data/kitti/labels/2009_004303.txt

problem is that the image path is parsed twice i think,

tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]

Attached tao_mounts file
.tao_mounts.json (802 Bytes)

kindly check anything to change in tao_mounts?

Morganh · June 7, 2022, 4:45am

Yes, because I am afraid you set extra path when you generate tfrecord files. So please try to modify to avoid the path is parsed twice.

soundarrajan · June 7, 2022, 4:55am

Hi @Morganh,

Can you please help to identify where i’m doing that mistake?

COMMAND for Tfrecord convertion:
tao detectnet_v2 dataset_convert -v -d /home/soundarrajan/detectnet_v2/config/data_convert_config_spec.txt -o /home/soundarrajan/detectnet_v2/result/tfrecord --log_file /home/soundarrajan/detectnet_v2/logs/dataset_convert_log.txt -v

Tfrecords are generated successfully…

Log file:
dataset_convert_log.txt (5.8 KB)

Config file:
data_convert_config_spec.txt (410 Bytes)

Morganh · June 7, 2022, 5:17am

Modify

image_dir_name: "/home/soundarrajan/detectnet_v2/data/kitti/data"
label_dir_name: "/home/soundarrajan/detectnet_v2/data/kitti/labels"

to

image_dir_name: "data"
label_dir_name: "labels"

soundarrajan · June 7, 2022, 7:39am

Hi @Morganh,

Thanks it worked,
I have updated the path like below

kitti_config{
root_directory_path: “/home/soundarrajan/detectnet_v2/dataset”
image_dir_name: “kitti/data”
label_dir_name: “kitti/labels”
partition_mode: “random”
num_partitions: 2
image_extension: “.jpg”
val_split: 20
num_shards: 10
}
image_directory_path: “/home/soundarrajan/detectnet_v2/dataset”

One more doubt: Is it possible to do transfer learning for SSD Mobilenet v2 tflite model with pretrained coco 2017 dataset with the custom dataset?
If possible please share reference link to explore.

Morganh · June 7, 2022, 7:47am

No, currently it is not supported.

soundarrajan · June 7, 2022, 7:57am

Hi @Morganh,

Tflite framework not supported or SSD Mobilenet V2 architecture not supported?

Any model/framework matrix to check?

Morganh · June 7, 2022, 8:05am

TAO supports SSD Mobilenet V2 architecture. See SSD — TAO Toolkit 3.22.05 documentation .
But TAO does not support 3rd party tflite pre-trained model.

Topic		Replies	Views
Detectnet_v2 notebook stuck at tfrecords conversion step TAO Toolkit	17	51	October 30, 2024
Tao toolkit facenet Error TAO Toolkit	14	1282	March 7, 2022
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1692	July 10, 2023
DataLossError: corrupted record at 0 when using resnet18 TAO Toolkit	14	892	July 11, 2022
Tao toolkit detectnet training kitty format error TAO Toolkit	10	415	December 8, 2023
Error while training detectnet v2 taotollkit on default notebook TAO Toolkit	2	307	March 9, 2024
Detectnet2 TAO Toolkit model training fail on formating dataset on kitti format TAO Toolkit	69	963	January 22, 2024
Detectnetv2 tfrecords error TAO Toolkit	4	421	January 13, 2024
Detectnet_v2.ipynb issue with custom data TAO Toolkit tao	3	276	May 17, 2024
TAO toolkit happend some .so bug TAO Toolkit tao	19	906	September 9, 2022

Detectnet_v2 training core dumped error

Related topics