Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc) : ** x86_64 GPU machine**
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here) :
detectnet_v2_train_config.txt (11.1 KB)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
i’m getting core dump error in training part.
Attached full logs
training_log.txt (42.0 KB)
and config file for reference.
.tao_mounts.json file:
.tao_mounts.json (802 Bytes)
COMMAND: tao detectnet_v2 train -k tao_encode -n detectnet_v2_resnet18 -r /home/soundarrajan/detectnet_v2/result/training -e /home/soundarrajan/detectnet_v2/config/detectnet_v2_train_config.txt --log_file /home/soundarrajan/detectnet_v2/logs/training_log.txt
ERROR:
INFO:tensorflow:Graph was finalized.
2022-06-06 12:34:52,779 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-06-06 12:34:54,226 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-06-06 12:34:54,743 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-06-06 12:35:00,359 [INFO] tensorflow: Saving checkpoints for step-0.
2022-06-06 12:35:15.946288: F tensorflow/core/kernels/cuda_solvers.cc:94] Check failed: cusolverDnCreate(&cusolver_dn_handle) == CUSOLVER_STATUS_SUCCESS Failed to create cuSolverDN instance.
[2e033a5e779a:00072] *** Process received signal ***
[2e033a5e779a:00072] Signal: Aborted (6)
[2e033a5e779a:00072] Signal code: (-6)
[2e033a5e779a:00072] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7f2a55138f10]
[2e033a5e779a:00072] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f2a55138e87]
[2e033a5e779a:00072] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f2a5513a7f1]
[2e033a5e779a:00072] [ 3] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(+0x82f75b4)[0x7f29cebea5b4]
[2e033a5e779a:00072] [ 4] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow10CudaSolverC1EPNS_15OpKernelContextE+0x102)[0x7f29cab3d042]
[2e033a5e779a:00072] [ 5] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/_pywrap_tensorflow_internal.so(_ZN10tensorflow18MatrixInverseOpGpuIfE12ComputeAsyncEPNS_15OpKernelContextESt8functionIFvvEE+0x147)[0x7f29ca1f9d27]
[2e033a5e779a:00072] [ 6] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice12ComputeAsyncEPNS_13AsyncOpKernelEPNS_15OpKernelContextESt8functionIFvvEE+0xeb)[0x7f29c5b0f69b]
[2e033a5e779a:00072] [ 7] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf9617d)[0x7f29c5b7317d]
[2e033a5e779a:00072] [ 8] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0xf97c6f)[0x7f29c5b74c6f]
[2e033a5e779a:00072] [ 9] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x281)[0x7f29c5c24791]
[2e033a5e779a:00072] [10] /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7f29c5c21df8]
[2e033a5e779a:00072] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df)[0x7f2a530236df]
[2e033a5e779a:00072] [12] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f2a54ee26db]
[2e033a5e779a:00072] [13] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f2a5521b61f]
[2e033a5e779a:00072] *** End of error message ***
Aborted (core dumped)
kindly help to fix the issue and training the model successfully.
Hi @Morganh,
Output for COMMAND: tao info --verbose
Configuration of the TAO Toolkit Instance
dockers:
nvidia/tao/tao-toolkit-tf:
v3.22.05-tf1.15.5-py3:
docker_registry: nvcr.io
tasks:
1. augment
2. bpnet
3. classification
4. dssd
5. faster_rcnn
6. emotionnet
7. efficientdet
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. yolo_v4_tiny
21. converter
v3.22.05-tf1.15.4-py3:
docker_registry: nvcr.io
tasks:
1. detectnet_v2
nvidia/tao/tao-toolkit-pyt:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. speech_to_text_conformer
4. action_recognition
5. pointpillars
6. pose_classification
7. spectro_gen
8. vocoder
v3.21.11-py3:
docker_registry: nvcr.io
tasks:
1. text_classification
2. question_answering
3. token_classification
4. intent_slot_classification
5. punctuation_and_capitalization
nvidia/tao/tao-toolkit-lm:
v3.22.05-py3:
docker_registry: nvcr.io
tasks:
1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022
Can you share the result of $nvidia-smi ?
Hi @Morganh,
Please find the required output for the COMMAND: nvidia-smi
Mon Jun 6 18:54:14 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08 Driver Version: 510.73.08 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:2D:00.0 Off | N/A |
| 0% 36C P8 12W / 300W | 9967MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2173 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 2348 G /usr/bin/gnome-shell 6MiB |
| 0 N/A N/A 5681 C /usr/bin/python3 825MiB |
| 0 N/A N/A 26781 C /usr/bin/python3 9121MiB |
+-----------------------------------------------------------------------------+
Please try below to check if it works. Thanks.
export TF_FORCE_GPU_ALLOW_GROWTH=true
Hi @Morganh,
No luck, still getting same error even after setting the suggested environmental variable.

training full log:
training_log.txt (42.0 KB)
I cannot reproduce the error while training with KITTI dataset.
Suggest you to free the gpu memory and retry.
$ sudo kill -9 5681 26781
Hi @Morganh,
Try with the dataset which i’m using. Tao toolkit container not installing - #19 by soundarrajan
PASCAL VOC dataset (17216 images) → KITTI format → Tfrecord is the path i’m following.
KITTI dataset is huge it seems, some 13GB right? But PASCAL VOC is only 2GB.
Please kill previous process and retry.
Hi @Morganh ,
Now getting the below error,
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303..jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]
[[data_loader_out]]
(1) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303..jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]
[[data_loader_out]]
[[data_loader_out/_3767]]
0 successful operations.
0 derived errors ignored.
But i could able to locate the /2009_004303…jpg; file in the directory
COMMAND: locate 2009_004303
/home/soundarrajan/dataset/kitti/data/2009_004303.jpg
/home/soundarrajan/dataset/kitti/labels/2009_004303.txt
/home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg
/home/soundarrajan/detectnet_v2/data/kitti/labels/2009_004303.txt
Attached full log:
training_log.txt (51.2 KB)
Please modify below path. The path is not right or necessary.
image_directory_path: “/home/soundarrajan/detectnet_v2/data/kitti/data”
Hi @Morganh,
The above mentioned path is correct and that’s where the dataset are present.
i can able to locate that
COMMAND: locate 2009_004303
/home/soundarrajan/dataset/kitti/data/2009_004303.jpg
/home/soundarrajan/dataset/kitti/labels/2009_004303.txt
/home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg
/home/soundarrajan/detectnet_v2/data/kitti/labels/2009_004303.txt
problem is that the image path is parsed twice i think,
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
(0) Not found: {{function_node __inference_Dataset_map__map_func_set_random_wrapper_5211}} /home/soundarrajan/detectnet_v2/data/kitti/data//home/soundarrajan/detectnet_v2/data/kitti/data/2009_004303.jpg; No such file or directory
[[{{node AssetLoader/ReadFile}}]]
Attached tao_mounts file
.tao_mounts.json (802 Bytes)
kindly check anything to change in tao_mounts?
Yes, because I am afraid you set extra path when you generate tfrecord files. So please try to modify to avoid the path is parsed twice.
Hi @Morganh,
Can you please help to identify where i’m doing that mistake?
COMMAND for Tfrecord convertion:
tao detectnet_v2 dataset_convert -v -d /home/soundarrajan/detectnet_v2/config/data_convert_config_spec.txt -o /home/soundarrajan/detectnet_v2/result/tfrecord --log_file /home/soundarrajan/detectnet_v2/logs/dataset_convert_log.txt -v
Tfrecords are generated successfully…
Log file:
dataset_convert_log.txt (5.8 KB)
Config file:
data_convert_config_spec.txt (410 Bytes)
Hi @Morganh,
Thanks it worked,
I have updated the path like below
kitti_config{
root_directory_path: “/home/soundarrajan/detectnet_v2/dataset”
image_dir_name: “kitti/data”
label_dir_name: “kitti/labels”
partition_mode: “random”
num_partitions: 2
image_extension: “.jpg”
val_split: 20
num_shards: 10
}
image_directory_path: “/home/soundarrajan/detectnet_v2/dataset”
One more doubt: Is it possible to do transfer learning for SSD Mobilenet v2 tflite model with pretrained coco 2017 dataset with the custom dataset?
If possible please share reference link to explore.
No, currently it is not supported.
Hi @Morganh,
Tflite framework not supported or SSD Mobilenet V2 architecture not supported?
Any model/framework matrix to check?
TAO supports SSD Mobilenet V2 architecture. See SSD — TAO Toolkit 3.22.05 documentation .
But TAO does not support 3rd party tflite pre-trained model.