TLT (with singularity) yolov4 inference error

karam.d1 · November 29, 2021, 11:50am

good day,

i’m trying to run the tlt yolov4 example on GPU cluster with singularity image,
after building the singularity image, training, pruning,and retraining all work with no problem
but then when I try to run the inference on the testing images "yolo_v4 inference -i /test_folder -o /output_folder -e xxx.txt -m /weights/yolov4_resnet18_epoch_050.tlt -k key " it gives the following error:

  0%|          | 0/940 [00:00<?, ?it/s]Floating point exception (core dumped)
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/yolo_v4", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.

the specs are;
• Hardware: GPU cluster contains 4 GPUs Ampere A100
• Network Type: Yolo_v4
• tlt info :
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.21.11
• the training spec file is the same as sample one; yolo_v4_train_resnet18_kitti.txt and yolo_v4_retrain_resnet18_kitti.txt

any idea what might be causing the error?

Morganh · November 29, 2021, 4:13pm

Can you share $docker ps ?
I want to check which docker you are running.

karam.d1 · November 30, 2021, 4:29pm

the singularity image is built by bootstrap docker: nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3
if the question is regarding the docker build tag…
otherwise, there is no docker daemon running on the GPU cluster as far as i know

Morganh · November 30, 2021, 4:36pm

Can you login the docker and try to run?
$ docker run --runtime=nvidia -it --rm -v yourlocalfolder:dockerfolder nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3 /bin/bash

then,run inference again.

Morganh · December 6, 2021, 9:45am

More, please check your label file. Negative value of x1,y1,x2,y2 is not expected.

karam.d1 · December 6, 2021, 10:17am

hello, sorry for the late reply

there is no docker daemon running on the cluster (for security reasons), it is not possible to run that command there…
but I tried it on a local machine that has a 2080Ti GPU, where I can run with docker, I got the same error.

I’m currently running the inference for the testing set of KITTI detection images (Download)… it does not have label files for testing,… not sure if label files would matter when running the inference,
the same error is shown when running the inference on the training set, even though training does not show any problem, I would assume that if some labels have false negative values the error would occur while training

Morganh · December 10, 2021, 4:11pm

For KITTI public dataset, can you follow TAO official jupyter notebook to train and run inference. We did not find your issue mentioned above.

More, can you check less images when you run "yolo_v4 inference -i /test_folder -o /output_folder -e xxx.txt -m /weights/yolov4_resnet18_epoch_050.tlt -k key " to see if there is still issue? If yes, just share several label files. If you can reproduce the issue with only 1 or 2 images/labels, that will be better. Then you can check what is wrong.

system · December 28, 2021, 5:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
(help needed) running tlt training with singularity TAO Toolkit	2	635	August 22, 2021
Error doing inference with yolo exported enginge TAO Toolkit yolo	4	670	October 4, 2021
Floating point exception during inference TAO Toolkit	5	475	October 12, 2021
TAO Toolkit exits with "Kill" without reason TAO Toolkit	14	1071	February 28, 2022
Yolov3 worklfow or incorrect calibration file for int8 inference TAO Toolkit tensorrt , yolo , deepstream	6	528	July 6, 2023
Tltv3 yolov4 train set aren't loaded TAO Toolkit tensorflow	4	622	June 25, 2021
Inference YOLO_v4 int8 mode doesn't show any bounding box TAO Toolkit	31	2553	November 12, 2021
Yolov4-tiny Training using cspdarknet_tiny.hdf5 for pre-trained model TAO Toolkit	7	859	December 16, 2021
Error when running custom YOLOv4 on deepstream_python_apps DeepStream SDK	9	1716	December 11, 2021
Unable to deploy TAO 4.0.1 yolov4 model on deepstream6.0 TAO Toolkit deepstream	43	1083	August 18, 2023

TLT (with singularity) yolov4 inference error

Related topics