TLT (with singularity) yolov4 inference error

good day,

i’m trying to run the tlt yolov4 example on GPU cluster with singularity image,
after building the singularity image, training, pruning,and retraining all work with no problem
but then when I try to run the inference on the testing images "yolo_v4 inference -i /test_folder -o /output_folder -e xxx.txt -m /weights/yolov4_resnet18_epoch_050.tlt -k key " it gives the following error:

  0%|          | 0/940 [00:00<?, ?it/s]Floating point exception (core dumped)
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/yolo_v4", line 8, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/", line 296, in launch_job
AssertionError: Process run failed.

the specs are;
• Hardware: GPU cluster contains 4 GPUs Ampere A100
• Network Type: Yolo_v4
• tlt info :
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.21.11
• the training spec file is the same as sample one; yolo_v4_train_resnet18_kitti.txt and yolo_v4_retrain_resnet18_kitti.txt

any idea what might be causing the error?

Can you share $docker ps ?
I want to check which docker you are running.

the singularity image is built by bootstrap docker:
if the question is regarding the docker build tag…
otherwise, there is no docker daemon running on the GPU cluster as far as i know

Can you login the docker and try to run?
$ docker run --runtime=nvidia -it --rm -v yourlocalfolder:dockerfolder /bin/bash

then,run inference again.

More, please check your label file. Negative value of x1,y1,x2,y2 is not expected.

hello, sorry for the late reply

there is no docker daemon running on the cluster (for security reasons), it is not possible to run that command there…
but I tried it on a local machine that has a 2080Ti GPU, where I can run with docker, I got the same error.

I’m currently running the inference for the testing set of KITTI detection images (Download)… it does not have label files for testing,… not sure if label files would matter when running the inference,
the same error is shown when running the inference on the training set, even though training does not show any problem, I would assume that if some labels have false negative values the error would occur while training

For KITTI public dataset, can you follow TAO official jupyter notebook to train and run inference. We did not find your issue mentioned above.

More, can you check less images when you run "yolo_v4 inference -i /test_folder -o /output_folder -e xxx.txt -m /weights/yolov4_resnet18_epoch_050.tlt -k key " to see if there is still issue? If yes, just share several label files. If you can reproduce the issue with only 1 or 2 images/labels, that will be better. Then you can check what is wrong.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.