i’m trying to run the tlt yolov4 example on GPU cluster with singularity image,
after building the singularity image, training, pruning,and retraining all work with no problem
but then when I try to run the inference on the testing images "yolo_v4 inference -i /test_folder -o /output_folder -e xxx.txt -m /weights/yolov4_resnet18_epoch_050.tlt -k key " it gives the following error:
0%| | 0/940 [00:00<?, ?it/s]Floating point exception (core dumped)
Using TensorFlow backend.
Traceback (most recent call last):
File "/usr/local/bin/yolo_v4", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
the specs are;
• Hardware: GPU cluster contains 4 GPUs Ampere A100
• Network Type: Yolo_v4
• tlt info :
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.21.11
• the training spec file is the same as sample one; yolo_v4_train_resnet18_kitti.txt and yolo_v4_retrain_resnet18_kitti.txt
the singularity image is built by bootstrap docker: nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3
if the question is regarding the docker build tag…
otherwise, there is no docker daemon running on the GPU cluster as far as i know
there is no docker daemon running on the cluster (for security reasons), it is not possible to run that command there…
but I tried it on a local machine that has a 2080Ti GPU, where I can run with docker, I got the same error.
I’m currently running the inference for the testing set of KITTI detection images (Download)… it does not have label files for testing,… not sure if label files would matter when running the inference,
the same error is shown when running the inference on the training set, even though training does not show any problem, I would assume that if some labels have false negative values the error would occur while training
For KITTI public dataset, can you follow TAO official jupyter notebook to train and run inference. We did not find your issue mentioned above.
More, can you check less images when you run "yolo_v4 inference -i /test_folder -o /output_folder -e xxx.txt -m /weights/yolov4_resnet18_epoch_050.tlt -k key " to see if there is still issue? If yes, just share several label files. If you can reproduce the issue with only 1 or 2 images/labels, that will be better. Then you can check what is wrong.